Out-of-Period Terms in LdaSeqModel
Fan-chen04 opened this issue · comments
Problem description
I'm currently working on a Dynamic Topic Modeling project using LdaSeqModel to analyze a dataset consisting of 20,000 online Chinese reviews of tourist attractions. During the modeling process, I've noticed that certain terms appear in time periods where they seemingly shouldn't, and I suspect this may be due to the model's smoothing process. Specifically, terms like "COVID-19" are occurring in time periods before the outbreak of the COVID-19 pandemic, which is unexpected.
Here is my code
pos_df['date'] = pd.to_datetime(pos_df['date'], unit='ms')
pos_df.sort_values(by='date', ascending=True, inplace=True)
pos_df.set_index("date", inplace=True)
pos_time_slice = [pos_df[:'2017-07-08'].count()[0],
pos_df['2017-07-09':'2020-01-03'].count()[0],
pos_df['2020-01-04':'2022-12-07'].count()[0],
pos_df['2022-12-08':].count()[0]]
def get_dic(data_df, col, no_below, no_above):
texts = data_df[col].apply(lambda x: ' '.join(eval(x)))
texts = [simple_preprocess(text) for text in texts]
dictionary = Dictionary(texts)
if len(texts) > 10000:
dictionary.filter_extremes(no_below=no_below, no_above=no_above)
else:
dictionary.filter_extremes(no_below=10, no_above=0.4)
pos_corpus = [dictionary.doc2bow(text) for text in texts]
return texts, pos_corpus, dictionary
pos_texts, pos_corpus, pos_dictionary = get_dic(pos_df, "segmented comments", pos_below, pos_above)
pos_DTM = ldaseqmodel.LdaSeqModel(corpus=pos_corpus, id2word=pos_dictionary, time_slice=pos_time_slice,num_topics=pos_topic_num)
print(my_model.lifecycle_events)
[{'fname_or_handle': 'D:/.../pos_DTM_topic5_below100_above0.5',
'separately': 'None', 'sep_limit': 10485760, 'ignore': frozenset(),
'datetime': '2024-03-21T10:07:54.953576', 'gensim': '4.3.1',
'python': '3.8.10 (tags/v3.8.10:3d8993a, May 3 2021, 11:48:03) [MSC v.1928 64 bit (AMD64)]',
'platform': 'Windows-10-10.0.22631-SP0', 'event': 'saving'},
{'fname': 'D:/:.../pos_DTM_topic5_below100_above0.5',
'datetime': '2024-03-31T16:24:59.672631', 'gensim': '4.3.2',
'python': '3.8.10 (tags/v3.8.10:3d8993a, May 3 2021, 11:48:03) [MSC v.1928 64 bit (AMD64)]',
'platform': 'Windows-10-10.0.22631-SP0', 'event': 'loaded'}]
Versions
Windows-10-10.0.22631-SP0
Python 3.8.10 (tags/v3.8.10:3d8993a, May 3 2021, 11:48:03) [MSC v.1928 64 bit (AMD64)]
Bits 64
NumPy 1.24.2
SciPy 1.8.1
gensim 4.3.2
FAST_VERSION 1