piskvorky / gensim

Topic Modelling for Humans

Home Page:https://radimrehurek.com/gensim

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Out-of-Period Terms in LdaSeqModel

Fan-chen04 opened this issue · comments

Problem description

I'm currently working on a Dynamic Topic Modeling project using LdaSeqModel to analyze a dataset consisting of 20,000 online Chinese reviews of tourist attractions. During the modeling process, I've noticed that certain terms appear in time periods where they seemingly shouldn't, and I suspect this may be due to the model's smoothing process. Specifically, terms like "COVID-19" are occurring in time periods before the outbreak of the COVID-19 pandemic, which is unexpected.

Here is my code

pos_df['date'] = pd.to_datetime(pos_df['date'], unit='ms')
pos_df.sort_values(by='date', ascending=True, inplace=True)
pos_df.set_index("date", inplace=True)

pos_time_slice = [pos_df[:'2017-07-08'].count()[0],
                  pos_df['2017-07-09':'2020-01-03'].count()[0],
                  pos_df['2020-01-04':'2022-12-07'].count()[0],
                  pos_df['2022-12-08':].count()[0]]

def get_dic(data_df, col, no_below, no_above):
    texts = data_df[col].apply(lambda x: ' '.join(eval(x)))
    texts = [simple_preprocess(text) for text in texts]
    dictionary = Dictionary(texts)
    if len(texts) > 10000:
        dictionary.filter_extremes(no_below=no_below, no_above=no_above)
    else:
        dictionary.filter_extremes(no_below=10, no_above=0.4)
    pos_corpus = [dictionary.doc2bow(text) for text in texts]
    return texts, pos_corpus, dictionary

pos_texts, pos_corpus, pos_dictionary = get_dic(pos_df, "segmented comments", pos_below, pos_above)
pos_DTM = ldaseqmodel.LdaSeqModel(corpus=pos_corpus, id2word=pos_dictionary, time_slice=pos_time_slice,num_topics=pos_topic_num)
print(my_model.lifecycle_events)
[{'fname_or_handle': 'D:/.../pos_DTM_topic5_below100_above0.5', 
'separately': 'None', 'sep_limit': 10485760, 'ignore': frozenset(), 
'datetime': '2024-03-21T10:07:54.953576', 'gensim': '4.3.1', 
'python': '3.8.10 (tags/v3.8.10:3d8993a, May  3 2021, 11:48:03) [MSC v.1928 64 bit (AMD64)]', 
'platform': 'Windows-10-10.0.22631-SP0', 'event': 'saving'}, 
{'fname': 'D:/:.../pos_DTM_topic5_below100_above0.5', 
'datetime': '2024-03-31T16:24:59.672631', 'gensim': '4.3.2', 
'python': '3.8.10 (tags/v3.8.10:3d8993a, May  3 2021, 11:48:03) [MSC v.1928 64 bit (AMD64)]', 
'platform': 'Windows-10-10.0.22631-SP0', 'event': 'loaded'}]

Versions

Windows-10-10.0.22631-SP0
Python 3.8.10 (tags/v3.8.10:3d8993a, May 3 2021, 11:48:03) [MSC v.1928 64 bit (AMD64)]
Bits 64
NumPy 1.24.2
SciPy 1.8.1
gensim 4.3.2
FAST_VERSION 1