WZBSocialScienceCenter / tmtoolkit

Text Mining and Topic Modeling Toolkit for Python with parallel processing power

Home Page:https://tmtoolkit.readthedocs.io/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Corpus.from_tabular add filenames as prefix to doc_labels.

fatihbozdag opened this issue · comments

Greetings,
Corpus.from_tabular function adds source file name as prefix to each doc_label identifier. Is there a way to prevent this?

corpus = Corpus.from_tabular('/metadata_with_text.csv', id_column = 'docid_field', text_column = 'text_field)
x = corpus.doc_labels
x[0:10]

['metadata_with_text-BGSU1001',
 'metadata_with_text-BGSU1002',
 'metadata_with_text-BGSU1003',
 'metadata_with_text-BGSU1004',
 'metadata_with_text-BGSU1005',
 'metadata_with_text-BGSU1006',
 'metadata_with_text-BGSU1007',
 'metadata_with_text-BGSU1008',
 'metadata_with_text-BGSU1009',
 'metadata_with_text-BGSU1010']

prefix = None as passed to pd.read_csv did not work.

Yes, you can change the default document label format by passing the parameter doc_label_fmt='{id}'. See https://tmtoolkit.readthedocs.io/en/latest/api.html#tmtoolkit.corpus.Corpus.add_tabular

Great! I do not know how I missed that parameter.

Meanwhile another question; is it possible to implement custom metadata to docs within corpus object? Like the speaker, speaker-gender, native-language, etc. which are provided in the CSV file along with the text? Honestly, I did not really get how to use preproc.add_metadata_per_doc function.

This issue is stale because it has been open for 30 days with no activity.

This issue was closed because it has been inactive for 14 days since being marked as stale.