Corpus.from_tabular add filenames as prefix to doc_labels.

Question

Corpus.from_tabular add filenames as prefix to doc_labels.

fatihbozdag opened this issue 3 years ago · comments

Greetings,
Corpus.from_tabular function adds source file name as prefix to each doc_label identifier. Is there a way to prevent this?

corpus = Corpus.from_tabular('/metadata_with_text.csv', id_column = 'docid_field', text_column = 'text_field)
x = corpus.doc_labels
x[0:10]

['metadata_with_text-BGSU1001',
 'metadata_with_text-BGSU1002',
 'metadata_with_text-BGSU1003',
 'metadata_with_text-BGSU1004',
 'metadata_with_text-BGSU1005',
 'metadata_with_text-BGSU1006',
 'metadata_with_text-BGSU1007',
 'metadata_with_text-BGSU1008',
 'metadata_with_text-BGSU1009',
 'metadata_with_text-BGSU1010']

prefix = None as passed to pd.read_csv did not work.

Markus Konrad · Answer 1 · Mon Jul 05 2021 19:45:58 GMT+0800 (China Standard Time)

Yes, you can change the default document label format by passing the parameter doc_label_fmt='{id}'. See https://tmtoolkit.readthedocs.io/en/latest/api.html#tmtoolkit.corpus.Corpus.add_tabular

fatihbozdag · Answer 2 · Mon Jul 05 2021 19:59:20 GMT+0800 (China Standard Time)

Great! I do not know how I missed that parameter.

Meanwhile another question; is it possible to implement custom metadata to docs within corpus object? Like the speaker, speaker-gender, native-language, etc. which are provided in the CSV file along with the text? Honestly, I did not really get how to use preproc.add_metadata_per_doc function.

github-actions · Answer 3 · Wed Feb 09 2022 11:30:09 GMT+0800 (China Standard Time)

This issue is stale because it has been open for 30 days with no activity.

github-actions · Answer 4 · Wed Feb 23 2022 11:31:24 GMT+0800 (China Standard Time)

This issue was closed because it has been inactive for 14 days since being marked as stale.