Corpus.from_tabular add filenames as prefix to doc_labels.
fatihbozdag opened this issue · comments
Greetings,
Corpus.from_tabular function adds source file name as prefix to each doc_label identifier. Is there a way to prevent this?
corpus = Corpus.from_tabular('/metadata_with_text.csv', id_column = 'docid_field', text_column = 'text_field)
x = corpus.doc_labels
x[0:10]
['metadata_with_text-BGSU1001',
'metadata_with_text-BGSU1002',
'metadata_with_text-BGSU1003',
'metadata_with_text-BGSU1004',
'metadata_with_text-BGSU1005',
'metadata_with_text-BGSU1006',
'metadata_with_text-BGSU1007',
'metadata_with_text-BGSU1008',
'metadata_with_text-BGSU1009',
'metadata_with_text-BGSU1010']
prefix = None
as passed to pd.read_csv
did not work.
Yes, you can change the default document label format by passing the parameter doc_label_fmt='{id}'
. See https://tmtoolkit.readthedocs.io/en/latest/api.html#tmtoolkit.corpus.Corpus.add_tabular
Great! I do not know how I missed that parameter.
Meanwhile another question; is it possible to implement custom metadata to docs within corpus object? Like the speaker, speaker-gender, native-language, etc. which are provided in the CSV file along with the text? Honestly, I did not really get how to use preproc.add_metadata_per_doc
function.
This issue is stale because it has been open for 30 days with no activity.
This issue was closed because it has been inactive for 14 days since being marked as stale.