MIND-Lab / OCTIS

OCTIS: Comparing Topic Models is Simple! A python package to optimize and evaluate topic models (accepted at EACL2021 demo track)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

vocabulary: a .txt for custom dataset

SaraAmd opened this issue · comments

how to generate vocabulary file from our csv / tsv dataset?

Hi, you can load the tsv file and then split the words using the spaces and save only the unique words. Like this:

import pandas as pd
df = pd.read_csv(dataset_path + "/corpus.tsv", sep='\t', header=None)
vocabulary = set()
for document in df[0].tolist():
    for word in document.split(): 
         vocabulary.add(word)
with open(dataset_path + "/vocabulary.txt", 'w') as fw:
    for word in vocabulary:
        fw.write(word)

Best,

Silvia