Multiple entries csv

Question

Multiple entries csv

kikirizki opened this issue 3 years ago · comments

Hi i come from upwork, is this what are you looking for, split dataset into (multi row csv)


start_token = "|<start of text>|"
end_token = "|<end of text>|"
with open('train.txt', encoding='utf-8') as txtfile:
    all_text = txtfile.read().replace(start_token,"").split(end_token)
    all_text = all_text[0:len(all_text)-1]
with open('train.csv', mode='w', encoding='utf-8') as csv_file:
    fieldnames = ['text']
    writer = csv.DictWriter(csv_file, fieldnames=fieldnames)
    writer.writeheader()
    for row in all_text:
        writer.writerow({'text': all_text})


with open('validation.txt', encoding='utf-8') as txtfile:
    all_text = txtfile.read().replace(start_token,"").split(end_token)
    all_text = all_text[0:len(all_text)-1]
with open('validation.csv', mode='w', encoding='utf-8') as csv_file:
    fieldnames = ['text']
    writer = csv.DictWriter(csv_file, fieldnames=fieldnames)
    writer.writeheader()
    for row in all_text:
        writer.writerow({'text': row})

print("created train.csv and validation.csv > files")```

Peter Albert · Answer 1 · Sat Apr 24 2021 18:46:10 GMT+0800 (China Standard Time)

Yes, that looks correct, if you want the model to view each line (that you defined from start to end token) in the original text file as separate document. This way the model will generate similar things to your examples from the start to the end token.