What's the training data formatting?
CoderReece opened this issue · comments
I want to use training data from a different platform but i'm unsure of what the formatting should be for the training data.
is it a json format? etc
if so can i get an example?
Thanks.
Training data for Word2Vec or for Seq2Seq? If you're asking for Seq2Seq, you should take a look at Seq2SeqXTrain.npy and Seq2SeqYTrain.npy. They are numpy matrices which dimensionalities of A x B where A is the number of training examples and B is the sequence length.
You can see above that the first training input (x[0]) consists of the integers that represent the words (52780 for blank, 34931 for "the", etc) for the input message. y[0] will contain the sequence of words that is the response.
The facebook chat data cannot be converted due to big changes to the html layout. so i was curious how the facebook data was parsed and stored so i can make one manually and have createDataset.py process that instead.
Thanks for some insight on how Seq2Seq works though.
Oh, so you're referring to the HTML file that you get after downloading your data?
If so, then, you probably should check Dillon's repo
Yeah the repo isn't maintained anymore:
UPDATE April 28th 2018: Facebook recently revamped the "download your data" feature to a much more usable state. This was probably in compliance with GDPR laws by the European Union, which will be enforced starting in May 2018. Facebook now allows you to download your message data in JSON format, which supercedes the purpose of this project.
In light of that this repository will no longer be maintained.
So i can't use my message data without knowing how to format it for createDataset.py
Sorry I've not done a good job at explaining myself, I don't mean to take up your time.
Okay, Thanks for the assistance.
So like have i got that right? I just want to be sure before i go off to try this.
From the data you've showed me i assume 08:00 doesn't change and probably isn't important?
[dateTtime-08:00] username: message
Yeah we end up ignoring everything in the brackets.
Should oldest messages be on top or on the bottom of file?
Think it depends on how facebook organizes the data when you download it. From what I remember it wasnt necessarily chronological.