adeshpande3 / Facebook-Messenger-Bot

Facebook chatbot that I trained to talk like me using Seq2Seq

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

What's the training data formatting?

CoderReece opened this issue · comments

I want to use training data from a different platform but i'm unsure of what the formatting should be for the training data.

is it a json format? etc
if so can i get an example?

Thanks.

Training data for Word2Vec or for Seq2Seq? If you're asking for Seq2Seq, you should take a look at Seq2SeqXTrain.npy and Seq2SeqYTrain.npy. They are numpy matrices which dimensionalities of A x B where A is the number of training examples and B is the sequence length.

image

You can see above that the first training input (x[0]) consists of the integers that represent the words (52780 for blank, 34931 for "the", etc) for the input message. y[0] will contain the sequence of words that is the response.

The facebook chat data cannot be converted due to big changes to the html layout. so i was curious how the facebook data was parsed and stored so i can make one manually and have createDataset.py process that instead.

Thanks for some insight on how Seq2Seq works though.

Oh, so you're referring to the HTML file that you get after downloading your data?

Like the below step?
image

If so, then, you probably should check Dillon's repo

Yeah the repo isn't maintained anymore:

UPDATE April 28th 2018: Facebook recently revamped the "download your data" feature to a much more usable state. This was probably in compliance with GDPR laws by the European Union, which will be enforced starting in May 2018. Facebook now allows you to download your message data in JSON format, which supercedes the purpose of this project.

In light of that this repository will no longer be maintained.

So i can't use my message data without knowing how to format it for createDataset.py
Sorry I've not done a good job at explaining myself, I don't mean to take up your time.

It's cool dw, I think the main thing you have to do is take that JSON and change it to a TXT file with the following format.

image

Okay, Thanks for the assistance.
So like have i got that right? I just want to be sure before i go off to try this.

From the data you've showed me i assume 08:00 doesn't change and probably isn't important?

[dateTtime-08:00] username: message

Yeah we end up ignoring everything in the brackets.

Should oldest messages be on top or on the bottom of file?

Think it depends on how facebook organizes the data when you download it. From what I remember it wasnt necessarily chronological.