Additional Directions

Question

Additional Directions

SciutoAlex opened this issue 7 years ago · comments

Would you be willing to include some additional directions or information for using the project with other datasets? I have a large set of sentences I would like to find similarity scores for. I'm not sure if this project is appropriate or not. Seems like it should be. Thanks!

Luke Munn · Answer 1 · Tue Feb 21 2017 10:24:47 GMT+0800 (China Standard Time)

+1
I've got halfway but beyond that things get very vague indeed.

Install Torch (already had it)
Clone the repo from Git
Use 'sh fetch_and_preprocess.sh' to get the Stanford Glove library
??? How do I train using my own sentences? Is there a parameter to pass in to the lua script to specify which data to use? And what are the requirements for formatting this text?

jinfengr · Answer 2 · Wed Feb 22 2017 03:46:20 GMT+0800 (China Standard Time)

@SciutoAlex @lukemunn You need to generate a new dataset based on your own sentences. The format should follow our sample dataset like data/msrvid/train. The folder seems too messy, I would suggest Hua to clean it up. But basically, you will only need 4 files in each train/dev/test subfolder:

a.toks: one sentence per line
b.toks: one sentence per line
id.txt: the index of the corresponding line
sim.txt: the similarity score of sentence a and b, i.e., it can be a binary score (0 if irrelevant, 1 for relevant), or any other ranges.

After doing that, the dataset is ready. Then the left thing is to generate the vocabulary using following code (change line 16 to your own dataset):
https://github.com/Jeffyrao/pairwise-neural-network/blob/master/scripts/build_vocab.py

Then you should be ready to go. :)

jinfengr · Answer 3 · Wed Feb 22 2017 03:55:08 GMT+0800 (China Standard Time)

As a complement information, here is the detailed introduction of how to adapt the code to run on your own dataset:
hohoCode#1

Luke Munn · Answer 4 · Wed Feb 22 2017 03:56:29 GMT+0800 (China Standard Time)

Thanks for the assistance Jeffy, the library is on my other machine so will take a look tonight.

From glancing at your response (admittedly without the code in front of me), I'm still confused by step 4 (sim.txt), providing a similarity score.

Isn't providing a similarity score the whole point of the algorithm/library? Why would that be part of the initial dataset? Or is this an empty value which gets populated?

jinfengr · Answer 5 · Wed Feb 22 2017 04:00:47 GMT+0800 (China Standard Time)

@lukemunn the similarity score is the ground truth of a sentence pair. It's created by the data owner, like you can label it as score 0 or 1 (the binary classification), or you want to set the range as 5-star (any score in the range [0, 5] will be valid).

The model generates a prediction score for each sentence pair, which tries to match the ground truth label as much as possible.