castorini / MP-CNN-Torch

Multi-Perspective Convolutional Neural Networks for modeling textual similarity (He et al., EMNLP 2015)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Additional Directions

SciutoAlex opened this issue · comments

Would you be willing to include some additional directions or information for using the project with other datasets? I have a large set of sentences I would like to find similarity scores for. I'm not sure if this project is appropriate or not. Seems like it should be. Thanks!

+1
I've got halfway but beyond that things get very vague indeed.

  1. Install Torch (already had it)
  2. Clone the repo from Git
  3. Use 'sh fetch_and_preprocess.sh' to get the Stanford Glove library
  4. ??? How do I train using my own sentences? Is there a parameter to pass in to the lua script to specify which data to use? And what are the requirements for formatting this text?

@SciutoAlex @lukemunn You need to generate a new dataset based on your own sentences. The format should follow our sample dataset like data/msrvid/train. The folder seems too messy, I would suggest Hua to clean it up. But basically, you will only need 4 files in each train/dev/test subfolder:

  1. a.toks: one sentence per line
  2. b.toks: one sentence per line
  3. id.txt: the index of the corresponding line
  4. sim.txt: the similarity score of sentence a and b, i.e., it can be a binary score (0 if irrelevant, 1 for relevant), or any other ranges.

After doing that, the dataset is ready. Then the left thing is to generate the vocabulary using following code (change line 16 to your own dataset):
https://github.com/Jeffyrao/pairwise-neural-network/blob/master/scripts/build_vocab.py

Then you should be ready to go. :)

As a complement information, here is the detailed introduction of how to adapt the code to run on your own dataset:
hohoCode#1

Thanks for the assistance Jeffy, the library is on my other machine so will take a look tonight.

From glancing at your response (admittedly without the code in front of me), I'm still confused by step 4 (sim.txt), providing a similarity score.

Isn't providing a similarity score the whole point of the algorithm/library? Why would that be part of the initial dataset? Or is this an empty value which gets populated?

@lukemunn the similarity score is the ground truth of a sentence pair. It's created by the data owner, like you can label it as score 0 or 1 (the binary classification), or you want to set the range as 5-star (any score in the range [0, 5] will be valid).

The model generates a prediction score for each sentence pair, which tries to match the ground truth label as much as possible.