eumesy / data-sts

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

data-sts

Format

sentence pairs

  • STS.input.foo.txt (original)

    • TSV file including sentence pairs
    • each line is of the form: sent1[TAB]sent2
  • STS.input.foo.words.en_ewt.joblib (tokenized by StanformdNLP)

    • joblib formatted file

    • list of words pairs

      [
          (['This', 'is', 'first', 'sentence', '.'], ['This', 'is', 'second', 'sentence', '.']),
          (['foo', 'bar', ...], ['hoge', 'fuga', ...]),
          ...
      ]

gold scores

  • STS.gs.foo.txt
    • each line gold scores (sentence similarities)

List of Dataset

STS-B

data/stsbenchmark

Web

http://ixa2.si.ehu.es/stswiki/index.php/STSbenchmark

Stats

http://ixa2.si.ehu.es/stswiki/index.php/STSbenchmark

Log of Data Preparation (for reproducibility)

wget http://ixa2.si.ehu.es/stswiki/images/4/48/Stsbenchmark.tar.gz -P data
tar xvzf data/Stsbenchmark.tar.gz -C data

cd data/stsbenchmark
cut -f5 sts-test.csv > STS.gs.sts-test.txt
cut -f 6-7 sts-test.csv > STS.input.sts-test.txt
cut -f5 sts-dev.csv > STS.gs.sts-dev.txt
cut -f 6-7 sts-dev.csv > STS.input.sts-dev.txt
cut -f5 sts-train.csv > STS.gs.sts-train.txt
cut -f 6-7 sts-train.csv > STS.input.sts-train.txt

Tokenize w/StanfordNLP ("en_ewt")

scripts/tokenize_STSinput_by_stanfordnlp.py data/stsbenchmark/STS.input.sts-train{.txt,.words.en_ewt.joblib}
scripts/tokenize_STSinput_by_stanfordnlp.py data/stsbenchmark/STS.input.sts-dev{.txt,.words.en_ewt.joblib}
scripts/tokenize_STSinput_by_stanfordnlp.py data/stsbenchmark/STS.input.sts-test{.txt,.words.en_ewt.joblib}

Data Preparation (for reproducibility)

Tokenizer

StanformdNLP

https://stanfordnlp.github.io/stanfordnlp/

pip install stanfordnlp
# $ brew install libomp # for torch in macOS?
>>> import stanfordnlp
>>> stanfordnlp.download('en') # This downloads the English models

Using the default treebank "en_ewt" for language "en". Would you like to download the models for: en_ewt now? (Y/n)

Default download directory: ~/stanfordnlp_resources Hit enter to continue or type an alternate directory.

About


Languages

Language:Python 81.7%Language:Perl 18.3%