data-sts

Format

sentence pairs

STS.input.foo.txt (original)
- TSV file including sentence pairs
- each line is of the form: sent1[TAB]sent2

STS.input.foo.words.en_ewt.joblib (tokenized by StanformdNLP)

joblib formatted file

list of words pairs

[
    (['This', 'is', 'first', 'sentence', '.'], ['This', 'is', 'second', 'sentence', '.']),
    (['foo', 'bar', ...], ['hoge', 'fuga', ...]),
    ...
]

gold scores

STS.gs.foo.txt
- each line gold scores (sentence similarities)

List of Dataset

STS-B

data/stsbenchmark

Web

http://ixa2.si.ehu.es/stswiki/index.php/STSbenchmark

Stats

http://ixa2.si.ehu.es/stswiki/index.php/STSbenchmark

Log of Data Preparation (for reproducibility)

wget http://ixa2.si.ehu.es/stswiki/images/4/48/Stsbenchmark.tar.gz -P data
tar xvzf data/Stsbenchmark.tar.gz -C data

cd data/stsbenchmark
cut -f5 sts-test.csv > STS.gs.sts-test.txt
cut -f 6-7 sts-test.csv > STS.input.sts-test.txt
cut -f5 sts-dev.csv > STS.gs.sts-dev.txt
cut -f 6-7 sts-dev.csv > STS.input.sts-dev.txt
cut -f5 sts-train.csv > STS.gs.sts-train.txt
cut -f 6-7 sts-train.csv > STS.input.sts-train.txt

Tokenize w/StanfordNLP ("en_ewt")

scripts/tokenize_STSinput_by_stanfordnlp.py data/stsbenchmark/STS.input.sts-train{.txt,.words.en_ewt.joblib}
scripts/tokenize_STSinput_by_stanfordnlp.py data/stsbenchmark/STS.input.sts-dev{.txt,.words.en_ewt.joblib}
scripts/tokenize_STSinput_by_stanfordnlp.py data/stsbenchmark/STS.input.sts-test{.txt,.words.en_ewt.joblib}

Data Preparation (for reproducibility)

Tokenizer

StanformdNLP

https://stanfordnlp.github.io/stanfordnlp/

pip install stanfordnlp
# $ brew install libomp # for torch in macOS?

>>> import stanfordnlp
>>> stanfordnlp.download('en') # This downloads the English models

Using the default treebank "en_ewt" for language "en". Would you like to download the models for: en_ewt now? (Y/n)

Default download directory: ~/stanfordnlp_resources Hit enter to continue or type an alternate directory.

eumesy / data-sts

data-sts

Format

sentence pairs

gold scores

List of Dataset

STS-B

Web

Stats

Log of Data Preparation (for reproducibility)

Tokenize w/StanfordNLP ("en_ewt")

Data Preparation (for reproducibility)

Tokenizer

StanformdNLP

About

Languages