piskvorky / gensim-data

Data repository for pretrained NLP models and NLP corpora.

Home Page:https://rare-technologies.com/new-api-for-pretrained-nlp-models-and-datasets-in-gensim/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

SemEval 2016/2017 Task 3, English Subtask A unannotated datasets and English Subtask B datasets

Witiko opened this issue · comments

Introduction

I converted SemEval 2016 and 2017 question answering datasets into JSON for ease of use. The original datasets are in XML and scattered across several ZIP archives. The JSON files are going to be immediately used in the Gensim documentation for the Soft Cosine Measure (see the respective pull request).

Description

Community Question Answering (CQA) forums are gaining popularity online. They are seldom moderated, rather open, and thus they have few restrictions, if any, on who can post and who can answer a question. On the positive side, this means that one can freely ask any question and expect some good, honest answers. On the negative side, it takes effort to go through all possible answers and to make sense of them. For example, it is not unusual for a question to have hundreds of answers, which makes it very time consuming to the user to inspect and to winnow. The challenge we propose may help automate the process of finding good answers to new questions in a community-created discussion forum (e.g., by retrieving similar questions in the forum and identifying the posts in the answer threads of those questions that answer the question well).

We build on the success of the previous editions of our SemEval tasks on CQA, SemEval-2015 Task 3 and SemEval-2016 Task 3, and present an extended edition for SemEval- 2017, which incorporates several novel facets.

Datasets

    [
        {

            "THREAD_SEQUENCE": "Q1", 
            "RelQuestion": {
                "RELQ_CATEGORY": "Politics", 
                "RELQ_DATE": "2009-12-14 11:58:33", 
                "RELQ_ID": "Q1", 
                "RELQ_USERID": "U2", 
                "RELQ_USERNAME": "anonymous", 
                "RelQBody": "The state of Internet in Thailand:IT Minsitry blocks CNN; Facebook; Yahoo; Flickr Thai Immigration website listed as dangerousFull story: http://www.thaivisa.com/forum/Thai-Govt-Blocks-Cnn-Yahoo-Financ-t321851.html", 
                "RelQSubject": "Thailand:IT Minsitry blocks CNN; Facebook;"
            }, 
            "RelComments": [
                {
                    "RELC_DATE": "2009-12-14 12:00:59", 
                    "RELC_ID": "Q1_C1", 
                    "RELC_USERID": "U210", 
                    "RELC_USERNAME": "DaRuDe", 
                    "RelCText": "have they blocked porn??? <img src=\"http://www.qatarliving.com/files/images/Da.gif\">"
                }, 
                {
                    "RELC_DATE": "2009-12-14 12:07:04", 
                    "RELC_ID": "Q1_C2", 
                    "RELC_USERID": "U2", 
                    "RELC_USERNAME": "anonymous", 
                    "RelCText": "like trying to contain a tsunami with a hand towel ************************************ I'm Jack's complete lack of surprise"
                }, 
                {
                    "RELC_DATE": "2009-12-14 12:09:23", 
                    "RELC_ID": "Q1_C3", 
                    "RELC_USERID": "U114", 
                    "RELC_USERNAME": "GodFather.", 
                    "RelCText": "oops double post.. ----------------- \"HE WHO DARES WINS\" Derek Edward Trotter"
                }, 
  • semeval-2016_2017-task3-subtaskB-english.json.gz (6.05M) – Example:
    {
        "2016-dev": [
            {
                "ORGQ_ID": "Q268", 
                "OrgQBody": "Which is a good bank as per your experience in Doha", 
                "OrgQSubject": "Good Bank", 
                "Threads": [
                    { 
                        "THREAD_SEQUENCE": "Q268_R4",
                        "RelQuestion": {
                            "RELQ_CATEGORY": "Advice and Help", 
                            "RELQ_DATE": "2013-05-02 19:43:00", 
                            "RELQ_ID": "Q268_R4", 
                            "RELQ_RANKING_ORDER": "4", 
                            "RELQ_RELEVANCE2ORGQ": "PerfectMatch", 
                            "RELQ_USERID": "U4882", 
                            "RELQ_USERNAME": "ankukuma", 
                            "RelQBody": "Hi Guys; I need to open a new bank accoount. Which is the best bank in Qatar ? I assume all of them will roughly be the same; but stll which has a slight edge (Money transfer; benifits etc) Thanks !!!", 
                            "RelQSubject": "Best Bank"
                        }, 
                        "RelComments": [
                            {
                                "RELC_DATE": "2013-05-03 07:23:20", 
                                "RELC_ID": "Q268_R4_C1", 
                                "RELC_RELEVANCE2ORGQ": "Good", 
                                "RELC_RELEVANCE2RELQ": "Good", 
                                "RELC_USERID": "U594", 
                                "RELC_USERNAME": "Dilgeer", 
                                "RelCText": "Commercial bank/IBQ"
                            }, 
                            {
                                "RELC_DATE": "2013-05-03 12:58:13", 
                                "RELC_ID": "Q268_R4_C2", 
                                "RELC_RELEVANCE2ORGQ": "Good", 
                                "RELC_RELEVANCE2RELQ": "Good", 
                                "RELC_USERID": "U979", 
                                "RELC_USERNAME": "Speedysid", 
                                "RelCText": "The best bank in Qatar for you would be the one that fits in your requirements.I suggest you visit the major banks here; and approach the Customer Relations person there to guide you with the facilities the bank offers. They include: -Current Accounts facilities -Savings Account facilities - Money Transfer (However; I highly recommend using the bank transfer only in emergency cases. There are money transfer agents which offer better exchange rates; and lower service fees) - Tie-ups with any bank in your home country to ease transfers"
                            }, 

Papers

Code

License

These are the licensing notices found in the individual ZIP files with the original XML datasets:

  • semeval2016-task3-cqa-ql-traindev-v3.2.zip

    These datasets are free for general research use.

  • semeval2017_task3_test.zip
    • the scripts and all files released for the task are free for general research use

    • you should use the following citation in your publications whenever using these resources:

      @InProceedings{SemEval-2017:task3,
         author    = {Nakov, Preslav and Hoogeveen, Doris and M\`{a}rquez, Llu\'{i}s and Moschitti, Alessandro and Mubarak, Hamdy and Baldwin, Timothy and Verspoor, Karin},
         title     = {{SemEval}-2017 Task 3: Community Question Answering},
         booktitle = {Proceedings of the 11th International Workshop on Semantic Evaluation},
         series    = {SemEval '17},
         month     = {August},
         year      = {2017},
         address   = {Vancouver, Canada},
         publisher = {Association for Computational Linguistics},
       }
      

Nice! :)

@menshikh-iv I pushed an updated semeval-2016_2017-task3-subtaskB-english.json.gz, which now contains the RELQ_RANKING_ORDER field as an integer rather than a string. It is a minor but convenient change.

Can this dataset be used directly as follows below?

import gensim
import gensim.downloader as api

corpus = api.load('semeval-2016-2017-task3-subtaskA-unannotated')
word2vec = gensim.models.Word2Vec(corpus)

I am getting a strange output on checking the vocab word2vec.wv.vocab:

{'RelComments': <gensim.models.keyedvectors.Vocab at 0x7f1740e26a90>,
 'RelQuestion': <gensim.models.keyedvectors.Vocab at 0x7f16fad64cf8>,
 'THREAD_SEQUENCE': <gensim.models.keyedvectors.Vocab at 0x7f173ee5f128>}

@AMR-KELEG The dataset is not a corpus. You will need to extract the text data you are interested in:

import gensim
import gensim.downloader as api

questions = api.load('semeval-2016-2017-task3-subtaskA-unannotated')
corpus = [question["RelQuestion"]["RelQBody"] for question in questions]