TongLi3701 / NLP_Coursework

NLP CO490 coursework 2020 at Imperial College London

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Sentence-level QE Task 2020

This repository provides our code for NLP CO490 coursework 2020 at Imperial College London. In this task, we used the English-Chinese corpus and achieved a Pearson score of 0.4327 as our best performance on the hidden test set. The code can run on Colab.

In this repository, we present eight models:

The titles describe which model we use + which embedding method we choose + whether we use pre-processing or not.

Getting Started

Upload one ipython notebook file to Colab and upload the files in the dataset folder.

Directly Runnable Models

Simply run all the modules and we could obtain the Pearson results of the models on the validation dataset.

Models Using two BERT models

  • SVR+Sentence_embed_Bert_base+raw

  • SVR+Sentence_embed_Bert_large+raw

    1. First we run till the Save English results module.

    2. Download the produced english_train_berteb.csv, english_dev_berteb.csv, english_test_berteb.csv.

    The files are also provided in the folders given above. We can start from step 4 if we use the files provided.

    1. Disconnect the notebook in Colab.

    2. Reconnect and upload the files in dataset folder and the csv files downloaded in step 2.

    3. Run the modules before the English Model module and the modules following the Chinese Model module.

    Finally we obtain the Pearson score on the validation dataset, along with the en-zh_svr.zip file (results on the test dataset) that can be handed in to CodaLab.

  • RNN+Word_embed_Bert+pre_process.ipynb

    1. Copy the folder with BERT word embedding results into Google drive.
    2. Start running from the module Model: LSTM (BERT word embedding).

Authors

About

NLP CO490 coursework 2020 at Imperial College London


Languages

Language:Mathematica 67.4%Language:Jupyter Notebook 32.6%