bert natural-language-understanding natural-language-inference text-classification wsdm keras

Fake News Detection

This is the 3rd place solution to ACM International Conference on Web Search and Data Mining(WSDM) Cup 2019, a challenge to fake news detection and sentence pairs modeling.

Documents

Slides
Paper

Reproduce our results

1. Setup

Clone this project.
Download the dataset from the corresponding competition on Kaggle and extract it under the directory zake7749/data/dataset

|-- dataset
    |-- sample_submission.csv
    |-- test.csv
    `-- train.csv

Prepare the embedding models

We use 2 open-source pretrained word embeddings in this competiton:

Tencent AI Lab Embedding Corpus for Chinese Words and Phrases
Chinese-Word-Vectors
- We select the SGNS version on word and n-gram level, trained with the mixed-large corpus

And put these two embeddings under the folder zake7749/data/wordvec/

|-- wordvec
    |-- Tencent_AILab_ChineseEmbedding.txt
    `-- sgns.merge.bigram

2. Instructions

The notebooks are under the folder zake7749/code

Pre-processing

Execute Stage 1.1. Preprocessing-on-word-level.ipynb
Execute Stage 1.2. Preprocessing-on-char-level.ipynb

These notebooks would generate 8 cleaned datasets under zake7749/data/processed_dataset.

.
|-- engineered_chars_test.csv
|-- engineered_chars_train.csv
|-- engineered_words_test.csv
|-- engineered_words_train.csv
|-- processed_chars_test.csv
|-- processed_chars_train.csv
|-- processed_words_test.csv
`-- processed_words_train.csv

Train the char-level embedding

Execute Stage 1.3. Train-char-embeddings, which would output 3 char embeddings under zake7749/data/wordvec/

|-- wordvec
    |-- Tencent_AILab_ChineseEmbedding.txt
    |-- fasttext-50-win3.vec
    |-- sgns.merge.bigram
    |-- zh-wordvec-50-cbow-windowsize50.vec
    `-- zh-wordvec-50-skipgram-windowsize7.vec

Train the base models (LB 0.84 ~ 0.86)

Execute Stage 2. First-Level-with-char-level.ipynb
Execute Stage 2. First-Level-with-word-level.ipynb

Ensemble the predictions of base models (LB 0.873)

Execute Stage 3.1. First-level-ensemble-ridge-regression
Execute Stage 3.2. First-level-ensemble-with-LGBM-each-side
Execute Stage 3.3. First-level-ensemble-with-LGBM
Execute Stage 3.4. First-level-ensemble-with-NN
Execute Stage 3.5. Second-level-ensemble

Fine-tune the cls vector of BERT (LB 0.867)

Run script hanshan/bert/train_wsdm.sh
To get predictions file to submit at this stage run zake7749/bert/data/probs_to_preds.py

Blend the predictions of ensemble NNs with BERT (LB 0.874)

Execute Stage 3.6. Bagging-with-BERT

** Note: Please change the path of sec_stacking_df to the corresponding file **

Fine-tune the base models with noisy labels (LB 0.86 ~ 0.875)

Execute Stage 4.1. Fine-tune-word-level-models.ipynb
Execute Stage 4.2. Fine-tune-char-level-models.ipynb

Fine-tune the cls vector of BERT with noisy labels (LB 0.880)

Run hanshan/prep_pseudo_labels.py
Run script hanshan/bert/train_wsdm_pl.sh

Ensemble the predictions of fine-tuned base models (LB 0.879)

Execute Stage 5.1. First-level-fine-tuned-ensemble-ridge-regression.ipynb
Execute Stage 5.2. First-level-fine-tuned-ensemble-withNN.ipynb
Execute Stage 5.3. First-level-fine-tuned-ensemble-with-LGBM.ipynb
Execute Stage 5.4. Second-level-fine-tuned-ensemble.ipynb

Final Blending with post-processing (LB 0.881)

Execute Stage 9. High-Ground.ipynb
Execute Stage 42. Final Answer.ipynb

The final prediction final_answer.csv would be generated under the folder zake7749/data/high_ground/

About

[ACM-WSDM] 3rd place solution at WSDM Cup 2019, Fake News Classification on Kaggle.

bert natural-language-understanding natural-language-inference text-classification wsdm keras

Apache License 2.0

Languages

Language:Jupyter Notebook 94.3%Language:Python 5.6%Language:Shell 0.1%