rcortx / kaggle-fake-news

Baseline TFIDF solution to https://www.kaggle.com/c/fake-news/data

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

kaggle-fake-news

Baseline TFIDF solution to https://www.kaggle.com/c/fake-news/data

Classifier uses a basic Text processing pipeline over just the text column to predict fake news:

  1. Text cleaning: accent removal, lower case
  2. Tokenization
  3. Stopword removal
  4. Lemmatization/Stemming
  5. TFIDF vectorization
  6. Experiments with tree classifiers like Decision Trees and Gradient Boosted Trees
  7. Achieved F1 of 87-91.5% on 20% validation set (best F1 with Gradient Boosted Trees) (NOTE: this is not k-cross validated)

Next Steps:

  1. Use other columns to improve classifier performance like author, title, etc
  2. Use BERT based vectorization instead of TFIDF

About

Baseline TFIDF solution to https://www.kaggle.com/c/fake-news/data


Languages

Language:Jupyter Notebook 100.0%