Baseline TFIDF solution to https://www.kaggle.com/c/fake-news/data
Classifier uses a basic Text processing pipeline over just the text
column to predict fake news:
- Text cleaning: accent removal, lower case
- Tokenization
- Stopword removal
- Lemmatization/Stemming
- TFIDF vectorization
- Experiments with tree classifiers like Decision Trees and Gradient Boosted Trees
- Achieved F1 of 87-91.5% on 20% validation set (best F1 with Gradient Boosted Trees) (NOTE: this is not k-cross validated)
Next Steps:
- Use other columns to improve classifier performance like
author
,title
, etc - Use BERT based vectorization instead of TFIDF