In this project I built a model for an API that classifies disaster messages. The datasets provided by Figure Eight contain real messages sent during disaster events and their respective categories. The task was to train the supervised ML classifier to automate categorization of the new messages so that different disaster relief agencies would receive only relevant ones.
https://disaster-reponse-api.herokuapp.com
$ pip install -r requirements.txt
-
Run the following commands in the project's root directory to set up the database and model.
- To run ETL pipeline that cleans data and stores in database
python data/process_data.py data/disaster_messages.csv data/disaster_categories.csv data/DisasterResponse.db
- To run ML pipeline that trains classifier and saves it as gzip pickle object
python models/train_classifier.py data/DisasterResponse.db models/model.p.gz
- To run ETL pipeline that cleans data and stores in database
-
Run the following command to start the server locally:
localhost=1 python run.py
-
Go to http://0.0.0.0:3001/
Step 1: ETL Pipeline
- Loaded the messages and categories datasets (
disaster_messages.csv
,disaster_categories.csv
) - Merged the two datasets
- Cleaned the data
- Saved it in a SQLite database
DisasterResponse.db
Step 2: ML Pipeline
- Loaded data from the SQLite database
- Split the dataset into training and test sets
- Built a text processing and ML pipeline using NLTK and scikit-learn's Pipeline
- Trained and tuned the model using GridSearchCV
- Evaluated results on the test set
- Exported the final model as a gzip pickle file
model.p.gz
Step 3: Python Scripts
- Converted the jupyter notebooks into python scripts
process_data.py
andtrain_classifier.py
- Refactored the code to make it modular
Step 4: Flask App
- Uploaded sql database file and pkl file with the final model to a Flask app template
- Created data visualizations in the app using Plotly
Because messages are being categorized into 36 different categories and the dataset contains only 26216 messages, some categories have either very small number of positive instances or after splitting the data into train/test sets and then into cross-validation sets might end up having no positive instances at all! Even the original dataset has a category "child_alone" without a single message in it (which put constrains on using certain ML models like those relying on gradient descent).
As a result we have very high class imbalances in this project, which influence the classification results to a large degree. Thus, with no/little positive cases, the model predicts the '0' class most of the time and is correct, which leads to high accuracy scores, but doesn't help with identification of relevant messages in this category. The accuracy score becomes a bad evaluation metric in the highly imbalanced tasks, with f1_score being a more appropriate one.
There are several ways to handle imbalanced dataset with resampling being one of the most popular. In this case, we artificially increase the instances of the underrepresented class (e.g. SMOTE technique) or downsize the instances of the overrepresented class. Also some classifiers like RandomForestClassifier in scikit-learn have class_imbalance among parameters. (I tried tuning it, but got inferior results.)
Since it is a multilabel classification task (meaning we assign simulteniously several labels to a message which are not mutually exclusive), I am not sure if we need to artificially increase the prediction of certain labels. We risk having more false positives in this case, which means bad allocation of resources especially during disasters. For a disaster response project, I believe it is more important for the model to be able to discern relevant messages from irrelevant ones and perform some high-level categorization well (e.g. earthquake vs flood).
As such, the model trained in this project does just that. It has relatively high f1_score on 'relevant', 'aid_related', 'weather_related' categories (0.89, 0.71, 0.77 respectively for positive instances) and can discriminate between earthquake/flood/storm messages well (with f1_score scores being 0.83, 0.67, 0.66 respectively). It also identifies messages related to water, shelter, food, clothing relatively well (0.67, 0.63, 0.79, 0.49).
This project is part of Udacity Data Science Nanodegree Programm.