vshantam / Disaster-Tweet-Classification

This is to classify the disaster tweets is of true disaster or false disaster

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Disaster-Tweet-Classification

This Project is to work on Natural language processing techniques to slove language based problems.

This project deals with the classification problem for the disaster tweets. So that it can be classified that which tweets is the real disaster tweets.

Pre requisistes

  1. Python3
  2. Jupyter Notebook
  3. Machine Learning
  4. NLTK
  5. Pandas
  6. Sklearn
  7. Regex
  8. Vectorization

Section Wise Implementation Guides

  1. Data Loading.
  2. Exploratory Data Analysis
  3. PreProcessing
  4. Fitting the model with Parameter Tuning.
  5. Performance evaluation

Data Loading

I like padas , my basic approach is to load the data into the dataframe and then performing operations and explorations like EDA

df = pd.read_csv("train.csv",engine="python", delimiter=",")

Exploratory Data Analysis

Dataset describe

image

Dataset info

image

Top Locations used in Dataset

sns.barplot(y=locations_vc[0:30].index, x=locations_vc[0:30], orient='h')
plt.title("Top 30 Locations")
plt.show()

image

Top Keywords used in Dataset

sns.barplot(y=keyword_vc[0:30].index, x=keyword_vc[0:30], orient='h')
plt.title("Top 30 keyword")
plt.show()

image

Word Cloud of the Abbreveations used

image

PreProcessing

Pre processing is the most important phase . As we are dealing with NLP , it is little different than the Numeric preprocessing Techninques.

The list of filters used for preprocessing the tweets are as follows

  1. Url
  2. Html
  3. Non Ascii
  4. abbreveation replacement
  5. removing mentions
  6. number
  7. punctuations
  8. stop words

The above were used to clean the text before vectorization

Below is the glimpse of befor and after

image

for vectorization of textual data , the two most popular methods are

  1. count vectorization
  2. TIDIF vectorization

Fitting the Model

To classify Randonm Forest Classifies has been used for the demonstration

classifier = RandomForestClassifier(n_estimators=1000, random_state=0)

The number of estimator can be decided by the testing with different values which size of estimator is giving you the best result.

For eg. the following model has been tested over multiple estimator size to determine which one giveing the most accurate results

image

Performance evaluation

print(classification_report(y_test,y_pred))
print(accuracy_score(y_test, y_pred))

image

The number of estimator = 1000 gave the best result .

About

This is to classify the disaster tweets is of true disaster or false disaster


Languages

Language:Jupyter Notebook 97.8%Language:Python 2.2%