Disaster-Tweet-Classification

This Project is to work on Natural language processing techniques to slove language based problems.

This project deals with the classification problem for the disaster tweets. So that it can be classified that which tweets is the real disaster tweets.

Pre requisistes

Python3
Jupyter Notebook
Machine Learning
NLTK
Pandas
Sklearn
Regex
Vectorization

Section Wise Implementation Guides

Data Loading.
Exploratory Data Analysis
PreProcessing
Fitting the model with Parameter Tuning.
Performance evaluation

Data Loading

I like padas , my basic approach is to load the data into the dataframe and then performing operations and explorations like EDA

df = pd.read_csv("train.csv",engine="python", delimiter=",")

Exploratory Data Analysis

Dataset describe

Dataset info

Top Locations used in Dataset

sns.barplot(y=locations_vc[0:30].index, x=locations_vc[0:30], orient='h')
plt.title("Top 30 Locations")
plt.show()

Top Keywords used in Dataset

sns.barplot(y=keyword_vc[0:30].index, x=keyword_vc[0:30], orient='h')
plt.title("Top 30 keyword")
plt.show()

Word Cloud of the Abbreveations used

PreProcessing

Pre processing is the most important phase . As we are dealing with NLP , it is little different than the Numeric preprocessing Techninques.

The list of filters used for preprocessing the tweets are as follows

Url
Html
Non Ascii
abbreveation replacement
removing mentions
number
punctuations
stop words

The above were used to clean the text before vectorization

Below is the glimpse of befor and after

for vectorization of textual data , the two most popular methods are

count vectorization
TIDIF vectorization

Fitting the Model

To classify Randonm Forest Classifies has been used for the demonstration

classifier = RandomForestClassifier(n_estimators=1000, random_state=0)

The number of estimator can be decided by the testing with different values which size of estimator is giving you the best result.

For eg. the following model has been tested over multiple estimator size to determine which one giveing the most accurate results

Performance evaluation

print(classification_report(y_test,y_pred))
print(accuracy_score(y_test, y_pred))

The number of estimator = 1000 gave the best result .

About

This is to classify the disaster tweets is of true disaster or false disaster

Languages

Language:Jupyter Notebook 97.8%Language:Python 2.2%