Disaster-Tweet-Classification
This Project is to work on Natural language processing techniques to slove language based problems.
This project deals with the classification problem for the disaster tweets. So that it can be classified that which tweets is the real disaster tweets.
Pre requisistes
- Python3
- Jupyter Notebook
- Machine Learning
- NLTK
- Pandas
- Sklearn
- Regex
- Vectorization
Section Wise Implementation Guides
- Data Loading.
- Exploratory Data Analysis
- PreProcessing
- Fitting the model with Parameter Tuning.
- Performance evaluation
Data Loading
I like padas , my basic approach is to load the data into the dataframe and then performing operations and explorations like EDA
df = pd.read_csv("train.csv",engine="python", delimiter=",")
Exploratory Data Analysis
Dataset describe
Dataset info
Top Locations used in Dataset
sns.barplot(y=locations_vc[0:30].index, x=locations_vc[0:30], orient='h')
plt.title("Top 30 Locations")
plt.show()
Top Keywords used in Dataset
sns.barplot(y=keyword_vc[0:30].index, x=keyword_vc[0:30], orient='h')
plt.title("Top 30 keyword")
plt.show()
Word Cloud of the Abbreveations used
PreProcessing
Pre processing is the most important phase . As we are dealing with NLP , it is little different than the Numeric preprocessing Techninques.
The list of filters used for preprocessing the tweets are as follows
- Url
- Html
- Non Ascii
- abbreveation replacement
- removing mentions
- number
- punctuations
- stop words
The above were used to clean the text before vectorization
Below is the glimpse of befor and after
for vectorization of textual data , the two most popular methods are
- count vectorization
- TIDIF vectorization
Fitting the Model
To classify Randonm Forest Classifies has been used for the demonstration
classifier = RandomForestClassifier(n_estimators=1000, random_state=0)
The number of estimator can be decided by the testing with different values which size of estimator is giving you the best result.
For eg. the following model has been tested over multiple estimator size to determine which one giveing the most accurate results
Performance evaluation
print(classification_report(y_test,y_pred))
print(accuracy_score(y_test, y_pred))
The number of estimator = 1000 gave the best result .