roshank007 / DAB_402_CAPSTONE_PROJECT

Sentiment Analysis on more than 1 Million Tweets of Gaming During Covid19 with help of Python and NLP

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

DAB 402 CAPSTONE PROJECT

My capstone project "Sentiment Analysis on Gaming-Covid with Python and NLP" in which I will performSentiment Analysis on more than 1 Million Tweets of Gaming During Covid19 with help of Python and NLP

Tech Stack 🤓

python jupyter twitter

Instructions to run

$ git clone https://github.com/roshank007/DAB_402_CAPSTONE_PROJECT.git
$ download csv https://www.kaggle.com/datasets/erroshan/sentiment-analysis-on-twitter-data-during-covid
$ start Jupyter notebook
$ open part1
$ open part2
$ open part3
$ open part4
$ run the kernel

sentimentdefi

What is sentiment analysis?

Sentiment analysis (or opinion mining) is a natural language processing (NLP) technique used to determine whether data is positive, negative or neutral.

Sentiment analysis is often performed on textual data to help businesses monitor brand and product sentiment in customer feedback, and understand customer needs.

AI_div

What is Natural Language Processing???

Natural language processing (NLP) is a branch of artificial intelligence that helps computers understand, interpret and manipulate human language.

NLP draws from many disciplines, including computer science and computational linguistics, in its pursuit to fill the gap between human communication and computer understanding.NLP is used to analyze text, allowing machines to understand how humans speak. This human-computer interaction enables real-world applications like automatic text summarization, sentiment analysis, topic extraction, named entity recognition, parts-of-speech tagging, relationship extraction, stemming, and more. NLP is commonly used for text mining, machine translation, and automated question answering.

NLP

Do we know any daily life usage of NLP In the life?

yes

Answer is Bigggggggggg yesssssssssss, the best known example of day to day life use of NLP oriented is smart assistant like Google assistant (hello/ok Google), Siri (Mac/iOS devices) and Cortana for the windows is the best example.

Let's back to my project:

I divided my project into several parts for better understanding and easy execution.

flowchart

Dataset Generation

data

Hashtags #Covid19 and #Gaming are two main constraints for my dataset.

Scrappers help to scrape non structured website to gather info and store into various readable format such as txt , json, csv so on. With help of python’s snscrape library I made my dataset which is in initially in ‘json’ consists of six months data and took around 50 hrs. My dataset contains various 26 attributes and 1122440 records. (More than 1 Million) Which contain various information such as content of tweet, username, source of tweet, language, retweet, like count, retweet count, and so on.

Data preprocessing and EDA

As we perform various steps in our dataset such as handling missing values, check duplication, standardization, normalization here we will perform task regarding our text based preprocessing on my dataset.

prepro

🎖Remove foreign language as I have knowledge of only English so I will do analysis on my data.

🎖Check null values and remove those columns which are not valuable to my data analysis part.

🎖Removing noise such as whitespace, ‘@’ of username initiate in twitter, punctuation marks, emojis and flags from my data those not convey any meaning in my dataset.

Language Visulization

lang

Here in my dataset total 78.01% tweets are in English language while rest of 21.99% tweets are in foreign language.

Out of 1122440 records, 875658 tweets posted in English language which contributes for 78.01%.

Sorce Of Tweet

Source

As here it shows twitter web app is the highest one where tweets came which records for 258493, 2nd by iPhone devices, and 3rd is Android.

Sentiment Generation

Sentiments needs to be generated, SentimentIntensityAnalyzer one of the tools of Vadar sentiment which is provided by python's NLP 'nltk' package.

If polarity will be higher than zero then interpreted as Positive,

If polarity equal to zero then interpreted as Neutral,

If polarity less than zero then interpreted as Negative.

tweet

Sentiment Visulization

piesenti

As here pie chart represents out of total tweets 46.47% tweets are positive which is slightly below the one half. 2nd dominating one is neutral which accounts for 36.68% that’s around the one third and the rest of is negative which is for 16.83%.

linesenti

As here early we observed in the pie chart for total tweet’s sentiment than we break down our tweets in by month so we notice around negative tweets steady throughout all months, while trend for neutral and positive one varies. Noticeable one is for all tweets records their highest in the March which is represented by 3 in figure. As obvious it depends on the total number of tweets.

Topic Modeling

Topic modeling is a machine learning technique that automatically analyzes text data to determine cluster words for a set of documents.

This is known as ‘unsupervised’ machine learning because it doesn’t require a predefined list of tags or training data that’s been previously classified by humans.

Since topic modeling doesn’t require training, it’s a quick and easy way to start analyzing your data. However, you can’t guarantee you’ll receive accurate results.

It’s simple, really. Topic modeling involves counting words and grouping similar word patterns to infer topics within unstructured data. As an example, Instead of spending hours going through heaps of feedback, in an attempt to deduce which texts are talking about your topics of interest, you could analyze them with a topic modeling algorithm. By detecting patterns such as word frequency and distance between words, a topic model clusters feedback that is similar, and words and expressions that appear most often. With this information, you can quickly deduce what each set of texts are talking about. Remember, this approach is ‘unsupervised’ meaning that no training is required. 

Latent Dirichlet Allocation (LDA) is based on the same underlying assumptions: the distributional hypothesis, (i.e. similar topics make use of similar words) and the statistical mixture hypothesis (i.e. documents talk about several topics) for which a statistical distribution can be determined.

The purpose of LDA is mapping each document in our corpus to a set of topics which covers a good deal of the words in the document.

CLICK IT TOO SEE MY TOPIC MODELING pyldavis

🤝 Connect with me

LinkedIn | Kaggle | InstaGram | Twitter | Gmail

About

Sentiment Analysis on more than 1 Million Tweets of Gaming During Covid19 with help of Python and NLP

License:MIT License


Languages

Language:Jupyter Notebook 81.4%Language:HTML 18.6%