souravs17031999 / MicrosoftBing-search-query-prediction

Analysis and Visualizations for COVID-19 Bing search engine queries + Classifier pipeline for predicting country based on search query.

Home Page:https://souravsdlboy.pythonanywhere.com/bing_search

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

MicrosoftBing-search-query-prediction

Problem statement:

The world is going through COVID-19 pandemic caused due to Novel Coronavirus and people are mostly in their homes across the world due to lockdown and they are searching for various things related to Coronavirus on internet using popular search engines such as Chrome, Bing, Yahoo, DuckDuckGo etc.

Now, the problem task is to attempt to explore the intent of the people by just using the queries (searches) in the last four months Feb to May 2020 and also build a predicting model which predicts the Country of origin from where the search query was issued , as search queries made by people can really help in understanding what’s going in people’s mind during this pandemic and exploring the queries at global level as well as state level granularity will enable state and central authorities to take appropriate action and eventually helping out and benefitting the Citizens.

The thing is that most of people in the world or the authorities (govt. or private) were unprepared and underestimates the risk when outbreak emerged because we haven't seen this type of event earlier and so, people panicked completely. They stressed and gone into depression due to job salary cuts, loosing their family etc...
Others were worried about mask, sanitizers, food items (due to which people started vacating stuff from malls/shops), rumours started circulating regarding various vaccines/cures etc... which needed to be checked upon and many people didn't speak due to depression or panic.
So, it can be great and effective tool for analysing data at such a great scale for determining area based, country based etc... measures which can be taken for the above said problems/situations.

Motivation:

As a Computer Engineering Student and my deep interested in Machine learning and Deep learning including NLP tasks, I have my inner passion to help out researchers / scientists working day and night for helping us by providing some technological solution using my knowledge. I have always believed AI can bring great revolutionary ideas in field of healthcare.

Dataset Collection:

After searching on various platforms, I found the dataset relevant for this problem and currently only this dataset is actually available for this problem and Microsoft Bing Team has been very generous to provide the dataset on their GitHub.
Dataset Link: https://github.com/microsoft/BingCoronavirusQuerySet
Some info about dataset:

  • Data range: 2020-01-01 to 2020-04-30
  • All the private data has already been removed by the Bing Team.
  • Only searches that were issued many times by multiple users were included.
  • Dataset includes queries from all over the world that had an intent related to the Coronavirus or Covid-19.

Methods Explored :

  • Bag of Words model
    vis1
  • N-grams model
    vis1
  • tf-idf model
  • Continuos bag of words model
    vis1
  • Skim-Gram model
    vis1
  • Word2Vec model
    Above architechtures for CBOW and skipgrams are used to construct the training samples which forms both - embedding matrix and context matrix (neighbouring words in the window).
    Then, supervised learning is done using (input samples, target) and error function is minimized to learn new embeddings for the input samples.

Here, window size is a important hyperparameter which controls the context words.

Embedding matrix and context matrix will be used in the neural network architechture to learn the word vectors (embeddings).

Input (words) are one - hot encoded binary vector of vocab size and same is with output vector.

Then, backpropogration is used in the neural network training process to update the parameters which are useful for learning embeddings.

vis1
vis1
vis1

The embedding matrix are finally learnt embeddings through number of times we cycle the training process.
NOTE : In the construction of above training samples, we have only shown those which are positive samples i.e. containing context correct words so target will always be "1" but in reality we will do negative sampling also i.e. random sampling words through vocabulary and set it to target vector so that there are "0", wrong context words also, otherwise the model might only predict "1" for any context word but with negative sampling, it will actually try to learn the semantic relationships embedded.

using skipgram
vis1
using CBOW vis1

  • Clustering techniques
  • Classification techniques
  • WordCloud and more...

Bag of Words model :

Characterizes frequency of terms appearing within the document and among the documents without concerned with order.
(This is also input to the models used for classification pipeline)

CountVectorizer vis1

TfidfVectorizer vis1

Preprocessing steps :

vis1

  • Tokenization :
    vis1
    vis1

  • Cleaning : Removing punctuation, stopwords, HTML tags, XML tags etc.. vis1

  • Normalization : Converting dates, other symbols, acronyms, and abbreviations to Text.

  • Stemming :
    vis1

  • Lemmatization : vis1

  • See reference links at the end for more details.

Modelling :

Models :

vis1

Deployment and Hosting:

Live WebApp : Server Link

Exploring Data with Sample visualizations :

DATA
vis1
vis2
vis3
vis4
vis5
vis6

NOTE : For more visualizations, check out the live webapp.

Few links for references :

⭐️ this Project if you liked it !

NOTE : Few images are taken from google images and copyrights reserved with their respective owners.