Vidhi1290 / Word2Vec-and-FastText-Word-Embedding-with-Gensim-in-Python

This project explores the realm of Natural Language Processing (NLP) using Word2Vec and FastText models. Dive into domain-specific embeddings, analyze clinical trials data related to Covid-19, and uncover the power of AI and ML in understanding textual data.🌟

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Word2Vec and FastText Word Embeddings with Gensim in Python

Python Version LinkedIn Kaggle GitHub

πŸš€ Business Objective

In the dynamic field of Natural Language Processing (NLP), deciphering context from textual data stands as a formidable challenge. Word embeddings, providing semantically rich vectors, emerge as the ultimate solution. This project aims to construct domain-specific medical word embeddings using Word2Vec and FastText in Python.

πŸ“Š Data Description

Leveraging a clinical trials dataset focused on Covid-19 obtained from Dimensions COVID-19 Publications, Datasets, and Clinical Trials. The dataset comprises 10666 rows and 21 columns, with emphasis on the 'Title' and 'Abstract' columns.

🎯 Aim

The primary objective is to train Skip-gram and FastText models for word embeddings and subsequently develop a search engine alongside a Streamlit UI.

πŸ› οΈ Tech Stack

  • Language: Python
  • Libraries: Pandas, NumPy, Matplotlib, Plotly, Gensim, Streamlit, NLTK
  • Environment: Jupyter Notebook

πŸ” Approach

  1. Import Essential Libraries.
  2. Read the Dataset.
  3. Pre-process the Data:
    • Remove URLs
    • Convert text to lowercase
    • Remove numerical values
    • Remove punctuation
    • Tokenization
    • Remove stop words
    • Lemmatization
    • Remove '\n' character from columns
  4. Conduct Exploratory Data Analysis (EDA):
    • Word cloud visualization
  5. Train the 'Skip-gram' Model.
  6. Train the 'FastText' Model.
  7. Model Embeddings and Assess Similarity.
  8. Generate PCA Plots for Skip-gram and FastText Models.
  9. Convert Abstract and Title to Vectors using the Skip-gram and FastText Models.
  10. Utilize the Cosine Similarity Function.
  11. Pre-process the Input Query.
  12. Define a Function to Return Top 'n' Similar Results.
  13. Evaluate Results.
  14. Deploy the Streamlit Application.

πŸ“ Project Takeaways

  1. Understanding the business problem.
  2. Grasping the architecture to build the Streamlit application.
  3. Mastery of Word2Vec and FastText models.
  4. Importing datasets and necessary libraries.
  5. Data Pre-processing.
  6. Basic Exploratory Data Analysis (EDA).
  7. Training Skip-gram model with varying parameters.
  8. Training FastText model with varying parameters.
  9. Embedding models understanding and implementation.
  10. Plotting PCA plots.
  11. Obtaining vectors for each attribute.
  12. Executing the Cosine similarity function.
  13. Input query pre-processing.
  14. Result evaluation.
  15. Building a function to return top 'n' similar results for a given query.
  16. Understanding the Streamlit application code.
  17. Deployment of the Streamlit application.

Certainly! Let's make the "Get Connected" section more fun and engaging:

Absolutely! Let's make the "Get Connected" section more enthusiastic and visually appealing, with follow buttons aligned on the left side:

πŸ”— Get Connected

For more insightful projects and collaboration, connect with me on:

About

This project explores the realm of Natural Language Processing (NLP) using Word2Vec and FastText models. Dive into domain-specific embeddings, analyze clinical trials data related to Covid-19, and uncover the power of AI and ML in understanding textual data.🌟


Languages

Language:Jupyter Notebook 99.5%Language:Python 0.5%