fasttext-embeddings genism jupyter-notebook machine-learning matplotlib nlp nltk numpy pandas plotly python streamlit word2vec

Word2Vec and FastText Word Embeddings with Gensim in Python

🚀 Business Objective

In the dynamic field of Natural Language Processing (NLP), deciphering context from textual data stands as a formidable challenge. Word embeddings, providing semantically rich vectors, emerge as the ultimate solution. This project aims to construct domain-specific medical word embeddings using Word2Vec and FastText in Python.

📊 Data Description

Leveraging a clinical trials dataset focused on Covid-19 obtained from Dimensions COVID-19 Publications, Datasets, and Clinical Trials. The dataset comprises 10666 rows and 21 columns, with emphasis on the 'Title' and 'Abstract' columns.

🎯 Aim

The primary objective is to train Skip-gram and FastText models for word embeddings and subsequently develop a search engine alongside a Streamlit UI.

🛠️ Tech Stack

Language: Python
Libraries: Pandas, NumPy, Matplotlib, Plotly, Gensim, Streamlit, NLTK
Environment: Jupyter Notebook

🔍 Approach

Import Essential Libraries.
Read the Dataset.
Pre-process the Data:
- Remove URLs
- Convert text to lowercase
- Remove numerical values
- Remove punctuation
- Tokenization
- Remove stop words
- Lemmatization
- Remove '\n' character from columns
Conduct Exploratory Data Analysis (EDA):
- Word cloud visualization
Train the 'Skip-gram' Model.
Train the 'FastText' Model.
Model Embeddings and Assess Similarity.
Generate PCA Plots for Skip-gram and FastText Models.
Convert Abstract and Title to Vectors using the Skip-gram and FastText Models.
Utilize the Cosine Similarity Function.
Pre-process the Input Query.
Define a Function to Return Top 'n' Similar Results.
Evaluate Results.
Deploy the Streamlit Application.

📝 Project Takeaways

Understanding the business problem.
Grasping the architecture to build the Streamlit application.
Mastery of Word2Vec and FastText models.
Importing datasets and necessary libraries.
Data Pre-processing.
Basic Exploratory Data Analysis (EDA).
Training Skip-gram model with varying parameters.
Training FastText model with varying parameters.
Embedding models understanding and implementation.
Plotting PCA plots.
Obtaining vectors for each attribute.
Executing the Cosine similarity function.
Input query pre-processing.
Result evaluation.
Building a function to return top 'n' similar results for a given query.
Understanding the Streamlit application code.
Deployment of the Streamlit application.

Certainly! Let's make the "Get Connected" section more fun and engaging:

Absolutely! Let's make the "Get Connected" section more enthusiastic and visually appealing, with follow buttons aligned on the left side:

🔗 Get Connected

For more insightful projects and collaboration, connect with me on:

About

This project explores the realm of Natural Language Processing (NLP) using Word2Vec and FastText models. Dive into domain-specific embeddings, analyze clinical trials data related to Covid-19, and uncover the power of AI and ML in understanding textual data.🌟

fasttext-embeddings genism jupyter-notebook machine-learning matplotlib nlp nltk numpy pandas plotly python streamlit word2vec

Languages

Language:Jupyter Notebook 99.5%Language:Python 0.5%