Embedding Projector is a free web application which offers commonly three methods ("PCA", "t-SNE", and "custom linear projections") for visaulizing high dimensional data. It includes build-in examples for visualizing word embeddings in Natural Language Processing (NLP) and image processing for MNIST in computer vision.
Question may arise in mind, what is PCA, t-SNE and custom linear?
- PCA: Principal Component Analysis in short for PCA is often effective at exploring the internal structure of the embeddings, revealing the most influential dimensions in the data.
- t-SNE : t-distributed stochastic neighbor embedding in short for t-SNE is useful for exploring local neighborhoods and finding clusters, allowing developers to make sure that an embedding preserves the meaning in the data
- Custom Linear Projection: Help discover meaningful "directions" in data sets - such as the distinction between a formal and casual tone in a language generation model - which would allow the design of more adaptable ML systems.
To add one more skills in my skills stack, I experimented with a way to load sentence embeddings along with class labels into this tool and explore them interactively. In this repo, I will explain entire process with an example.
To further understand the use case, let's take a subset of 200 movie reviews from the SST-2 dataset that have been classified as positve and negative.
# import library
import pandas as pd
# import movie review dataset
df = pd.read_csv('http://bit.ly/dataset-sst2', nrows=200, sep='\t', names=['text', 'label'])
# replace target values 1 as positive and 0 as negative
df['label'] = df['label'].replace({1: 'positive', 0:'negative'})
The dataframe contains the text and label indicating whether it's positive or negative movie reviews.
df.head()
Using random text to tamper with five of the responses, we will add noise to our dataset. It will serve as an exception to our example.
Before Noise
df.loc[[10, 19, 154, 168, 181], 'text']
After Noise
df.loc[[10, 19, 154, 168, 181], 'text'] = 'asdfg qwerf zxcvb'
df.loc[[10, 19, 154, 168, 181], 'text']