id-shiv / text_process

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

text_tag

Problem Statement

  • For a given text, identify most relevant tags.

Approach

  • For this project, let's use IMDB dataset of movie reviews and generate optimal set of clusters.
  • To generate optimal set of clusters, we shall use sum of squared distances and plot to view the K at elbow.

Requirements

Configurations

DATA

DATA_PATH = '/Users/shiv/Documents/gitRepositories/iutils/input/data/IMDB Dataset.csv'
TEXT_COLUMN = 'review'
NUM_OF_SAMPLES = 100

ENCODER

ENCODER_PATH = '/Users/shiv/Documents/gitRepositories/text_search/encoders/universal-sentence-encoder-large_5'
_encoder = hub.load(ENCODER_PATH) # Load the encoder

Results

NOTE: Results depend on # of samples in dataset, current project could be improved with some pre-processing of text

Search Text = 'psychological thriller is what i like'

About


Languages

Language:Python 100.0%