selimelawwa/texthero

Text preprocessing, representation and visualization from zero to hero.

From zero to hero • Installation • Getting Started • Examples • API • FAQ • Contributions

From zero to hero

Texthero is a python toolkit that help you work with text-based dataset quickly and effortlessly. Texthero is very simple to learn and designed to be used on top of Pandas.

You can think of Texthero as a tool to help you understand and work with text-based dataset. Given a tabular dataset, it's easy to grasp the main concept. Instead, given a text dataset it's harder to have quick insights of the underline data.

With Texthero, preprocessing text data, map it into vectors and visualize the obtained vector space takes only a couple of lines.

Texthero is composed of only three python modules preprocessing.py, representation.py, visualization.py and it's well documented.

Installation

Install texthero via pip:

pip install texthero

☝️Under the hoods, Texthero makes use of multiple NLP and machine learning toolkits such as Gensim, NLTK, SpaCy and scikit-learn. You don't need to install them all separately, pip will take care of that.

For fast performance, make sure you have installed Spacy version >= 2.1 !

Suggested python version: 3.7.7.

Getting started

The best way to learn Texthero is through the Getting Started docs.

In case you are an advanced python user, then help(texthero) should do the work.

Example

1. Text cleaning, TF-IDF representation and visualization

import texthero as hero
import pandas as pd

df = pd.read_csv(
   "https://github.com/jbesomi/texthero/raw/master/dataset/bbcsport.csv"
)

df['pca'] = (
   df['text']
   .pipe(hero.clean)
   .pipe(hero.tfidf)
   .pipe(hero.pca)
)
hero.scatterplot(df, 'pca', color='topic', title="PCA BBC Sport news")

2. Text preprocessing, TF-IDF, K-means and visualization

import texthero as hero
import pandas as pd

df = pd.read_csv(
    "https://github.com/jbesomi/texthero/raw/master/dataset/bbcsport.csv"
)

df['tfidf'] = (
    df['text']
    .pipe(hero.clean)
    .pipe(hero.tfidf)
)

df['kmeans_labels'] = (
    df['tfidf']
    .pipe(hero.kmeans, n_clusters=5)
    .astype(str)
)

df['pca'] = df['tfidf'].pipe(hero.pca)

hero.scatterplot(df, 'pca', color='kmeans_labels', title="K-means BBC Sport news")

3. Simple pipeline for text cleaning

Say we got some dirty text data we wants to clean.

>>> import texthero as hero
>>> import pandas as pd
>>> text = "This sèntencé    (123 /) needs to [OK!] be cleaned!   "
>>> s = pd.Series(text)
>>> s
0    This sèntencé    (123 /) needs to [OK!] be cleane...
dtype: object

Remove all digits:

>>> s = hero.remove_digits(s)
>>> s
0    This sèntencé    (  /) needs to [OK!] be cleaned!
dtype: object

Remove digits replace only blocks of digits. The digits in the string "hello123" will not be removed. If we want to remove all digits, we need to se the arguments only_blocks to False.

Remove all type of brackets and their content.

>>> s = hero.remove_brackets(s)
>>> s 
0    This sèntencé    needs to  be cleaned!
dtype: object

Remove diacritics.

>>> s = hero.remove_diacritics(s)
>>> s 
0    This sentence    needs to  be cleaned!
dtype: object

Remove punctuation.

>>> s = hero.remove_punctuation(s)
>>> s 
0    This sentence    needs to  be cleaned
dtype: object

Remove extra white-spaces.

>>> s = hero.remove_whitespace(s)
>>> s 
0    This sentence needs to be cleaned
dtype: object

Sometimes we also wants to get rid of stop-words.

>>> s = hero.remove_stopwords(s)
>>> s
0    This sentence needs cleaned
dtype: object

API

Texthero is composed of three modules: preprocessing.py, representation.py and visualization.py.

1. Preprocessing

Scope: prepare the text data for further analysis.

Full documentation: preprocessing

2. Representation

Scope: map text data into vectors and do dimensionality reduction.

Supported representation algorithms:

Term frequency (count)
Term frequency-inverse document frequency (tfidf)

Supported clustering algorithms:

K-means (kmeans)
Density-Based Spatial Clustering of Applications with Noise (dbscan)
Meanshift (meanshift)

Supported dimensionality reduction algorithms:

Principal component analysis (pca)
t-distributed stochastic neighbor embedding (tsne)
Non-negative matrix factorization (nmf)

Full documentation: representation

3. Visualization

Scope: summarize the main facts regarding the text data and visualize it. This module is opinionable. It's handy for anyone that needs a quick solution to visualize on screen the text data, for instance during a text exploratory data analysis (EDA).

Supported functions:

Text scatterplot (scatterplot)
Most common words (top_words)

Full documentation: visualization

FAQ

Why Texthero

Because I couldn't find something like that.

What is Texthero in a nutshell?

When you get a bunch of text data, chances are they will need some cleaning and that you want to understand it somehow. Texthero help you do that very efficiently.

I'm not an NLP expert, is texthero for me?

Yes, it is. Texthero is very easy to use and has been conceived also for beginners.

Contributions

Pull requests are amazing and most welcome. Start by fork this repository and open an issue.

Texthero is also looking for maintainers and contributors. In case of interest, just drop a line at jonathanbesomi__AT__gmail.com

selimelawwa / texthero