dpoulopoulos / forma

Automatic format error detection on tabular data.

Home Page:https://dpoulopoulos.github.io/forma/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

CI

Forma

Forma is an open-source library, written in python, that enables automatic and domain-agnostic format error detection on tabular data. The library is a by-product of the research project BigDataStack.

Install

Run pip install forma to install the library in your environment.

How to use

We will work with the the popular movielens dataset.

# local
# load the data
col_names = ['user_id', 'movie_id', 'rating', 'timestamp']
ratings_df = pd.read_csv('../data/ratings.dat', delimiter='::', names=col_names, engine='python')
# local
ratings_df.head()
user_id movie_id rating timestamp
0 1 1193 5 978300760
1 1 661 3 978302109
2 1 914 3 978301968
3 1 3408 4 978300275
4 1 2355 5 978824291

Let us introduce some random mistakes.

# local
dirty_df = ratings_df.astype('str').copy()

dirty_df.iloc[3]['timestamp'] = '9783000275'
dirty_df.iloc[2]['movie_id'] = '914.'
dirty_df.iloc[4]['rating'] = '10'

Initialize the detector, fit and detect. The returned result is a pandas DataFrame with an extra column p, which records the probability of a format error being present in the row. We see that the probability for the tuples where we introduced random artificial mistakes is increased.

# local
# initialize detector
detector = FormatDetector()
# fit detector
detector.fit(dirty_df, generator= PatternGenerator(), n=3)
# detect error probability
assessed_df = detector.detect(reduction=np.mean)

# visualize results
assessed_df.head()
100%|██████████| 4/4 [02:58<00:00, 44.58s/it]
100%|██████████| 1000209/1000209 [07:28<00:00, 2230.59it/s]
user_id movie_id rating timestamp p
0 1 1193 5 978300760 0.319957
1 1 661 3 978302109 0.456679
2 1 914. 3 978301968 0.509287
3 1 3408 4 9783000275 0.550982
4 1 2355 10 978824291 0.569957

About

Automatic format error detection on tabular data.

https://dpoulopoulos.github.io/forma/

License:Apache License 2.0


Languages

Language:Jupyter Notebook 68.8%Language:Python 29.8%Language:Makefile 1.3%