dataviper

dataviper is a SQL-based tool to get the basic data preparation done in easy way, with doing

Create "Data Profile" report of a table
One-hot encode for "Categorical Columns" and create a "one-hot" table
Check "Joinability" between 2 tables
// TODO: and more

Example

pip install dataviper

Your main.py will look like this

from dataviper import Client
from dataviper.source import MySQL

client = Client(source=MySQL({
    'host': 'localhost',
    'user': 'root',
    'password': 'xxxxxx',
    'database': 'your_database'
}))

with client.connect():
    profile = client.profile('Your_Table')
    profile.to_excel()

python3 main.py
# Then you will get 'profile_Your_Table.xlsx' 🤗

Why?

It's known that "Data Profiling" needs to be done with scanning all the rows in a table. If you try to do this naively by pandas or any libraries which internally use pandas, it's not avoidable to use bunch of memory of your local machine and freeze your work.

dataviper is a SQL-based Data Profiling tool, which simply and dynamically generates SQLs and lets the database machine do the annoying calculation.

With dataviper, you don't have to have massive local computer. All you need are a stable network and reachable SQL db.

Data Sources

You can choose your data source from

APIs

`profile`

Create "Data Profile" excel file of a specified table.

When you have Sales table like this

id	region	sales_type	price	rep_id
1	jp	phone	240	115723
2	us	web	90	125901
3	jp	web	560	8003
4	us	shop	920	182234
5	jp	NULL	90	92231
6	us	shop	180	100425
7	us	shop	70	52934

do

with client.connect() as conn:
    table_name = 'Sales'
    profile = client.profile(table_name, example_count=3)
    profile.to_excel()

then you will get profile_Sales.xlsx file with

column_name	data_type	null_count	null_%	unique_count	unique_%	min	max	avg	std	example_top_3	example_last_3
id	bigint	0	0	7	100.00	1	7	4.0	2.0	[1,2,3]	[5,6,7]
region	varchar	0	0	2	28.57					[jp,us,jp]	[jp,us,us]
sales_type	varchar	1	14.28	3	42.85					[phone,web,web]	[None,shop,shop]
price	int	0	0	6	85.71	70	920	307.1428	295.379	[240,90,560]	[90,180,70]
rep_id	int	0	0	7	100.00	8003	182234	96778.7142	51195.79065	[115723,125901,8003]	[92231,100425,52934]

`pivot`

Spread categorical columns to N binary columns.

When you have Sales table like above, do

with client.connect() as conn:
    table_name = 'Sales'
    key = 'id'
    categorical_columns = ['region', 'sales_type']
    profile = client.get_schema(table_name)
    client.pivot(profile, key, categorical_columns)

then you will get Sales_PIVOT_YYYYmmddHHMM table with

id	region_jp	region_us	sales_type_phone	sales_type_web	sales_type_shop
1	1	0	1	0	0
2	0	1	0	1	0
3	1	0	0	1	0
4	0	1	0	0	1
5	1	0	0	0	0
6	0	1	0	0	1
7	0	1	0	0	1

`joinability`

Count how much 2 tables can be joined.

When you have Sales table like above, and Reps table like this

id	name	tenure
8003	Hiromu	9
8972	Ochiai	6
52934	Taro	1
92231	otiai10	2
100425	Hanako	7
125901	Chika	3
182234	Mary	5
199621	Jack	1

do

with client.connect() as conn:
    report = client.joinability(on={'Sales': 'rep_id', 'Reps': 'id'})
    report.to_excel()

then you will get join_Sales_Reps.xlsx file with

table	key	total	match	match_%	drop	drop_%
Sales	[rep_id]	7	6	85.714	1	14.285
Reps	[id]	8	6	75.00	2	25.00

Issues and TODOs

https://github.com/otiai10/dataviper/issues

About

Data Profiling and Basic Quality Assessment tool, for the very beginning phase of your project.

https://pypi.org/project/dataviper/

MIT License

Languages

Language:Python 97.1%Language:Dockerfile 2.6%Language:Makefile 0.3%