Dataswim

A simple api to explore, clean, transform and visualize data. This api is:

Minimalistic: short names, simple functionalites, minimal parameters
Pragmatic: focuses on raw efficiency rather than idiomatic code
Simple stupid: keep it easy to understand for both code and api

Dependencies

Pandas to work with data

Dataset and Sql Alchemy to work with databases

To chart data:

Supported databases

Postgresql, Sqlite and all those that Sql Alchemy supports
Influxdb

Install

Using conda:

conda install pandas sqlalchemy seaborn
conda install -c ioam holoviews bokeh
conda install altair --channel conda-forge
pip install dataset pytablewriter goerr gencharts chartjspy
pip install dataswim --no-deps

Using pip:

pip install dataswim

To get the Altair charts in notebooks running this command is required:

jupyter nbextension enable vega --py --sys-prefix

Usage

Read the api documentation.

Note: the api may change and is not stable yet. The doc is in sync with master. To autobuild the docs for your installed version clone and:

pip install sphinx

cd docs
make html
firefox _build/html/api_auto.html&

Example

Jupyter demo notebooks are available as example.

Users demo notebook: it uses Django user data loaded from csv or directly from a database. Note: the data in this demo is autogenerated and fake but it is possible to run it on real user data connecting to a Django database.

Resample data demo: demonstrates how to resample data by time periods

To run the notebooks online click the badge:

User registrations demo notebook:

Goal: chart user registrations over time from fake Django user data

from dataswim import ds

# Load from a Django database
#ds.connect('postgresql://user:xxxx@localhost/dbname')
#ds.connect('sqlite:////path/to/db.sqlite3')
#ds.load("auth_user")

# Load demo data from csv
ds.load_csv("data/users.csv")

# for a data description:
#ds.describe()
# for a quick look:
#ds.look()
# for a sample:
ds.show()

[START] Loading csv...
[END] Finished loading csv in 0.12 seconds
[INFO] The dataframe has 1007 rows and 11 columns:
id, password, last_login, is_superuser, first_name, last_name, email, is_staff, is_active, 
date_joined, username

	id	password	last_login	is_superuser	first_name	last_name	email	is_staff	is_active	date_joined	username
0	1	pbkdf2_sha256$36000$TGjJvvfpuFhR$jVD3mP8MNxNYD...	2017-12-02 14:05:30.796573	True	NaN	NaN	NaN	True	True	2017-12-02 13:57:53.393755	ggg
1	2	Cupiditate accusamus velit sit dolor. Doloribu...	1986-07-16 11:22:02.000000	False	Lindsey	Thompson	farmermeghan@henderson.com	False	False	1997-08-06 16:23:37.000000	jrichards
2	3	Nobis quisquam voluptatibus nulla.\nEa archite...	1981-04-06 06:48:03.000000	False	Kurt	Black	kellycharles@marsh.com	True	True	2013-08-29 18:04:10.000000	eroberts
3	4	Inventore ea quia ducimus eligendi quod. Velit...	1991-08-08 22:23:45.000000	True	Sandra	Wilson	antoniobowers@hotmail.com	True	False	1981-04-22 05:32:55.000000	emilykelley
4	5	Veniam sequi aut nisi vitae. Quasi explicabo v...	1971-02-08 15:53:42.000000	True	Andrew	Guzman	hannahconner@yahoo.com	False	False	2010-11-10 17:53:54.000000	lsmith

Transform the data

# Resample data by one year
# see Pandas frequencies for units: 
# https://github.com/pandas-dev/pandas/blob/master/pandas/tseries/frequencies.py#L98
# try "1D" for one day
ds.rsum("1A", dateindex="date_joined", num_col="Registrations", index_col="Date")
# Convert nulls to zeros
ds.zero_nan("Registrations")
# Keep only the fields we need
ds.keep("Date", "Registrations")
ds.backup()
ds.show()

[OK] Added a datetime index from column date_joined
[OK] Data resampled by 1A
[OK] Column Date added from index
[OK] Replaced 0 values by nan in column Registrations
[OK] Setting dataframe to columns Date Registrations
[OK] Dataframe backed up
[INFO] The dataframe has 48 rows and 2 columns:
Date, Registrations

	Date	Registrations
date_joined
1970-12-31	1970-12-31	20
1971-12-31	1971-12-31	21
1972-12-31	1972-12-31	20
1973-12-31	1973-12-31	25
1974-12-31	1974-12-31	26

Draw charts

import holoviews as hv
hv.extension('bokeh')

Simple line

ds.chart("Date", "Registrations", label="Registrations")
ds.line_()

Line and points

ds.opts(dict(width=940, height=300))
line = ds.line_()
style = dict(color="orange", size=10)
opts = dict(tools=["hover"])
point = ds.point_(style=style, opts=opts)
# The * operator merge the two charts in one
line*point

Different types of chart

ds.width(300)
ds.height(200)
ds.color("orange")
area = ds.area_()
ds.color("green")
point = ds.point_()
ds.color("grey")
bar = ds.bar_()
area+point+bar

Histogram and mean line

ds.color("green")
ds.width(850)
ds.height(300)
c7 = ds.hline_("Registrations")
c8 = ds.area_("Registrations").hist()
c8*c7

Distribution chart

Prepare the data: convert all to integers to make these charts to work

ds.to_int("Date")
ds.to_int("Registrations")

    [OK] Converted column values to integers
    [OK] Converted column values to integers

Draw the chart

% matplotlib inline
ds.chart("Registrations", "Date")
c = ds.distrib_()

Density chart

ds.chart("Date", "Registrations")
c = ds.density_()

Linear regression with marginal distributions

ds.opts(dict(xticks=(1970, 2017), yticks=(0, 35)))
c = ds.dlinear_()

Models residuals chart

c = ds.residual_()

About

Utilities to swim in a data lake

MIT License

Languages

Language:Python 100.0%