chrisdev / dataswim

Utilities to swim in a data lake

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Dataswim

A simple api to explore, clean, transform and visualize data. This api is:

  • Minimalistic: short names, simple functionalites, minimal parameters
  • Pragmatic: focuses on raw efficiency rather than idiomatic code
  • Simple stupid: keep it easy to understand for both code and api

Dependencies

Pandas to work with data

Dataset and Sql Alchemy to work with databases

To chart data:

Supported databases

  • Postgresql, Sqlite and all those that Sql Alchemy supports
  • Influxdb

Install

Using conda:

conda install pandas sqlalchemy seaborn
conda install -c ioam holoviews bokeh
conda install altair --channel conda-forge
pip install dataset pytablewriter goerr gencharts chartjspy
pip install dataswim --no-deps

Using pip:

pip install dataswim

To get the Altair charts in notebooks running this command is required:

jupyter nbextension enable vega --py --sys-prefix

Usage

Read the api documentation.

Note: the api may change and is not stable yet. The doc is in sync with master. To autobuild the docs for your installed version clone and:

pip install sphinx

cd docs
make html
firefox _build/html/api_auto.html&

Example

Jupyter demo notebooks are available as example.

Users demo notebook: it uses Django user data loaded from csv or directly from a database. Note: the data in this demo is autogenerated and fake but it is possible to run it on real user data connecting to a Django database.

Resample data demo: demonstrates how to resample data by time periods

To run the notebooks online click the badge: Binder

User registrations demo notebook:

Goal: chart user registrations over time from fake Django user data

from dataswim import ds

# Load from a Django database
#ds.connect('postgresql://user:xxxx@localhost/dbname')
#ds.connect('sqlite:////path/to/db.sqlite3')
#ds.load("auth_user")

# Load demo data from csv
ds.load_csv("data/users.csv")

# for a data description:
#ds.describe()
# for a quick look:
#ds.look()
# for a sample:
ds.show()
[START] Loading csv...
[END] Finished loading csv in 0.12 seconds
[INFO] The dataframe has 1007 rows and 11 columns:
id, password, last_login, is_superuser, first_name, last_name, email, is_staff, is_active, 
date_joined, username
id password last_login is_superuser first_name last_name email is_staff is_active date_joined username
0 1 pbkdf2_sha256$36000$TGjJvvfpuFhR$jVD3mP8MNxNYD... 2017-12-02 14:05:30.796573 True NaN NaN NaN True True 2017-12-02 13:57:53.393755 ggg
1 2 Cupiditate accusamus velit sit dolor. Doloribu... 1986-07-16 11:22:02.000000 False Lindsey Thompson farmermeghan@henderson.com False False 1997-08-06 16:23:37.000000 jrichards
2 3 Nobis quisquam voluptatibus nulla.\nEa archite... 1981-04-06 06:48:03.000000 False Kurt Black kellycharles@marsh.com True True 2013-08-29 18:04:10.000000 eroberts
3 4 Inventore ea quia ducimus eligendi quod. Velit... 1991-08-08 22:23:45.000000 True Sandra Wilson antoniobowers@hotmail.com True False 1981-04-22 05:32:55.000000 emilykelley
4 5 Veniam sequi aut nisi vitae. Quasi explicabo v... 1971-02-08 15:53:42.000000 True Andrew Guzman hannahconner@yahoo.com False False 2010-11-10 17:53:54.000000 lsmith

Transform the data

# Resample data by one year
# see Pandas frequencies for units: 
# https://github.com/pandas-dev/pandas/blob/master/pandas/tseries/frequencies.py#L98
# try "1D" for one day
ds.rsum("1A", dateindex="date_joined", num_col="Registrations", index_col="Date")
# Convert nulls to zeros
ds.zero_nan("Registrations")
# Keep only the fields we need
ds.keep("Date", "Registrations")
ds.backup()
ds.show()
[OK] Added a datetime index from column date_joined
[OK] Data resampled by 1A
[OK] Column Date added from index
[OK] Replaced 0 values by nan in column Registrations
[OK] Setting dataframe to columns Date Registrations
[OK] Dataframe backed up
[INFO] The dataframe has 48 rows and 2 columns:
Date, Registrations
Date Registrations
date_joined
1970-12-31 1970-12-31 20
1971-12-31 1971-12-31 21
1972-12-31 1972-12-31 20
1973-12-31 1973-12-31 25
1974-12-31 1974-12-31 26

Draw charts

import holoviews as hv
hv.extension('bokeh')

Simple line

ds.chart("Date", "Registrations", label="Registrations")
ds.line_()

Users chart

Line and points

ds.opts(dict(width=940, height=300))
line = ds.line_()
style = dict(color="orange", size=10)
opts = dict(tools=["hover"])
point = ds.point_(style=style, opts=opts)
# The * operator merge the two charts in one
line*point

Users chart

Different types of chart

ds.width(300)
ds.height(200)
ds.color("orange")
area = ds.area_()
ds.color("green")
point = ds.point_()
ds.color("grey")
bar = ds.bar_()
area+point+bar

Users chart

Histogram and mean line

ds.color("green")
ds.width(850)
ds.height(300)
c7 = ds.hline_("Registrations")
c8 = ds.area_("Registrations").hist()
c8*c7

Users chart

Distribution chart

Prepare the data: convert all to integers to make these charts to work

ds.to_int("Date")
ds.to_int("Registrations")
    [OK] Converted column values to integers
    [OK] Converted column values to integers

Draw the chart

% matplotlib inline
ds.chart("Registrations", "Date")
c = ds.distrib_()

Users chart

Density chart

ds.chart("Date", "Registrations")
c = ds.density_()

Users chart

Linear regression with marginal distributions

ds.opts(dict(xticks=(1970, 2017), yticks=(0, 35)))
c = ds.dlinear_()

Users chart

Models residuals chart

c = ds.residual_()

Users chart

About

Utilities to swim in a data lake

License:MIT License


Languages

Language:Python 100.0%