bjpop / gurita

A convenient and expressive tool for data analytics and plotting on the command line

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

fun image of octopus using a computer

Gurita: a command line data analytics and plotting tool

Gurita is a command line tool for analysing and visualising tabular data in CSV or TSV format.

At its core Gurita provides a suite of commands, each of which carries out a common data analytics or plotting task.

A unique and powerful feature of Gurita is that commands to be chained together into flexible analysis pipelines. See the advanced example below.

It is designed to be fast and convenient, and is particularly suited to data exploration tasks. Input files with large numbers of rows (> millions) are readily supported.

Gurita commands are highly customisable, however sensible defaults are applied. Therefore simple tasks are easy to express and complex tasks are possible.

Gurita is implemented in Python and makes extensive use of the Pandas, Seaborn, and Scikit-learn libraries for data processing and plot generation.

Documentation

Please consult the Gurita Documentation for detailed information about installation and usage.

Examples

Simple example

Box plot of sepal_length for each species in the classic iris dataset:

cat iris.csv | gurita box -x species -y sepal_length

example box plot of sepal_length for each species in the classic iris dataset

Advanced example

The following example illustrates Gurita's ability to chain commands together.

Commands in a chain are separated by the plus sign (+) and data flows from left to right in the chain.

cat iris.csv | gurita filter 'species != "virginica"' \
                      + sample 0.9 \
                      + pca \
                      + scatter -x pc1 -y pc2 --hue species

Scatter plot comparing principal components pc1 and pc2 from a filtered iris dataset

In this example there are 4 commands that are executed in the following order:

  1. The filter command selects all rows where species is not equal to virginica.
  2. The filtered rows are then passed to the sample command which randomly selects 90% of the remaining rows.
  3. The sampled rows are then passed to the pca command which performs principal component analysis (PCA) as a data reduction step, yielding two extra columns in the data called pc1 and pc2.
  4. Finally the pca-transformed data is passed to the scatter command which generates a scatter plot of pc1 and pc2 (the first two principal components).

Licence

This program is released as open source software under the terms of MIT License.

Authors

About

A convenient and expressive tool for data analytics and plotting on the command line

License:MIT License


Languages

Language:Python 59.9%Language:Shell 39.7%Language:Makefile 0.4%Language:CSS 0.0%