This is a personal library, allowing more functional programming in Python data-science. Mostly, it's focused on writing code like this:
from yo_extensions import *
import json
(Query
.file.text('data.jsonlines') # read file and create a 'stream' of lines
.select(json.loads) # parse each line with JSON
.where(lambda z: maybe(z,'status')=='OK') # only items with status equals OK, maybe is Elvis operator
.select(lambda z: (z['id'],z['message']))
.to_dataframe(columns=['id','message']) # seamless integration with pandas
.groupby('message')
.size()
.feed(plots.series.pie()) # extension method, draws a pie chart with custom settings
)
The key principles are:
- Fluent interface
- Type annotations
- Extendability
Contents:
- Yet another port of
C# LINQ
to Python. The closest analogue isasq
. The key differences are: type annotation support and different extendability mechanism - Extension methods for better data-science: plotting, status reporting, algorithms on pandas
- A few useful classes for machine-learning
- Wide test coverage for most of the implemented funcionality
The port of C# LINQ
to Python with type annotations. The usual methods (select
, where
) are implemented as methods of Queryable
class.
The extension methods are challenging due to Python restrictions. I couldn't use monkey-patching, because it does not preserve type-annotations, and injected methods are not seen by IDE. Thus, the following mechanism is employed:
- Consider the function
f(q,X)
whereq
isQueryable
andX
is a tuple of additional argument. - Lets Curry
q
, introducingh(X)
such thath(X)
returnsg(q)
and soh(X)(q)=g(q)=f(q,X)
- To inject
h
intoq
,q.feed
method acceptsg
, soq.feed(h(X)) = h(X)(q) = f(q,X)
This mechanism preserves the type annotation, allows to add any functionality to Queryable
and almost preserves Fluent interface: you need to add feed
instead of just chaining methods.
To avoid coding of both g
and h
function for any functionality, the suggested way of implementation for h
is a class, X
is provided in __init__
, and also h
is Callable
so it can accept q
.
The same mechanism employed for pd.DataFrame
, pd.Series
, pd.DataFrameGroupBy
and pd.SeriesGroupBy
. For these classes, feed
is monkey-patced and does not preserve the type annotation.
- Several extensions for
fluq
: input/output to various file types, partitioning, etc. - Few extensions for
pandas
: adding ordering inside groups, stratifying order for Dataframes, etc - Plots: several plots I like to use in research, implemented in
feed
-compatible mode.
yo_extensions/__init__.py
provides the demonstration on how better include fluq
with extensions into the side project.
Small utilities:
kraken
: Executes method with the various arguments (plan) and returns the result aspd.DataFrame
for futher analysismetrics
: computes lots of metrics for predicted/actual values and returns them aspd.DataFrame
.keras
: wrapper overkeras
generators.