shenrun / polars_ds_extension

Polars extension for general data science use cases

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Polars Extension for General Data Science Use

A Polars Plugin aiming to simplify common numerical/string data analysis procedures. This means that the most basic data science, stats, NLP related tasks can be done natively inside a dataframe, thus minimizing the number of dependencies.

Its goal is not to replace SciPy, or NumPy, but rather it tries to improve runtime for common tasks, reduce Python code and UDFs.

See examples here.

Read the docs here.

Currently in Beta. Feel free to submit feature requests in the issues section of the repo.

Disclaimer: this plugin is not tested with streaming mode.

Getting Started

pip install polars_ds

and

import polars_ds as pld

when you want to use the namespaces provided by the package.

Examples

In-dataframe statistical testing

df.select(
    pl.col("group1").stats.ttest_ind(pl.col("group2"), equal_var = True).alias("t-test"),
    pl.col("category_1").stats.chi2(pl.col("category_2")).alias("chi2-test"),
    pl.col("category_1").stats.f_test(pl.col("group1")).alias("f-test")
)

shape: (1, 3)
┌───────────────────┬──────────────────────┬────────────────────┐
│ t-testchi2-testf-test             │
│ ---------                │
│ struct[2]         ┆ struct[2]            ┆ struct[2]          │
╞═══════════════════╪══════════════════════╪════════════════════╡
│ {-0.004,0.996809} ┆ {37.823816,0.386001} ┆ {1.354524,0.24719} │
└───────────────────┴──────────────────────┴────────────────────┘

Generating random numbers according to reference column

df.with_columns(
    # Sample from normal distribution, using reference column "a" 's mean and std
    pl.col("a").stats.sample_normal().alias("test1") 
    # Sample from uniform distribution, with low = 0 and high = "a"'s max, and respect the nulls in "a"
    , pl.col("a").stats.sample_uniform(low = 0., high = None, respect_null=True).alias("test2")
).head()

shape: (5, 3)
┌───────────┬───────────┬──────────┐
│ atest1test2    │
│ ---------      │
│ f64f64f64      │
╞═══════════╪═══════════╪══════════╡
│ null0.459357null     │
│ null0.038007null     │
│ -0.8265180.2419630.968385 │
│ 0.737955-0.8194752.429615 │
│ 1.10397-0.6842892.483368 │
└───────────┴───────────┴──────────┘

Blazingly fast string similarity comparisons. (Thanks to RapidFuzz)

df.select(
    pl.col("word").str2.levenshtein("asasasa", return_sim=True).alias("asasasa"),
    pl.col("word").str2.levenshtein("sasaaasss", return_sim=True).alias("sasaaasss"),
    pl.col("word").str2.levenshtein("asdasadadfa", return_sim=True).alias("asdasadadfa"),
    pl.col("word").str2.fuzz("apples").alias("LCS based Fuzz match - apples"),
    pl.col("word").str2.osa("apples", return_sim = True).alias("Optimal String Alignment - apples"),
    pl.col("word").str2.jw("apples").alias("Jaro-Winkler - apples"),
)
shape: (5, 6)
┌──────────┬───────────┬─────────────┬────────────────┬───────────────────────────┬────────────────┐
│ asasasasasaaasssasdasadadfaLCS based FuzzOptimal String AlignmentJaro-Winkler - │
│ ---------match - apples- apple…                  ┆ apples         │
│ f64f64f64---------            │
│          ┆           ┆             ┆ f64f64f64            │
╞══════════╪═══════════╪═════════════╪════════════════╪═══════════════════════════╪════════════════╡
│ 0.1428570.1111110.0909090.8333330.8333330.966667       │
│ 0.4285710.3333330.2727270.1666670.00.444444       │
│ 0.1111110.1111110.0909090.5555560.4444440.5            │
│ 0.8750.6666670.5454550.250.250.527778       │
│ 0.750.7777780.4545450.250.250.527778       │
└──────────┴───────────┴─────────────┴────────────────┴───────────────────────────┴────────────────┘

Even in-dataframe nearest neighbors queries! 😲

df.with_columns(
    pl.col("id").num.knn_ptwise(
        pl.col("val1"), pl.col("val2"), 
        k = 3, dist = "haversine", parallel = True
    ).alias("nearest neighbor ids")
)

shape: (5, 6)
┌─────┬──────────┬──────────┬──────────┬──────────┬──────────────────────┐
│ idval1val2val3val4nearest neighbor ids │
│ ------------------                  │
│ i64f64f64f64f64list[u64]            │
╞═════╪══════════╪══════════╪══════════╪══════════╪══════════════════════╡
│ 00.8042260.9370550.4010050.119566 ┆ [0, 3, … 0]          │
│ 10.5266910.5623690.0614440.520291 ┆ [1, 4, … 4]          │
│ 20.2250550.0803440.4259620.924262 ┆ [2, 1, … 1]          │
│ 30.6972640.1122530.6662380.45823  ┆ [3, 1, … 0]          │
│ 40.2278070.7349950.2256570.668077 ┆ [4, 4, … 0]          │
└─────┴──────────┴──────────┴──────────┴──────────┴──────────────────────┘

And a lot more!

Credits

  1. Rust Snowball Stemmer is taken from Tsoding's Seroost project (MIT). See here
  2. Some statistics functions are taken from Statrs (MIT). See here

Other related Projects

  1. Take a look at our friendly neighbor functime
  2. My other project dsds. This is currently paused because I am developing polars-ds, but some modules in DSDS, such as the diagonsis one, is quite stable.
  3. String similarity metrics is soooo fast and easy to use because of RapidFuzz

About

Polars extension for general data science use cases

License:MIT License


Languages

Language:Rust 58.0%Language:Python 39.7%Language:Jupyter Notebook 2.2%Language:Makefile 0.1%