pwwang / datar

A Grammar of Data Manipulation in python

Home Page:https://pwwang.github.io/datar/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Decile calculation with "ntile"...

coforfe opened this issue · comments

Hi,

Thanks for your excellent package to port R (dplyr) flow of processing to Python. I have been using another alternatives, and yours is the one that offers the most extensive and equivalent to what is possible now with dplyr.

I have an issue with how ntile() calculates the different groups for a vector of probabilities ("p2").

This is the output of that calculation.

kk = ( trainnonFun >> select( f.p2) >> mutate( decil = ntile(f.p2, n=10)))
kk
            p2      decil
      <float64> <category>
6535   0.971462         10
7523   0.971462         10
48441  0.970154         10
48417  0.970154         10
...         ...        ...
13971  0.970154         10
38140  0.409739          1
13400  0.409739          1
45999  0.405575          1
26150  0.372226          1
29939  0.357850          1

But when you calculates how many values are in each bucket, it shows something strange:

pp = ( kk >> count(f.decil))
pp
      decil       n
  <category> <int64>
0          1       7
1          2     542
2          3    1361
3          4     924
4          5    1240
5          6    1655
6          7    3080
7          8    2647
8          9    1571
9         10    1345

The groups are very dissimilar.

For the sake of reproducibility, In this file you can find that dataframe with the probabilities and the calculated decile.

Now, I am calculating the right decile with pandas qcut() method, which offers the right output, with a much mofre balanced number of elements in each bucket.

Thanks again,
Carlos.

Hi, could you show the versions of datar:

from datar import get_versions
get_versions()

Hi,
Yes, this is what I get.

>>> from datar import get_versions
>>> get_versions()
python      : 3.10.4 (main, Mar 31 2022, 08:41:55) [GCC 7.5.0]
datar       : 0.10.2
simplug     : 0.2.1
executing   : 1.2.0
pipda       : 0.10.0
datar-numpy : 0.0.0
numpy       : 1.23.4
datar-pandas: 0.0.0
pandas      : 1.5.2

Thanks!
Carlos.

This is a nice catch!
The ntile implementation should use pd.qcut instead of pd.cut.
It shall be fixed by datar-pandas v0.1.1

Try updating datar by:

pip install -U datar[pandas]

and also try get_versions to ensure datar-pandas v0.1.1 is installed.

By the way, thanks for the compliments:

Thanks for your excellent package to port R (dplyr) flow of processing to Python. I have been using other alternatives, and yours is the one that offers the most extensive and equivalent to what is possible now with dplyr.

Do you mind if I put it as a testimonial in the README file?

Thanks a lot for your quick fix!.

No, I do not mind at all.
Thanks to you.
Carlos.

Thanks!

Please confirm if this is fixed and feel free to close it if so.
Feel free to open new issues if you have other questions.

Thanks,

Yes, I have just updated datar with your indications and now the problem is fixed.

       decil       n
  <category> <int64>
0          1    1438
1          2    1437
2          3    1437
3          4    1437
4          5    1437
5          6    1439
6          7    1435
7          8    1585
8          9    1293
9         10    1434

Thanks again,
Carlos.