Decile calculation with "ntile"...

Question

Decile calculation with "ntile"...

coforfe opened this issue 2 years ago · comments

Hi,

Thanks for your excellent package to port R (dplyr) flow of processing to Python. I have been using another alternatives, and yours is the one that offers the most extensive and equivalent to what is possible now with dplyr.

I have an issue with how ntile() calculates the different groups for a vector of probabilities ("p2").

This is the output of that calculation.

kk = ( trainnonFun >> select( f.p2) >> mutate( decil = ntile(f.p2, n=10)))
kk
            p2      decil
      <float64> <category>
6535   0.971462         10
7523   0.971462         10
48441  0.970154         10
48417  0.970154         10
...         ...        ...
13971  0.970154         10
38140  0.409739          1
13400  0.409739          1
45999  0.405575          1
26150  0.372226          1
29939  0.357850          1

But when you calculates how many values are in each bucket, it shows something strange:

pp = ( kk >> count(f.decil))
pp
      decil       n
  <category> <int64>
0          1       7
1          2     542
2          3    1361
3          4     924
4          5    1240
5          6    1655
6          7    3080
7          8    2647
8          9    1571
9         10    1345

The groups are very dissimilar.

For the sake of reproducibility, In this file you can find that dataframe with the probabilities and the calculated decile.

https://github.com/coforfe/deciles/blob/main/deciles.csv

Now, I am calculating the right decile with pandas qcut() method, which offers the right output, with a much mofre balanced number of elements in each bucket.

Thanks again,
Carlos.

pwwang · Answer 1 · Mon Dec 12 2022 22:01:13 GMT+0800 (China Standard Time)

Hi, could you show the versions of datar:

from datar import get_versions
get_versions()

Carlos Ortega · Answer 2 · Tue Dec 13 2022 02:06:33 GMT+0800 (China Standard Time)

Hi,
Yes, this is what I get.

>>> from datar import get_versions
>>> get_versions()
python      : 3.10.4 (main, Mar 31 2022, 08:41:55) [GCC 7.5.0]
datar       : 0.10.2
simplug     : 0.2.1
executing   : 1.2.0
pipda       : 0.10.0
datar-numpy : 0.0.0
numpy       : 1.23.4
datar-pandas: 0.0.0
pandas      : 1.5.2

Thanks!
Carlos.

pwwang · Answer 3 · Tue Dec 13 2022 02:59:00 GMT+0800 (China Standard Time)

This is a nice catch!
The ntile implementation should use pd.qcut instead of pd.cut.
It shall be fixed by datar-pandas v0.1.1

Try updating datar by:

pip install -U datar[pandas]

and also try get_versions to ensure datar-pandas v0.1.1 is installed.

pwwang · Answer 4 · Tue Dec 13 2022 03:01:14 GMT+0800 (China Standard Time)

By the way, thanks for the compliments:

Thanks for your excellent package to port R (dplyr) flow of processing to Python. I have been using other alternatives, and yours is the one that offers the most extensive and equivalent to what is possible now with dplyr.

Do you mind if I put it as a testimonial in the README file?

Carlos Ortega · Answer 5 · Tue Dec 13 2022 04:03:17 GMT+0800 (China Standard Time)

Thanks a lot for your quick fix!.

No, I do not mind at all.
Thanks to you.
Carlos.

pwwang · Answer 6 · Tue Dec 13 2022 04:22:54 GMT+0800 (China Standard Time)

Thanks!

Please confirm if this is fixed and feel free to close it if so.
Feel free to open new issues if you have other questions.

Carlos Ortega · Answer 7 · Tue Dec 13 2022 06:35:28 GMT+0800 (China Standard Time)

Thanks,

Yes, I have just updated datar with your indications and now the problem is fixed.

       decil       n
  <category> <int64>
0          1    1438
1          2    1437
2          3    1437
3          4    1437
4          5    1437
5          6    1439
6          7    1435
7          8    1585
8          9    1293
9         10    1434

Thanks again,
Carlos.