postgresml / postgresml

The GPU-powered AI application database. Get your app to market faster using the simplicity of SQL and the latest NLP, ML + LLM models.

Home Page:https://postgresml.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Potential performance issue: Unreliable performance of .loc in pandas 2.0.3

TendouArisu opened this issue · comments

Issue Description:

Hello.
I have discovered a performance degradation in the .loc function of pandas version 2.0.3 when .loc handling big DataFrame with non-unique indexes. When using pandas more than 4 indexes, .loc drastically increases to X1000 times. And I notice that pgml-sdks/pgml/python/examples/requirements.txt, shows that it depends on pandas version 2.0.3. I am not sure whether this performance problem in pandas will affect this repository. I found some discussions on GitHub related to this issue, including #54550 and #54746.
I also found that pgml-extension/tests/xgboost_python.py used the influenced api. There may be more files used the influenced api.

Suggestion

I would recommend considering an upgrade to a different version of pandas >= 2.1 or exploring other solutions to optimize the performance of .loc .
Any other workarounds or solutions would be greatly appreciated.
Thank you!