data-apis / python-record-api

Inferring Python API signatures from tracing usage.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Pandas Ran out of memory again!

saulshanabrook opened this issue · comments

So the pandas test suite ran out of memory again in Kubernetes. It used up ~13Gb and then was killed, because the pods only have that much available.

I am a bit hesitant to just raise the pod memory limit again... If anyone knows if this is a reasonable amount of memory for Pandas to use when testing (cc @datapythonista), that would be helpful! It's also possible that the tracing has some sort of memory leak which is blowing things up for pandas, although all the other test suites don't seem to have the same problem.

Maybe I can run Pandas test suites with some flags to ignore some high memory tests? These are my current ones:

CMD [ "pytest", "pandas", "--skip-slow", "--skip-network", "--skip-db", "-m", "not single", "-r", "sxX", "--strict", "--suppress-tests-failed-exit-code" ]

I copied it from the test-fast script, or whatever that is, in the Pandas repo.

That's strange. There is a flag --run-high-memory to run the high memory tests, which by default are not run. I'm not sure how much memory it requires, but surely not 13Gb.

I'm not sure if test-fast is updated, I'd probably use the settings in the CI ci/azure/posix.yml.

pytest -m "not slow and not network and not clipboard" pandas is what I'd use. I don't think that should be very different from what you've got, but you can give it a try, just in case.

I'm not sure if we've got many more markers you can play with, but if you uninstall optional libraries you'll be running less tests. You can set up an environment with just numpy, dateutil and pytz and give it a try, this should run the core tests only. If you're wondering, this approach makes tests very unreliable, and sometimes have problems of tests being silently skipped. I wouldn't recommend it, but it is how it works now in pandas.

I'm not sure if we've got many more markers you can play with, but if you uninstall optional libraries you'll be running less tests. You can set up an environment with just numpy, dateutil and pytz and give it a try, this should run the core tests only.

Interesting, I will give that a go. Just those three? Is that documented anywhere or used anywhere, or just something you try locally when you are running less tests?

Just those three? Is that documented anywhere or used anywhere, or just something you try locally when you are running less tests?

Those three are documented to be the minimal required dependencies (eg https://pandas.pydata.org/docs/dev/getting_started/install.html#dependencies), and since we automatically skip tests for optional dependencies (not something that is explicitly documented I think), having only those installed is the way to run only tests that don't rely on any optional dependency.

Still happening :( It is now exceeding 14 gb... #94 (comment)