dovpanda-dev / dovpanda

Directions overlay for working with pandas in an analysis environment

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Low_Memory Warning

AnzorGozalishvili opened this issue · comments

Brief Description

While reading a DataFrame there is one parameter which is called low_memory and it's set to True by default. It's function is to decide minimal data type that is required to fit values of each column which seems to be for memory optimization purposes. In order to detect correct data type we need to consider all values in a column which doesn't seem to be optimal for big DataFrame because of 2 reasons I guess: memory and data loading time. And my assumption is that Pandas is optimizing both. That's why this parameter is True by default. I didn't dig into the implementation of that optimized version, how it detects data types (maybe reading some chunk of DataFrame take the minimal requirement).
The problem is that sometimes it gives unexpected results. Once I spent one week of some heavy calculations on chunks of data with a hope that I could assemble it back using index which was definitely unique. But I didn't check one specific detail that index was 8digit at the beginning of data and it was becoming 16digits (it was takes from some db with different versions primary key). While reading chunks of data I was actually getting first 8digits from 16digit index since low_memory was set to True by default and didn't check all index values. Finally I ended up with the calculations with no hope to assemble back and merge to original data.
I told such a long and dramatic story because that low_memory option is very strange, nobody takes it seriously but it becomes very critical in some cases.
So, please consider that case and put some warnings about that in dovpanda.