QST: why deepchecks use NumPy to storage the nlp text list,that can easily cause a memory overflow。
san5167 opened this issue · comments
Research
-
I have searched the [deepchecks] tag on StackOverflow for similar questions.
-
I have asked my usage related question on StackOverflow.
Link to question on StackOverflow
no
Question about deepchecks
Desc:
The numpy data structure is that both pointers and stored data are pre-allocated memory. Moreover, the memory size of each row of the data store allocated for the string type depends on the memory size of the string of the largest length in the original data row.
This can easily cause memory allocation overflows.
Example:
I use this data(https://huggingface.co/datasets/bigcode/the-stack/blob/main/data/abap/train-00000-of-00001.parquet) chang to TextData, This dataset has only 23,512 rows, but requires 66.4GB of memory to be allocate。
Test Code:
import numpy as np
import pandas as pd
from deepchecks.nlp import TextData
abap = pd.read_parquet(path="D:/worker/dataset/stack/train-00000-of-00001.parquet")
raw_text = abap['content'].to_list()
max_len = 0
max_index = 0
for i in range(len(raw_text)):
if max_len < len(raw_text[i]):
max_len = len(raw_text[i])
max_index = i
abap = TextData(raw_text)
# np.asarray([str(x) for x in raw_text]) is equal to the memory allocated under np.empty((len(raw_text), max_len), dtype=np.str_).
empty_numpy = np.empty((len(raw_text), max_len), dtype=np.str_)
print(empty_numpy.nbytes)
empty_numpy.fill(raw_text)
Hi @san5167, very good point. Fixing this will be a pretty serious refactor, so will take some time to get to.