QST: why deepchecks use NumPy to storage the nlp text list，that can easily cause a memory overflow。

Question

QST: why deepchecks use NumPy to storage the nlp text list，that can easily cause a memory overflow。

san5167 opened this issue a year ago · comments

san5167 commented a year ago

Research

I have searched the [deepchecks] tag on StackOverflow for similar questions.
I have asked my usage related question on StackOverflow.

Link to question on StackOverflow

no

Question about deepchecks

Desc：

The numpy data structure is that both pointers and stored data are pre-allocated memory. Moreover, the memory size of each row of the data store allocated for the string type depends on the memory size of the string of the largest length in the original data row.
This can easily cause memory allocation overflows.

Example:

I use this data(https://huggingface.co/datasets/bigcode/the-stack/blob/main/data/abap/train-00000-of-00001.parquet) chang to TextData, This dataset has only 23,512 rows, but requires 66.4GB of memory to be allocate。

Test Code：

    
import numpy as np
import pandas as pd
from deepchecks.nlp import TextData

abap = pd.read_parquet(path="D:/worker/dataset/stack/train-00000-of-00001.parquet")
raw_text = abap['content'].to_list()

max_len = 0
max_index = 0
for i in range(len(raw_text)):
    if max_len < len(raw_text[i]):
        max_len = len(raw_text[i])
        max_index = i


abap = TextData(raw_text)

# np.asarray([str(x) for x in raw_text]) is equal to the memory allocated under np.empty((len(raw_text), max_len), dtype=np.str_).
empty_numpy = np.empty((len(raw_text), max_len), dtype=np.str_)
print(empty_numpy.nbytes)
empty_numpy.fill(raw_text)

Noam Bressler · Answer 1 · Tue Aug 01 2023 17:48:13 GMT+0800 (China Standard Time)

Hi @san5167, very good point. Fixing this will be a pretty serious refactor, so will take some time to get to.

san5167 · Answer 2 · Thu Aug 03 2023 10:30:50 GMT+0800 (China Standard Time)

Hi @san5167, very good point. Fixing this will be a pretty serious refactor, so will take some time to get to.

ok,I'm really looking forward to this follow-up optimization。