deepchecks / deepchecks

Deepchecks: Tests for Continuous Validation of ML Models & Data. Deepchecks is a holistic open-source solution for all of your AI & ML validation needs, enabling to thoroughly test your data and models from research to production.

Home Page:https://docs.deepchecks.com/stable

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

QST: why deepchecks use NumPy to storage the nlp text list,that can easily cause a memory overflow。

san5167 opened this issue · comments

Research

Link to question on StackOverflow

no

Question about deepchecks

Desc:

The numpy data structure is that both pointers and stored data are pre-allocated memory. Moreover, the memory size of each row of the data store allocated for the string type depends on the memory size of the string of the largest length in the original data row.
This can easily cause memory allocation overflows.

Example:

I use this data(https://huggingface.co/datasets/bigcode/the-stack/blob/main/data/abap/train-00000-of-00001.parquet) chang to TextData, This dataset has only 23,512 rows, but requires 66.4GB of memory to be allocate。

Test Code:

    
import numpy as np
import pandas as pd
from deepchecks.nlp import TextData

abap = pd.read_parquet(path="D:/worker/dataset/stack/train-00000-of-00001.parquet")
raw_text = abap['content'].to_list()

max_len = 0
max_index = 0
for i in range(len(raw_text)):
    if max_len < len(raw_text[i]):
        max_len = len(raw_text[i])
        max_index = i


abap = TextData(raw_text)

# np.asarray([str(x) for x in raw_text]) is equal to the memory allocated under np.empty((len(raw_text), max_len), dtype=np.str_).
empty_numpy = np.empty((len(raw_text), max_len), dtype=np.str_)
print(empty_numpy.nbytes)
empty_numpy.fill(raw_text)

Hi @san5167, very good point. Fixing this will be a pretty serious refactor, so will take some time to get to.

Hi @san5167, very good point. Fixing this will be a pretty serious refactor, so will take some time to get to.

ok,I'm really looking forward to this follow-up optimization。