memoryError: bad allocation for rapidfuzz.process.cdist
al-yakubovich opened this issue · comments
Hi, the following code gives a memoryError :
from rapidfuzz import process, fuzz
import pandas as pd
d_test = {
'name' : ['South Beach', 'Dog', 'Bird', 'Ant', 'Big Dog', 'Beach', 'Dear', 'Cat', 'Fish', 'Dry Fish', 'Fish'],
'cluster_number' : [1, 2, 3, 3, 2, 1, 4, 2, 2, 2, 2]
}
df_test = pd.DataFrame(d_test)
names = df_test["name"]
scores = pd.DataFrame(rapidfuzz.process.cdist(names, names, workers=-1), columns=names, index=names)
x, y = np.where(scores > 50)
groups = (pd.DataFrame(scores.index[x], scores.index[y])
.groupby(level=0)
.agg(frozenset)
.drop_duplicates()
.reset_index(drop=True)
.reset_index()
.explode("name"))
groups.rename(columns={'index': 'restaurant_id'}, inplace=True)
groups.restaurant_id += 1
df_test = df_test.merge(groups, how="left")
on line: scores = pd.DataFrame(rapidfuzz.process.cdist(names, names, workers=-1), columns=names, index=names)
if df_test
is changed with dataframe with 1 million rows. My PC has 12GB of free RAM space. Any ideas how to avoid this error?
cdist
returns a matrix of len(queries) x len(choices) x size(dtype)
. By default this dtype is float
or int32_t
depending on the scorer (for the default scorer you are using it is float
). So for 1 million names, the result matrix would require 3.6 terrabyte of memory.
You will need to process your data in smaller chunks and store the results on disk in between.