Error in `python': free(): invalid next size (normal)

Question

Error in `python': free(): invalid next size (normal)

lukyanenkomax opened this issue 6 years ago · comments

Using any model except GaussianNB causes an error in stacking():
task: [classification]
n_classes: [2]
metric: [log_loss]
mode: [oof_pred_bag]
n_models: [1]
model 0: [LogisticRegression]
/opt/conda/lib/python3.6/site-packages/sklearn/linear_model/base.py:297: RuntimeWarning: overflow encountered in exp
np.exp(prob, prob)
----
MEAN: [0.56676799] + [0.01295934]
FULL: [0.56677227]

*** Error in `python': free(): invalid next size (normal): 0x0000564aaa718ea0 ***
How to debug it to find the reason of error?

Igor Ivanov · Answer 1 · Sat Aug 25 2018 17:33:43 GMT+0800 (China Standard Time)

Hi!
I can’t reproduce this error.
It looks like low level memory management error which is not related to vecstack package.
Please try to run the code below and post complete output you will get including the whole traceback.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import log_loss
from sklearn.linear_model import LogisticRegression
from vecstack import stacking

# make binary classification data
X, y = make_classification(n_classes=2, n_samples=500,
                           n_features=5, n_informative=3,
                           n_redundant=1, flip_y=0, random_state=0)

# split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, 
                                                    random_state=0)

# models
models = [LogisticRegression(random_state=0)]

# run stacking
S_train, S_test = stacking(models, X_train, y_train, X_test, 
                           regression=False, metric=log_loss, 
                           needs_proba=True, mode='oof_pred_bag',
                           random_state=0, verbose=2)

My output:

task:         [classification]
n_classes:    [2]
metric:       [log_loss]
mode:         [oof_pred_bag]
n_models:     [1]

model  0:     [LogisticRegression]
    fold  0:  [0.49452530]
    fold  1:  [0.43760224]
    fold  2:  [0.45353286]
    fold  3:  [0.41709172]
    ----
    MEAN:     [0.45068803] + [0.02841544]
    FULL:     [0.45068803]

lukyanenkomax · Answer 2 · Sun Aug 26 2018 16:03:30 GMT+0800 (China Standard Time)

I've got same output as you have with code above and script execution was successful. My dataset is not so large - 5000 rows, ~800 columns and size in RAM is about 50 MB. I have decreased my dataset to first 500 rows and about 10 columns and now it is executed successfully. I run this my script in Kaggle kernel, it does not seem to have memory issue for such small dataset. I have no idea what is the reason of this error.

Igor Ivanov · Answer 3 · Mon Aug 27 2018 17:29:52 GMT+0800 (China Standard Time)

It's hard to tell what might be the cause.
First of all let's try to fit and predict your data using LogisticRegression without stacking.
Do you get the same error? Post full output and traceback.

from sklearn.model_selection import train_test_split
from sklearn.metrics import log_loss
from sklearn.linear_model import LogisticRegression

# create numpy arrays from your full data
X, y = 

# split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, 
                                                    random_state=0)

# fit and predict
model = LogisticRegression(random_state=0)
model = model.fit(X_train, y_train)
y_pred = model.predict(X_test)
y_pred_proba = model.predict_proba(X_test)
print(y_pred[:10])
print(y_pred_proba[:10])

lukyanenkomax · Answer 4 · Mon Aug 27 2018 19:19:43 GMT+0800 (China Standard Time)

It seems like your package is not affected by this error, I've got same error when I run different kernel with similar ensemble approach.

Igor Ivanov · Answer 5 · Tue Aug 28 2018 21:23:36 GMT+0800 (China Standard Time)

It's an interesting problem anyway.
One possibility is that there are some issues with your dataset.
For example extremely large or extremely small numbers, missing values, unusual encoding, etc.
So you can try to investigate. Check for missing values,
try to scale data using StandardScaler or MinMaxScaler,
or just look at some random examples to see what's going on.

Good luck!