vecxoz / vecstack

Python package for stacking (machine learning technique)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Error in `python': free(): invalid next size (normal)

lukyanenkomax opened this issue · comments

Using any model except GaussianNB causes an error in stacking():
task: [classification]
n_classes: [2]
metric: [log_loss]
mode: [oof_pred_bag]
n_models: [1]
model 0: [LogisticRegression]
/opt/conda/lib/python3.6/site-packages/sklearn/linear_model/base.py:297: RuntimeWarning: overflow encountered in exp
np.exp(prob, prob)
----
MEAN: [0.56676799] + [0.01295934]
FULL: [0.56677227]

*** Error in `python': free(): invalid next size (normal): 0x0000564aaa718ea0 ***
How to debug it to find the reason of error?

Hi!
I can’t reproduce this error.
It looks like low level memory management error which is not related to vecstack package.
Please try to run the code below and post complete output you will get including the whole traceback.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import log_loss
from sklearn.linear_model import LogisticRegression
from vecstack import stacking

# make binary classification data
X, y = make_classification(n_classes=2, n_samples=500,
                           n_features=5, n_informative=3,
                           n_redundant=1, flip_y=0, random_state=0)

# split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, 
                                                    random_state=0)

# models
models = [LogisticRegression(random_state=0)]

# run stacking
S_train, S_test = stacking(models, X_train, y_train, X_test, 
                           regression=False, metric=log_loss, 
                           needs_proba=True, mode='oof_pred_bag',
                           random_state=0, verbose=2)

My output:

task:         [classification]
n_classes:    [2]
metric:       [log_loss]
mode:         [oof_pred_bag]
n_models:     [1]

model  0:     [LogisticRegression]
    fold  0:  [0.49452530]
    fold  1:  [0.43760224]
    fold  2:  [0.45353286]
    fold  3:  [0.41709172]
    ----
    MEAN:     [0.45068803] + [0.02841544]
    FULL:     [0.45068803]

I've got same output as you have with code above and script execution was successful. My dataset is not so large - 5000 rows, ~800 columns and size in RAM is about 50 MB. I have decreased my dataset to first 500 rows and about 10 columns and now it is executed successfully. I run this my script in Kaggle kernel, it does not seem to have memory issue for such small dataset. I have no idea what is the reason of this error.

It's hard to tell what might be the cause.
First of all let's try to fit and predict your data using LogisticRegression without stacking.
Do you get the same error? Post full output and traceback.

from sklearn.model_selection import train_test_split
from sklearn.metrics import log_loss
from sklearn.linear_model import LogisticRegression

# create numpy arrays from your full data
X, y = 

# split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, 
                                                    random_state=0)

# fit and predict
model = LogisticRegression(random_state=0)
model = model.fit(X_train, y_train)
y_pred = model.predict(X_test)
y_pred_proba = model.predict_proba(X_test)
print(y_pred[:10])
print(y_pred_proba[:10])

It seems like your package is not affected by this error, I've got same error when I run different kernel with similar ensemble approach.

It's an interesting problem anyway.
One possibility is that there are some issues with your dataset.
For example extremely large or extremely small numbers, missing values, unusual encoding, etc.
So you can try to investigate. Check for missing values,
try to scale data using StandardScaler or MinMaxScaler,
or just look at some random examples to see what's going on.

Good luck!