Output exceeds the size limit. Open the full output data in a text editor

Question

Output exceeds the size limit. Open the full output data in a text editor

eromoe opened this issue 2 years ago · comments

Hello, I meet this error while testing featurewiz , I want to do some auto feature engineering , so choose the old way , but unfortunately got Output exceeds the size limit. Open the full output data in a text editor .

Detail:

X shape : Shape = (128463, 1341) , mixed string, int , float and nan values.
code:

import featurewiz as FW
outputs = FW.featurewiz(dataname=X.reset_index(drop=True), target=y.reset_index(drop=True), corr_limit=0.70, verbose=2, sep=',', 
          header=0, test_data='',feature_engg='', category_encoders='',
          dask_xgboost_flag=False, nrows=None)

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
f:\Work\jupyter_pipeline\pj01\1.1.0 clean_data.ipynb Cell 126 in <cell line: 1>()
      1 if Config.add_feature:
      2     # # Add feature
      3     # from jinshu_model.build_models import HighDimensionFeatureAdder
   (...)
      8     # ce = HighDimensionFeatureAdder(max_gmm_component=4, onehot=False)
      9     # X = ce.fit_transform(X)
     10     import featurewiz as FW
---> 11     outputs = FW.featurewiz(dataname=X.reset_index(drop=True), target=y.reset_index(drop=True), corr_limit=0.70, verbose=2, sep=',', 
     12             header=0, test_data='',feature_engg='', category_encoders='',
     13             dask_xgboost_flag=False, nrows=None)
     14 else:
     15     ce = CategoricalEncoder()

File c:\Users\ufo\anaconda3\lib\site-packages\featurewiz\featurewiz.py:793, in featurewiz(dataname, target, corr_limit, verbose, sep, header, test_data, feature_engg, category_encoders, dask_xgboost_flag, nrows, **kwargs)
    791     print('Classifying features using a random sample of %s rows from dataset...' %nrows_limit)
    792     ##### you can use nrows_limit to select a small sample from data set ########################
--> 793     train_small = EDA_randomly_select_rows_from_dataframe(dataname, targets, nrows_limit, DS_LEN=dataname.shape[0])
    794     features_dict = classify_features(train_small, target)
    795 else:

File c:\Users\ufo\anaconda3\lib\site-packages\featurewiz\featurewiz.py:2977, in EDA_randomly_select_rows_from_dataframe(train_dataframe, targets, nrows_limit, DS_LEN)
   2975     test_size = 0.9
...
-> 5842     raise KeyError(f"None of [{key}] are in the [{axis_name}]")
   5844 not_found = list(ensure_index(key)[missing_mask.nonzero()[0]].unique())
   5845 raise KeyError(f"{not_found} not in index")

KeyError: "None of [Int64Index([0, 0, 0, 0, 1, 1, 0, 0, 0, 1,\n            ...\n            0, 0, 0, 0, 0, 0, 1, 0, 0, 0],\n           dtype='int64', length=128463)] are in the [columns]"

aBixyy commented 2 years ago

help me

Home of AutoViz, AutoViML and featurewiz · Answer 1 · Mon Aug 22 2022 00:16:08 GMT+0800 (China Standard Time)

Hi @eromoe 👍
There is something wrong with the input. I would suggest you avoid the y.reset_index(drop=True) statement. Instead, just send X and y as is. Featurewiz doesn't mind if you send dataframes with index in them. Can you try that and report back to me?

Thanks
Autovimal

eromoe · Answer 2 · Mon Aug 22 2022 09:38:54 GMT+0800 (China Standard Time)

@AutoViML The index is string, so I droped, without drop got same error .
PS: The new style feature selection works well with same input .

Home of AutoViz, AutoViML and featurewiz · Answer 3 · Mon Aug 22 2022 19:35:51 GMT+0800 (China Standard Time)

hello @eromoe
I figured out the problem in the first statement. 👍

You must send in the entire train dataframe in the statement below and the target refers to the name of your target column in the train dataframe - instead you sent in X and y. That's the issue!

import featurewiz as FW
outputs = FW.featurewiz(dataname=X.reset_index(drop=True), target=y.reset_index(drop=True), corr_limit=0.70, verbose=2, sep=',', 
          header=0, test_data='',feature_engg='', category_encoders='',
          dask_xgboost_flag=False, nrows=None)

Also I noticed that this is a big dataframe. So you might want to set nrows to be 10000 or something small so that your dataframe is handled in pandas without blowing up.
Hope this finally solves it.
AutoVimal

eromoe · Answer 4 · Fri Aug 26 2022 19:20:47 GMT+0800 (China Standard Time)

Oh , that's my mistake , thank you for point it out !