indentify_zero_importance with asymmetrical data

Question

indentify_zero_importance with asymmetrical data

RustyBrain opened this issue 6 years ago · comments

Hi,

Firstly I want to say thank you for such an amazing piece of work, really need this!

Secondly, I am trying to identify_zero_importance features in my dataset, but when I am running fs.idenitify_zero_importance() I get the following:

fs.identify_zero_importance('classification', eval_metric='auc')
Training Gradient Boosting Model
Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "C:\Users\...\venv\lib\site-packages\feature_selector\feature_selector.py", line 306, in identify_zero_importance
    train_features, valid_features, train_labels, valid_labels = train_test_split(features, labels, test_size = 0.15)
  File "C:\Users\...\venv\lib\site-packages\sklearn\model_selection\_split.py", line 2031, in train_test_split
    arrays = indexable(*arrays)
  File "C:\Users\...\venv\lib\site-packages\sklearn\utils\validation.py", line 229, in indexable
    check_consistent_length(*result)
  File "C:\Users\...\venv\lib\site-packages\sklearn\utils\validation.py", line 204, in check_consistent_length
    " samples: %r" % [int(l) for l in lengths])
ValueError: Found input variables with inconsistent numbers of samples: [150, 144]

Taking a look at the errors in the validation.py script, it looks as though this error will be thrown when the length and width of the dataframe passed into the FeatureSelector(data=df) at the beginning are not equal. Is this correct? How can I fix this?

Will Koehrsen · Answer 1 · Wed Dec 12 2018 06:14:35 GMT+0800 (China Standard Time)

This is going to occur when the number of training data points is not equal to the number of labels. For each data point (row) you need to have a corresponding label. Can you make sure that is the case for your problem?

RustyBrain · Answer 2 · Thu Dec 13 2018 16:47:14 GMT+0800 (China Standard Time)

Thanks for the reply. Just for my understanding, if I pass in a standard dataframe, is it identifying rows or columns that have zero importance? I have tried passing labels in for both and I either get the above error (passing df.columns in) or ValueError: y contains new labels: (passing df.index in). Each row and column has a label. I understand that these errors are being thrown by sklearn, but any advice would be appreciated.

Will Koehrsen · Answer 3 · Fri Dec 14 2018 03:19:59 GMT+0800 (China Standard Time)

In machine learning, features are in the columns with observations in the rows. As we want to identify features with zero importance, we check the columns. You should be passing in the entire dataframe (with observations in the rows and features in the columns) along with the labels for identifying zero importance features. You need to have the same number of observations in the dataframe and in the labels.

RustyBrain · Answer 4 · Tue Dec 18 2018 00:27:32 GMT+0800 (China Standard Time)

Hi Will, thanks for the reply. I know you are not hear to teach ignorami like me, but I do appreciate your advice. My dataframe is that shape features as columns and observations as rows. Is the label argument the label of the feature or the observation? I have tried with both and am getting errors either way, with the initial error documented initially when passing in the feature names as labels, but the ValueError: y contains new labels: [A list of observation names] error when passing the observations in as labels. My dataset is wide rather than long (around 250 features and 150 observations), could this be the source of the errors? I have checked that the length of the labels and index are the same.

RustyBrain · Answer 5 · Tue Dec 18 2018 01:05:22 GMT+0800 (China Standard Time)

I have also just done some more evaluation and it appears that when I pass the observation names as labels I get the contains new labels error, and it lists 23 (15.333%) of the labels as new, and these change each time I attempt identify_zero_importance. Is this something to do with the test/train split?

Will Koehrsen · Answer 6 · Tue Dec 18 2018 04:19:21 GMT+0800 (China Standard Time)

Could you share the code that is giving you errors?

RustyBrain · Answer 7 · Wed Dec 19 2018 19:43:59 GMT+0800 (China Standard Time)

There is a lot of wrangling to get the dataframe in shape, then I call

fs = FeatureSelector(data=df, labels=df.index)
fs.identify_zero_importance(task = 'classification', eval_metric = 'auc', 
                            n_iterations = 10, early_stopping = True)

Which leads to the error:

ValueError: y contains new labels: [`a list of 23 (15.333%) of the items in the index`]

Will Koehrsen · Answer 8 · Thu Dec 20 2018 00:23:08 GMT+0800 (China Standard Time)

The labels should be in a separate array, not in the dataframe itself. What kind of labels do you have, binary, multiclass, or continuous?

RustyBrain · Answer 9 · Thu Dec 20 2018 00:57:13 GMT+0800 (China Standard Time)

Hi Will,

They are multiclass labels - strings of geographical area names. I have tried passing the index values in as a list rather than a direct call to the df, but it is giving the same error as before.

cujo0072 · Answer 10 · Fri Feb 14 2020 08:41:16 GMT+0800 (China Standard Time)

FYI: I found this write up describing the problem. It helped me get past this problem:
https://datascience.stackexchange.com/questions/20199/train-test-split-error-found-input-variables-with-inconsistent-numbers-of-sam