How to solve the ValueError: Shape of passed values is (27, 3), indices imply (27, 5)

Question

How to solve the ValueError: Shape of passed values is (27, 3), indices imply (27, 5)

bhuvanshukla opened this issue 5 years ago · comments

NOTE: Added one more y_label: support . I also tried predicting the same way as you have shown, it's working perfectly. But if I remove the files(resumes) from embed and neither folders. Then I get this error:
ValueError: Shape of passed values is (27, 3), indices imply (27, 5)

9 resumes each in dev,web_dev and support; embed and neither folders are empty.
I can solve that by:
pd.DataFrame(pipe.predict_proba(trial['text']), columns=['dev','web_dev','support'])

So should I create a copy of y_label by checking that which folders are empty then proceed ?

Dan Masquelier · Answer 1 · Thu Jun 27 2019 22:46:11 GMT+0800 (China Standard Time)

y_label should be the only thing to update at the top of the notebook.

It used to look like:

y_labels = ('web_dev', 'dev', 'embed', 'neither')

I would revert the change you made to:

 pd.DataFrame(pipe.predict_proba(trial['text']), columns=['dev','web_dev','support'])

back to

pd.DataFrame(pipe.predict_proba(trial['text']), columns=y_labels)

then update the y_values variable to be:

y_labels = ('dev','web_dev','support')

BHUVAN SHUKLA · Answer 2 · Fri Jun 28 2019 14:22:30 GMT+0800 (China Standard Time)

Scenario
Say ./doc/web-dev folder is empty. Now when we predict using pipe.predict_proba() it will give the output with shape (something, 3 ) but actually the it would be wanting (something, 4) , hence leading to the ValueError.

I want to know if I am right.

Dan Masquelier · Answer 3 · Fri Jun 28 2019 22:35:24 GMT+0800 (China Standard Time)

You're right. You can't have zero data for any given label.

The test/train split doesn't help with this problem at low volumes. Say you had one resume in web-dev. After you run the train/test split, that one resume could end up in X_train, but not in x_test, so you'll have another shape mismatch.

When constructing this example, I added 10 files per label to avoid this error.

Realistically, you're only going to see strong classification scores with hundreds or thousands of sample resumes per label, so definitely take the time to construct your data set.

The most difficult problem in machine learning is getting well labeled data. Maybe hit linkedin.com and download a couple thousand resumes ;)

BHUVAN SHUKLA · Answer 4 · Sun Jun 30 2019 12:45:56 GMT+0800 (China Standard Time)

Thank you have a nice weekend 😊.