How to solve the ValueError: Shape of passed values is (27, 3), indices imply (27, 5)
bhuvanshukla opened this issue · comments
NOTE: Added one more y_label
: support
. I also tried predicting the same way as you have shown, it's working perfectly. But if I remove the files(resumes) from embed
and neither
folders. Then I get this error:
ValueError: Shape of passed values is (27, 3), indices imply (27, 5)
9 resumes each in dev
,web_dev
and support
; embed
and neither
folders are empty.
I can solve that by:
pd.DataFrame(pipe.predict_proba(trial['text']), columns=['dev','web_dev','support'])
So should I create a copy of y_label
by checking that which folders are empty then proceed ?
y_label
should be the only thing to update at the top of the notebook.
It used to look like:
y_labels = ('web_dev', 'dev', 'embed', 'neither')
I would revert the change you made to:
pd.DataFrame(pipe.predict_proba(trial['text']), columns=['dev','web_dev','support'])
back to
pd.DataFrame(pipe.predict_proba(trial['text']), columns=y_labels)
then update the y_values
variable to be:
y_labels = ('dev','web_dev','support')
Scenario
Say ./doc/web-dev
folder is empty. Now when we predict using pipe.predict_proba()
it will give the output with shape (something, 3 ) but actually the it would be wanting (something, 4) , hence leading to the ValueError
.
I want to know if I am right.
You're right. You can't have zero data for any given label.
The test/train split doesn't help with this problem at low volumes. Say you had one resume in web-dev. After you run the train/test split, that one resume could end up in X_train, but not in x_test, so you'll have another shape mismatch.
When constructing this example, I added 10 files per label to avoid this error.
Realistically, you're only going to see strong classification scores with hundreds or thousands of sample resumes per label, so definitely take the time to construct your data set.
The most difficult problem in machine learning is getting well labeled data. Maybe hit linkedin.com and download a couple thousand resumes ;)
Thank you have a nice weekend 😊.