dancrew32 / resume_classifier

LR on resumes .doc files

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

How to solve the ValueError: Shape of passed values is (27, 3), indices imply (27, 5)

bhuvanshukla opened this issue · comments

NOTE: Added one more y_label: support . I also tried predicting the same way as you have shown, it's working perfectly. But if I remove the files(resumes) from embed and neither folders. Then I get this error:
ValueError: Shape of passed values is (27, 3), indices imply (27, 5)

9 resumes each in dev,web_dev and support; embed and neither folders are empty.
I can solve that by:
pd.DataFrame(pipe.predict_proba(trial['text']), columns=['dev','web_dev','support'])

So should I create a copy of y_label by checking that which folders are empty then proceed ?

y_label should be the only thing to update at the top of the notebook.

It used to look like:

y_labels = ('web_dev', 'dev', 'embed', 'neither')

I would revert the change you made to:

 pd.DataFrame(pipe.predict_proba(trial['text']), columns=['dev','web_dev','support'])

back to

pd.DataFrame(pipe.predict_proba(trial['text']), columns=y_labels)

then update the y_values variable to be:

y_labels = ('dev','web_dev','support')

Scenario
Say ./doc/web-dev folder is empty. Now when we predict using pipe.predict_proba() it will give the output with shape (something, 3 ) but actually the it would be wanting (something, 4) , hence leading to the ValueError.

I want to know if I am right.

You're right. You can't have zero data for any given label.

The test/train split doesn't help with this problem at low volumes. Say you had one resume in web-dev. After you run the train/test split, that one resume could end up in X_train, but not in x_test, so you'll have another shape mismatch.

When constructing this example, I added 10 files per label to avoid this error.

Realistically, you're only going to see strong classification scores with hundreds or thousands of sample resumes per label, so definitely take the time to construct your data set.

The most difficult problem in machine learning is getting well labeled data. Maybe hit linkedin.com and download a couple thousand resumes ;)

Screenshot from 2019-06-28 07-33-35

Thank you have a nice weekend 😊.