MODOC - Mental Organism Designed Only for Classifying

This project aims to classify, based on natural language processing, the curriculum vitae from candidates for job vacancies. This project was, initially, built to the project management II discipline, nonetheless, ended in some kind of commercial product for HR consulting companies. This project is optimised for python 3.6.

Project Structure - Directories

Data: datasets directory;
Drivers: webdrivers and webcrawlers;
Scripts: python scripts directory.

Modules

scraper: The webscraping module (module responsible for extracting the CVs from web);
pdf_converter: The pdf-to-image module (due the OCR incapability to extract from pdf extensions, it is necessary to convert them into image files).
ocr: The image-to-text module (an machine learning model for image-to-text extraction);
classifier: The test classifier module (just an experimental module) to be substituted in future;
val_alg: The machine learning's fitting module to be implemented in future;
main: The machine learning's classifiers module to be implemented in future.

obs: due the low number of CVs. The presentation to investidors was maded using dividends receipts from Argentina Stock Exchange (Bolsar).

Requirements

This project, as dependencies, require the following python libraries:

scikit-learn;
pandas;

To install them, in your anaconda envoironment or virtual envoironment, run the following command:

  pip install sklearn pandas

Results

Models Accuracy

The Random Forest model assertiveness rate was: 83.33 %.
The dumb algorithm assertiveness rate was 50.00 %. - _independent of attributes, the model always infers Finalised.

Confusion Matrix

	Finalised	Not Finalised
Finalised	5	0
Not Finalised	1	0

Brunopaes / modoc