scikit-learn nlp nltk machine-learning python dataset amazon scraper virtualenv retail-products-classifier text-classification-python

Retail products classifier

This project helps classify retail products into categories. Although in this example the categories are structured in a hierarchy, to keep it simple I considered all subcategories as top-level. The main packages used in this projects are: sklearn, nltk and dataset.

You can read the post explaining this project here.

You will need Python3+ to use this project.

Installation

1. Download

Now, you need the text-classification-python project files in your workspace:

$ git clone https://github.com/joaorafaelm/text-classification-python;
$ cd text-classification-python;

2. Virtualenv (Optional)

You should already know what is virtualenv at this stage. So, simply create it for the project:

$ virtualenv venv;
$ source venv/bin/activate;

3. Requirements

You will find the requirements.txt. To install them, simply type:

$ pip install -r requirements.txt

Running the scraper

To run the scraper you will need a csv of ASINS (amazons product identifier). Just search the webz for it. And then run:

python amazon_scrape.py

All data will be saved into sqlite (file database.db), table products.

Dump the database

datafreeze .datafreeze.yaml

This will create a json file under the directory dumps/.

Data preparation

python data_prep.py

The script will create a new file called products.json at the root of the project, and print out the category tree structure. Change the value of the variables default_depth, min_samples and domain if you need more data.

Classify and get prediction results.

python classify.py

It will print out the accuracy of each category, along with the confusion matrix.

About

An example of retails products classification using scikit and nltk -

https://joaorafaelm.github.io/blog/text-classification-with-python

scikit-learn nlp nltk machine-learning python dataset amazon scraper virtualenv retail-products-classifier text-classification-python

Languages

Language:Python 100.0%