watson-document-classifier - Work in progress

Augumented Classification and Attribution of Document with Watson Natural Language Understanding and Jupyter Notebooks

In this developer journey we will use Jupyter notebooks to augument IBM Watson Natural Language Understanding API output through configurable mechanism for text classification.

When the reader has completed this journey, they will understand how to:

Create and run a Jupyter notebook in DSX.
Use DSX Object Storage to access a data and configuration files.
Use IBM Watson Watson Natural Language Understanding API to extract metadata from document in Jupyter notebooks.
Extract and format unstructured data using simplified Python functions.
Use a configuration file to build configurable and layered classification grammer.
Use the combination of grammatical classification and regex patterns from configuration file to classify word token classes.
Store the processed output JSON in DSX Object Storage.

The intended audience for this journey is developers who want to learn method for augumenting classification metadata obtained from Watson Natural Language Understanding API, in situations when there is scarcity of historical data. The traditional approach of training a Text Analytics model yields less than expected results. The distinguishing factor of this journey is, it allows a head start even if the domain in question is a specialized area, which might not be supported by generally available English parsers.

Included components

IBM Data Science Experience: Analyze data using RStudio, Jupyter, and Python in a configured, collaborative environment that includes IBM value-adds, such as managed Spark.
Bluemix Object Storage: A Bluemix service that provides an unstructured cloud data store to build and deliver cost effective apps and services with high reliability and fast speed to market.

Featured technologies

Jupyter Notebooks: An open-source web application that allows you to create and share documents that contain live code, equations, visualizations and explanatory text.

Watch the Video

Steps

Follow these steps to setup and run this developer journey. The steps are described in detail below.

Sign up for the Data Science Experience
Create Bluemix services
Create the notebook
Add the data and configuraton file
Update the notebook with service credentials
Run the notebook
Download the results

1. Sign up for the Data Science Experience

Sign up for IBM's Data Science Experience. By signing up for the Data Science Experience, two services: DSX-Spark and DSX-ObjectStore will be created in your Bluemix account.

2. Create Bluemix services

Create the following Bluemix service by following the links to use the Bluemix UI and create it.

3. Create the notebook

Use the menu on the left to select My Projects and then Default Project. Click on Add notebooks (upper right) to create a notebook.

Select the From URL tab.
Enter a name for the notebook.
Optionally, enter a description for the notebook.
Enter this Notebook URL: https://github.com/IBM/watson-document-classifier/blob/master/notebooks/watson_document_classifier.ipynb
Click the Create Notebook button.

4. Add the data and configuration file

5. Update the notebok with service credentials

6. Run the notebook

When a notebook is executed, what is actually happening is that each code cell in the notebook is executed, in order, from top to bottom.

Each code cell is selectable and is preceded by a tag in the left margin. The tag format is In [x]:. Depending on the state of the notebook, the x can be:

A blank, this indicates that the cell has never been executed.
A number, this number represents the relative order this code step was executed.
A *, this indicates that the cell is currently executing.

There are several ways to execute the code cells in your notebook:

One cell at a time.
- Select the cell, and then press the Play button in the toolbar.
Batch mode, in sequential order.
- From the Cell menu bar, there are several options available. For example, you can Run All cells in your notebook, or you can Run All Below, that will start executing from the first cell under the currently selected cell, and then continue executing all cells that follow.
At a scheduled time.
- Press the Schedule button located in the top right section of your notebook panel. Here you can schedule your notebook to be executed once at some future time, or repeatedly at your specified interval.

vijaysanjosalexander / watson-document-classifier