Labelling topics with wikipedia

OverView

After extracting topics (eg. using LDA Topic Modeling approach), next step is to label this topics. In this template labelling of topics is carried out with wikipedia as knowledge base or reference. This ML Engine has been implemented using Spark - MlLib - 1.5.1, predictionIO 0.10.0-incubating-rc1 and scala 2.10.5.

Labelling topics can be solved as classification problem, where trainig data set is labelled data of wikipedia pages containing page title(label) and page content. Preprocessing and feature extraction from wikipedia page content is carried out in data preprator part of DASE model which return Labelled points that are trained in Algorithm part of DASE model with Naive Bayes classifier. Classification algorithm can be customizable.

Usage

Event Data Requirements

By default, the template requires the following events to be collected (/data/import_eventserver.py ):

user train event, to train model with specific dataset

Input Query

{"topics": [["apple","iphone","safari","smartphone"]]}

Output Predicted Result

it returns category or pagetitle to which topical words belongs.

{"Category" : "Apple Inc."}

Dataset

Collect latest wikipedia page names from [Wikipedia latest pages titles] (https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-all-titles-in-ns0.gz)

Run data/get_wikipages.py code to get wikipedia page content and make corpus that contains page title as label and page content.

$pip install wikipedia

import wikipedia
...
page = wikipedia.page(pageName)
titles.append(page.title)
content.append(page.content)
...

Trainig data sample :

title, content
Politics, "Politics (from Greek: πολιτικός politikos, definition ""of, for, or relating to citizens"") is the process of making  decisions applying to all members of each group. More narrowly, it refers to achieving and exercising positions of governance
A variety of methods are deployed in politics"
Apple Inc., "Apple Inc. is an American multinational technology company headquartered in Cupertino, California, that designs, develops, and sells consumer electronics, computer software, and online services"

The training sample events have the following format (Generated by data/import_eventserver.py):

 client.create_event(
           event="train",
           entity_type="wiki_page",
           entity_id=rownum,
           properties= { "category" : row[1], "content": row[2] }
        )

Install and Run PredictionIO

Install PredictinIO from Apache PredictionIO. Let's say you have installed PredictionIO at /home/yourname/PredictionIO/. For convenience, add PredictionIO's binary command path to your PATH, i.e. /home/yourname/PredictionIO/bin

$ PATH=$PATH:/home/yourname/PredictionIO/bin; export PATH

Once you have completed the installation process, please make sure all the components (PredictionIO Event Server, Elasticsearch, and HBase) are up and running.

$ pio-start-all

You can check the status by running:

$ pio status

Download Template

To get template clone the below repository by executing the following command in the directory where you want the code to reside:

git clone https://github.com/peoplehum/template-Labelling-Topics-with-wikipedia

Generate an App ID and Access Key

Let's assume you want to use this engine in an application named "testApp". You will need to collect some training data for machine learning modeling. You can generate an App ID and Access Key that represent "testApp" on the Event Server easily:

$ pio app new testApp

You should find the following in the console output:

...
[INFO] [App$] Initialized Event Store for this app ID: 1.
[INFO] [App$] Created new app:
[INFO] [App$]       Name: testApp
[INFO] [App$]         ID: 1
[INFO] [App$] Access Key: Oc6X6nxSR7xhuX4S8-0InJUF90T-KBicGbXnWSP3yECVac52GpkPW2OJL0L5DU3x

Take note of the Access Key and App ID. You will need the Access Key to refer to "testApp" when you collect data. $ pio app list will return a list of names and IDs of apps created in the Event Server.

$ pio app list
[INFO] [App$]                 Name |   ID |                                                       Access Key | Allowed Event(s)
[INFO] [App$]               testApp |    1 | Oc6X6nxSR7xhuX4S8-0InJUF90T-KBicGbXnWSP3yECVac52GpkPW2OJL0L5DU3x | (all)
[INFO] [App$]               MyApp |    2 | io5lz6Eg4m3Xe4JZTBFE13GMAf1dhFl6ZteuJfrO84XpdOz9wRCrDU44EUaYuXq5 | (all)
[INFO] [App$] Finished listing 2 app(s).

To use template with above created application, modify appName in engine.json

"datasource": {
    "params" : {
      "appName" : "testApp"
    }
  }

Collecting Data

Next, let's collect some training data. By default, the Engine Template reads 2 properties of a user record: page title(Category) and page content.

You can send these data to PredictionIO Event Server in real-time easily by making a HTTP request or through the EventClient of an SDK.

A Python import script import_eventserver.py is provided in the template to import the data to Event Server using Python SDK. Replace the value of access_key parameter by your applications's Access Key and run:

$ pip install predictionio
$ cd template-Labelling-LDA-Topics-with-wikipedia
$ python data/import_eventserver.py --access_key Oc6X6nxSR7xhuX4S8-0InJUF90T-KBicGbXnWSP3yECVac52GpkPW2OJL0L5DU3x --file data/sample_wiki_pages_data.csv

You should see the following output:

Importing data...
100 events are imported.

This python script converts the data file to proper events formats as needed by the event server. Now the training data is stored as events inside the Event Store.

Deploy the Engine as a Service

Now you can build, train, and deploy the engine. First, make sure you are under the Template directory.

Build

Start with building your LDA topic labelling engine.

$ pio build

This command should take few minutes for the first time; all subsequent builds should be less than a minute. You can also run it with --verbose to see all log messages.

Upon successful build, you should see a console message similar to the following.

[INFO] [Console$] Your engine is ready for training.

Training the Predictive Model

Train your engine.

$ pio train

In case of very large dataset allocate more memory

$ SPARK_MEM="6g" pio train

When your engine is trained successfully, you should see a console message similar to the following.

[INFO] [CoreWorkflow$] Training completed successfully.

Deploying the Engine

Now your engine is ready to deploy.

$ pio deploy

This will deploy an engine that binds to http://localhost:8000. You can visit that page in your web browser to check its status.

You can specify port where to deploy

$ pio deploy --port 8088

Execute Query

Run below request for processing query on serving layer, it will return category or wikipedia page name which it belongs.

curl -H "Content-Type: application/json" -d '{"topics": [["apple","iphone","safari","smartphone"]]}' http://localhost:8000/queries.json

Relative Issues

For any problem, you can create issue here and for merging new changes make pull request. For any further query you can communicate on bansari.jan93@gmail.com

License

This algorithm is under Apache 2 license.

gzilt-playground / pio-topic-labelling