kaorism / pantip-libr

:books: Sentiment analysis hack project for Pantip.com Q&A site

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Pantip-Libr

Pantip librarian!


What is Pantip?

Pantip is the biggest online Q&A community in Thailand founded in 1996. Pantip stores a very large user-generated questions and answers in numberous topics, e.g., lifestyle, health, tradings, technologies, sciences, sports, movies, and lots of others.


What does this do?

Pantip Librarian downloads and analyses a bulk of Pantip's user-generated questions and answers with text mining techniques. The ultimate purpose (experimentally) of the project is to extract and capture potential patterns which make the question popular or negatively reacted by the users.


Prerequisites

Before running the tasks, these dependencies need to me met:

Make sure you have all above prerequisites installed, up and running.


Prepare development environment

Suppose you have all major dependencies as listed in the previous section installed properly. Now you can simply run the script to install all development dependencies:

$ bash dev-setup.sh

The script basically collects and installs all Python libraries you need for running the library.


Try it

Pantip-Libr is not a complex module so hopefully you can have a speedy first step. Following is the list of common tasks you can find.


1. Download Pantip threads

We have a script to fetch Pantip topics (in a specified range of IDs) and store them in a certain format in CouchDB on your local machine. Simply run the following command:

$ ./fetch

The script will download series of Pantip threads in the specified range of topic IDs and store them in the CouchDB.

Caveat: Please accept my apology. The download script doesn't guard against HTTP connection failures. If network glitch happens, the script poorly ends execution.


2. Process the downloaded threads

To process the downloaded threads, execute the following command. (You may notice that fetch.py should implicitly be triggered at least once before calling this.)

$ ./process

The script spawns several child processes to do the feature vectorisation, classification, and other processing tasks. Basically, the entire process will take some time to finish.

Hint. The subprocesses leave its access logs in the root directory of the repo.

Steps of operation

#step script role
1 core/process.py Tokenise the downloaded records and push to MQ
2 core/textprocess.py Takes the dataset out of MQ and runs machine learning

Analysis Services

Service Diagram

To process any seen or unseen topic with the trained models, you need to start analysis services by:

$ ./start_server

The command will execute the following services.

  • MQ Monitor service (monitor.py)
  • REST server (server/terminal.py)

To end the process, do it manually (for the time being).


How it got so far?

Still in experimental phase. The training time is painful and the models are too huge (over 4GB ...). Yet the accuracy still needs improvement.

Hashing with Truncated SVD : 4000 samples

DIM K TAG % Total Class=0 Class=1 Class=10 Class=-1
1000 3 1024 62.83 80.73 0.82 51.92 31.25
1000 3 512 63.25 78.92 8.60 51.92 68.75
1000 3 256 63.45 78.65 10.60 51.92 62.50
1000 3 128 62.88 78.55 8.01 51.92 75.00
1000 3 64 62.15 70.31 34.28 52.88 56.25
1000 3 32 60.23 68.39 32.04 50.00 75.00

Hashing with LDA : 4000 samples

DIM | K | TAG | % Total | Class=0 | Class=1 | Class=10 | Class=-1 ----|---|-----|---------|-------|-------|-------|-------|---- 30 | 3 | 32 | 58.12 | 71.13 | 15.19 | 32.69 | 37.50 | 20 | 3 | 32 | 57.23 | 68.23 | 21.44 | 31.73 | 37.50 | 10 | 3 | 32 | 61.12 | 74.33 | 17.79 | 33.65 | 37.50 |

Caveat

The process takes slightly high computational power. It probably breaks on some workstations due to computational capability shortage. YMMV.


Significant 3rd parties

These are our brilliant prerequisites.


Licence

Creative Commons License
pantip-libr by starcolon is licensed under a Creative Commons Attribution 4.0 International License.

The module pantip-libr is distributed under Creative Commons 4.0 licence. Forking, modification, redistribution are welcome.

About

:books: Sentiment analysis hack project for Pantip.com Q&A site


Languages

Language:Python 94.2%Language:Shell 4.0%Language:Ruby 1.7%