Optimize Syntaxnet POS Tagging

Question

Optimize Syntaxnet POS Tagging

DeastinY opened this issue 7 years ago · comments

Currently the single posts are piped into parsey mcparseface. This is horribly slow.
It could easily be speeded up using e.g. files that are processed and returned annotated.
The overhead is in tearing down and buliding up tensorflow all the time.
Furthermore there should exist some syntax that returns the sentence in a simple, annotated manner.
Maybe you could look into that @FullLifeGames @dnerger

jerryspan · Answer 1 · Sat Apr 29 2017 04:54:56 GMT+0800 (China Standard Time)

could you define "horribly slow"?

DeastinY · Answer 2 · Sat Apr 29 2017 21:00:49 GMT+0800 (China Standard Time)

It runs in a couple of hours. I just programmed it pretty inefficient ^^
Processing should be done in a few minutes with proper setup :)

DeastinY · Answer 3 · Mon May 01 2017 20:01:18 GMT+0800 (China Standard Time)

Here is information on Parsey, how to use and set it up: https://github.com/tensorflow/models/tree/master/syntaxnet
I see two approaches:

either the sentences should be unwrapped from our 'database' and passed as a single long sentence to Parsey (but this would probably be limited to ~ 500 due to maximal command length) or
the sentences could be written to a temporary file, that is passed to Parsey, processed and then annotated within the 'database'.

Bene · Answer 4 · Thu May 18 2017 03:51:37 GMT+0800 (China Standard Time)

I added two new methods to use parsey.

In "parsey_every8.py" every 8 posts get merged together and get added to ParseyMcParseface. The number "8" comes from testing.
In "parsey_5000_char.py" the posts get added up, until the combined length is greater than 5000 characters. They together get added to ParseyMcParseface. The number 5000 comes from testing different input lengths and testing the limit.

Pros:

Much faster than reading in every post individually

Cons:

This is only as a "proof of concept". After combining the posts, there needs to be a mechanism to seperate them again from each other. This could be done after the analysis.
The parse function for the output of ParseyMcParseface has to be worked on, since with the added bits it doesn't extract the tags correctly anymore.

The 5000 chars solution is better than 8 posts at once one, since it is more general and processes more posts at once.

Bene · Answer 5 · Tue Jun 20 2017 17:43:46 GMT+0800 (China Standard Time)

Since we switched from parsey to nltk and from nltk to open nlp, this is not worked on anymore