cemunds / ABSA-TensorFlow

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Optimize Syntaxnet POS Tagging

DeastinY opened this issue · comments

Currently the single posts are piped into parsey mcparseface. This is horribly slow.
It could easily be speeded up using e.g. files that are processed and returned annotated.
The overhead is in tearing down and buliding up tensorflow all the time.
Furthermore there should exist some syntax that returns the sentence in a simple, annotated manner.
Maybe you could look into that @FullLifeGames @dnerger

could you define "horribly slow"?

It runs in a couple of hours. I just programmed it pretty inefficient ^^
Processing should be done in a few minutes with proper setup :)

Here is information on Parsey, how to use and set it up: https://github.com/tensorflow/models/tree/master/syntaxnet
I see two approaches:

  • either the sentences should be unwrapped from our 'database' and passed as a single long sentence to Parsey (but this would probably be limited to ~ 500 due to maximal command length) or
  • the sentences could be written to a temporary file, that is passed to Parsey, processed and then annotated within the 'database'.
commented

I added two new methods to use parsey.

  • In "parsey_every8.py" every 8 posts get merged together and get added to ParseyMcParseface. The number "8" comes from testing.

  • In "parsey_5000_char.py" the posts get added up, until the combined length is greater than 5000 characters. They together get added to ParseyMcParseface. The number 5000 comes from testing different input lengths and testing the limit.

Pros:

  • Much faster than reading in every post individually

Cons:

  • This is only as a "proof of concept". After combining the posts, there needs to be a mechanism to seperate them again from each other. This could be done after the analysis.

  • The parse function for the output of ParseyMcParseface has to be worked on, since with the added bits it doesn't extract the tags correctly anymore.

The 5000 chars solution is better than 8 posts at once one, since it is more general and processes more posts at once.

commented

Since we switched from parsey to nltk and from nltk to open nlp, this is not worked on anymore