guile2912 / boilerpipe

Automatically exported from code.google.com/p/boilerpipe

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

timeout and fallback strategy for boilerpipe

GoogleCodeExporter opened this issue · comments

I don't see a news group or other forum for asking questions like this, so 
please forgive me making this an issue ticket.

Is there a best practice example for managing boilerpipe with a timeout and 
falling back to a series of less sophisticated extractors?  

For example, when boilerpipe's ArticleExtractor says:
Warning: SAX input contains nested A elements -- You have probably hit a bug in 
your HTML parser (e.g., NekoHTML bug #2909310). Please clean the HTML 
externally and feed it to boilerpipe again. Trying to recover somehow...

and hits an infinite loop, I need to kill it and hammer the text in another way.

Should I just run it inside a thread and kill the thread after allotted time 
passes?  Or does boilerpipe have tools for doing this kind of thing for me?

What sequence of extractors would you recommend?

Thanks!

John

Original issue reported on code.google.com by postsh...@gmail.com on 6 Feb 2012 at 5:17

Hi John,

boilerpipe comes with a patched version of NekoHTML. So unless you are using a 
different SAX parser or you are having an unpatched NekoHTML in your classpath, 
you should not see this error at all.

Could you please give me an URL that I can check against my local installation 
of boilerpipe?

Best,
Christian

Original comment by ckkohl79 on 6 Feb 2012 at 5:33

I'm running into this issue with 
http://heraldnews.suntimes.com/news/10325442-418/voting-map-redrawn-by-those-in-
power-with-hope-of-keeping-it.html

Original comment by cary...@gmail.com on 29 Feb 2012 at 8:09

Hi carylee,

thanks for this feedback. This page works fine here with the latest version 
from trunk as well as with the previous version on 
http://boilerpipe-web.appspot.com/
Could you please checkout that version from SVN and try again?

I am pretty sure that this is a classpath issue. Please ensure that you really 
have the patched versions of NekoHTML's HTMLElements and HTMLTagBalancer (which 
come with boilerpipe-core) included in your classpath *before* the original 
nekohtml-1.9.13.jar.

Original comment by ckkohl79 on 21 Mar 2012 at 9:18

[deleted comment]
[deleted comment]

Original comment by ckkohl79 on 27 Jun 2012 at 4:19

  • Changed state: Fixed