timeout and fallback strategy for boilerpipe
GoogleCodeExporter opened this issue · comments
Google Code Exporter commented
I don't see a news group or other forum for asking questions like this, so
please forgive me making this an issue ticket.
Is there a best practice example for managing boilerpipe with a timeout and
falling back to a series of less sophisticated extractors?
For example, when boilerpipe's ArticleExtractor says:
Warning: SAX input contains nested A elements -- You have probably hit a bug in
your HTML parser (e.g., NekoHTML bug #2909310). Please clean the HTML
externally and feed it to boilerpipe again. Trying to recover somehow...
and hits an infinite loop, I need to kill it and hammer the text in another way.
Should I just run it inside a thread and kill the thread after allotted time
passes? Or does boilerpipe have tools for doing this kind of thing for me?
What sequence of extractors would you recommend?
Thanks!
John
Original issue reported on code.google.com by postsh...@gmail.com
on 6 Feb 2012 at 5:17
Google Code Exporter commented
Hi John,
boilerpipe comes with a patched version of NekoHTML. So unless you are using a
different SAX parser or you are having an unpatched NekoHTML in your classpath,
you should not see this error at all.
Could you please give me an URL that I can check against my local installation
of boilerpipe?
Best,
Christian
Original comment by ckkohl79
on 6 Feb 2012 at 5:33
Google Code Exporter commented
I'm running into this issue with
http://heraldnews.suntimes.com/news/10325442-418/voting-map-redrawn-by-those-in-
power-with-hope-of-keeping-it.html
Original comment by cary...@gmail.com
on 29 Feb 2012 at 8:09
Google Code Exporter commented
Hi carylee,
thanks for this feedback. This page works fine here with the latest version
from trunk as well as with the previous version on
http://boilerpipe-web.appspot.com/
Could you please checkout that version from SVN and try again?
I am pretty sure that this is a classpath issue. Please ensure that you really
have the patched versions of NekoHTML's HTMLElements and HTMLTagBalancer (which
come with boilerpipe-core) included in your classpath *before* the original
nekohtml-1.9.13.jar.
Original comment by ckkohl79
on 21 Mar 2012 at 9:18
Google Code Exporter commented
[deleted comment]
Google Code Exporter commented
[deleted comment]
Google Code Exporter commented
Original comment by ckkohl79
on 27 Jun 2012 at 4:19
- Changed state: Fixed