Evaluate the use of an alternative html parser for better performance
GoogleCodeExporter opened this issue · comments
I am thinking about how to fasten the html parsing and have found this article
about python html-parsers:
http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/
According to which lxml is the fastest python parser because it is only a type
of python binding to the underlying libxml and libxslt libraries.
Further analysis reveals that the latest beta version of the BeautifulSoup
Parser 4.x is supporting this parser as the underlying engine.
Thus bringing me to the conclusion that patching jsunpack to use lxml as the
html parser would be only a small patch which might be something like this:
From (in html.py):
import BeautifulSoup
...
soup = BeautifulSoup.BeautifulSoup(data)
soup.findAll(tag,attrib)
To:
import bs4
soup = bs4.BeautifulSoup(data)
soup.find_all(tag,attrib)
(And tests/test_lxml.py contains a sample of how to use lxml as a bs4.builder)
What do you think?
Regards
Ali
Original issue reported on code.google.com by ali.iki...@gmail.com
on 20 Jul 2011 at 6:11
Ali, thanks for the suggestion! I'll be testing this to see whether I want to
integrate it.
Original comment by urul...@gmail.com
on 25 Jul 2011 at 2:33
Added support for BeautifulSoup v4 with builtin lxml support. It makes a huge
performance difference.
Original comment by ali.iki...@gmail.com
on 29 Oct 2011 at 5:22
- Changed state: Fixed
- Added labels: Type-Enhancement
- Removed labels: Type-Defect