jedie / django-phpBB3

django database models of phpBB3 **unmaintained**

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

HTMLParseError: malformed start tag

jedie opened this issue · comments

  File "/home/jedie/DjangoBB_env/src/django-phpbb3/django_phpBB3/management/commands/phpbb2djangobb.py", line 610, in migrate_posts
    user_ip=phpbb_post.poster_ip,
  File "/home/jedie/DjangoBB_env/lib/python2.6/site-packages/django/db/models/manager.py", line 137, in create
    return self.get_query_set().create(**kwargs)
  File "/home/jedie/DjangoBB_env/lib/python2.6/site-packages/django/db/models/query.py", line 377, in create
    obj.save(force_insert=True, using=self.db)
  File "/home/jedie/DjangoBB_env/src/djangobb/djangobb_forum/models.py", line 222, in save
    self.body_html = smiles(self.body_html)
  File "/home/jedie/DjangoBB_env/src/djangobb/djangobb_forum/util.py", line 188, in smiles
    parser.feed(data)
  File "/home/jedie/DjangoBB_env/src/djangobb/djangobb_forum/util.py", line 160, in feed
    HTMLParser.feed(self, data)
  File "/usr/lib64/python2.6/HTMLParser.py", line 108, in feed
    self.goahead(0)
  File "/usr/lib64/python2.6/HTMLParser.py", line 148, in goahead
    k = self.parse_starttag(i)
  File "/usr/lib64/python2.6/HTMLParser.py", line 229, in parse_starttag
    endpos = self.check_for_whole_start_tag(i)
  File "/usr/lib64/python2.6/HTMLParser.py", line 304, in check_for_whole_start_tag
    self.error("malformed start tag")
  File "/usr/lib64/python2.6/HTMLParser.py", line 115, in error
    raise HTMLParseError(message, self.getpos())
HTMLParseError: malformed start tag, at line 1, column 155

The real problem is a old HTMLParser in Python <2.7.3

e.g.:

import sys
from HTMLParser import HTMLParser, HTMLParseError
print sys.version
tests=(
    '<a href="foo"bar"></a>',
    '<a href="foo".bar"></a>',
    '<a href="foo."bar"></a>',
    '<a href="foo.".bar"></a>',
)
for html in tests:
    print
    print repr(html)
    parser = HTMLParser()
    try:
        parser.feed(html)
        parser.close()
    except HTMLParseError, err:
        print "HTMLParseError: %s" % err
    else:
        print "OK"
print "--END--"

Output with Python v2.7.2:

2.7.2 (default, Jun 24 2011, 12:21:10) [MSC v.1500 32 bit (Intel)]

'<a href="foo"bar"></a>'
HTMLParseError: EOF in middle of construct, at line 1, column 1

'<a href="foo".bar"></a>'
HTMLParseError: malformed start tag, at line 1, column 14

'<a href="foo."bar"></a>'
HTMLParseError: EOF in middle of construct, at line 1, column 1

'<a href="foo.".bar"></a>'
HTMLParseError: malformed start tag, at line 1, column 15
--END--

Output with Python v2.7.3:

2.7.3 (default, Aug  1 2012, 05:14:39) 
[GCC 4.6.3]

'<a href="foo"bar"></a>'
OK

'<a href="foo".bar"></a>'
OK

'<a href="foo."bar"></a>'
OK

'<a href="foo.".bar"></a>'
OK
--END--

Output with Python v2.6:

2.6.6 (r266:84292, Sep 11 2012, 08:34:23) 
[GCC 4.4.6 20120305 (Red Hat 4.4.6-4)]

'<a href="foo"bar"></a>'
HTMLParseError: EOF in middle of construct, at line 1, column 1

'<a href="foo".bar"></a>'
HTMLParseError: malformed start tag, at line 1, column 14

'<a href="foo."bar"></a>'
HTMLParseError: EOF in middle of construct, at line 1, column 1

'<a href="foo.".bar"></a>'
HTMLParseError: malformed start tag, at line 1, column 15
--END--

Add a new section in README: https://github.com/jedie/django-phpBB3#troubleshooting with 87b4116

See also http://support.djangobb.org/post/1404/

The HTMLParser in Python 2.7.3 has get some bugfixes. Search for HTMLParser in change log: http://hg.python.org/cpython/file/d46c1973d3c4/Misc/NEWS

- HTMLParser is now able to handle slashes in the start tag.
- Issue #13987: HTMLParser is now able to handle EOFs in the middle of a construct and malformed start tags.
- Issue #13993: HTMLParser is now able to handle broken end tags.
- Issue #13993: HTMLParser is now able to handle broken end tags.
- Issue #13358: HTMLParser now calls handle_data only once for each CDATA.
- Issues #1745761, #755670, #13357, #12629, #1200313: HTMLParser now correctly handles non-valid attributes, including adjacent and unquoted attributes.
- Issue #670664: Fix HTMLParser to correctly handle the content of ``<script>...</script>`` and ``<style>...</style>``.
- Issue #7311: Fix HTMLParser to accept non-ASCII attribute values.
- Patch #912410: Replace HTML entity references for attribute values in HTMLParser.