classilla / classilla

Building a secure browser for classic Mac OS.

Home Page:http://www.classilla.org/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

www.newsmax.com/m may suck, but it should parse

GoogleCodeExporter opened this issue · comments

Reported by Walt. Their mobile site does parse on TenFourFox. On Classilla it 
gets an error:

XML Parsing Error: not well-formed
Location: http://www.newsmax.com/m
Line Number 107, Column 71:<img 
src="C:\inetpub\wwwroot\ProdCMSV3_0\ga.aspx?utmac=UA-31221-1&utmn=2095532446&utm
r=-&utmp=%2fCMSTemplates%2fNewsmax%2fMobileSiteCMS%2fDefault.aspx%3faliaspath%3d
%252fmobilehome%252fDefault&guid=ON" />
----------------------------------------------------------------------^

This is clearly awful XML, but it should parse.

Original issue reported on code.google.com by classi...@floodgap.com on 21 Jan 2012 at 1:37

Attachments:

It looks like it's seeing it as an entity. This is true for XML, but the server 
wants the page identified as HTML, and parsed as HTML this would work. So we 
are sniffing the document wrong.

Original comment by classi...@floodgap.com on 21 Jan 2012 at 1:47

Trying...
Connected to newsmax.com.
Escape character is '^]'.
GET /m HTTP/1.0
Host: www.newsmax.com
Connection: close

HTTP/1.1 200 OK
Cache-Control: no-cache,private, no-store, must-revalidate
Content-Length: 8077
Content-Type: text/html; charset=utf-8
Server: Microsoft-IIS/7.0
X-AspNet-Version: 2.0.50727
Set-Cookie: CMSPreferredCulture=en-US; expires=Mon, 21-Jan-2013 01:45:20 GMT; 
path=/
Set-Cookie: ASP.NET_SessionId=d1u3e245ilhwhc550zbeoham; path=/; HttpOnly
X-Powered-By: ASP.NET
X-UA-Compatible: IE=7
Date: Sat, 21 Jan 2012 01:45:19 GMT
Connection: close

Original comment by classi...@floodgap.com on 21 Jan 2012 at 1:47

<!DOCTYPE html PUBLIC "-//WAPFORUM//DTD XHTML Mobile 1.0//EN" 
"http://www.wapforum.org/DTD/xhtml-mobile10.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">

Original comment by classi...@floodgap.com on 21 Jan 2012 at 1:48

Current suspect: htmlparser/src/nsParser.cpp:DetermineParseMode

We'll throw a breakpoint in there when we're ready to debug this.

Original comment by classi...@floodgap.com on 21 Jan 2012 at 2:51

Actually, the MIME type detect is not failing, because newsmax declares itself 
as XML:

<!-- Mobile Meta Tags -->
    <meta http-equiv="Content-type" content="application/xhtml+xml; charset=utf-8" />

The only way around this is to relax the parser. Yuck.

Original comment by classi...@floodgap.com on 1 Feb 2012 at 2:37

Altering expat so that XML_TOK_INVALID parses leads to "success" but holes in 
the page.

Maybe the simplest way is just to force application/xhtml+xml to be parsed as 
HTML. This is wrong, but no more wrong than other hacks we do.

Original comment by classi...@floodgap.com on 1 Feb 2012 at 3:26

This is what we did, and now the site works.

Let's see if this breaks anything.

Original comment by classi...@floodgap.com on 19 Feb 2012 at 5:07

  • Changed state: Started
It breaks about: (since about: needs to be parsed as xhtml). Maybe we add an 
exception for this.

Original comment by classi...@floodgap.com on 4 Mar 2012 at 2:47

Implemented better solution from issue 189: fudge content types in 
HttpChannel::ProcessNormal(). Since about: is loaded from jar:, it will not get 
its content type changed, and is parsed as proper XHTML. Since this loads from 
the network, it will.

Original comment by classi...@floodgap.com on 5 Mar 2012 at 12:37

Original comment by classi...@floodgap.com on 19 Oct 2012 at 4:49

  • Changed state: Verified