atifaziz / Fizzler

.NET CSS Selector Engine

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

QuerySelectorAll on HtmlNode for FORM returns 0 nodes child INPUTs

atifaziz opened this issue · comments

Originally reported on Google Code with ID 24

What steps will reproduce the problem?

1. var nodes = formNode.QuerySelectorAll("textarea,input,button,select");
2. Console.WriteLine(nodes.Count());

What is the expected output? What do you see instead?

The method should return all child nodes of "formNode" matching the CSS 
selector. 0 nodes are returned at the moment.

Reported by asbjornu on 2009-05-06 14:31:05

Do you have a test HTML document that you can attach here as a file and where this
can be reproduced?

Reported by azizatif on 2009-05-06 14:34:06

It looks like this is a bug in the HtmlAgilityPack. I tested with the Google home
page, which contains a form with some input fields, and using the attached IronPython
script. The result of my interactive test was:

IronPython 2.0 (2.0.0.0) on .NET 2.0.50727.3074
Type "help", "copyright", "credits" or "license" for more information.
>>> import clr
>>> clr.AddReference('HtmlAgilityPack')
>>> from HtmlAgilityPack import HtmlDocument
>>> from System.Net import WebClient
>>> doc = HtmlDocument()
>>> doc.LoadHtml(WebClient().DownloadString('http://www.google.com/'))
>>> root = doc.DocumentNode
>>> print 'FORM tag count = ', root.SelectNodes('//form').Count
FORM tag count =  1
>>> print 'INPUT tag count = ', root.SelectNodes('//input').Count
INPUT tag count =  8
>>> form = root.SelectSingleNode('//form')
>>> print 'FORM tag child count', form.ChildNodes.Count
FORM tag child count 0

The problem is that the ChildNodes property of a form returns an empty collection!
As
a result, Fizzler fails to find anything within the descendants or immediate children
of a form.

It looks like this issue is already logged with HtmlAgilityPack (but unfortunately
with no resolution):

http://htmlagilitypack.codeplex.com/WorkItem/View.aspx?WorkItemId=21782

Reported by azizatif on 2009-05-06 15:03:00


- _Attachment: [issue24.py](https://storage.googleapis.com/google-code-attachments/fizzler/issue-24/comment-2/issue24.py)_

Reported by azizatif on 2009-05-06 15:04:39

  • Labels added: Component-External
There are alternatives to HtmlAgilityPack available. How much work would it be to
drop HtmlAgilityPack and use something else as our default?

We should unit test this problem if we do swap.

Reported by info%colinramsay.co.uk@gtempaccount.com on 2009-05-06 15:08:34

> drop HtmlAgilityPack 

I don't suggest dropping it. Just leave it in there as it is, but yet, drop it as the
default perhaps if a more robust implementation is available.

> How much work would it be to and use something else as our default?

Shouldn't be a whole lot as long as the other supports a reasonable API providing
access to attributes, children and siblings of a node.

Reported by azizatif on 2009-05-06 15:18:04

> drop HtmlAgilityPack and use something else as our default?

Now tracked separately as issue #25.


Reported by azizatif on 2009-05-06 15:23:53

Reported by azizatif on 2009-05-06 15:24:49

  • Status changed: WontFix
Guys, this is not an HTML agility pack comment. Check back the 
http://htmlagilitypack.codeplex.com/WorkItem/View.aspx?
WorkItemId=21782&ProjectName=htmlagilitypack page.

Reported by simon_mourier@hotmail.com on 2009-05-18 06:29:14

Thanks Simon. We'll take a look at this as HTMLAgilityPack actually worked just fine
apart from this item. At the moment we created a new SgmlReader wrapper and are using
that as our default. I could see us receiving further bug reports because of this
behaviour, so we might just stick with the SgmlReader as a default but I'll re-open
this issue for now.

Reported by info%colinramsay.co.uk@gtempaccount.com on 2009-05-18 08:29:09

  • Status changed: Accepted
Simon, thanks for your input on this issue. What would you recommend for the value of
HtmlElementFlag for FORM? By default, it seems to be CanOverlap OR Empty. I tried by
also turning on the Closed flag and that made it work. That is, with CanOverlap OR
Closed OR Empty, one sees INPUT elements appear within descendants of FORM:

IronPython 2.0 (2.0.0.0) on .NET 2.0.50727.3074
Type "help", "copyright", "credits" or "license" for more information.
>>> import clr
>>> clr.AddReference('HtmlAgilityPack')
>>> from HtmlAgilityPack import *
>>> print HtmlNode.ElementsFlags['form']
10
>>> HtmlNode.ElementsFlags['form'] |= HtmlElementFlag.Closed
>>> print HtmlNode.ElementsFlags['form']
14
>>> from System.Net import WebClient
>>> doc = HtmlDocument()
>>> doc.LoadHtml(WebClient().DownloadString('http://www.google.com/'))
>>> root = doc.DocumentNode
>>> print 'FORM tag count = ', root.SelectNodes('//form').Count
FORM tag count =  1
>>> print 'INPUT tag count = ', root.SelectNodes('//input').Count
INPUT tag count =  8
>>> form = root.SelectSingleNode('//form')
>>> print 'FORM tag child count', form.ChildNodes.Count
FORM tag child count 1
>>> def dump(node, level = 0):
...     print ' ' * level, node.Name
...     for child in node.ChildNodes:
...         dump(child, level + 1)
...
>>> dump(form)
 form
  table
   tr
    td
     #text
    td
     input
     input
     input
     br
     input
     input
    td
     font
      #text
      a
       #text
      br
      #text
      a
       #text
      br
      #text
      a
       #text
   tr
    td
     font
      span
       #text
       input
       label
        #text
       input
       label
        #text
       input
       label
        #text

Also, I see that this does not affect Fizzler directly, only its clients. It does,
however, affect Visual and Console Fizzler utilities, which do happen to be Fizzler
clients and perhaps which should now have an option to opt in for on behavior or the
other with regard to FORM.


Reported by azizatif on 2009-05-18 08:49:04

Reported by azizatif on 2009-09-30 23:23:22

  • Status changed: Started

Reported by azizatif on 2009-10-01 06:19:40

Fixed in r256.

Reported by azizatif on 2009-10-01 06:20:29

  • Status changed: Fixed

Reported by azizatif on 2009-12-08 23:11:08

What alternatives are there to Html Agility Pack and SgmlReader? I have found
SgmlReader to be pretty slow.

Has anyone used HtmlUnit? We could use http://www.ikvm.net/ to convert that library
to .net?

Reported by jake.net on 2009-12-09 04:26:29

IKVM.net is excellent and so is HtmlUnit, but the tests I've done show that the 
converted code is awfully slow to initialize and somewhat slower during execution 
than the unconverted code. That's not to say a patch from you implementing the 
conversion and HtmlUnit as a DOM engine in Fizzler should be declined, though. :)

Reported by asbjornu on 2009-12-09 10:27:32

I'm still noticing an issue where it seems to return null (no results) when passing
"form" or an ID of a form (eg. "#form1"), even when using LoadHtml2 or in Visual Fizzler.


Has anyone else still had this issue?

Reported by mmezzacca on 2010-07-23 16:14:44

This issue was closed by revision 073aa958b22b.

Reported by azizatif on 2013-01-04 08:31:04

This issue was closed by revision 9c7132c82f3c.

Reported by azizatif on 2013-01-04 10:56:38