lhtml

lhtml is a lenient HTML parser for Go.

It differs from the standard html package because it will not re-order any of the encountered elements, nor will it try to sanitize your HTML file. This package is intended to be used for HTML-template based systems which want to process their own custom tags and attributes.

Features
API
Usage
Examples
Hacking
Changelog
Related Projects
License

Features

lhtml diffes from standard html parser in the following ways:

Single parsing funtion that handles both documents as well as fragments
- ParseHtml
You may allow tags to have multiple attributes with same name
- ParseOption#AllowMultipleAttributesWithSameName
No sanitization of the resulting DOM
- example
Provides node discovery functions
- GetElementById
- GetElementsByName
- GetBefore
- GetAfter
- Get (at index)
- First
- Last
Manipulation functions
- InsertFirst
- InsertLast
- EmptyChildren
- Remove
- Replace
Visitor functions when building tree, or to walk tree
- Traverse(visitor) (example)

API

lhtml only has a single API that works both on the HTML document as well as HTML fragments.

func ParseHtml(reader io.Reader) (*core.HtmlDocument, error)

We also expose a convenience method in case you would like to use strings instead of a io.Reader:

func ParseHtmlString(html string) (*core.HtmlDocument, error)

Usage

Simply add the library to your project:

$ go get github.com/sangupta/lhtml@v0.1.0

And then, use it to parse your HTML markup:

import (
    "github.com/sangupta/lhtml"
    "github.com/sangupta/lhtml/core"
)

func test() {
    html := "<html class='test1' class='test2' custom:title='hello'>Hello World <custom:PageBody /></html>"
    doc, err := lhtml.ParseHtmlString(htmlString)
    if err != nil {
        panic(err)
    }

    visitor := func(node *core.HtmlNode) bool {
        if node.NodeType == core.ElementNode {
            fmt.Println(node.TagName)
        }

        return true
    }
}

Examples

No DOM sanitization

For example, the HTML title tag cannot contain another tag. Given the following html:

<html>
    <head>
        <title>
            <custom:PageTitle />
        </title>
    </head>
</html>

The standard Go implementation will parse it to:

<html>
    <head>
        <title>
            &lt; custom:PageTitle /&gt;
        </title>
    </head>
</html>

However, when using lhtml you will get the exact markup as defined above. It is left to the callee code on how it wants to interpret and use the parsed DOM nodes.

Traversing the DOM

func test() {
    doc, err := lhtml.ParseString("<html><head><title>Example</title></head><body><h1>Hello World</h1></body></html>")
    if err != nil {
        panic(err)
    }

    s := ""
	called := 0
	visitor := func(node *HtmlNode) bool {
		called++
		if node.NodeType != ElementNode {
			return true
		}
		s = s + " " + node.NodeName()
		return true
	}

    doc.Traverse(visitor)
	fmt.Println(s)          // " html head title body h1"
    fmt.Println(called)     // 7 (5 element nodes, 2 text nodes)
}

Hacking

To build the Go docs locally:
- $ godoc -http=:6060
- Open http://localhost:6060/pkg/github.com/sangupta/lhtml
To run all tests along with code coverage report
- $ go test ./... -v -coverprofile coverage.out
- $ go tool cover -html=coverage.out
To publish the Go module:
- $ git tag v0.x.0
- $ git push origin v0.x.0
- $ GOPROXY=proxy.golang.org go list -m github.com/sangupta/lhtml@v0.x.0

terminar / lhtml

lhtml

Table of contents

Features

API

Usage

Examples

No DOM sanitization

Traversing the DOM

Hacking

Changelog

Related projects

License

About

Languages