rowbot/dom

rowbot/dom is an attempt to implement the Document Object Model (DOM) in PHP that is more inline with current standards. While PHP does already have its own implementation of the DOM, it is somewhat outdated and is more geared towards XML/XHTML/HTML4. This is very much a work in progress and as a result things may be broken.

Requirements
Features
The DocumentBuilder class
Usage
Caveats
Turning your tree back into a string

Requirements

PHP >= 7.1
ext-mbstring
rowbot/url

Features

Does not rely on ext-dom
Robust HTML5 tokenizer and parser
Supports the <template> element
Supports innerHTML and outerHTML
Supports live ranges
Extensive test suite ported from Web platform tests

The DocumentBuilder class

The primary entry point is the DocumentBuilder class. It allows you create a document while specifing things such as the Document's base URL and whether or not scripting should be emulated.

DocumentBuilder::create(): static

Returns a new instance of the DocumentBuilder.

DocumentBuilder::setContentType(string $contentType): $this

Required. Sets the content type of the document. If the given content type is invalid, a TypeError will be thrown. This will determine the type of document returned as well as what parser to use. The content type can be one of the following:

'text/html'
'text/xml'
'application/xml'
'application/xhtml+xml'
'image/svg+xml'

DocumentBuilder::setDocumentUrl(string $url): $this

Sets the URL of the document. This is used for resolving links in tags such as <a href="/index.php"></a> and for resolving any links specified by <base> elements in the document. If not set, the document will default to the "about:blank" URL. This must be an absolute URL. If the given URL fails parsing, a TypeError will be thrown. Not all valid URIs are a valid document URL, for example, this will happily accept a URI of "mailto:me@example.com", so you should take care when setting this value.

Examples

DocumentBuilder::create()->setDocumentUrl('http://example.com/');

DocumentBuilder::create()->setDocumentUrl('file:///C:/example.html');

DocumentBuilder::create()->setDocumentUrl('https://my.domain.net/index.php');

DocumentBuilder::create()->setDocumentUrl('https://searchengine.fr/search');

DocumentBuilder::emulateScripting(bool $enable): $this

Enables scripting emulation. Enabling this does not cause any scripts to be executed. This affects how the parser and serializer handle <noscript> tags. If scripting emulation is enabled, then their content will be seen as plain text to the DOM. If emulation is disabled, which is the default, their content will be parsed as part of the DOM.

Example with scripting emulation enabled

$document = DocumentBuilder::create()
    ->setContentType('text/html')
    ->emulateScripting(true)
    ->createEmptyDocument();

$el = $document->createElement('div');
$el->innerHTML = '<noscript><p id="foo">You must enable scripting!</p></noscript>';

$el->textContent; // &lt;p id="foo"&gt;You must enable scripting!&lt;/p&gt;&lt;/noscript&gt;
$foo = $el->getElementById('foo'); // null
$el->firstChild->firstChild->nodeName; // #text

Example with scripting emulation disabled

$document = DocumentBuilder::create()
    ->setContentType('text/html')
    ->emulateScripting(false)
    ->createEmptyDocument();

$el = $document->createElement('div');
$el->innerHTML = '<noscript><p id="foo">You must enable scripting!</p></noscript>';

$el->textContent; // You must enable scripting!
$foo = $el->getElementById('foo'); // HTMLParagraphElement
$el->firstChild->firstChild->nodeName; // P

DocumentBuilder::createFromString(string $input): Document

Parses the input string and returns the resulting Document object. This will throw a TypeError if the content type is not specified.

DocumentBuilder::createEmptyDocument(): Document

Returns an empty Document object. The type of Document object returned is dependent on the specified content type. This will throw a TypeError if the content type is not specified.

Usage

Recommeded way to create a Document

<?php

require_once 'vendor/autoload.php';

use Rowbot\DOM\DocumentBuilder;

// Creates a new DocumentBuilder, and saves the resulting document to $document
$document = DocumentBuilder::create()

  // This is required. Tells the builder to what type of document and parser should be used.
  ->setContentType('text/html');

  // Set's the document's URL, for more accurate link parsing. Not setting this will cause the
  // document to default to the "about:blank" URL. This must be a valid URL.
  ->setDocumentUrl('https://example.com')

  // Whether or not the environment should emulate scripting, which mostly affects how <noscript>
  // tags are parsed and serialized. The default is false.
  ->emulateScripting(true)

  // Returns a new document using the input string.
  ->createFromString(file_get_contents('path/to/my/index.html'));

// Do some things with the document
$document->getElementById('foo');

Parsing an HTML Document using DOMParser

<?php

require_once "vendor/autoload.php";

use Rowbot\DOM\DOMParser;

$parser = new DOMParser();

// Currently "text/html" is the only supported option.
$document = $parser->parseFromString(file_get_contents('/path/to/file.html'), 'text/html');

// Do some things with the document
$document->getElementById('foo');

Creating an empty Document

<?php
require_once "vendor/autoload.php";

use Rowbot\DOM\DocumentBuilder;

/**
 * This creates a new empty HTML Document.
 */
$doc = DocumentBuilder::create()
    ->setContentType('text/html')
    ->createEmptyDocument();

/**
 * Want a skeleton framework for an HTML Document?
 */
$doc = $doc->implementation->createHTMLDocument();

// Set the page title
$doc->title = "My HTML Document!";

// Create an HTML anchor tag
$a = $doc->createElement("a");
$a->href = "http://www.example.com/";

// Insert it into the document
$doc->body->appendChild($a);

// Convert the DOM tree into a HTML string
echo $doc->toString();

Caveats

Only UTF-8 encoded documents are supported.
All string input is expected to be in UTF-8.
All strings returned to the user, such as those returned from Text.data, are in UTF-8, rather than UTF-16.
All string offsets and lengths such as those in Text.replaceData() or Text.length are expressed in UTF-8 code points, rather than UTF-16 code units.
No XML parser exists at this time. However, XML documents can be built manually and serialized.

Turning your tree back into a string

For the entire Document:
- You may call the toString() method on the Document, e.g. $document->toString(), or you may cast the Document to a string, e.g. (string) $document,
For Elements:
- Depending on your needs, you may use the innerHTML property to get all of the Element's descendants, e.g. $element->innerHTML, or you may use the outerHTML property to get the Element itself and all its descendants, e.g. $element->outerHTML.
For Text nodes:
- You may use the data property, e.g. $textNode->data to get the text data from the node.
For the entire Range:
- You may call the toString() method, e.g. $range->toString(), or you may cast the Range to a string, e.g. (string) $range.

TRowbotham / PHPDOM