fb55 / htmlparser2

The fast & forgiving HTML and XML parser

Home Page:https://feedic.com/htmlparser2

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

AST improvements

alexander-akait opened this issue · comments

Hello, thanks for great projects, I am from webpack organization. We want to improve integration with html, but faced some difficulties due to insufficient information in AST and would like to help improve this.

  1. Update startIndex and endIndex on onattribute

Our concept is transform html to js:

file.html

<div>
  <h1>Text</h1>
  <img src="./image.png" alt="alt" />
</div>

Transform to:

import file from './image.png';

export default "<div><h1>Text</h1><img src=" + file + " alt="alt" /></div>"

But we lack some information - positions to replace them on imported content.

Now we use hack https://github.com/webpack-contrib/html-loader/blob/master/src/plugins/source-plugin.js#L436

{
  onattribute(name, value) {
    const endIndex = parser._tokenizer._index;
    const startIndex = endIndex - value.length;
    const unquoted = html[endIndex] !== '"' && html[endIndex] !== "'";

    attributesMeta[name] = { startIndex, unquoted };
  }
}

But it is very hacky and dirty solution. I think it will be very easy to fix, I can help with this.

  1. No information about quotes.

Developer can set any transformer for import file from './image.png'; and even change name of file, so we need specific logic to ensure should we keep quotes or insert them if filename with not allowed (for example space in name - image of something.png) characters in name.

You can see our hack above. Will be great to add them and improve onattribute like:

{
  onattribute(name, value, quotes) {}
}

The quotes argument can be:

  • undefined - no quotes
  • ' or " - type of quotes
  1. Duplicate attributes

It is very edge case and i think it is breaking change.

For example: <img src="./image.png" src="./other-image.png" alt="alt" />

Now the parser returns to us:

{
  attribs: {
    src: "./image.png",
    alt: "alt"
  },
}

But onattribute called twice as expected.

Will be great to improve it to:

{
    attribs: [
        {
            src: "./image.png",
            alt: "alt"
        },
        {
            src: "./other-image.png",
            alt: "alt"
        }
    ];
}

Thank you again for the good project, and I will be happy for any feedback. And I’m ready to help with any of these problems, they do not seem complicated

Hi @evilebottnawi, very interesting use-case, thanks for providing some insights!

(1) should definitely be fixed. I haven't gone through the responsible code in a while, but adding some _updatePosition calls to src/Parser.ts should be a good start.

For (2): So far, I've been pretty adamant to not add any output that does not relate to the semantic meaning. Once (1) is fixed, this should be much easier to implement.

As you said, (3) is a pretty big breaking change. This is the exact use-case of per-attribute events, so generating this yourself is hopefully not too bad.

(1) should definitely be fixed. I haven't gone through the responsible code in a while, but adding some _updatePosition calls to src/Parser.ts should be a good start.

👍

For (2): So far, I've been pretty adamant to not add any output that does not relate to the semantic meaning. Once (1) is fixed, this should be much easier to implement.

I think quotation marks have a very important semantic meaning, by the way, you already have this information, why do not provide it to developer, it would make life easier. I found postcss-html uses this package too and they use hacks for same, if one of the biggest consumers uses hacks, maybe it's really worth considering an improvement

As you said, (3) is a pretty big breaking change. This is the exact use-case of per-attribute events, so generating this yourself is hopefully not too bad.

Maybe we will postpone it until the next major release, I find this a little unfortunate decision. I can even imagine how a developer is trying to create linter using this package and can't implement no-duplicate-attributes rule 😄

I've also run into a use case where I think we can benefit from this (at least if cheerio and dom-serializer make use of it).

For my new project, integrity-matters, a tool to check hashes and auto-update HTML integrity attributes and CDN version URLs (based on what is present in node_modules), I'd like to keep inter-attribute whitespace in place, e.g., to have:

  <script src="https://unpkg.com/leaflet@1.4.0/dist/leaflet.js"
    integrity="sha512-QVftwZFqvtRNi0ZyCtsznlKSWOStnDORoefr1enyq5mVL4tmKB3S/EnC3rRJcxCPavG10IcrVGSmPh6Qw5lwrg=="
    crossorigin=""></script>

...not be overwritten after an update into a one-liner like:

  <script src="https://unpkg.com/leaflet@1.6.0/dist/leaflet.js" integrity="sha512-gZwIG9x3wUXg2hdXF6+rVkLF/0Vi9U8D2Ntg4Ga5I5BZpVkVxlJWbSQtXPSiUTtC0TjtGOmxa1AJPuV0CPthew==" crossorigin></script>

Finally getting around to addressing this. The one thing that definitely can be added is information about quotes. We actually have four states here:

  1. Single quotes (foo='bar')
  2. Double quotes (foo="bar")
  3. No quotes around the value (foo=bar)
  4. No value (foo)

Should they be handled separately?


Adding an array of attributes is not something I want to add here, as it would always be a breaking change. To add this to the existing DOM could be done easily outside of this module. Something like this should do the job:

class DomWithAttributeArrayHandler extends DomHandler {
    _attributes = [];

    onattribute(name, value, quote) {
        this._attributes.push([name, value, quote]);
    }

    onopentag(name, attribs) {
        super.onopentag(name, attribs);
        this._tagStack[
            this._tagStack.length - 1
        ].attributeList = this._attributes;
        this._attributes = [];
    }
}

Finally, adding location information to attributes is also a bit tricky, as onopentag is emitted after all of the attributes. The start of the section would actually have to track back, which is not something that is supported right now. Happy to accept PRs for a more wholistic solution here, I am struggling to come up with a good solution.

  1. Single quotes (foo='bar')
  2. Double quotes (foo="bar")
  3. No quotes around the value (foo=bar)
  4. No value (foo)

Maybe?

  1. quote - "'"
  2. quote - '"'
  3. quote - null
  4. quote - undefined

Adding an array of attributes is not something I want to add here, as it would always be a breaking change. To add this to the existing DOM could be done easily outside of this module.

Maybe we can postpone it? Not high priority. Using this._attributes for me was always unsafe, because it is look like private variables.

Finally, adding location information to attributes is also a bit tricky, as onopentag is emitted after all of the attributes. The start of the section would actually have to track back, which is not something that is supported right now. Happy to accept PRs for a more wholistic solution here, I am struggling to come up with a good solution.

I'll try to look at it soon, maybe I can find a good solution. But without this information it is very difficult to use package for future generations, we use this hack https://github.com/webpack-contrib/html-loader/blob/master/src/plugins/source-plugin.js#L46, maybe it can help

I pushed a change that adds quotes to the onattribute event.

Using this._attributes for me was always unsafe, because it is look like private variables.

this._attributes doesn't actually exist on DomHandler instances, this would be a private property for the extended class 🤷

@fb55 Just interesting, do we have the quote property for the attributes argument in onopentag(tag, attributes) callback? Will be great to have this property in all places where we have access to attribute

do we have the quote property for the attributes argument in onopentag(tag, attributes) callback

onopentag is just a thin layer over onopentagname, onattribute and onopentagend and should be pretty easy to replicate in user land. I'd prefer not to make any changes that either break the existing API or allocate memory that won't be used by a large set of users.

@fb55 we can always introduce option(s) for this, if you're worried about performance, I don't need this information in some cases, in other cases it would be very useful/required, same for locations (startIndex/endIndex), I understand perfectly that I sacrifice performance because I need more information, otherwise I have to look for dirty solutions which is not very good

A key question for me, @fb55 , as far as making a PR to add location info (optionally or otherwise), is whether you'd accept changes to dom-serializer and DomHandler to take advantage of the feature. If so, I can see about a PR, as I have energy, as that is my real interest (I figured you'd want it solved at the source here, however, rather than hacked into DOMHandler or wherever the hack could be applied in those projects). I might also see about passing on the attribute quotes if that API is ready. Thanks!

To give a definite answer here: I don't think the existing DOM structure can support these use-cases. It seems like there might be enough interest to create a separate handler, perhaps as a fork of the existing one. Happy to promote something for that use-case.

I'm closing this ticket as I don't have a good way forward in the existing project.