Wrong attribute detection when space at the end of attribute value (version 8.3.0)

Question

Wrong attribute detection when space at the end of attribute value (version 8.3.0)

TrumanRu opened this issue 2 years ago · comments

TrumanRu commented 2 years ago

Package's name
string-strip-html@8.3.0

Describe the bug
Wrong attribute detection when space at the end of attribute value

To Reproduce
Steps to reproduce the behavior:

<table style="font-family: "Times New Roman";" data-mce-selected="1"><tbody><tr><td class="b-productParams ">without cover</td></tr></tbody></table>
(space between 'b-productParams' and the quote)
See error:

  ranges-push/Ranges/add(): [THROW_ID_10] "to" value, the second input argument, must be a natural number or zero! Currently it's of a type "number" equal to: null
  TypeError: ranges-push/Ranges/add(): [THROW_ID_10] "to" value, the second input argument, must be a natural number or zero! Currently it's of a type "number" equal to: null
     at Ranges.add (/app/node_modules/ranges-push/dist/ranges-push.cjs.js:125:17)
     at Ranges.push (/app/node_modules/ranges-push/dist/ranges-push.cjs.js:132:12)
     at /app/app/controllers/socket.js:127:33
     at Array.forEach (<anonymous>)
     at Object.cb (/app/app/controllers/socket.js:79:28)
     at _loop2 (/app/node_modules/string-strip-html/dist/string-strip-html.cjs.js:606:16)
     at stripHtml (/app/node_modules/string-strip-html/dist/string-strip-html.cjs.js:747:17)

Attribute values:

{
  "nameStarts": 117,
  "nameEnds": 118,
  "name": "\""
}

Expected behavior
I think - no reaction to this.

Additional context
I need 8.x branch cause I use require() import.

Roy Revelt · Answer 1 · Wed Apr 13 2022 01:20:21 GMT+0800 (China Standard Time)

hi! Sorry about that! Thank you for reporting, I'll fix.

Roy Revelt · Answer 2 · Thu Apr 14 2022 02:37:03 GMT+0800 (China Standard Time)

Hm, I looked at it, it's a complex situation.

In short, you should set opts.skipHtmlDecoding to true and use it as-is.

I didn't anticipate your case and it's valid and looking back, the default opts.skipHtmlDecoding should have been set to true. But it's not worth to change now, especially in the middle of global ESM migration where vast majority of users are consuming v8.

What's going on is, we recursively decode HTML entities before processing the input. There are two cases:

entity is within tag, for example, <a href="z"> (similar to your example) — but also, that's weird cases related to infosec
entity is outside the tag, for example, <span>£</span>

Now, in theory, algorithm should be aware of tag locations and decode HTML only outside the tags, and upon request. But it's not the case, plus this program has a lot of whitespace control logic and a callback interface on top. 513 unit tests with 100% coverage (c8 is missing else clause so we can't do anything about it, but while we were on istanbul coverage was 100%).

My concern is that whatever the API currently is, it has set the expectations, especially in context of infosec. What if somebody is using this program to sanitise the inputs? That's a bad idea, this is a wrong program for that, but it's possible, right?

Also, another point, your example has very unusual pattern, escaped double quotes. This does not happen in "business as usual" commercial code, because we all use prettier and prettier autofixes that to inverted quotes pattern instead. Here's the playground proof.

Do you see?

So, I can't/won't fix, sorry.