getAggregateText function sticks together text between different nodes

Question

getAggregateText function sticks together text between different nodes

ilya-khorev opened this issue 9 years ago · comments

Hi

Please, look at my test case
https://jsfiddle.net/weirrrdy/xLcfnrpk/

I'd like to match the exact word "Highlighted", that's why there is a special metacharacter "\b" wrapping the regex expression. Semantically it doesn't necessary to put '\b' in this case but I'm using it in my application and it's required for the test case.

The expected behaviour here would be selecting both 'highlighted' words, however only the second phrase has matched.

The problem is the getAggregateText function has combined text of first two paragraphs into the following text: "HighlightedPhrase". Therefore the current regexp expression has skipped the word "Highlighted".

I would really appreciate if you fix this.

James Padolsey · Answer 1 · Mon Apr 20 2015 20:58:25 GMT+0800 (China Standard Time)

Sorry for the massive delay on this. Am investigating now.

Chen Yijun · Answer 2 · Mon Apr 20 2015 22:16:51 GMT+0800 (China Standard Time)

A paragraph can be styled presentationally as an inline element in CSS. Likewise, an emphasis text run can be styled as a block element too.

Two adjacent ps should be handled just like two adjacent ems by default. Maybe it's better to add an additional option (element/selector-wise) for such cases.

James Padolsey · Answer 3 · Mon Apr 20 2015 22:26:36 GMT+0800 (China Standard Time)

This is a tough one...

I'm happy to make it configurable, but it seems like the default behaviour should be what people typically expect, and people probably don't typically expect <p> elements to be treated as inline. But I'm very wary of making this assumption because it could lead people to a lot of undesirable behaviour.

James Padolsey · Answer 4 · Mon Apr 20 2015 22:28:26 GMT+0800 (China Standard Time)

So, yeh, defaulting to "everything is inline" makes sense. Just trying to figure out how to make this easily configurable and obvious.

James Padolsey · Answer 5 · Mon Apr 20 2015 22:35:40 GMT+0800 (China Standard Time)

The alternative is to assume everything is a block-level element by default, and then provide the opportunity to configure inline elements.

Chen Yijun · Answer 6 · Tue Apr 21 2015 00:32:43 GMT+0800 (China Standard Time)

The alternative is to assume everything is a block-level element by default, and then provide the opportunity to configure inline elements.

Do you mean with the boundary matcher \b only?

James Padolsey · Answer 7 · Tue Apr 21 2015 00:47:14 GMT+0800 (China Standard Time)

Ok, so, this is what I've come up with: (Please let me know what you both think)

PROPOSAL

Everything stays the same as it currently is, by default. The findAndReplaceDOMText function will assume that it can bleed matches through ALL elements.

If someone wants to explicitly disallow this, e.g. in the case of block-level <p> elements, then they can set a custom boundary function that returns true in the case that either of its arguments are paragraphs (the boundary function will always be passed two elements so you can decide whether to create a boundary between them):

findAndReplaceDOMText(document.getElementById('test'), {
  find: RegExp('\\b' + 'highlighted' + '\\b', 'gi'),
  boundary: function(a, b) {
    return a.nodeName.toLowerCase() === 'p' || b.nodeName.toLowerCase() === 'p';
  },
  wrap: 'em'
});

You can also set boundary to true which will mean that NO matches bleed through elements, e.g. a match must exist in contiguous text nodes.

Additionally there will be an easy way to set ALL block-level borders as boundaries:

findAndReplaceDOMText(document.getElementById('test'), {
  find: RegExp('\\b' + 'highlighted' + '\\b', 'gi'),
  boundary: findAndReplaceDOMText.BLOCK_LEVEL_ELEMENTS,
  wrap: 'em'
});

(I imagine this will be the most common usage)

Chen Yijun · Answer 8 · Tue Apr 21 2015 01:17:39 GMT+0800 (China Standard Time)

Would you care to use Element.prototype.matches method instead of the fairly long nodeName comparison?

https://github.com/ethantw/fibre.js/blob/6857deac21/dist/fibre.js#L18-28

James Padolsey · Answer 9 · Tue Apr 21 2015 01:20:09 GMT+0800 (China Standard Time)

Oo, yep, didn't realize that had such good support. Still a shame about IE8.

Ilya Khorev · Answer 10 · Tue Apr 21 2015 04:00:22 GMT+0800 (China Standard Time)

Thanks for the answer.

The second option is the best one, at least to me. That's what I'm looking for, if I understood you correctly.
boundary: findAndReplaceDOMText.BLOCK_LEVEL_ELEMENTS,

James Padolsey · Answer 11 · Wed Apr 22 2015 22:41:47 GMT+0800 (China Standard Time)

Some progress: d41ae22 introduces something similar to my proposal above. Instead of 'boundaries' I've chosen the term 'contexts' and specifically a new option: forceContext which accepts either a boolean or a function that returns a boolean. With this concept you can prescribe that any specific element creates a new matching context (so that no matches bleed through the element in either direction).

So, for the block-level use-case, you can do:

findAndReplaceDOMText(document.getElementById('test'), {
  find: RegExp('\\b' + 'highlighted' + '\\b', 'gi'),
  forceContext: findAndReplaceDOMText.BLOCK_LEVEL_MATCH,
  wrap: 'em'
});

This means that any block-level elements will create new matching contexts. The findAndReplaceDOMText.BLOCK_LEVEL_MATCH function returns true if passed a traditionally block-level element.

You can set forceContext to your own function too, and even use Element.matches to figure out which elements to force a new context on.

If this looks good to you, I'll merge into master and push a new release. (Shall timeout in 48h and assume we're all happy to go ahead with it if I don't hear anything).

@weirdy: Here's your JSFiddle with the change added: https://jsfiddle.net/8dcqgcoa/1/

Ilya Khorev · Answer 12 · Wed Apr 22 2015 23:42:55 GMT+0800 (China Standard Time)

Awesome! Thank you, good work.

Chen Yijun · Answer 13 · Wed Apr 22 2015 23:43:29 GMT+0800 (China Standard Time)

Great! Thanks for the significant fix. I have two questions though.

Even without the \b matcher they'll still be considered as another context?
Also, I believe that br, details, dt, img, li, summary, tbody, td, th, thead, tr (, etc) are accidentally missed out?

The br and img (like hr) are necessary for cases like,

Highlight<img src="./path"><span>xxx</span>

https://developer.mozilla.org/en/docs/Web/HTML/Element

James Padolsey · Answer 14 · Wed Apr 22 2015 23:50:14 GMT+0800 (China Standard Time)

Ah, good point; I'll review the list of block elements. Actually, I think what we want is not a list of block elements, per se, but instead a list of all elements that are not inline textual elements (including <img> and <br> and <hr> etc.). I'll work on this.

Chen Yijun · Answer 15 · Wed Apr 22 2015 23:56:22 GMT+0800 (China Standard Time)

Agree. 👍

James Padolsey · Answer 16 · Sun Apr 26 2015 00:28:42 GMT+0800 (China Standard Time)

Until I find a better name for this it's gonna be called "NON_INLINE_PROSE":

findAndReplaceDOMText(document.getElementById('test'), {
  find: RegExp('\\b' + 'highlighted' + '\\b', 'gi'),
  forceContext: findAndReplaceDOMText.NON_INLINE_PROSE,
  wrap: 'em'
});

This will force a context on all of the following elements:

NON_PROSE_ELEMENTS = {
    // Block Elements
    address:1, article:1, aside:1, blockquote:1, canvas:1, dd:1, div:1,
    dl:1, fieldset:1, figcaption:1, figure:1, footer:1, form:1, h1:1, h2:1, h3:1,
    h4:1, h5:1, h6:1, header:1, hgroup:1, hr:1, main:1, nav:1, noscript:1, ol:1,
    output:1, p:1, pre:1, section:1, ul:1,
    // Other misc. elements that are not part of continuous inline prose:
    br:1, li: 1, summary: 1, dt:1, details:1,
    // Media / Source elements:
    script:1, style:1, img:1, video:1, audio:1, canvas:1, svg:1, map:1, object:1,
    // Input elements
    input:1, textarea:1, select:1, option:1, optgroup: 1, button:1,
    // Table related elements:
    table:1, tbody:1, thead:1, th:1, tr:1, td:1, caption:1, col:1, tfoot:1, colgroup:1
};

It's probably missing a few (let me know if you notice any). Thanks!

PR happening in #30 (shall merge soon I hope). I also need to add a chunk to the readme to explain what on earth this is all about :P

Chen Yijun · Answer 17 · Sun Apr 26 2015 01:03:27 GMT+0800 (China Standard Time)

Until I find a better name for this it's gonna be called "NON_INLINE_PROSE":

Awesome, I think that's comprehensible enough. 👍

It's probably missing a few (let me know if you notice any).

I would like to suggest we add rp, rt and rtc, but exclude ruby and rb. The rt element is used to wrap the annotation for its corresponding inline text run, while rp is used to provide fallback. Their box display attribute would not be inline as well. For example,

He invented <ruby>WWW<rp>(<rt>world wide web<rp>)</ruby> that helps people connect to one another.

Well, maybe they belong to those who need ignoring instead?

As for ruby and rb, they are both inline and part of the main content. Would make more sense for them to remain outside of the list.

And what do you think about the elements designed for web component such as template?

James Padolsey · Answer 18 · Sun Apr 26 2015 04:05:34 GMT+0800 (China Standard Time)

Cool, I've added rp, rt and rtc.

Right now there's no easy way to filter out non-prose elements completely. So, even though they can now have "forced contexts" via NON_INLINE_PROSE, things like <style> and <script> are, by default, still searched.

So... I think it'd be worth adding a "prose" preset, like:

findAndReplaceDOMText(..., {
  preset: 'prose'
});

And it could do the following:

Only searches textual elements (not STYLE, SCRIPT, OBJECT, etc.)
- I.e. they're completely avoided/skipped
Forces contexts using the NON_INLINE_PROSE set of elements

It would also avoid <template>.

Seems like this is the most sensible/common usage of the lib, and it wouldn't be a breaking change since it's a brand new option. Thoughts?

Ilya Khorev · Answer 19 · Sun Apr 26 2015 17:28:18 GMT+0800 (China Standard Time)

Excellent!

I thought styles, scripts, etc are already skipped by your function. If not, it would be a good option.
It's worth making this as a default behavior.

Chen Yijun · Answer 20 · Sun Apr 26 2015 20:48:04 GMT+0800 (China Standard Time)

Seems like this is the most sensible/common usage of the lib, and it wouldn't be a breaking change since it's a brand new option. Thoughts?

Agree. 🌈

James Padolsey · Answer 21 · Tue Apr 28 2015 19:07:16 GMT+0800 (China Standard Time)

Ok, all shipped (0.4.3). 🐑