padolsey / findAndReplaceDOMText

:mag: Find and replace DOM text

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

getAggregateText function sticks together text between different nodes

ilya-khorev opened this issue Β· comments

Hi

Please, look at my test case
https://jsfiddle.net/weirrrdy/xLcfnrpk/

screenshot_1

I'd like to match the exact word "Highlighted", that's why there is a special metacharacter "\b" wrapping the regex expression. Semantically it doesn't necessary to put '\b' in this case but I'm using it in my application and it's required for the test case.

The expected behaviour here would be selecting both 'highlighted' words, however only the second phrase has matched.

The problem is the getAggregateText function has combined text of first two paragraphs into the following text: "HighlightedPhrase". Therefore the current regexp expression has skipped the word "Highlighted".

I would really appreciate if you fix this.

Sorry for the massive delay on this. Am investigating now.

A paragraph can be styled presentationally as an inline element in CSS. Likewise, an emphasis text run can be styled as a block element too.

Two adjacent ps should be handled just like two adjacent ems by default. Maybe it's better to add an additional option (element/selector-wise) for such cases.

This is a tough one...

I'm happy to make it configurable, but it seems like the default behaviour should be what people typically expect, and people probably don't typically expect <p> elements to be treated as inline. But I'm very wary of making this assumption because it could lead people to a lot of undesirable behaviour.

So, yeh, defaulting to "everything is inline" makes sense. Just trying to figure out how to make this easily configurable and obvious.

The alternative is to assume everything is a block-level element by default, and then provide the opportunity to configure inline elements.

The alternative is to assume everything is a block-level element by default, and then provide the opportunity to configure inline elements.

Do you mean with the boundary matcher \b only?

Ok, so, this is what I've come up with: (Please let me know what you both think)


PROPOSAL

Everything stays the same as it currently is, by default. The findAndReplaceDOMText function will assume that it can bleed matches through ALL elements.

If someone wants to explicitly disallow this, e.g. in the case of block-level <p> elements, then they can set a custom boundary function that returns true in the case that either of its arguments are paragraphs (the boundary function will always be passed two elements so you can decide whether to create a boundary between them):

findAndReplaceDOMText(document.getElementById('test'), {
  find: RegExp('\\b' + 'highlighted' + '\\b', 'gi'),
  boundary: function(a, b) {
    return a.nodeName.toLowerCase() === 'p' || b.nodeName.toLowerCase() === 'p';
  },
  wrap: 'em'
});

You can also set boundary to true which will mean that NO matches bleed through elements, e.g. a match must exist in contiguous text nodes.

Additionally there will be an easy way to set ALL block-level borders as boundaries:

findAndReplaceDOMText(document.getElementById('test'), {
  find: RegExp('\\b' + 'highlighted' + '\\b', 'gi'),
  boundary: findAndReplaceDOMText.BLOCK_LEVEL_ELEMENTS,
  wrap: 'em'
});

(I imagine this will be the most common usage)

Would you care to use Element.prototype.matches method instead of the fairly long nodeName comparison?

https://github.com/ethantw/fibre.js/blob/6857deac21/dist/fibre.js#L18-28

Oo, yep, didn't realize that had such good support. Still a shame about IE8.

Thanks for the answer.

The second option is the best one, at least to me. That's what I'm looking for, if I understood you correctly.
boundary: findAndReplaceDOMText.BLOCK_LEVEL_ELEMENTS,

Some progress: d41ae22 introduces something similar to my proposal above. Instead of 'boundaries' I've chosen the term 'contexts' and specifically a new option: forceContext which accepts either a boolean or a function that returns a boolean. With this concept you can prescribe that any specific element creates a new matching context (so that no matches bleed through the element in either direction).

So, for the block-level use-case, you can do:

findAndReplaceDOMText(document.getElementById('test'), {
  find: RegExp('\\b' + 'highlighted' + '\\b', 'gi'),
  forceContext: findAndReplaceDOMText.BLOCK_LEVEL_MATCH,
  wrap: 'em'
});

This means that any block-level elements will create new matching contexts. The findAndReplaceDOMText.BLOCK_LEVEL_MATCH function returns true if passed a traditionally block-level element.

You can set forceContext to your own function too, and even use Element.matches to figure out which elements to force a new context on.


If this looks good to you, I'll merge into master and push a new release. (Shall timeout in 48h and assume we're all happy to go ahead with it if I don't hear anything).

@weirdy: Here's your JSFiddle with the change added: https://jsfiddle.net/8dcqgcoa/1/

Awesome! Thank you, good work.

Great! Thanks for the significant fix. I have two questions though.

  • Even without the \b matcher they'll still be considered as another context?
  • Also, I believe that br, details, dt, img, li, summary, tbody, td, th, thead, tr (, etc) are accidentally missed out?

The br and img (like hr) are necessary for cases like,

Highlight<img src="./path"><span>xxx</span>

https://developer.mozilla.org/en/docs/Web/HTML/Element

Ah, good point; I'll review the list of block elements. Actually, I think what we want is not a list of block elements, per se, but instead a list of all elements that are not inline textual elements (including <img> and <br> and <hr> etc.). I'll work on this.

Agree. πŸ‘

Until I find a better name for this it's gonna be called "NON_INLINE_PROSE":

findAndReplaceDOMText(document.getElementById('test'), {
  find: RegExp('\\b' + 'highlighted' + '\\b', 'gi'),
  forceContext: findAndReplaceDOMText.NON_INLINE_PROSE,
  wrap: 'em'
});

This will force a context on all of the following elements:

NON_PROSE_ELEMENTS = {
    // Block Elements
    address:1, article:1, aside:1, blockquote:1, canvas:1, dd:1, div:1,
    dl:1, fieldset:1, figcaption:1, figure:1, footer:1, form:1, h1:1, h2:1, h3:1,
    h4:1, h5:1, h6:1, header:1, hgroup:1, hr:1, main:1, nav:1, noscript:1, ol:1,
    output:1, p:1, pre:1, section:1, ul:1,
    // Other misc. elements that are not part of continuous inline prose:
    br:1, li: 1, summary: 1, dt:1, details:1,
    // Media / Source elements:
    script:1, style:1, img:1, video:1, audio:1, canvas:1, svg:1, map:1, object:1,
    // Input elements
    input:1, textarea:1, select:1, option:1, optgroup: 1, button:1,
    // Table related elements:
    table:1, tbody:1, thead:1, th:1, tr:1, td:1, caption:1, col:1, tfoot:1, colgroup:1
};

It's probably missing a few (let me know if you notice any). Thanks!

PR happening in #30 (shall merge soon I hope). I also need to add a chunk to the readme to explain what on earth this is all about :P

Until I find a better name for this it's gonna be called "NON_INLINE_PROSE":

Awesome, I think that's comprehensible enough. πŸ‘

It's probably missing a few (let me know if you notice any).

I would like to suggest we add rp, rt and rtc, but exclude ruby and rb. The rt element is used to wrap the annotation for its corresponding inline text run, while rp is used to provide fallback. Their box display attribute would not be inline as well. For example,

He invented <ruby>WWW<rp>(<rt>world wide web<rp>)</ruby> that helps people connect to one another.

2015-04-26 00 58 46

Well, maybe they belong to those who need ignoring instead?

As for ruby and rb, they are both inline and part of the main content. Would make more sense for them to remain outside of the list.

And what do you think about the elements designed for web component such as template?

Cool, I've added rp, rt and rtc.

Right now there's no easy way to filter out non-prose elements completely. So, even though they can now have "forced contexts" via NON_INLINE_PROSE, things like <style> and <script> are, by default, still searched.

So... I think it'd be worth adding a "prose" preset, like:

findAndReplaceDOMText(..., {
  preset: 'prose'
});

And it could do the following:

  • Only searches textual elements (not STYLE, SCRIPT, OBJECT, etc.)
    • I.e. they're completely avoided/skipped
  • Forces contexts using the NON_INLINE_PROSE set of elements

It would also avoid <template>.

Seems like this is the most sensible/common usage of the lib, and it wouldn't be a breaking change since it's a brand new option. Thoughts?

Excellent!

I thought styles, scripts, etc are already skipped by your function. If not, it would be a good option.
It's worth making this as a default behavior.

Seems like this is the most sensible/common usage of the lib, and it wouldn't be a breaking change since it's a brand new option. Thoughts?

Agree. 🌈

Ok, all shipped (0.4.3). πŸ‘