Markdown comparison includes headings unnecessarily

Question

Markdown comparison includes headings unnecessarily

adaboese opened this issue 9 months ago · comments

it('diffs markdown (2)', () => {
  const a =
    '## AIMD generated content\n\nJust for fun, here is the same article scrambled using AIMD anti-AI detection techniques:';
  const b =
    "## AIMD generated content\n\nJust for fun, here is the same article scrambled using AIMD anti-AI detection techniques. If you're curious about building depth on related topics and establishing yourself as an authority, check out our detailed guide on [Understanding Topical Authority](https://aimd.app/blog/2023-12-26-revolutionizing-the-marketing-hierarchy-why-topical-authority-is-the-new-currency).";

  const diff = diffSentences(a, b);
});

Produces:

[
  {
    count: 1,
    added: undefined,
    removed: true,
    value: '## AIMD generated content\n' +
      '\n' +
      'Just for fun, here is the same article scrambled using AIMD anti-AI detection techniques:'
  },
  {
    count: 4,
    added: true,
    removed: undefined,
    value: '## AIMD generated content\n' +
      '\n' +
      "Just for fun, here is the same article scrambled using AIMD anti-AI detection techniques. If you're curious about building depth on related topics and establishing yourself as an authority, check out our detailed guide on [Understanding Topical Authority](https://aimd.app/blog/2023-12-26-revolutionizing-the-marketing-hierarchy-why-topical-authority-is-the-new-currency)."
  }
]

It is not clear why ## AIMD generated content\n is marked as removed.

Ada Boese · Answer 1 · Tue Jan 02 2024 00:03:56 GMT+0800 (China Standard Time)

Interestingly, if I paste the same content into http://incaseofstairs.com/jsdiff/, I cannot replicate the issue.

## AIMD generated content

Just for fun, here is the same article scrambled using AIMD anti-AI detection techniques:

## AIMD generated content

Just for fun, here is the same article scrambled using AIMD anti-AI detection techniques. If you're curious about building depth on related topics and establishing yourself as an authority, check out our detailed guide on [Understanding Topical Authority](https://aimd.app/blog/2023-12-26-revolutionizing-the-marketing-hierarchy-why-topical-authority-is-the-new-currency).

Ada Boese · Answer 2 · Tue Jan 02 2024 00:06:02 GMT+0800 (China Standard Time)

@kpdecker does http://incaseofstairs.com/jsdiff/ do something different than what's in my code?

Ada Boese · Answer 3 · Tue Jan 02 2024 00:13:07 GMT+0800 (China Standard Time)

oh, it is not using sentences. It uses words diff. I will evaluate that.

Ada Boese · Answer 4 · Tue Jan 02 2024 00:19:19 GMT+0800 (China Standard Time)

Looks like switching to words algorithm comes with tradeoffs though.

it.only('recognizes different sentences', () => {
  const a = 'This is a sentence. This is another sentence.';
  const b = 'This is a sentence. And this is a different sen.';
  const abDiff = diff(a, b);

  expect(abDiff).toStrictEqual([
    {
      count: 17,
      value: 'This is a sentence. This is ',
    },
    {
      added: true,
      count: 9,
      removed: undefined,
      value: 'a different',
    },
    {
      count: 9,
      value: ' sentence.',
    },
  ]);
});

This produces a fairly unreadable diff.

[
  { count: 9, value: 'This is a sentence. ' },
  { count: 1, added: undefined, removed: true, value: 'This' },
  { count: 1, added: true, removed: undefined, value: 'And' },
  { count: 1, value: ' ' },
  { count: 2, added: true, removed: undefined, value: 'this ' },
  { count: 2, value: 'is ' },
  { count: 1, added: undefined, removed: true, value: 'another' },
  { count: 1, added: true, removed: undefined, value: 'a' },
  { count: 1, value: ' ' },
  { count: 1, added: undefined, removed: true, value: 'sentence' },
  { count: 3, added: true, removed: undefined, value: 'different sen' },
  { count: 1, value: '.' }
]

Ada Boese · Answer 5 · Tue Jan 02 2024 00:50:08 GMT+0800 (China Standard Time)

Based on my tests https://github.com/google/diff-match-patch is producing the best results when used with diff_cleanupSemantic. Looks I could continue using jsdiff if I just extract logic of diff_cleanupSemantic.

Ada Boese · Answer 6 · Tue Jan 02 2024 00:57:59 GMT+0800 (China Standard Time)

Hm. That won't work as easy as I thought it would. It self-references the constructor and bunch of prototype functions. It would be nice if jsdiff supported equivalent logic as that package does not appear to be actively maintained.

Mark Amery · Answer 7 · Tue Aug 20 2024 19:29:14 GMT+0800 (China Standard Time)

diffSentences, right now, purports to simply diff two strings of "text" - not of Markdown. But the rules you want only make sense if the text is Markdown and the tokenizer knows that.

We could make newlines be treated as a sentence break, but that wouldn't actually give you what you want here, because Markdown lets you hard-wrap paragraphs with newlines that have no effect on the rendered result. Nor would treating only double-newlines as sentence breaks work, since a heading can be terminated with a single newline. The only solution is a tokenizer that recognises the start of a heading (or perhaps also a bullet point or numbered list item, if you want to treat list items as always being distinct sentences).

I thus don't really see a possible change to jsdiff that would give you what you want without also baking an assumption into diffSentences that its arguments are Markdown, thus I think I probably shouldn't try to "fix" this issue with a change to the library. To get the behaviour you want, you can instead write some custom logic to split the Markdown documents you're diffing into "sentences" (per whatever precise notion of "sentence" you want to use and then pass the result to diffArrays, and my instinct (admittedly without too much research or deep thought) is that that's the only strategy the library should support, for three reasons:

the exact rules about what terminates a heading or list item or whatever may well be dependent on the Markdown dialect you're using; I know Stack Overflow's flavour of Markdown and GitHub's flavour of Markdown used to be quite different, and don't know whether things are more standardized today.
there are several questions I can think of about how to split Markdown into "sentences" that don't have obvious, objective answer (e.g. is each list item in a bulleted or numbered list a distinct sentence? Do inline images in a paragraph constitute sentence breaks? How should code blocks be handled?)
In general, I would like to encourage people diffing text in computer languages to take an approach of running a language-specific parser/tokenizer, combining or splitting language tokens as they see fit for their particular purposes, and passing an array of tokens to diffArrays, rather than bloating jsdiff with more language-specific diff functions.