Word differ returns out-of-bounds column numbers

Question

Word differ returns out-of-bounds column numbers

sesse opened this issue 2 months ago · comments

Steinar H. Gunderson commented 2 months ago

It seems that when a large token is split into multiple Syntax atoms, and that token spans multiple lines, column numbers that are out-of-bounds can be created (this may be a bug in the line-numbers crate).

To reproduce, run difftastic on old.xml and new.xml (they are from the Chromium repository). Files in files.tar.gz

You will get:

The badness happens somewhere in split_atom_words(), and I would assume that it is related to the comment “TODO: don't assume this atom is on a single line”. For instance, we have a single atom '.' at some point, at offset 67. We have pos that looks like this (though we only use the first element, pos[0]):

[
  SingleLineSpan { line: LineNumber: 1102 (zero-indexed: 1101), start_col: 0, end_col: 80 },
  SingleLineSpan { line: LineNumber: 1103 (zero-indexed: 1102), start_col: 0, end_col: 64 },
  SingleLineSpan { line: LineNumber: 1104 (zero-indexed: 1103), start_col: 0, end_col: 79 },
  SingleLineSpan { line: LineNumber: 1105 (zero-indexed: 1104), start_col: 0, end_col: 15 },
  SingleLineSpan { line: LineNumber: 1106 (zero-indexed: 1105), start_col: 0, end_col: 2 }
]

What we get back, and put in word_pos, is:

[SingleLineSpan { line: LineNumber: 1103 (zero-indexed: 1102), start_col: 66, end_col: 67 }]

This is nonsensical; line 1103 has only 64 columns, so column “66 to 67” doesn't exist and the diff gets confused. But on column 67 on line 1102, there is indeed a period. Why it ends up with line 1103 is a bit beyond me, because it's not entirely clear to me exactly what from_region_relative_to is supposed to be doing.

Steinar H. Gunderson · Answer 1 · Tue Apr 09 2024 20:17:25 GMT+0800 (China Standard Time)

Aha:

        // Don't create a MatchedPos for empty positions at the start                                                                                 
        // or end. We still want empty positions in the middle of                                                                                     
        // multiline atoms, as a multiline string literal may include                                                                                 
        // empty lines.                                                                                                                               
        let pos = filter_empty_ends(pos);

This ends badly, since the content of the tag starts with a newline and there is no corresponding filtering of that. So we take the byte offset of the . character, find its line:col within \nLINE 1\nLINE 2 and then use that line:col to offset from the start of LINE 1\nLINE 2 (no leading newline) in the document.

I guess a possible fix (which also would save some RAM) would be to just work with byte offsets everywhere instead of line:col, but that would probably be hard. I guess you also have alternatives like this hack (which is mostly for illustration purposes):

diff --git a/src/parse/syntax.rs b/src/parse/syntax.rs
index 4d9e22fa6..d25d28058 100644
--- a/src/parse/syntax.rs
+++ b/src/parse/syntax.rs
@@ -4,6 +4,7 @@
 
 use std::{cell::Cell, env, fmt, hash::Hash, num::NonZeroU32};
 
+use line_numbers::LineNumber;
 use line_numbers::LinePositions;
 use line_numbers::SingleLineSpan;
 use typed_arena::Arena;
@@ -673,6 +674,7 @@ pub(crate) struct MatchedPos {
 fn split_atom_words(
     content: &str,
     pos: &[SingleLineSpan],
+    skipped_first: bool,
     opposite_content: &str,
     opposite_pos: &[SingleLineSpan],
     kind: AtomKind,
@@ -700,6 +702,15 @@ fn split_atom_words(
 
     let content_newlines = LinePositions::from(content);
     let opposite_content_newlines = LinePositions::from(opposite_content);
+    let root_pos = if skipped_first {
+        SingleLineSpan {
+            line: LineNumber::from(pos[0].line.0 - 1),
+            start_col: pos[0].start_col,
+            end_col: pos[0].end_col,
+        }
+    } else {
+        pos[0]
+    };
 
     let mut offset = 0;
     let mut opposite_offset = 0;
@@ -716,7 +727,7 @@ fn split_atom_words(
                         },
                         pos: content_newlines.from_region_relative_to(
                             // TODO: don't assume a single line atom.
-                            pos[0],
+                            root_pos,
                             offset,
                             offset + word.len(),
                         )[0],
@@ -728,7 +739,7 @@ fn split_atom_words(
                 // This word is present on both sides.
                 // TODO: don't assume this atom is on a single line.
                 let word_pos =
-                    content_newlines.from_region_relative_to(pos[0], offset, offset + word.len())
+                    content_newlines.from_region_relative_to(root_pos, offset, offset + word.len())
                         [0];
                 let opposite_word_pos = opposite_content_newlines.from_region_relative_to(
                     opposite_pos[0],
@@ -787,18 +798,22 @@ fn has_common_words(word_diffs: &Vec<myers_diff::DiffResult<&&str>>) -> bool {
 }
 
 /// Skip line spans at the beginning or end that have zero width.
-fn filter_empty_ends(line_spans: &[SingleLineSpan]) -> Vec<SingleLineSpan> {
+fn filter_empty_ends(line_spans: &[SingleLineSpan]) -> (Vec<SingleLineSpan>, bool) {
     let mut spans: Vec<SingleLineSpan> = vec![];
+    let mut skipped_first = false;
 
     for (i, span) in line_spans.iter().enumerate() {
         if (i == 0 || i == line_spans.len() - 1) && span.start_col == span.end_col {
+            if i == 0 {
+                skipped_first = true;
+            }
             continue;
         }
 
         spans.push(*span);
     }
 
-    spans
+    (spans, skipped_first)
 }
 
 impl MatchedPos {
@@ -812,7 +827,7 @@ impl MatchedPos {
         // or end. We still want empty positions in the middle of
         // multiline atoms, as a multiline string literal may include
         // empty lines.
-        let pos = filter_empty_ends(pos);
+        let (pos, skipped_first) = filter_empty_ends(pos);
 
         match ck {
             ReplacedComment(this, opposite) | ReplacedString(this, opposite) => {
@@ -839,7 +854,14 @@ impl MatchedPos {
                     AtomKind::Comment
                 };
 
-                split_atom_words(this_content, &pos, opposite_content, opposite_pos, kind)
+                split_atom_words(
+                    this_content,
+                    &pos,
+                    skipped_first,
+                    opposite_content,
+                    opposite_pos,
+                    kind,
+                )
             }
             Unchanged(opposite) => {
                 let opposite_pos = match opposite {