Word differ returns out-of-bounds column numbers
sesse opened this issue · comments
It seems that when a large token is split into multiple Syntax atoms, and that token spans multiple lines, column numbers that are out-of-bounds can be created (this may be a bug in the line-numbers crate).
To reproduce, run difftastic on old.xml and new.xml (they are from the Chromium repository). Files in files.tar.gz
The badness happens somewhere in split_atom_words(), and I would assume that it is related to the comment “TODO: don't assume this atom is on a single line”. For instance, we have a single atom '.' at some point, at offset 67. We have pos
that looks like this (though we only use the first element, pos[0]):
[
SingleLineSpan { line: LineNumber: 1102 (zero-indexed: 1101), start_col: 0, end_col: 80 },
SingleLineSpan { line: LineNumber: 1103 (zero-indexed: 1102), start_col: 0, end_col: 64 },
SingleLineSpan { line: LineNumber: 1104 (zero-indexed: 1103), start_col: 0, end_col: 79 },
SingleLineSpan { line: LineNumber: 1105 (zero-indexed: 1104), start_col: 0, end_col: 15 },
SingleLineSpan { line: LineNumber: 1106 (zero-indexed: 1105), start_col: 0, end_col: 2 }
]
What we get back, and put in word_pos, is:
[SingleLineSpan { line: LineNumber: 1103 (zero-indexed: 1102), start_col: 66, end_col: 67 }]
This is nonsensical; line 1103 has only 64 columns, so column “66 to 67” doesn't exist and the diff gets confused. But on column 67 on line 1102, there is indeed a period. Why it ends up with line 1103 is a bit beyond me, because it's not entirely clear to me exactly what from_region_relative_to
is supposed to be doing.
Aha:
// Don't create a MatchedPos for empty positions at the start
// or end. We still want empty positions in the middle of
// multiline atoms, as a multiline string literal may include
// empty lines.
let pos = filter_empty_ends(pos);
This ends badly, since the content of the tag starts with a newline and there is no corresponding filtering of that. So we take the byte offset of the .
character, find its line:col within \nLINE 1\nLINE 2
and then use that line:col to offset from the start of LINE 1\nLINE 2
(no leading newline) in the document.
I guess a possible fix (which also would save some RAM) would be to just work with byte offsets everywhere instead of line:col, but that would probably be hard. I guess you also have alternatives like this hack (which is mostly for illustration purposes):
diff --git a/src/parse/syntax.rs b/src/parse/syntax.rs
index 4d9e22fa6..d25d28058 100644
--- a/src/parse/syntax.rs
+++ b/src/parse/syntax.rs
@@ -4,6 +4,7 @@
use std::{cell::Cell, env, fmt, hash::Hash, num::NonZeroU32};
+use line_numbers::LineNumber;
use line_numbers::LinePositions;
use line_numbers::SingleLineSpan;
use typed_arena::Arena;
@@ -673,6 +674,7 @@ pub(crate) struct MatchedPos {
fn split_atom_words(
content: &str,
pos: &[SingleLineSpan],
+ skipped_first: bool,
opposite_content: &str,
opposite_pos: &[SingleLineSpan],
kind: AtomKind,
@@ -700,6 +702,15 @@ fn split_atom_words(
let content_newlines = LinePositions::from(content);
let opposite_content_newlines = LinePositions::from(opposite_content);
+ let root_pos = if skipped_first {
+ SingleLineSpan {
+ line: LineNumber::from(pos[0].line.0 - 1),
+ start_col: pos[0].start_col,
+ end_col: pos[0].end_col,
+ }
+ } else {
+ pos[0]
+ };
let mut offset = 0;
let mut opposite_offset = 0;
@@ -716,7 +727,7 @@ fn split_atom_words(
},
pos: content_newlines.from_region_relative_to(
// TODO: don't assume a single line atom.
- pos[0],
+ root_pos,
offset,
offset + word.len(),
)[0],
@@ -728,7 +739,7 @@ fn split_atom_words(
// This word is present on both sides.
// TODO: don't assume this atom is on a single line.
let word_pos =
- content_newlines.from_region_relative_to(pos[0], offset, offset + word.len())
+ content_newlines.from_region_relative_to(root_pos, offset, offset + word.len())
[0];
let opposite_word_pos = opposite_content_newlines.from_region_relative_to(
opposite_pos[0],
@@ -787,18 +798,22 @@ fn has_common_words(word_diffs: &Vec<myers_diff::DiffResult<&&str>>) -> bool {
}
/// Skip line spans at the beginning or end that have zero width.
-fn filter_empty_ends(line_spans: &[SingleLineSpan]) -> Vec<SingleLineSpan> {
+fn filter_empty_ends(line_spans: &[SingleLineSpan]) -> (Vec<SingleLineSpan>, bool) {
let mut spans: Vec<SingleLineSpan> = vec![];
+ let mut skipped_first = false;
for (i, span) in line_spans.iter().enumerate() {
if (i == 0 || i == line_spans.len() - 1) && span.start_col == span.end_col {
+ if i == 0 {
+ skipped_first = true;
+ }
continue;
}
spans.push(*span);
}
- spans
+ (spans, skipped_first)
}
impl MatchedPos {
@@ -812,7 +827,7 @@ impl MatchedPos {
// or end. We still want empty positions in the middle of
// multiline atoms, as a multiline string literal may include
// empty lines.
- let pos = filter_empty_ends(pos);
+ let (pos, skipped_first) = filter_empty_ends(pos);
match ck {
ReplacedComment(this, opposite) | ReplacedString(this, opposite) => {
@@ -839,7 +854,14 @@ impl MatchedPos {
AtomKind::Comment
};
- split_atom_words(this_content, &pos, opposite_content, opposite_pos, kind)
+ split_atom_words(
+ this_content,
+ &pos,
+ skipped_first,
+ opposite_content,
+ opposite_pos,
+ kind,
+ )
}
Unchanged(opposite) => {
let opposite_pos = match opposite {