Potential improvement to semantic cleanup?
GoogleCodeExporter opened this issue · comments
What steps will reproduce the problem?
1. Go to:
http://neil.fraser.name/software/diff_match_patch/svn/trunk/demos/demo_diff.html
2. Enter in the following text for version 1:
The centrifugal moult is modified in the tail feathers. The general pattern
seen in passerines is that the primaries are replaced outward, secondaries
inward, and the tail from center outward.
3. Enter the following text for version 2:
The centrifugal moult is modified in the tail feathers. The greater primary
coverts are moulted in synchrony with the primary that they overlap. The
general pattern seen in passerines is that the primaries are replaced outward,
secondaries inward, and the tail from center outward.
What is the expected output? What do you see instead?
The output is correct. However, from a human / semantic point of view the
algorithm picks the wrong "The" to be the inserted one. Looking at the output I
would have expected that the second "The" in version 2 would be the inserted
one, not the third "The".
Would a potential semantic cleanup operation be to try to match the last
character of the first equal run with the last character of the inserted run
and swap them around - and repeat while they are the same?
What version of the product are you using? On what operating system?
This version:
http://neil.fraser.name/software/diff_match_patch/svn/trunk/demos/demo_diff.html
Please provide any additional information below.
Original issue reported on code.google.com by m...@dixon.se
on 11 Nov 2011 at 2:07
That's odd, diff_cleanupSemanticScore already has a weighting for hugging
punctuation. It should already be doing exactly what you suggest. I'll look
into it. Thanks!
Original comment by neil.fra...@gmail.com
on 11 Nov 2011 at 5:47
- Changed state: Started
- Added labels: ****
- Removed labels: ****
Revision 100 should resolve this issue.
Cool bug, thank you again!
Original comment by neil.fra...@gmail.com
on 18 Nov 2011 at 12:15
- Changed state: Fixed
- Added labels: ****
- Removed labels: ****