finediff is eating my words when showing the comparisons.

Question

finediff is eating my words when showing the comparisons.

newpen opened this issue 10 years ago · comments

After reading #18, I can now use finediff for Chinese successfully, but sometimes it will eat out my words!

Example:

(before)
1根據相對論，信息的傳播速度有限，因此在某些情况下，例如在發生宇宙膨胀時，距离我们非常遥远的區域中我們將只能收到一小部分区域的信息，其他部分的信息将永远无法传播到我们的區域。可以被我們觀測到的时空稱為“可觀測宇宙”、“可見宇宙”或“我們的宇宙”。應該強調，這是由於時空本身的结構造成的，與我們所用的觀測設備没有關係。 T'尸F.

(after)
根據相對論，信息的傳播速度有限，因此在某些情况下，例如在發生宇宙膨胀時，距离我们非常遥远的區中我們將只能收到一小部分区域的信息，其他部分的信息将永远无法传播到我们的區域。可以被我們觀測到的時

2空稱為“可觀測宇宙”、“可見宇宙”或“我們的宇宙”。應該強調，這是由於時空本身的结構造成的，與我們所用的觀測設備没有關係。

Using FineDiff::renderDiffToHTMLFromOpcodes($a, $opcodes), the result will just be:
根據相對論，信息的傳播速度有限，因此在某些情况下，例如在發生宇宙膨胀時，距离我们非常遥远的區域空稱為“可觀測宇宙”、“可見宇宙”或“我們的宇宙”。應該強調，這是由於時空本身的结構造成的，與我們所用的觀測設備没有關係。 T'尸F.

The whole "中我們將只能收到一小部分区域的信息，其他部分的信息将永远无法传播到我们的區域。可以被我們觀測到的時

2" is missing! I don;t know where to look at to solve the problem. Please give me some guidance. Thanks!

Raymond Hill · Answer 1 · Fri Jan 09 2015 14:20:39 GMT+0800 (China Standard Time)

Probably one of the HTML tag end up being inserted in the middle of a multibyte character.

FineDiff works on a binary byte basis, it doesn't know about characters. It happens to work fine for display for ASCII characters because they are single byte. renderToTextFromOpcodes($from, $opcodes) should work fine, except that you will have to render yourself to HTML.

Not sure if you could find where an HTML tag split a whole character and shift back or forth (depending on whether it is the opening or closing tag) to a proper character boundary.

newpen · Answer 2 · Mon Jan 12 2015 17:59:09 GMT+0800 (China Standard Time)

Where can I try and set the character boundary? I tried to look into the codes but it was a bit too difficult for me to follow... Thanks in advance!

Raymond Hill · Answer 3 · Mon Jan 12 2015 21:37:54 GMT+0800 (China Standard Time)

Create your own rendering handler:

public static function renderFromOpcodes($from, $opcodes, $callback);

See code. each time your callback is called, you may want to see if the start/end of the segment are valid Unicode characters, and if not look around to fetch the previous/following missing bytes. Frankly, it's just an untested idea, but if I had time, that what I would look into.

Changing FineDiff code is not an option, it's completely designed to work on bytes, and these bytes could be anything, FineDiff doesn't care about their meaning.

newpen · Answer 4 · Tue Jan 13 2015 11:44:18 GMT+0800 (China Standard Time)

OK thanks, but the thing is that it is missing more than one characters (sometimes a few sentences), so if I shift back and forth, most likely I would just have one more character back, which doesn't really help much...

Raymond Hill · Answer 5 · Tue Jan 13 2015 11:51:06 GMT+0800 (China Standard Time)

it is missing more than one characters (sometimes a few sentences)

Probably because you are looking at the broken HTML result. Look at the binary string internally, not the broken rendered HTML. Putting in there an HTML renderer was my biggest mistake, I should not have created this helper method because FineDiff is really completely binary and it doesn't care about what the data is, originally it was used just to save storage, saving only what changed.

Many users of the library think the library is to render diff visually on screen, that wasn't my intention at all originally.

Isn't it true that if you use renderToTextFromOpcodes($from, $opcodes), the output will be as expected?

Edit: Out of curiosity, what granularity do you use?

newpen · Answer 6 · Tue Jan 13 2015 12:17:38 GMT+0800 (China Standard Time)

I played on it for a bit, the following codes seem to work on my case, but not sure about other cases..

        if ( $opcode === 'c' ) { // copy n characters from source

            $shift = 0;
            $char = mb_substr($from, $from_offset+$n-3, 3);
            while(strlen($char) == strlen(utf8_decode($char))){                 
                $shift++;
                $char = mb_substr($from, $from_offset+$n-3-$shift, 3);
            }

            call_user_func($callback, 'c', $from, $from_offset, $n - $shift, '');
            $from_offset += $n;
            }
        else if ( $opcode === 'd' ) { // delete n characters from source

            $shift = 0;
            $char = mb_substr($from, $from_offset, 3);
            while(strlen($char) == strlen(utf8_decode($char))){                 
                $shift++;
                $char = mb_substr($from, $from_offset-$shift, 3);
            }

            call_user_func($callback, 'd', $from, $from_offset - $shift, $n + $shift, '');
            $from_offset += $n;
            }

Raymond Hill · Answer 7 · Tue Jan 13 2015 12:20:26 GMT+0800 (China Standard Time)

Are Chinese characters always 3-byte large? (including whitespace, etc.)

newpen · Answer 8 · Tue Jan 13 2015 12:37:14 GMT+0800 (China Standard Time)

No... can be 1-4 bytes... I guess I may have to refine it for a bit to suit more cases... But how about insertion? I can't get it right using the same technique...

Raymond Hill · Answer 9 · Tue Jan 13 2015 12:56:09 GMT+0800 (China Standard Time)

Alright, looking at Unicode encoding, to find the beginning of the character seems pretty easy: if bit 7-6 are binary 0x80 (i.e. char code & 0xC0 === 0x80), go back one byte, check again. As soon as the condition bit 7-6 !== 0x80, you have the beginning of your character.

Now use the distance of the beginning of the character to the passed $from_offset to correct the start and the end -- i.e. $from_offset - $distance and $n + distance. Do this for each segment regardless of whether it is insert, delete, copy. I believe this should work all fine, with much less overhead than what you have above.

It has been a while since I wrote PHP, so I would have to check again the PHP reference.. I forgot.. Can we check a single byte in a string using array notation? If so this become very easy.

You could all do this without changing FineDiff, just by providing your own callback to renderFromOpcodes($from, $opcodes, $callback), it's up to you.

Edit: fixed mistakes

newpen · Answer 10 · Tue Jan 13 2015 13:19:27 GMT+0800 (China Standard Time)

Thanks! I also just googled it and discovered this fork
https://github.com/xrstf/PHP-FineDiff

It is working well with my Chinese characters!

Raymond Hill · Answer 11 · Tue Jan 13 2015 13:24:05 GMT+0800 (China Standard Time)

Just be aware you won't have the same kind of performance however, as there is no equivalent for strspn/strcspn with mb_ functions.

Raymond Hill · Answer 12 · Wed Jan 14 2015 09:45:38 GMT+0800 (China Standard Time)

I've worked a bit on this today, I wanted to test the idea above about nudging the boundary back/forth. The idea works, it's all in the details though. It's not perfect yet but I have figured how to make it work perfectly, but I don't know when this will be ready.

newpen · Answer 13 · Wed Jan 14 2015 10:26:58 GMT+0800 (China Standard Time)

Cool! Thanks! I realize the the fork isn't performing as efficient as this one, but it still can serve as the temporary solution. Looking forward to your updates!

Erfan Attarzadeh · Answer 14 · Tue Aug 18 2015 03:26:10 GMT+0800 (China Standard Time)

See this https://github.com/xrstf/PHP-FineDiff
I used it for Farsi/Persian language and it works perfectly.