gorhill / PHP-FineDiff

A PHP implementation of a Fine granularity Diff engine: Diff can be computed up to character-level

Home Page:http://www.raymondhill.net/finediff/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

finediff is eating my words when showing the comparisons.

newpen opened this issue · comments

After reading #18, I can now use finediff for Chinese successfully, but sometimes it will eat out my words!

Example:


(before)
1根據相對論,信息的傳播速度有限,因此在某些情况下,例如在發生宇宙膨胀時,距离我们非常遥远的區域中我們將只能收到一小部分区域的信息,其他部分的信息将永远无法传播到我们的區域。可以被我們觀測到的时空稱為“可觀測宇宙”、“可見宇宙”或“我們的宇宙”。應該強調,這是由於時空本身的结構造成的,與我們所用的觀測設備没有關係。 T'尸F.


(after)
根據相對論,信息的傳播速度有限,因此在某些情况下,例如在發生宇宙膨胀時,距离我们非常遥远的區中我們將只能收到一小部分区域的信息,其他部分的信息将永远无法传播到我们的區域。可以被我們觀測到的時

2空稱為“可觀測宇宙”、“可見宇宙”或“我們的宇宙”。應該強調,這是由於時空本身的结構造成的,與我們所用的觀測設備没有關係。


Using FineDiff::renderDiffToHTMLFromOpcodes($a, $opcodes), the result will just be:
根據相對論,信息的傳播速度有限,因此在某些情况下,例如在發生宇宙膨胀時,距离我们非常遥远的區域空稱為“可觀測宇宙”、“可見宇宙”或“我們的宇宙”。應該強調,這是由於時空本身的结構造成的,與我們所用的觀測設備没有關係。 T'尸F.


The whole "中我們將只能收到一小部分区域的信息,其他部分的信息将永远无法传播到我们的區域。可以被我們觀測到的時

2" is missing! I don;t know where to look at to solve the problem. Please give me some guidance. Thanks!

Probably one of the HTML tag end up being inserted in the middle of a multibyte character.

FineDiff works on a binary byte basis, it doesn't know about characters. It happens to work fine for display for ASCII characters because they are single byte. renderToTextFromOpcodes($from, $opcodes) should work fine, except that you will have to render yourself to HTML.

Not sure if you could find where an HTML tag split a whole character and shift back or forth (depending on whether it is the opening or closing tag) to a proper character boundary.

Where can I try and set the character boundary? I tried to look into the codes but it was a bit too difficult for me to follow... Thanks in advance!

Create your own rendering handler:

public static function renderFromOpcodes($from, $opcodes, $callback);

See code. each time your callback is called, you may want to see if the start/end of the segment are valid Unicode characters, and if not look around to fetch the previous/following missing bytes. Frankly, it's just an untested idea, but if I had time, that what I would look into.

Changing FineDiff code is not an option, it's completely designed to work on bytes, and these bytes could be anything, FineDiff doesn't care about their meaning.

OK thanks, but the thing is that it is missing more than one characters (sometimes a few sentences), so if I shift back and forth, most likely I would just have one more character back, which doesn't really help much...

it is missing more than one characters (sometimes a few sentences)

Probably because you are looking at the broken HTML result. Look at the binary string internally, not the broken rendered HTML. Putting in there an HTML renderer was my biggest mistake, I should not have created this helper method because FineDiff is really completely binary and it doesn't care about what the data is, originally it was used just to save storage, saving only what changed.

Many users of the library think the library is to render diff visually on screen, that wasn't my intention at all originally.

Isn't it true that if you use renderToTextFromOpcodes($from, $opcodes), the output will be as expected?

Edit: Out of curiosity, what granularity do you use?

I played on it for a bit, the following codes seem to work on my case, but not sure about other cases..

        if ( $opcode === 'c' ) { // copy n characters from source

            $shift = 0;
            $char = mb_substr($from, $from_offset+$n-3, 3);
            while(strlen($char) == strlen(utf8_decode($char))){                 
                $shift++;
                $char = mb_substr($from, $from_offset+$n-3-$shift, 3);
            }

            call_user_func($callback, 'c', $from, $from_offset, $n - $shift, '');
            $from_offset += $n;
            }
        else if ( $opcode === 'd' ) { // delete n characters from source

            $shift = 0;
            $char = mb_substr($from, $from_offset, 3);
            while(strlen($char) == strlen(utf8_decode($char))){                 
                $shift++;
                $char = mb_substr($from, $from_offset-$shift, 3);
            }

            call_user_func($callback, 'd', $from, $from_offset - $shift, $n + $shift, '');
            $from_offset += $n;
            }

Are Chinese characters always 3-byte large? (including whitespace, etc.)

No... can be 1-4 bytes... I guess I may have to refine it for a bit to suit more cases... But how about insertion? I can't get it right using the same technique...

Alright, looking at Unicode encoding, to find the beginning of the character seems pretty easy: if bit 7-6 are binary 0x80 (i.e. char code & 0xC0 === 0x80), go back one byte, check again. As soon as the condition bit 7-6 !== 0x80, you have the beginning of your character.

Now use the distance of the beginning of the character to the passed $from_offset to correct the start and the end -- i.e. $from_offset - $distance and $n + distance. Do this for each segment regardless of whether it is insert, delete, copy. I believe this should work all fine, with much less overhead than what you have above.

It has been a while since I wrote PHP, so I would have to check again the PHP reference.. I forgot.. Can we check a single byte in a string using array notation? If so this become very easy.

You could all do this without changing FineDiff, just by providing your own callback to renderFromOpcodes($from, $opcodes, $callback), it's up to you.

Edit: fixed mistakes

Thanks! I also just googled it and discovered this fork
https://github.com/xrstf/PHP-FineDiff

It is working well with my Chinese characters!

Just be aware you won't have the same kind of performance however, as there is no equivalent for strspn/strcspn with mb_ functions.

I've worked a bit on this today, I wanted to test the idea above about nudging the boundary back/forth. The idea works, it's all in the details though. It's not perfect yet but I have figured how to make it work perfectly, but I don't know when this will be ready.

Cool! Thanks! I realize the the fork isn't performing as efficient as this one, but it still can serve as the temporary solution. Looking forward to your updates!

See this https://github.com/xrstf/PHP-FineDiff
I used it for Farsi/Persian language and it works perfectly.