ren_placeholder to find character width is wrong

Question

ren_placeholder to find character width is wrong

kyx0r opened this issue 3 years ago · comments

Hello @aligrudi I was looking at commit history and I find that this commit 0d6850f
has a bug. The problem is that function conf_placeholder can modify the wid variable without the character actually being a placeholder, so for example if the character is 'a' and last placeholder in the placeholders array is width of 5, the 'a' character will be treatead as width of 5, which is obviously wrong. The only saving grace from having this bug appear is that currently there isn't any placeholder in array to be wider than 1.

Also see my code for proper implementation.
https://github.com/kyx0r/neatvi/blob/fee6b5a1acccc4bec8ed14eff61f5a362b1b8d75/ren.c#L176

To be honest, I am quite skeptical of this change in the first place, as those extra checks actually hold an astunding performance implication. What is your take on the issue?

Kyryl · Answer 1 · Sun Mar 07 2021 10:45:53 GMT+0800 (China Standard Time)

Are those uc_iscomb(s) || uc_isbell(s) calls even necessary??? For now I only mirrored the change to have the same functionality. But I don't understand what problem they solve. uc_bell may be able to solve the problem of bad encoding data, so the width should be 1 for it, which looks right, but what's up with the uc_iscomb call? Is that something that was overlooked ?

Kyryl · Answer 2 · Sun Mar 07 2021 11:03:19 GMT+0800 (China Standard Time)

uc_wid has 3 possible return states, 0, 1, 2. binary search is performed on dwchars, it would safely return 1, zwchars are also handled. I don't see a reason for those checks, and they are quite a performance hurdle!

Ali Gholami Rudi · Answer 3 · Sun Mar 07 2021 20:25:15 GMT+0800 (China Standard Time)

Hi, Kyryl <notifications@github.com> wrote:

Hello @aligrudi I was looking at commit history and I find that this commit 0d6850f has a bug. The problem is that function conf_placeholder can modify the wid variable without the character actually being a placeholder, so for example if the character is 'a' and last placeholder in the placeholders array is width of 5, the 'a' character will be treatead as width of 5, which is obviously wrong. The only saving grace from having this bug appear is that currently there isn't any placeholder in array to be wider than 1. Also see my code for proper implementation. https://github.com/kyx0r/neatvi/blob/fee6b5a1acccc4bec8ed14eff61f5a362b1b8d75/ren.c#L176 To be honest, I am quite skeptical of this change in the first place, as those extra checks actually hold an astunding performance implication. What is your take on the issue?

Thanks for reporting. Does this patch fix the issue? diff --git a/ren.c b/ren.c index bc6b489..e06fb16 100644 --- a/ren.c +++ b/ren.c @@ -129,11 +129,11 @@ static char *ren_placeholder(char *s, int *wid) { char *src, *dst; int i; - if (wid) - *wid = 1; for (i = 0; !conf_placeholder(i, &src, &dst, wid); i++) if (src[0] == s[0] && uc_code(src) == uc_code(s)) return dst; + if (wid) + *wid = 1; if (uc_iscomb(s)) { static char buf[16]; char cbuf[8] = ""; Thanks, Ali

Ali Gholami Rudi · Answer 4 · Sun Mar 07 2021 20:31:44 GMT+0800 (China Standard Time)

Kyryl <notifications@github.com> wrote:

uc_wid has 3 possible return states, 0, 1, 2. binary search is performed on dwchars, it would safely return 1, zwchars are also handled. I don't see a reason for those checks, and they are quite a performance hurdle!

Double-width characters should appear as characters of width 2 on the screen. Zero-width and combining characters have width zero when rendered. However, it makes editing easier if they are rendered as characters that have width one (otherwise, you may not even know if they are present...). Ali

Kyryl · Answer 5 · Sun Mar 07 2021 22:42:29 GMT+0800 (China Standard Time)

Thanks for quick responce. Does uc_wid() have an erroneous width reporting in some cases? By that I mean, sometimes character of width 2 is taken as width 1 or wise versa? If so, that means the look up table is not complete, or its wrong and should be fixed. As for zero width characters, it can be handled this way:

int wid = uc_wid(s);
return !wid ? 1 : wid ;

This would solve the issue of them being 0, but more importantly, it only took 1 if statement which is 100000x faster than those is_comb is_bell checks. Am I missing some important piece of information?

Kyryl · Answer 6 · Sun Mar 07 2021 22:55:54 GMT+0800 (China Standard Time)

Now are only 2 states, uc_wid() can either be 1 or 2. However the 2 should be returned only for double width, and if the function handles double withth correctly, that leaves it with only 1 possible state for everything else, meaning that 1 will be returned regardless and it doesn't matter what is_comb and is_bell do.

Ali Gholami Rudi · Answer 7 · Mon Mar 08 2021 00:32:57 GMT+0800 (China Standard Time)

Kyryl <notifications@github.com> wrote:

Thanks for quick responce. Does uc_wid() have an erroneous width reporting in some cases? By that I mean, sometimes character of width 2 is taken as width 1 or wise versa? If so, that means the look up

It should not be so, unless there is a problem.

table is not complete, or its wrong and should be fixed. As for zero

Yes.

width characters, it can be handled this way: int wid = uc_wid(s); return !wid ? 1 : wid ;

Do you mean in the implementation of ren_cwid()? The reason for the existence of ren_placeholder() is specifying placeholders for characters that should not be printed in the output. The second argument allows returning the screen width of the placeholder. If a character is not in handled in ren_placeholder, it should appear in the output with width uc_wid(). Currently ren_placeholder() handles the following characters: + Placeholders specified in conf.h + Combining characters + Nonprintable characters

This would solve the issue of them being 0, but more importantly, it only took 1 if statement which is 100000x faster than those is_comb is_bell checks. Am I missing some important piece of information?

Is it a bottleneck? I think some profiling is needed to be sure. Thanks, Ali

Kyryl · Answer 8 · Mon Mar 08 2021 01:01:40 GMT+0800 (China Standard Time)

It certainly is a bottleneck! You are doing worst case 4 binary search lookups for every character in the line and like + 15 more if statements, when it can clearly be 2 bsearches and 3 if statements. And even worse, if that character actually happens to be passing the is_comb condition it will run into sprintf function which has like 100+ if statements inside of it, because it's a univeral function that needs to check for all kinds of string conversions. Now I imagine you won't notice any performance difference with this if you are editing files that have ~50 - 80 characters in them, but what will happen in there are 10000 characters in a line? I tried on my version and I have features that require screen to be redrawn completely all the time, and even on ascii lines that were <80 characters long it lags pretty bad, granted it's because I have complex syntax highlighting rules that take %90 of the cpu time there, but because this is a problem, there is no reason other things need to be slow, I need to get all the performance I can get so that it can be used on more important things!

Kyryl · Answer 9 · Mon Mar 08 2021 01:23:55 GMT+0800 (China Standard Time)

Do you mean in the implementation of ren_cwid()?

Yes.

I am perfectionist, by nature a want things to be done the way they should be done, even if it may seem like it's not a big performance issue right now, trust me, those things do add up quickly as time goes on. I worked on projects that are very bloated and there is just so much crap inside them that no profiler ever can figure out the bottleneck! It's because there actually isn't one, there is just a million things that were taken as negligible by programers so that they end up costing a big price once there was so many of them! It ends up pretty bad, this is why I think software should get better over time, not worse. That's why I refactored a lot of code in neatvi that I did not have to, that was fine as is, but it wasn't optimal.

Ali Gholami Rudi · Answer 10 · Mon Mar 08 2021 02:55:06 GMT+0800 (China Standard Time)

Kyryl <notifications@github.com> wrote:

>Do you mean in the implementation of ren_cwid()? I am perfectionist, by nature a want things to be done the way they should be done, even if it may seem like it's not a big performance issue right now, trust me, those things do add up quickly as time goes on. I worked on projects that are very bloated and there is just so much crap inside them that no profiler ever can figure out the bottleneck! It's because there actually isn't one, there is just a million things that were taken as negligible by programers so that they end up costing a big price once there was so many of them! It ends up pretty bad, this is why I think software should get better over time, not worse. That's why I refactored a lot of code in neatvi that I did not have to, that was fine as is, but it wasn't optimal.

That is OK. Just to clarify, there are two points: + The job of the function ren_placeholder() is to identify characters that have a placeholder; the user may add placeholders for characters he wants, by modifying the placeholders[] array of config.h. There is also a placeholder for combining and nonprintable characters for now. + The performance of supporting placeholders. Therefore, the question is do you think supporting placeholders as such is not a good idea or it can be implemented with lower overhead. About optimizing, sometimes (but not always) making code faster implies making it more complex. I usually prefer simplicity to having very low overhead. However, if the difference is noticeable, the additional complexity may be worth the speed gains. That is why I suggested profilers. Thanks, Ali

Kyryl · Answer 11 · Mon Mar 08 2021 06:06:32 GMT+0800 (China Standard Time)

I think that placeholder feature support is actually great in general. But it definitely can be implemented with lower overhead, as it has been done in the past, before that commit. I understand the motivation of trying to reduce the number of lines of code to make it simpler, but in this case the benefits don't outweight the downsides. ren_placeholder() shouldn't be trying to get the width in the first place, it's crammed in functionality that's going to be used only on 1 occasion. And personally it made it hard for me to understand the code, which is why I made this issue because it's unclear why is_comb and is_bell have to be invoked. And it will be unclear to anybody else who will try to read the code, because lets be real, nobody is going to try and run the binary search of a look up table in their head to figure out whether those code path's affect the execution or not. But it turns out they don't affect anything but they will misslead anybody like they are necessary and also hurt performance. Unix philosophy, it should do one thing and do it well, same applies to functions in code.

If I didn't miss anything important on how uc_wid function works, this is my final solution https://github.com/kyx0r/neatvi/blob/208fa09f4094416d613a42d415b7a035c6ea9d35/ren.c#L149-L178

And as you can see, it's not that much more code, we are talking net ~3 more lines.

Ali Gholami Rudi · Answer 12 · Tue Mar 09 2021 00:04:09 GMT+0800 (China Standard Time)

Kyryl <notifications@github.com> wrote:

I think that placeholder feature support is actually great in general. But it definitely can be implemented with lower overhead, as it has been done in the past, before that commit. I understand the motivation of trying to reduce the number of lines of code to make it simpler, but in this case the benefits don't outweight the downsides.

Actually, there was a problem before that commit, and that was the main reason for the change. It assigned width zero to combining characters, which made them disappear from the output. The goal of that commit was to prevent such mistakes.

ren placeholder() shouldn't be trying to get the width in the first place, it's crammed in functionality that's going to be used only on 1 occasion. And personally it made it hard for me to understand the code, which is why I made this issue because it's unclear why is comb and is bell have to be invoked. And it will be unclear to anybody else who will try to read the code, because lets be real, nobody is going to try and run the binary search of a look up table in their head to figure out whether those code path's affect the execution or not. But it turns out they don't affect anything but they will misslead anybody like they are necessary and also hurt performance. Unix philosophy, it should do one thing and do it well, same applies to functions in code.

Well, ren_placeholder() returns the placeholder of a character and its width on the screen. The width is a property of the placeholder; it is not performing two functions. You may compute the width of these placeholders in ren_cwid(). However, when ren_placeholder() is modified, you have to verify if ren_cwid() needs to be updated as well. Why should ren_cwid() know what placeholders are defined or whether combining characters have a placeholder. Suppose, for instance, a new variable is defined that disables placeholders altogether. It can be implemented by adding only the following two lines to the beginning of ren_placeholder(): if (!xplaceholder) return NULL; In any case, I think this is a small issue and to some extent a matter of taste.

If I didn't miss anything important on how uc wid function works, this is my final solution https://github.com/kyx0r/neatvi/blob/208fa09f4094416d613a42d415b7a035c6ea9d35/ren.c#L149-L178 And as you can see, it's not that much more code, we are talking net ~3 more lines.

Looks good. Thanks, Ali

Kyryl · Answer 13 · Tue Mar 09 2021 00:16:25 GMT+0800 (China Standard Time)

Thank you for you time, I think that settles the issue then. But I will make another one, soon, it's a different question.