vlang / v

Simple, fast, safe, compiled language for developing maintainable software. Compiles itself in <1s with zero library dependencies. Supports automatic C => V translation. https://vlang.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

`string.int()` Commas are not parsed properly.

opened this issue · comments

V version: ArchLinux
OS: 0.2.4 e031096

What did you do?

fn main() {
	i := '1,000'.int() + '900'.int()
	println(i)
}
  1. Commas are not parsed properly.
  2. If it is the specification that commas are not parsed, an error should be raised as an illegal argument.
  3. If it is determined that generating errors is a problem, then a warning should be given by documenting in an easily visible location that the specification generates such calculation results.
  4. If performance is a concern, new methods should be added to handle these properly like string.formatted_int().

What did you expect to see?

1900

What did you see instead?

901

string.int() , .f64() are frequently used basic methods.
It seems out of character with V's language philosophy to allow this kind of design.
It's just like the labor-intensive C language, not fun.

Of course, you can get around this by doing '1,000'.replace(',', '').int(), but that is not the essence of this matter.

There is also this kind of bug as a related matter.

fn main() {
	i := '1000K'.int() + '900'.int()
	println(i)
}

result in

1900

K is ignored.
This is also a hotbed of bugs.

I already answered this in the other issue, yet you're trying again.

The convenience methods (such as .int() and .f64()) act the same as their C counterparts (atoi and atof). They parse digits (and in V's case, _ separators) until they hit something that is not a digit or a _. Comma is not a valid separator for digits, so parsing stops when it is hit. The same thing happens when it hits a K (as in your example), or anything else that is not either a digit or _.

If you want your code to stop when it hits an "invalid" character, use the strconv routines that return optionals.

@JalonSolov I will raise this issue again and again as long as you immediately close.

While I agree technically, I disagree with the argument that because C is the way it is, it is inevitable that V is the way it is.
It is not a fun language.

I believe it is a matter that deserves to be here for fully considered, whether it is to be parsed, to make an error, to be clearly documented and add another method.

I personally think that it would be best to issue an error regarding this specification and provide some sort of Formatter class.

Anyway, I thank you, I can now raise the issue in a more cohesive way.

Maybe this should throw an error since , is used instead of . in some natural languages, but then the function signature would have to be changed.

I didn't immediately close this one or the other one. I answered the question.

This was discussed on Discord a while ago, and decided that the convenience functions would act like the C functions because that is what people expect/are used to using.

Yes, it could probably be better documented, with more details as to exactly how it works, and why it works the way it does.

I don't think this should be handled by standard (more technical) conversion casts but by a separate lib aware of localization. In my locale (German) "." and "," are used exact the other way round than in English. So applying my locale settings "1,000" would just be 1 but "1.000,00" would be 1000. Dealing with anything than scientific notations in a computer language can lead to have it either depend on one natural language and locale or open the door for many bugs.

On the other hand, having a lib to parse most data representations depending on a given locale is something different.

This is also true in French.

I agree if the specification is that methods such as string.int() .f64() etc. throw an error (just panic is ok), accepting nothing but numbers, a period and underscores.

The current problem is that the program can be executed without any errors, even if commas, K, or other characters are given, and the answer is not what is expected.

It is not about the natural language, but about the fact that the function's behavior is not clearly defined and returns unexpected results, i.e., it is a flawed method, a bug.
Hence, we expect a clear, unambiguous definition of how this function is supposed to work.

Who would expect 1,011 to be 1?
Even in French, you would expect this to return 1.011, right?
But in fact it returns 1 which is neither 1011 nor 1.011.
Who would use this function expecting this result?

Who would expect that result?

Anyone who used the atoi function in C. Which is hundreds of thousands of people (at least) over the last 60+ years...

I would not except neither 1.011 nor 1, but I would certainly expect either an error (no panic plz) or 1011.

Who would expect that result?

Anyone who used the atoi function in C. Which is hundreds of thousands of people (at least) over the last 60+ years...

That's sophistry.
It is bare C, not V in my opinion.
You like bare C?
It is in the world of C language users.
Do V users expect C language?
I wonder who among the V users expects such results? I said.
Do you really have to go this far to understand my intent?
Or are you being sarcastic?
For what reason did you misread my intent?
Was my English incorrect?
I'm using a translator, so I'm sorry if I didn't convey my intentions accurately.

sophistry:
1. the use of clever but false arguments, especially with the intention of deceiving.
2. a fallacious argument.

My statements are neither. I am simply stating facts. As I said, this was discussed before. If you disagree with that discussion, please feel free to give a well-reasoned argument as to why that decision should be changed... and why it will be worth it to change every existing piece of code that uses the current functionality.

There is a simple way to get what you want - use the strconv module routines.

I pointed out that it was sophistry because I was talking about V users and you brought up C users.
You have confused the purpose of the discussion.
That definition of sophistry is exactly what you did.
Why is it that string.int() is over the last 60+ years... ?

I have already stated my opinion and examples, so please read and try to understand.

Your own opinion is that '1,011'.int() being 1 is the expected result and correct behavior?
And that some of the core V developers intentionally set '1,011'.int() to 1?
And that the V core developers decided in the discord to do so?

Please post a link to that part of the discord, because I would like to know why you so stubbornly refuse to throw an error.

I am not talking about strconv etc, I am talking about why you made the decision to refuse to throw the error.

Who would use this function expecting this result?

Me.

Do you really have to go this far to understand my intent?
Or are you being sarcastic?

@kahsa, can you please stop being so paranoid every time you talk/respond to @JalonSolov ?

I do not see anything in what he wrote that would substantiate that response from you.

On the topic - V is not bound to have the same behavior of C's atoi, and if multiple people want another behavior, then .int() etc can be modified, or another method can be added that does what is needed (customizable with the locale's thousands and point separator etc).

However C's behavior is common, and in my personal opinion, .int() as it is, works fine.

@medvednikov what do you think?

In Go this results in an error:

strconv.Atoi: parsing "777,999": invalid syntax

There are hundreds of different locales, and they can't all be expected to be supported by string.int().

Like mentioned above, it's better to use a separate lib for that.

I think it's ok for V to emulate C's atoi.

We can add another function that returns an optional or a result type.

https://modules.vlang.io/strconv.html#atoi already has this alternate behavior, which is what I said all along.

If it is, then doc.md should be changed, as mentioned above.
Using string.int() by a beginning V programmer is an unexpected behavior.
Not all programmers are C users.
From the point of view of non-C users, string.int() returns unorthodox results.
To avoid confusing non-C users, the method to convert strings to numbers in the documentation should be string.atoi(), and string.int() should be hidden from the introductory documentation.

Who would use this function expecting this result?

Me.

Unbelievable.
https://modules.vlang.io/index.html#string.int
How can you expect such behavior just from reading this document?

The reason why I expected '1,011'.f64() to be 1011 is because 100K becomes 100 and 30% becomes 30.
I happened to be doing a data analysis that included those numbers and it was working well, and I noticed this strange behavior because I found a strange calculation result and checked the cause and found that 1,011 was 1.
If you had given me some kind of error with 100K from the beginning, I would have noticed it beforehand.

https://modules.vlang.io/index.html#string.int
From reading this document, I don't think that my guess is particularly strange.

In short, is the intention of having string.int() behave in such a buggy way to avoid having to do '1,111'.int() or { panic(err) } like this every time?
Is that why you dare to allow it to return strange answers without throwing an error?

In my personal opinion, it would be correct enough behavior if it would panic just like an array bounds error.

Not a single popular language I tried converted "1,011" to 1011.

So it's not strange.

Well, in practical use, there is a workaround, so I am not troubled, but this behavior seems to be a clear bug, judging from the documentation, so I reported it.
I am sure there are others who think like me.
I hope you can fix the behavior or provide clearer documentation.

Not a single popular language I tried converted "1,011" to 1011.

This is not a particular problem.
If that is the specification, so be it.

However, it is strange that 1,011 becomes 1.
Do all languages return such an answer?
I think it should be an error at least.

int() doesn't return an error, it can return 0.

C has this behavior.

That's still understandable.
We can consider 0 to be false.

But for example '2,222'.int() returns 2.

I'd be ok with it returning 0, but it would slow down the function a bit.

Also I'm not sure which behavior is preferable: C behavior or returning 0.

If you're concerned about speed, that's ok.
If that's the spec.

If it is, then the specification would need to be clearly documented, I say.

Or, instead of .int(), which is a method name that anyone would be tempted to come up with quickly and use, why not use a special name that is used only when speed is required?

Don't get me wrong, I also place great emphasis on performance.
I am fine with allowing strange behavior for the sake of performance, as long as I know what I am doing and use it with care.
But then what about the developer who uses it without knowing it?

In other words

.int() is a special method that is allowed to have this strange behavior for performance reasons, so please understand that when using it.

That means it needs to be documented as such.
And while we're at it, it would be nice to have a slightly different signature.

I just thought of something. For example, what about an internal convention that prefixes performance-oriented functions with _p.
string.int_p() or something like that.
A convention that says that because these functions are speed-oriented, they do not throw errors or checks, so they exhibit strange behavior, and you should know exactly how they behave before you use them.

But for example '2,222'.int() returns 2.

It returns 2 in that case because it parses the 2, then sees the comma (which is not a digit or _), so it quits and returns 2.

Yes, that is the specification for the convenience method.

A convention that says that because these functions are speed-oriented, they do not throw errors or checks, so they exhibit strange behavior, and you should know exactly how they behave before you use them.

Who will determine what is a strange behavior ?
You ?

The behavior where conversion stops at the first non recognized character and returns the accumulated result is the simplest, if you implement it from scratch. It is not coincidence that most languages use it.

Whether it is strange or not is purely subjective imho.

I fully agree that it should be documented better, so that people can choose another method if that does not suit them.

A convention that says that because these functions are speed-oriented, they do not throw errors or checks, so they exhibit strange behavior, and you should know exactly how they behave before you use them.

Who will determine what is a strange behavior ? You ?

It's a document.

And language specifications.

The designer explains how to use the function.

What is not written depends on the common sense of the developers who use it.

The common sense of each user is different.

That's why, as only the core developers know, undocumented information is almost non-existent for the average user.

Your common sense is not my common sense.
Therefore, for functions that have different expectations from person to person, it is necessary to explain and share the designer's intentions and precautions in detail.

And it is desirable to make it invisible to general users.

The behavior where conversion stops at the first non recognized character and returns the accumulated result is the simplest, if you implement it from scratch. It is not coincidence that most languages use it.

Whether it is strange or not is purely subjective imho.

So if you claim it's not a bug, then I'm asking why it's such a spec.
And it's for performance, I got an answer from Alex.
And I was convinced of it.

After that, as mentioned repeatedly above,
I am making two proposals.

  1. Add a document that it is intentional that both '2,222'.int () and '2,222'.f64 () returns 2 because of performance reason, so do not use letters other than numbers, periods and underscores.

  2. For example, make the method name special like .int_p () (which means a function without any validation for performance reason) and make .int () a method that gives an error at the expense of performance.

That would be perfect to this issue.

@JalonSolov The perfect answer that would have immediately convinced me is as follows

Your third point is the answer.
string.int() .f64() are designed to be as fast as possible because it is heavily used internally. That's why it doesn't check values and doesn't generate errors. That's why it returns the results you pointed out.
So no characters should be used except numbers, a period and underscores.
This is an important thing that should be documented in detail, but unfortunately it is not yet documented.
So it is better to report this issue as a lack of documentation than to report that a function is not working properly.