evanw / float-toy

Use this to build intuition for the IEEE floating-point format

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Insufficiently accurate

fire-eggs opened this issue · comments

The accuracy of the values goes downhill from bit 16 onward, with incorrect results.

I.e. using the 32-bit (float) toy:

  1. Set only bit 17 'on': 0x3F810000 : value is 1.0078125: correct!
  2. Set only bit 16 'on': 0x3F808000 : value is 1.0039062: incorrect! [Should be 1.00390625]
  3. Set only bit 15 'on': 0x3F804000 : value is 1.0019531: incorrect! [Should be 1.001953125]
  4. Set only bit 14 'on': 0x3F802000 : value is 1.0009766: incorrect! [Should be 1.0009765625]

The inaccuracies accumulate.

My specific egg-on-face incident was "1.28": the toy claims that 0x3FA3D70A correctly represents "1.28", but any C compiler will tell you the value is 1.27999973297.

[Or Google:
1+(2^-2)+(2^-6)+(2^-7)+(2^-8)+(2^-9)+(2^-11)+(2^-13)+(2^-14)+(2^-15)+(2^-20)]

Both 1.0039062 and 1.00390625 have the same 32-bit floating-point representation. Same thing with the other numbers. Some quick tests from my browser's JavaScript console:

> Array.from(new Int32Array(new Float32Array([1.0039062, 1.00390625]).buffer)).map(x => x.toString(16))
["3f808000", "3f808000"]
> Array.from(new Int32Array(new Float32Array([1.0019531, 1.001953125]).buffer)).map(x => x.toString(16))
["3f804000", "3f804000"]
> Array.from(new Int32Array(new Float32Array([1.0009766, 1.0009765625]).buffer)).map(x => x.toString(16))
["3f802000", "3f802000"]

From the computer's point of view, the strings 1.0039062 and 1.00390625 (and 1.003906225 and infinitely many other strings) are all the exact same 32-bit floating-point number.

Given infinitely many possible equivalent values to display, it seems to me like the most sensible thing to do would be to display the shortest string that still represents the number exactly, of which there is only one unique value. That's what is being done here.

It's your program and you can do it how you want. I thought I had found a tool to see the exact value of a given floating point representation, which turns out not to be the case.

From the computer's point of view, the strings 1.0039062 and 1.00390625 (and 1.003906225
and infinitely many other strings) are all the exact same 32-bit floating-point number.

I hear what you are saying. This is the compiler picking the closest possible floating point representation of a string. I.e. the fundamental issue of floating point (im)precision.

But coming from the other direction, a given 32-bit floating point representation does in fact have one, and only one, value. "0x3F808000" represents "1.00390625" exactly.

I decided to give this a go and forked / hacked together https://shiona.github.io/float-toy/ (sources https://github.com/shiona/float-toy).

commented

I realize this is an old issue and that my comment may never reach the parties involved in this issue, but I think there is an important point that is missing with respect to this issue that pertains to floating point representations. I am going to try to explain this step by step so this post will be long and possibly boring.

When a floating point number is encoded in IEEE 754 32 bit format its mantissa or fractional part has exactly 24 (23 for denormals) significant bits (either 0 or 1). What this means is that the decimal number it represents is also limited in the number of decimal significant digits (a number between 0 and 9) it contains. For results of mathematical operations to be correct, numbers must be rounded to the appropriate number of bits/digits after operations are performed.

Although it is true that calculations may provide bits/digits beyond significance, they are invalid and must be discarded because they are part of the error that is incurred when an exact mathematical calculation is approximated using inexact numbers. So the issue discussed here pertains to the number of decimal significant digits that can be obtained from a finite number of binary significant bits.

To get this number lets first get some simple bounds. To encode 10 numbers in binary you need 4 bits, but 4 bits can store numbers bigger than 9 so the answer is that each decimal digit encodes a number of bits that is between 3 and 4. To obtain the exact number we write both representations in long form

binary = sum(n=-B..B) b_n*2^n
decimal = sum(n--D..D) d_n*10^n

so a digit d_n generates a number

d_n*10^n= d_n*(2^x)^n=d_n*2^(x*n)

for the value of x such that 2^x=10. The answer is log2(10) (or 1/log10(2)) which is ~ 3.3219280948873623478 which agrees with our estimate. Then it follows that 24 significant bits produces 24/3.3219280948873623478 ~ 7.2 significant decimal digits.
What this doesn't mean is that we cannot use all the bits the binary provides to calculate the decimal number. What it does mean is that for every element from the sum above computed from each binary bit we can only keep 7 digits if we claim the answer has any form of exactness.
This means that statements like

"0x3F808000" represents "1.00390625" exactly.

is not entirely correct because the quoted answer includes digits that are not significant (2 in this case) and that invalidates the asserted exactness . The answers provided by this app are more correct because they are rounded to the correct number of significant digits. The answers provided by the app that the last post refers to are less correct because they contain digits that are not significant . One obvious indicator of the misunderstanding is that the number of digits displayed in the decimal representations created from binary or hex inputs is the same for half, single and double precision formats which cannot possibly be correct. Each representation implies a different number of decimal significant digits which now you can also calculate.
In light of this lets revisit the following statement

My specific egg-on-face incident was "1.28": the toy claims that 0x3FA3D70A correctly represents "1.28", but any C compiler will tell you the value is 1.27999973297.

As this is from a 32 bit floating point number the answer must be rounded to 7 or 8 digits. We round by multiplying the number by 10^7 rounding to nearest integer and dividing by 10^7 again. The answer is 1.28 as indicated by the app and not the number with 13 digits which is almost twice the number of digits it should have. What that number represents is more likely the result of casting the float 1.28 to a double precision number and has nothing to do with IEEE 745 conversions of binaries into/from decimals which is what this app does.

This is all very basic college level numerical analysis material well beyond any reasonable doubt. I feel that not posting this would be a disservice to the author of this app which is both technically sound and aesthetically appealing.

commented

Thank you.

I agree that the number of digits my tool outputs when the LSB is 1 is ridiculous, trying to read the IEEE 754 specification, I fail to find a requirement for rounding, or limiting the precision in which a number is shown. I can only find minimum precision values. This of course does not mean those don't exist, just that I'm not good at reading specs. https://irem.univ-reunion.fr/IMG/pdf/ieee-754-2008.pdf was the version of the spec I was able to find for free. If you can help and pinpoint the part for the requirement, I'm more than happy to learn.

But the original reason I made my version was this post in our company slack messages:

With following variable: "const x = 910.1807250976562" eslint raises an error "This number literal will lose precision at runtime eslint@typescript-eslint/no-loss-of-precision".

I found this intriguing, and remembering evanw's tool, I thought I'd quickly crack this problem. But to my surprise it didn't help me at all, it would not accept any more precision. Doing some quick math I realized to my horror that the tool actually did rounding, which clearly meant that I had found the reason for the eslint error. I quickly hacked together a python script to get the "exact" value eslint wanted (which was 910.18072509765625, so one more 5 added to the end). I thought that I would come across this problem later in life, so I decided to hack this float-toy to do what I understood was "correct".

Believing you are correct, I guess eslint has (or had) a bug.

I understand my tool might not be correct, it most likely is not that useful that often, but it seems to help in some weird corner cases, so I stand by writing it. But I think I probably should add a note that people should default to evanw's version and only use mine if all else fails.

commented

The number of significant decimal digits is a function of the number of bits in the mantissa and of that number alone. The same is true of any measurement made using any base number, in decimal base we call mantissa exponent notation scientific notation except the we use 0 as the leading digit because unlike for binary there is no unique non-zero digit. Thus the standard tells you how many significant decimal digits exists by fixing the number of mantissa bits that different formats have.

16 bit half-float : 11 mantissa bits -> 11*log(2) ~ 3 decimal significant digits
32 bit float: 24 mantissa bits -> 24*log(2) ~ 7 decimal significant digits
64 bit double: 53 mantissa bits -> 53*log(2) ~ 15 decimal significant digits
128 bit quad: 113 mantissa bits -> 113*log(2) ~ 34 decimal significant digits

those are for normal numbers. Denormals have one less bit of precision but after truncation the numbers remain the same. These are the number of significant decimal digits for all numbers in the format, but the story doesn't end there.

Due to properties of numbers in the real line it is not possible to express all rationals using finite number of symbols for any floating point representation using a fixed radix (aka base) in an exact manner. Also one must consider the fact that IEEE 745 numbers exist as points on the real line. That means that entire intervals of real numbers have the same 745 representation. As long as the number of digits satisfies the condition any number that belongs to the appropriate interval is a valid answer. Most implementations of IEEE 745 try to find the shortest decimal representation that exists in the interval so for example is you are looking at the IEEE 745 double number obtained by turning the LSB to one then 1.401298e-34 is acceptable, but 1e-34 is preferred.

I can't comment on eslint message without further information. If one assumes that it is using double precision IEEE format then one expects 15 significant digits to be correct so having to specify 17 does seem like something else is going on. Maybe a bug or maybe something about the way numbers are manipulated by the software requires extra digits. To be sure one would have to look at the source code of the part that generates the error for clues about what the error means and how the error is detected to begin with.

I never meant to imply that your version is incorrect, but the first time I read all the posts here I was convinced that this tool was giving incorrect answers. As I am trying to implement something like it in c++ is reread the posts several times and noticed that my first assessment was incorrect. Javascript like python have a unique take on IEEE that makes it very easy to find the shortest decimal representations but in 'normal' compiled languages this is not an easy thing to do. My post was mainly to ensure that my mistake is not repeated, and in doing so I indicated that this tool gives answers that are more in tune with existing tools and libraries like dragon4, etc than the version you created that gives answers with arbitrarily large number of digits but I didn't mean to imply that the answers were incorrect and if I did I apologize.