Are leading zeros allowed in the exponent part of a float?

Question

Are leading zeros allowed in the exponent part of a float?

avakar opened this issue 9 years ago · comments

An issue have arisen in my parser: avakar/pytoml#9, and I find specs ambiguous about this. In particular, it says

An exponent part is an E (upper or lower case) followed by an integer part (which may be prefixed with a plus or minus sign).

Does the phrase integer part mean it's an integer as specified in the Integer section of the specs and thus disallows leading zeros? In other words, is the following a valid TOML file?

maximum_error = 4.85e-06

Note the leading zero in the exponent.

Maurizio Tomasi · Answer 1 · Tue Oct 06 2015 19:03:05 GMT+0800 (China Standard Time)

I am the initial reporter of the bug avakar/pytoml#9. I would advocate the correctness of using leading zeroes, as this is common practice in a number of languages. (It allows nicely aligned numbers.)

Here are a few examples:

Python 2.7/3.4:

print(4.86e-6)
# Prints "4.86e-06"

Ruby 2.1:

puts(4.86e-6)
# Prints 4.86e-06

Tom Preston-Werner · Answer 2 · Sun Nov 01 2015 04:53:41 GMT+0800 (China Standard Time)

Can someone provide a concrete example where allowing leading zeros is useful?

Maurizio Tomasi · Answer 3 · Sun Nov 01 2015 06:27:41 GMT+0800 (China Standard Time)

Here is my case. @avakar's TOML parser refused leading zeros because it failed to parse some of the TOML files that a Python script of mine was producing automatically. It turned out that such files were the ones where a parameter turned so small that scientific notation was used for it. (As I said above, Python's print automatically puts a leading zero to the exponent if this has just one digit.)

Tom Preston-Werner · Answer 4 · Sun Nov 01 2015 06:52:41 GMT+0800 (China Standard Time)

@ziotom78 Oh, I see, I didn't realize that's what your example was pointing out. That's quite curious that Ruby and Python do that. Do you know what the rationale is behind that behavior?

HRXN · Answer 5 · Sun Nov 01 2015 17:32:52 GMT+0800 (China Standard Time)

I'd call that bad design...

Reminds me of that old thing, the C strcmp function and the implications it had for sorting...

Maurizio Tomasi · Answer 6 · Mon Nov 02 2015 17:53:33 GMT+0800 (China Standard Time)

I imagine that this way of formatting numbers might allow nicer alignment when printing numbers in columns, though I am not really sure. Python says nothing about this: https://docs.python.org/2/library/string.html#formatspec. (See this question on StackOverflow for some more interesting information: http://stackoverflow.com/questions/9910972/python-number-of-digits-in-exponent.)

Interesting enough, there are cases where this behaviour is intentional and documented, see how .NET does floating-point formatting (https://msdn.microsoft.com/en-us/library/dwhawy9k.aspx#EFormatString):

The exponent always consists of a plus or minus sign and a minimum of three digits. The exponent is padded with zeros to meet this minimum, if required.

HRXN · Answer 7 · Thu Nov 05 2015 07:02:18 GMT+0800 (China Standard Time)

It's a mess...

https://en.wikipedia.org/wiki/Leading_zero#0_as_a_prefix

Decimal vs. octal, etc.

I think C#/VB.net etc. use three digits in the exponent, forcing the user to use custom string formats for 'correct' output..

AFAIK, strictly mathematically, using leading zeros is generally discouraged.

Programming languages have different conventions, obviously, so I don't know what an easy and elegant answer to the question at hand would be..

Maurizio Tomasi · Answer 8 · Thu Nov 05 2015 17:28:23 GMT+0800 (China Standard Time)

I agree that C's way of indicating octal numbers by prefixing them with a zero is really confusing. However, in this case we are discussing the opportunity of allowing leading zeroes in exponents. AFAIK, in this case there is no ambiguity at all, as every language I know always interprets exponents as decimal numbers, regardless of the trailing zeros.

Tom Preston-Werner · Answer 9 · Sat Jan 23 2016 08:18:02 GMT+0800 (China Standard Time)

In this case I think internal consistency between "integer parts" of the spec outweighs the benefits of accepted the generated output of certain languages. You could argue that integer values should allow leading zeros, which would then propagate to all integer parts, but I just don't agree with allowing such cruft. TOML is designed for readability as a primary goal, and leading zeros are counter to that. I'll submit a clarification to the float language.

Alexander Duda · Answer 10 · Wed May 09 2018 21:23:03 GMT+0800 (China Standard Time)

I would argue that C99 is a pretty strong standard and going against it will produce a lot of headaches for a lot of people. One of the reason is that there is no easy way to manipulate the number of digits in the exponent for output functions like cout / printf ... making toml unsuitable to be output by many languages based on C.

"The exponent always contains at least two digits, and only as many more digits as necessary to represent the exponent."

A reconsideration would be welcome.

Paul Anguiano · Answer 11 · Wed May 09 2018 22:14:43 GMT+0800 (China Standard Time)

"outweighs the benefits of accepted the generated output of certain languages." This is misleading, as, in fact we're talking most common languages. Exponents with leading zeroes has always been a standard in the computing world, and restricting them here reduces usability to no purpose. toml is a terrible place to create new notation standards.

Dave Ostroske · Answer 12 · Thu May 10 2018 14:26:36 GMT+0800 (China Standard Time)

I'm sure this comes up with naive computer-generated TOML configurations, and not as much with human-written configurations. My knee-jerk reaction, valid or not, is that the programs generating TOML ought to write human-readable exponents. But leading zeros in exponents, valid or not, are human-readable, especially the ones that printf and cout make.

This is a case in which consistency for the sake of the spec isn't worth the cost to users writing configs with the tools they've got on hand. Integer values shouldn't have leading zeros, but exponent values could.

The spec could be changed to read "An exponent part is an E (upper or lower case) followed by an integer part (which follows the same rules as integer values but may include up to two leading zeros)."

I chose the "up to two" part because printf may use one leading 0, and printf on Windows may use two. We could drop the "up to two" part, though allowing an arbitrary number of leading zeros could lead to abuse. But a C99-based program writing seventy million leading zeros on exponents is about as hard to craft as a C99-based program writing no leading zeros on exponents.

The ABNF could use the following in place of the current definition of exp, which I've tested on Instaparse with the current version of toml.abnf as so modified:

exp = "e" float-exp-part
float-exp-part = [ minus / plus ] zero-prefixable-int

And if we really don't want any more than two leading zeros, we could instead do this:

exp = "e" float-exp-part
float-exp-part = [ minus / plus ] float-exp-int
float-exp-int  = [ %x30 [ underscore ] [ %x30 [ underscore ] ] ] unsigned-dec-int

There's probably a better way to write this, but it works. The underscores are ugly but consistent with unsigned-dec-int.

And so I also ask @mojombo to reconsider.

Alexander Duda · Answer 13 · Thu May 10 2018 18:25:44 GMT+0800 (China Standard Time)

Actually, Microsoft had a function called _set_output_format to print the exponent as two digits which is now obsolete because they also follow C99 starting with Visual Studio 2015.

https://msdn.microsoft.com/en-us/library/bb531344(v=vs.140).aspx

Exponent formatting The %e and %E format specifiers format a floating point number as a decimal mantissa and exponent. The %g and %G format specifiers also format numbers in this form in some cases. In previous versions, the CRT would always generate strings with three-digit exponents. For example, printf("%e\n", 1.0) would print 1.000000e+000. This was incorrect: C requires that if the exponent is representable using only one or two digits, then only two digits are to be printed.

In Visual Studio 2005 a global conformance switch was added: _set_output_format. A program could call this function with the argument _TWO_DIGIT_EXPONENT, to enable conforming exponent printing. The default behavior has been changed to the standards-conforming exponent printing mode."

John Gabriele · Answer 14 · Tue Aug 20 2019 14:51:36 GMT+0800 (China Standard Time)

Two observations:

Since before computers, real physical constants used in computation generally had exponents with 2 digits max. For example, consider the tiny mass of an electron (9.11e-31 kg) and the huge mass of the Sun (1.99e30 kg). So, if your program displays or prints out a list of floats, they typically line up most nicely and are easiest to read when you have 3 spots for the exponent: a +/- sign, and two digits. (Note, most programming languages print out the plus sign too, for positive exponents.)
With at least Lua, Python, and Haxe (probably many others, I haven't tried them), small numbers from 1e-5 to 1e-9 get printed "1e-05", "1e-06", ... "1e-09" --- with that extra zero. Note, numbers larger than 1e-5 (and up to around 1e13) most often are printed in decimal, since that's presumably considered more human-readable. So there's really not a huge range of numbers we're talking about that get that extra zero. These languages also accept as input the extra zero, and the plus sign too.

Anyhow, my point is, there is effectively zero ambiguity over what 1e-05 means; and for that matter, what 1e+05 means. They've been written that way since time immemorial, programming languages typically print them that way by default and read them that way as well, and even most non-graphing calculators display 2-digit zero-padded exponents. I'd be surprised if TOML didn't accept them that way. If I were using a toml file for a scientific program with floating point config values that I was getting from elsewhere (maybe output from another program), it would be a nuisance to have to edit them and remove the zeros and plus signs from the exponents.

Maurizio Tomasi · Answer 15 · Tue Aug 20 2019 15:36:43 GMT+0800 (China Standard Time)

If I were using a toml file for a scientific program with floating point config values that I was getting from elsewhere (maybe output from another program), it would be a nuisance to have to edit them and remove the zeros and plus signs from the exponents.

This is indeed the main reason why I stopped using TOML in my scientific codes.

Christian Siefkes · Answer 16 · Wed Aug 21 2019 19:35:07 GMT+0800 (China Standard Time)

@uvtc is right, and considering their observations and @ziotom78's corroboration I would plead to re-open this issue and allow leading zeros. Even those who don't find them useful will supposedly find them harmless, and considering that others find them useful – or even essential – that's a clear case in favor of allowing them.

Pradyun Gedam · Answer 17 · Wed Aug 21 2019 19:54:11 GMT+0800 (China Standard Time)

@mojombo ^

Tom Preston-Werner · Answer 18 · Thu Aug 22 2019 05:14:38 GMT+0800 (China Standard Time)

Ok, you've all made a compelling argument. I would love to see a PR adding this capability.

Dave Ostroske · Answer 19 · Thu Aug 22 2019 07:07:07 GMT+0800 (China Standard Time)

Here goes nothing! I stuck to what I spelled out in my previous comment for any number of leading zeroes, and updated the changelog.

(And trimmed an unnecessary trailing whitespace elsewhere, whoops. Blame my trusty editor.)

Tom Preston-Werner · Answer 20 · Fri Aug 23 2019 02:44:27 GMT+0800 (China Standard Time)

Closed by #656.