toml-lang / toml

Tom's Obvious, Minimal Language

Home Page:https://toml.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Are leading zeros allowed in the exponent part of a float?

avakar opened this issue · comments

An issue have arisen in my parser: avakar/pytoml#9, and I find specs ambiguous about this. In particular, it says

An exponent part is an E (upper or lower case) followed by an integer part (which may be prefixed with a plus or minus sign).

Does the phrase integer part mean it's an integer as specified in the Integer section of the specs and thus disallows leading zeros? In other words, is the following a valid TOML file?

maximum_error = 4.85e-06

Note the leading zero in the exponent.

I am the initial reporter of the bug avakar/pytoml#9. I would advocate the correctness of using leading zeroes, as this is common practice in a number of languages. (It allows nicely aligned numbers.)

Here are a few examples:

Python 2.7/3.4:

print(4.86e-6)
# Prints "4.86e-06"

Ruby 2.1:

puts(4.86e-6)
# Prints 4.86e-06

Can someone provide a concrete example where allowing leading zeros is useful?

Here is my case. @avakar's TOML parser refused leading zeros because it failed to parse some of the TOML files that a Python script of mine was producing automatically. It turned out that such files were the ones where a parameter turned so small that scientific notation was used for it. (As I said above, Python's print automatically puts a leading zero to the exponent if this has just one digit.)

@ziotom78 Oh, I see, I didn't realize that's what your example was pointing out. That's quite curious that Ruby and Python do that. Do you know what the rationale is behind that behavior?

commented

I'd call that bad design...

Reminds me of that old thing, the C strcmp function and the implications it had for sorting...

I imagine that this way of formatting numbers might allow nicer alignment when printing numbers in columns, though I am not really sure. Python says nothing about this: https://docs.python.org/2/library/string.html#formatspec. (See this question on StackOverflow for some more interesting information: http://stackoverflow.com/questions/9910972/python-number-of-digits-in-exponent.)

Interesting enough, there are cases where this behaviour is intentional and documented, see how .NET does floating-point formatting (https://msdn.microsoft.com/en-us/library/dwhawy9k.aspx#EFormatString):

The exponent always consists of a plus or minus sign and a minimum of three digits. The exponent is padded with zeros to meet this minimum, if required.

commented

It's a mess...

https://en.wikipedia.org/wiki/Leading_zero#0_as_a_prefix

Decimal vs. octal, etc.

I think C#/VB.net etc. use three digits in the exponent, forcing the user to use custom string formats for 'correct' output..

AFAIK, strictly mathematically, using leading zeros is generally discouraged.

Programming languages have different conventions, obviously, so I don't know what an easy and elegant answer to the question at hand would be..

I agree that C's way of indicating octal numbers by prefixing them with a zero is really confusing. However, in this case we are discussing the opportunity of allowing leading zeroes in exponents. AFAIK, in this case there is no ambiguity at all, as every language I know always interprets exponents as decimal numbers, regardless of the trailing zeros.

In this case I think internal consistency between "integer parts" of the spec outweighs the benefits of accepted the generated output of certain languages. You could argue that integer values should allow leading zeros, which would then propagate to all integer parts, but I just don't agree with allowing such cruft. TOML is designed for readability as a primary goal, and leading zeros are counter to that. I'll submit a clarification to the float language.

I would argue that C99 is a pretty strong standard and going against it will produce a lot of headaches for a lot of people. One of the reason is that there is no easy way to manipulate the number of digits in the exponent for output functions like cout / printf ... making toml unsuitable to be output by many languages based on C.

"The exponent always contains at least two digits, and only as many more digits as necessary to represent the exponent."

A reconsideration would be welcome.

"outweighs the benefits of accepted the generated output of certain languages." This is misleading, as, in fact we're talking most common languages. Exponents with leading zeroes has always been a standard in the computing world, and restricting them here reduces usability to no purpose. toml is a terrible place to create new notation standards.

I'm sure this comes up with naive computer-generated TOML configurations, and not as much with human-written configurations. My knee-jerk reaction, valid or not, is that the programs generating TOML ought to write human-readable exponents. But leading zeros in exponents, valid or not, are human-readable, especially the ones that printf and cout make.

This is a case in which consistency for the sake of the spec isn't worth the cost to users writing configs with the tools they've got on hand. Integer values shouldn't have leading zeros, but exponent values could.

The spec could be changed to read "An exponent part is an E (upper or lower case) followed by an integer part (which follows the same rules as integer values but may include up to two leading zeros)."

I chose the "up to two" part because printf may use one leading 0, and printf on Windows may use two. We could drop the "up to two" part, though allowing an arbitrary number of leading zeros could lead to abuse. But a C99-based program writing seventy million leading zeros on exponents is about as hard to craft as a C99-based program writing no leading zeros on exponents.

The ABNF could use the following in place of the current definition of exp, which I've tested on Instaparse with the current version of toml.abnf as so modified:

exp = "e" float-exp-part
float-exp-part = [ minus / plus ] zero-prefixable-int

And if we really don't want any more than two leading zeros, we could instead do this:

exp = "e" float-exp-part
float-exp-part = [ minus / plus ] float-exp-int
float-exp-int  = [ %x30 [ underscore ] [ %x30 [ underscore ] ] ] unsigned-dec-int

There's probably a better way to write this, but it works. The underscores are ugly but consistent with unsigned-dec-int.

And so I also ask @mojombo to reconsider.

Actually, Microsoft had a function called _set_output_format to print the exponent as two digits which is now obsolete because they also follow C99 starting with Visual Studio 2015.

https://msdn.microsoft.com/en-us/library/bb531344(v=vs.140).aspx

Exponent formatting The %e and %E format specifiers format a floating point number as a decimal mantissa and exponent. The %g and %G format specifiers also format numbers in this form in some cases. In previous versions, the CRT would always generate strings with three-digit exponents. For example, printf("%e\n", 1.0) would print 1.000000e+000. This was incorrect: C requires that if the exponent is representable using only one or two digits, then only two digits are to be printed.

In Visual Studio 2005 a global conformance switch was added: _set_output_format. A program could call this function with the argument _TWO_DIGIT_EXPONENT, to enable conforming exponent printing. The default behavior has been changed to the standards-conforming exponent printing mode."

Two observations:

  1. Since before computers, real physical constants used in computation generally had exponents with 2 digits max. For example, consider the tiny mass of an electron (9.11e-31 kg) and the huge mass of the Sun (1.99e30 kg). So, if your program displays or prints out a list of floats, they typically line up most nicely and are easiest to read when you have 3 spots for the exponent: a +/- sign, and two digits. (Note, most programming languages print out the plus sign too, for positive exponents.)

  2. With at least Lua, Python, and Haxe (probably many others, I haven't tried them), small numbers from 1e-5 to 1e-9 get printed "1e-05", "1e-06", ... "1e-09" --- with that extra zero. Note, numbers larger than 1e-5 (and up to around 1e13) most often are printed in decimal, since that's presumably considered more human-readable. So there's really not a huge range of numbers we're talking about that get that extra zero. These languages also accept as input the extra zero, and the plus sign too.

Anyhow, my point is, there is effectively zero ambiguity over what 1e-05 means; and for that matter, what 1e+05 means. They've been written that way since time immemorial, programming languages typically print them that way by default and read them that way as well, and even most non-graphing calculators display 2-digit zero-padded exponents. I'd be surprised if TOML didn't accept them that way. If I were using a toml file for a scientific program with floating point config values that I was getting from elsewhere (maybe output from another program), it would be a nuisance to have to edit them and remove the zeros and plus signs from the exponents.

If I were using a toml file for a scientific program with floating point config values that I was getting from elsewhere (maybe output from another program), it would be a nuisance to have to edit them and remove the zeros and plus signs from the exponents.

This is indeed the main reason why I stopped using TOML in my scientific codes.

@uvtc is right, and considering their observations and @ziotom78's corroboration I would plead to re-open this issue and allow leading zeros. Even those who don't find them useful will supposedly find them harmless, and considering that others find them useful – or even essential – that's a clear case in favor of allowing them.

Ok, you've all made a compelling argument. I would love to see a PR adding this capability.

Here goes nothing! I stuck to what I spelled out in my previous comment for any number of leading zeroes, and updated the changelog.

(And trimmed an unnecessary trailing whitespace elsewhere, whoops. Blame my trusty editor.)

Closed by #656.