tree-sitter / tree-sitter

An incremental parsing system for programming tools

Home Page:https://tree-sitter.github.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Support `atof` for WASM builds

kylegoetz opened this issue · comments

Problem

Languages define rules for what constitutes a valid floating-point number. As such, part of the parsing process is determining whether a list of numbers constitutes a valid float. atof is important in doing this. Otherwise, a user essentially must include the source code for atof in their own scanner.

However, as per the C standard, atof is implementation-dependent and can vary from platform to platform.

The accuracy of the floating-point operations (+, -, *, /) and of the library functions in <math.h> and <complex.h> that return floating-point results is implementation- defined, as is the accuracy of the conversion between floating-point internal representations and string representations performed by the library functions in <stdio.h>, <stdlib.h>, and <wchar.h>. The implementation may state that the accuracy is unknown.

As such, pasting atof source into scanner is insufficient.

Expected behavior

No response

  1. grammar.js rules will generally suffice. See python.
  2. You don't actually need atof in the scanner. See rust.
  3. Using strtof / strtod is preferred over atof.

I don't think using atof is a good idea in general, you probably should be parsing the float manually in case atof doesn't handle it the way your language does (and/or for perf reasons).

You can also check out how I did it with Odin for another scanner example, but Odin has a wonky float format - https://github.com/tree-sitter-grammars/tree-sitter-odin/blob/master/src/scanner.c#L27

@amaanq It seems you aren't ensuring what you parse as a float is within the upper and lower bounds permitted by Odin. I looked at Odin's docs and it doesn't seem like there is an upper and lower bound (at least, the docs are silent on the question), which explains why you don't need to do that.

However, in Unison, a float must be IEEE 754 double precision, which means just checking for /numbers\.numbers/ is not an option, as illegally large numbers will be validated as Float mistakenly. For example, 1.7976931348623157E+309 would pass a simple "bunch of numbers in a row" check, but it is not a Float in Unison because it exceeds the upper bound.

That's why I'm using atof, which is defined in the spec as converting a string into an IEEE754 double precision (double in C). So this one function does exactly the same thing as Unison specifies.

So atof does exactly what I need, and it cannot be achieved without it, short of essentially just copy and pasting atof (which might be implemented differently across architectures even if the end result should be the same across architectures, so my concern is that if I just copy-paste atof source into my scanner, it will fail on some architectures.)

Edit Actually looking at atof source code for macOS (first place I checked), it doesn't appear to make any bounds checks either, nor does it do safe addition/multiplication. Harumph. Looks like I will need to implement myself.

Huh, are you trying to ensure your floats are valid in the scanner? That's something a compiler/semantic analyzer should be doing, not something you should do with tree-sitter, unless it exceeding a certain value makes it something else (some other node/type/whatever)

I'm pretty conservative with what's added to the standard lib symbols list because each added symbol does increase the size of the wasm module a decent amount, and only libc functions that are applicable/useful in a variety of places should be considered to be added. I still don't think atof makes sense to add if you're trying to ensure that a float is within some range

Huh, are you trying to ensure your floats are valid in the scanner? That's something a compiler/semantic analyzer should be doing, not something you should do with tree-sitter

If the language's rules define a float in a certain way, and someone has typed something that violates that, it's an error in syntax and so shouldn't be parsed as a float node at all, right?

In any case, it turns out atof doesn't do checked arithmetic anyway, so it's out as an option. :/

not exactly?

#include <stdio.h>

int main() {
    double a = 1.8976931348623157e+308;
    printf("%f\n", a);
}

That's way out of bounds for a double, yet still compiles just fine albeit with a warning. Idk what unison does here/how it handles it, but I can't think of an instance where this would be a syntax error in any language I know. It's just not a syntax error in general to use a number that's out of bounds, that's something for a compiler or something with semantic info to deduce, not a parser

Generally, I don’t think your parser needs to check the numeric values are valid. Like @amaanq said, that would typically happen at a different layer.

If the language's rules define a float in a certain way, and someone has typed something that violates that, it's an error in syntax and so shouldn't be parsed as a float node at all, right?

It's a compilation error, not a lexing error. Lexers (usually) cannot and do not check bounds.
As far as I understand from the Haskell code, Unison's internal lexer doesn't check them either.