elm / parser

A parsing library, focused on simplicity and great error messages

Home Page:https://package.elm-lang.org/packages/elm/parser/latest

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

int parser fails on trailing decimal separator

aforemny opened this issue · comments

Hi,
I have found that the int parser fails on the input "42.". I would have expected that int parses the "42", and leaves "." to be consumed later on.

Here is an Ellie demonstrating the problem: https://ellie-app.com/3gJS4VwRJrfa1

If this is not a bug, I would propose to add documentation describing that behavior and how to work around it.

I suspect this is not a bug, but it's definitely a legitimate concern. IP address parsing provides a concrete example of how the int parser can seem like the right tool, but isn't.

Potential solution

Note: Someone more skilled than I may know how to change Parser (Maybe Int) to Parser Int. I couldn't figure out how to convert the Nothing side of String.toInt into a parser error.

digitChain : Parser (Maybe Int)
digitChain =
    getChompedString (chompWhile Char.isDigit)
        |> Parser.map String.toInt

Demo: https://ellie-app.com/4hJzJjnPQ3Ma1
Chomp approach, thanks to dmy.

Of course, even this isn't general purpose. As hinted at in the docs, you may still need to make choices about how to handle -. The answer is likely to be different if you're parsing, say, "123. -429. 3482." vs "123-429-3482.".

Some thoughts on why this isn't a bug

Here's the underlying implementation of int:

From Advanced.elm

int : x -> x -> Parser c x Int
int expecting invalid =
  number
    { int = Ok identity
    , hex = Err invalid
    , octal = Err invalid
    , binary = Err invalid
    , float = Err invalid
    , invalid = invalid
    , expecting = expecting
    }

Here we see that the int parser, like hex et al, is actually a disambiguating layer built on top of the number parser, which is explicitly designed to handle number processing without backtracking.

I'm pretty sure this means that the digits-followed-by-a-dot-that's-not-a-decimal problem can't be solved without a fundamental change in both architecture and behavior for all of the parsers in the number family. I'm also pretty sure that such a shift would go against the library's core goals of simplicity and speed.

@clozach
In my opinion this is clearly a bug, caused by incorrect implementation/architecture which tries to reuse number parser when it is clearly not correct in this case. Fundamental rewrite seems like the only correct option here.

Also you are completely wrong about current parser following core goals - actually it's exact opposite as it does NOT follow first two of the three core goals:

  • "Make writing parsers as simple and fun as possible" - strange special cases like not allowing period after int is not simple nor fun
  • "Produce excellent error messages." - error message is awful

Hey @malaire. Sounds like this is causing you some frustration. I'm tempted to mount an argument that I'm not completely wrong, but I'm too familiar with Evan's thoughts on building community to expect any joy from that direction. But if you'll concede that I'm only partially wrong, well…I'd be surprised if that were not the case! 😉

If you're currently stuck trying to resolve something that int can't help you with, would you be up for sharing it on elm-discourse or Slack?

Or, if you're primarily interested in proving a point (more power to you), how about implementing your own number parsing functions and posting them as a package so the whole community can benefit?

If neither of those appeal, how about writing up an experience report? I'd love to learn more about what it is you were trying to do, and what you experienced when int didn't behave as you expected.

As one of the individual bitten by this (I was the Discourse thread author whom @malaire and others assisted), allow me to mount an argument for why this is an unexpected behaviour.

Expectation based on Documentation

In the Parser library, there is the end function. The documentation here states:

Parsers can succeed without parsing the whole string. Ending your parser with end guarantees that you have successfully parsed the whole string.

The implication is that a parser ought to succeed greedily by consuming until the last character that would not produce a parse error. If the user expects an explicit end, one can use the end function to indicate that "nothing should come after this."

This would allow int to be chained with decimal delimiters for use cases like IP Addresses or even splitting the integer portion from the mantissa:

type alias IntegerAndMantissa =
    { integer: Int
    , mantissa: Int
    }

splitIntegerAndMantissa : Parser IntegerAndMantissa
splitIntegerAndMantissa =
    succeed IntegerAndMantissa
        |= int
        |. symbol "."
        |= int

main = Html.text <| Debug.toString (run splitIntegerAndMantissa "42.56")

Without the ability to use or symbol ".", it becomes extremely cumbersome to parse int when all we want is the raw numeral.

If the user did not want the int to be followed by a period, then perhaps:

succeed identity
    |= int
    |. end

-- succeeds with the input "42" but fails with "42."

Potential Solution at the Library level

Changing the default behaviour of int would be a very disruptive change, as I'm sure there are plenty of existing users of this library depending on the fact that parsing 42. throws an error in the current Parser.

My solution eschews changing the current behaviour, and instead introduce a new function:
numeral

The behaviour as described in the expectation above will be mapped onto this new functions, leaving the current int behaviour intact (thereby not breaking existing parsers for consumers of this library).

The idea is that in cases where we want to just parse the raw number in a greedy fashion, ignoring interpretations of the period for decimal point (which isn't even universal, to begin with!), we can use numeral, and thus we can chain it with symbol "." or symbol "," as the need arises.

The expected behaviour for numeral would be to return an Int type, however internally it will only ever be a positive integer, as it will not attempt to parse and interpret signs (- and +) as part of the numeral.

Summary

Concepts such as "int" and "float" (as well as positive and negative handling) are great if the user is writing a programming language, and wishes for a standard interpretation of what an int and float are (e.g. IEEE-754), but there are many cases where we would like to simply parse the numeral without a regard for its interpretation (and let's be clear, treating . as a decimal marker is definitely an "interpretation"!).

Introducing a new function numeral that specifically consumes digit characters in a greedy fashion until it can no longer do so seems to be an obvious way to solve this problem. Leave the -/+/. interpretation as default for int and float functions. For every other use case, numeral can give us the raw digits as an Int.

Even this numeral function could have several variations, for example:

  • does it accept leading zeroes? What if it does and I don't want to, or vice-versa?
  • what should be the accepted range? What should be the behavior when the number is out of range, an error or an overflow?

Therefore even if the current int behavior may not be fully intuitive, why not write the exact variant you need on a case-per-case basis? For example your numeral function could be:

import Parser exposing (Parser)

numeral : Parser Int
numeral =
    Parser.getChompedString (Parser.chompWhile Char.isDigit)
        |> Parser.andThen
            (\str ->
                case String.toInt str of
                    Just n ->
                        Parser.succeed n

                    Nothing ->
                        Parser.problem "expected an integer"
            )

This one will only accept digits, including leading zeroes, without consuming more characters.

One alternative could be to write a very configurable function in a user package, maybe something like number, but with even more options (leading zeros, minus sign, exponent support, etc.). Its configuration might be unnecessarily complex for it to be in elm/parser though. Or if people find it useful enough, it could become one day part of an hypothetical elm-community/parser-extra package for example, before its inclusion in elm/parser is considered.

@clozach My use-case is parsing version string like 1.2.3 which I solved by just replacing "." with ":" before parsing.

But note that using number-parser where it should not be used doesn't cause just this issue with period - issues #25 and #28 are also caused by this incorrect design.

@clozach

Or, if you're primarily interested in proving a point (more power to you), how about implementing your own number parsing functions and posting them as a package so the whole community can benefit?

This library contains operators. Its API can not be replicated by third-party programmers.