toml-lang / toml

Tom's Obvious, Minimal Language

Home Page:https://toml.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Unclear whether control characters are allowed in comments

jorisvr opened this issue · comments

It is not clear from the specification whether control characters (U+0000 .. U+001F and U+007F) are allowed in comments.

The section Comment only states that "A hash symbol marks the rest of the line as a comment.".
This implies that any non-newline characters are allowed after the hash symbol.

Discussion about control characters only appears in the section String, which should have no implication on comments.

The ABNF definition forbids control characters (except tab) in comments, but the ABNF is not authoritative.

I think the section Comment should explicitly state which characters are allowed in comments.
Note that it would be unlogical to allow e.g. form-feed inside comments, because form-feed is traditionally a stronger separator than newline.

ABNF is not authoritative.

It is.

I think the section Comment should explicitly state which characters are allowed in comments.

If there's more people who think this would add value, we might do this. I don't want to complicate an otherwise straightforward explanation.

ABNF is not authoritative.

It is.

The general theme in issues #566, #567, #568, #569 is that I believe the specification to be ambiguous. I assumed the text document is the complete, stand-alone, authoritative specification, while the ABNF has only experimental status.

If indeed the ABNF is authoritative, perhaps these issues are resolved.
However I think in that case the specification text should explicitly declare the ABNF grammer to be authoritative and defer to it for details. Otherwise how will anybody know where to look for a precise definition of TOML ?

From the point of view of a parser, it's probably easier to allow control characters (except for newline) in comments, because then the rule is "Once you see a comment marker, ignore everything until the next newline". This is simpler (at least for some parser libraries) to implement than "Once you see the comment marker, ignore all valid characters but throw an error for 0x00-0x1f and 0x7f, and switch modes at the newline".

So I'd personally prefer for the rule to be "control characters are legal (and ignored) in comments; after a comment marker is seen, only the next newline matters."

@rmunn On the other hand, if control chars are prohibited in comments too, they would be prohibited anywhere in a TOML document since they are also prohibited in strings of all kinds (and they are certainly not allowed outside of strings or comments). So prohibiting them everywhere might actually be quite simple for many parsers.

Currently, I am ambivalent on this issue tbh. I don't see any major gains either way so I'm inclined to say status quo would win here.

I can be convinced either way though.

@pradyunsg I assume if you say "status quo" you mean the status quo as defined by the ABNF (which prohibits control chars in comments)?

The original poster's point was exactly that, in the human-readable spec, the status quo is not defined. I too would advice clarifying this in the written spec in addition to the ABNF, for people who can read English better than Backus–Naur.

Thanks everyone for the patience here -- I took a long time to come back around to this.

Yes, I think forbidding control characters in comments makes sense. If there's a significant use case for control characters in comments, do holler.

Meanwhile, I think what we need here is a PR clarifying in text what's already clear in the ABNF -- that control characters are not allowed in comments.

I understand this is closed now, and I also understand the logic behind the decision, however I think the following has not been considered and might make you reconsider the decision here:

If a comment with certain Unicode characters (except newline) is to be considered invalid, then this has the following drawbacks:

  • a comment can involuntarily invalidate the whole TOML file. I think such accidents should be prevented.
  • parsers must also parse each character in a comment. This may complicates parsing. Not because it's almost the same as 'string', but because of different handling of error situations, informing users why a comment is invalid.
  • on the same token, it slows down parsing,whatever way you look at it
  • I think the status quo requires strings to be valid Unicode strings. That's a little more parsing than just checking for control chars, and a parser really shouldn't be bothered with this added complexity when it encounters comments.
  • conceptually, people consider comments not part of the syntax, and feel free to dump whatever they like in them. Disallowing certain chars contradicts this very human 'feeling' about what comments are.

Bottom line of this argument, I'd like to propose a simplification: define a comment as any range of code points, until a newline is encountered.

This is simpler, and clearer for end users.

As an implementation author, I agree with @abelbraaksma above. Firstly, it's substantially more complicated to parse out control characters in comments and raise an error, and this kind of strictness provides no real advantage to users - it results in documents being rejected for content that is commented out, which is itself an indication by the user that the content should be ignored by the parser.

I don't seriously object to allowing control characters in comments. Are there any security concerns about them here? I could object to a delete char, 0x7F, immediately following the hash sign. But that's it.

If you think it's merited, let's reopen this issue, and I can compose a PR that would allow most everything but newlines in comments.

I understand these sentiments, but currently the spec says "Control characters other than tab ... are not permitted in comments." So if we change this again, all 1.0-compatible parsers would supposedly have to be changed again.

Also some implementations might conceivably check for control chars before doing any further parsing step? If so, like I wrote earlier: "prohibiting them everywhere might actually be quite simple for many parsers."

Maybe this is a good case to let implementations decide on their behavior? The spec could say something like:

TOML implementations may throw an error or issue a warning if encountering any control characters other than tab (U+0000 to U+0008, U+000A to U+001F, U+007F) in a comment, but they don't have to do so.

In that way, both rejecting control chars and ignoring them together with the rest of the comment would both be fine.

This language would have to be adjusted regarding Unicode validity too, but here I think the same logic applies: some parsers might get the rejection of invalid Unicode sequences for free, because they use an OS/library function that does it for them. For them, if we require that "any range of code points" is accepted in the comment, whether valid Unicode or not, might actually make things harder. So "it's implementation-dependent" may be the best course of action too.

@ChristianSi We don't need to say anything if implementations are going to choose their own behavior for handling comments. We just need to be simple and obvious. So we can simplify things to their essentials, and leave the rest to interpretation.

It's a given that any TOML document must be a valid Unicode document encoded in UTF-8. So we need not worry about invalid code points (like surrogates) or invalid byte strings, which would yield well-defined errors of their own.

One concern that wasn't addressed is whether or not the 0x0D in a Windows-type newline ought to be ignored. Since the line feed character is central to how comments work, we need only to mention that either type of newline can mark the end of a comment. I propose this text to replace the restrictions at the end of the Comments section in toml.md:

All characters except line feeds (0x0A) are permitted in comments and may be
ignored. At their discretion, parsers that read comments may exclude a final
carriage return (0x0D) appearing before a terminating line feed.

That's about all that needs to be said, I think. Simple parsers can just start from the #, scan ahead for 0x0A, and throw out all that stuff, excluding the 0x0A. More sophisticated parsers can take Windows-style newlines into account when preserving comments or interpreting them. Although this may require existing TOML v1.0 parsers to reimplement their comment handling code, these changes do ostensibly simplify parsing and address each of @abelbraaksma's concerns.

For now, I've implemented the stricter mechanism in TomlJ - but in a bit of a sketchy way: I lex out any comment as # through to any control character (COMMENT : '#' (~[\u0000-\u0008\u000A-\u001F\u007F])*).

That works only because I know control characters aren't allowed anywhere else, so will immediately generate an error (except for \r?\n). But the error isn't particularly descriptive - it'll only indicate that the control character isn't expected, e.g. Unexpected '\\r', expected a newline or end-of-input, when it should really indicate that it isn't allowed within a comment.

@cleishm, not sure if this is feasible, but if there are inconsistent line endings, I'd report that as a specific error. Something like "Inconsistent line endings detected in file". Personally, I think any combination of CR|LF|CRLF should be allowed for simplicity's sake, but that ship has sailed, as mentioned in #924.

@abelbraaksma For most parsers, it's probably not feasible. In my case, I use the ANTLR lexer (tokenizer) to detect newlines so that the parser only has to deal with the newline token. To give that kind of error would mean the lexer needs to be stateful - recording that it saw a newline of one type and then of a different type later. Not impossible, of course, but not trivial either.

@cleishm, I’m not sure how the lexer in your case defines the errors, but there ought to be a location where that exception is thrown. Since it’s only ever going to be a new line character that could possibly mean “wrong/corrupt line endings” and since it could mean nothing else, you could just make the method that throws the error itself smarter, ie by simply switching over whether the incorrect token is a \r, and adjust the message accordingly.

It’s been many years ago that I worked with ANTLR, so forgive me if I’m missing the obvious, and oversimplify things…

As it stands, a TOML document with mixed LFs and CRLFs for line endings should not produce an error. Either line ending would be handled properly as a newline. And within multiline strings, the parser will normalize the line endings in the resulting strings.

Yes. But I think we were talking about sole CRs here, which are explicitly disallowed, when not followed by an LF.

Which I agree with. My point is that the error message should not complain about "inconsistent" line endings, because they actually can be inconsistent, as long as only the two permitted line endings are used.

Totally, I didn’t mean to muddy the waters. Sorry for the confusion!

In TOMLJ, sole CRs are already raised as errors, and a newline is tokenized from \r?\n, meaning it will consume any CR immediately before the linefeed (https://github.com/tomlj/tomlj/blob/main/src/main/antlr/org/tomlj/internal/TomlLexer.g4#L32). So this fits the current specification.

It won't handle documents where only CRs are used as line endings, but that is not currently permitted by the spec (and such documents are very uncommon now anyway).