moment / moment

Parse, validate, manipulate, and display dates in javascript.

Home Page:momentjs.com

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

'H' format string broken in strict parsing

tp opened this issue · comments

I hit an unexpected error when using 'H' to parse 24h-time values without a leading zero:

> var moment = require('moment');
undefined
> moment.version
'2.5.0'
> moment().format('H')
'17'
> var parsed = moment('17', 'H');
undefined
> parsed.isValid()
true
> var parsed = moment('17', 'H', true);
undefined
> parsed.isValid()
false

I would expect H to be usable here, so that the code can also work with times before noon.

You're right. We made strict parsing a bit too strict in the last version.

Hmm, did we? If H eats 2 tokens whenever possible, that will lead to broken parses:

moment('3141113', 'HDDMMYY', true).format(); //=> '2013-11-14T03:00:00-05:00'
moment('3141113', 'HHDDMMYY', true).format(); //=> 'Invalid date'

That's a big part of why we made strict act the way it does. (Of course, the real solution would be to use a global regex so we get backtracking, but that's a separate discussion)

More generally, why even have "H" and "HH" if they're the same?

Well, we have H and HH for format. I don't think we need them for parsing. For parsing they can mean the same, I'm not sure if other libraries are doing something similar. The same applies for years -- if you use YYYY then you can not parse a year that has different number of digits, which means everybody using YYYY to parse is wrong :) According to the cldr token docs, the number of repetitions (in YYYY -- 4), means pad to that much, meaning bigger years will always be left intact. I guess if you print something with YYYY then you'd expect to parse it back with YYYY (same goes for H).

Meaning, you're proposing that in strict mode, H always parses two characters?

I'm OK with that, but it won't allow you to roundtrip H from the formatter back to the parser. That's never possible to guarantee, because the whole point of H as a formatting token is that it outputs a variable number of characters.

I think maybe the best thing to do would be to eliminate H as a parsing token altogether, but we probably can't because that would break too much for non-strict users.

@icambron H is NOT intended to produce a specific number of characters. Its intended to NOT pad to a specific number of characters. In the same way that Y doesn't mean a single digit year, it means the year, verbatim. (YY is a special case that always displays 2).

I see two types of issues here. The year family of tokens (/Y+/, /g+/, /G+/). The problem here is that you don't know how many digits they would generate. There is a lower bound (YYYY would produce at least 4 digits), but not an upper one. I can see these use cases:

  • regular years (meaning +/- 100 years from now) displayed with 2 digits (YY)
  • regular years displayed with 4 digits (YYYY)
  • irregular years (negative, too small, too big) displayed with any number of digits (rest of tokens)

So I suggest

  • YYYY to parse exactly 4 digits
  • YY to parse exactly 2 digits
  • Y to parse anything /[-+]?\d+/

This behavior is kind of different than the formatting behavior, but I think it has the least amount of surprise for most users. Because most users use regular years, and YY and YYYY make sense for them.

The second issue is with all other tokens (h, H, d, D, M, s, w, W..., DDD). These can be mostly one or two digits, and only padding is changed with the different formatting tokens. I think if you want rock solid parsing, you'd use HH (or equivalent) for both formatting and parsing. H would mean -- this comes from a user, and it should match any number of digits (until a non-digit is found). In general you don't pad units going to the user, and user won't pad units when entering. But user also won't put date and month numbers glued together, so greedy scanning works in this case.

So I propose:

  • non padded version (h, H, D, DDD, ...) mean eager /\d+/ tokenization
  • padded versions (hh, HH, DD, DDDD) mean strict /\d\d/ tokenization

@ichernev: I think if you want rock solid parsing, you'd use HH (or equivalent) for both formatting and parsing. H would mean -- this comes from a user, and it should match any number of digits (until a non-digit is found).

But one can neither use H for all 24 hours of the day (see above), nor HH if single-digit hours don't have a leading 0:

var parsed = moment('1', 'HH', true);
parsed.isValid() => false

So what would you suggest?

/\d+/ means any number of digits.

@ichernev I think that's what I said too. I just wanted to clarify that you won't be able to guarantee round tripping. It sounds like you're OK with that and only intend H (as a parsing token) to be for user-generated strings. I'm just imagining this ticket:

moment(moment().hours(1).format('HDD'), 'HDD', true); //WHY YOU NO WORK?

If you're OK with answering that with "dude, use HH", then I'm good with it too.

Well, you're shooting yourself in the foot with that code IMHO. So yeah :)

Btw the initial implementation of strict would have worked here -- because it first builds the entire regex, and the second token would be \d\d, so the first \d+ greedy would not match 2 digits -- just one ;-)

Right, the advantage of a global regex is that we get backtracking and we don't have to deal with greed as much.