jqlang / jq

Command-line JSON processor

Home Page:https://jqlang.github.io/jq/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Parsing of milliseconds in dates

juliangamble opened this issue · comments

Here the author writes:

Would it also parse ISO timestamps like 2015-03-05T19:40:53.324Z? Eg. what you get from javascript new Date().toISOString().

#364 (comment)

What we need is

strptime("%Y-%m-%dT%H:%M:%S.%fZ")

similar to how python does it.

commented

is progress in #1413 (comment) kind of abandoned?

hi @juliangamble
jq-1.6 had supports strptime usage.you can try it

root@oss-001:test_jq#  echo '"2015-03-05T23:51:47Z"' | jq 'strptime("%Y-%m-%dT%H:%M:%SZ")'
[
  2015,
  2,
  5,
  23,
  51,
  47,
  4,
  63
]

You can see detailed usage in the manual or jq-1.6 manual.yml.
https://github.com/stedolan/jq/blob/master/docs/content/manual/v1.6/manual.yml

I think this is an issue about %f support which jq-1.6 does not support.

 % echo '"2015-03-05T23:51:47.487Z"' | jq 'strptime("%Y-%m-%dT%H:%M:%S.%fZ")'
jq: error (at <stdin>:1): date "2015-03-05T23:51:47.487Z" does not match format "%Y-%m-%dT%H:%M:%S.%fZ"
 % python -c 'from datetime import datetime; print(datetime.strptime("2015-03-05T23:51:47.487Z", "%Y-%m-%dT%H:%M:%S.%fZ"))'
2015-03-05 23:51:47.487000

Struggled today with this, problem is that the unix "strptime" doesn't support milliseconds. As such, I switched to using a regex replace

.last_updated |= sub("(?<time>.*)\\..*Z"; "\(.time)Z")

So I'm transforming this "2015-03-05T23:51:47.487Z" to "2015-03-05T23:51:47Z"

When you want to format a timestamp which contains offset:

sub("(?<time>.*)\\.[\\d]{3}(?<tz>.*)"; "\(.time)\(.tz)")

This will transform the following:
2015-03-05T23:51:47.487Z to 2015-03-05T23:51:47Z

And will work with this as well:
2015-03-05T23:51:47.487+0100 to 2015-03-05T23:51:47+0100

was this solved?
I am still not being able to parse milliseconds

jq -n 'now | strftime("%FT%H:%M:%SZ")'
"2022-03-16T19:16:22Z"

You could use gojq:

echo '"2015-03-05T23:51:47.487Z"' | gojq 'strptime("%Y-%m-%dT%H:%M:%S.%fZ")'

[
  2015,
  2,
  5,
  23,
  51,
  47.486999988,
  4,
  63
]

Struggled today with this, problem is that the unix "strptime" doesn't support milliseconds. As such, I switched to using a regex replace

.last_updated |= sub("(?<time>.*)\\..*Z"; "\(.time)Z")

So I'm transforming this "2015-03-05T23:51:47.487Z" to "2015-03-05T23:51:47Z"

If you do not care about preserving the number of milliseconds anyway, you can abuse the fact that parsing %G (ISO week number) has no effect (in glibc, at least) and allows arbitrarily large numbers to be part of the input string.

$ echo '"2015-03-05T23:51:47.487Z"' | jq 'strptime("%Y-%m-%dT%H:%M:%S.%GZ")'
[
  2015,
  2,
  5,
  23,
  51,
  47,
  4,
  63
]

this seems to be platform-dependent... @hvdijk example above does not work for me:

$ echo '"2015-03-05T23:51:47.487Z"' | jq 'strptime("%Y-%m-%dT%H:%M:%S.%GZ")'
jq: error (at <stdin>:1): date "2015-03-05T23:51:47.487Z" does not match format "%Y-%m-%dT%H:%M:%S.%GZ"

I think this is because I am on macos ... I have similar problem with date command in the shell

We might need to import an implementation of strftime()/strptime(). By using the one from the OS we end up exposing differences in functionality between all the OSes, which clearly annoys a lot of users including me.

I ran into this issue today while massaging logs. Did this workaround:

Put text below into fromdate.jq, add argument -L <directory for fromjq.jq>, prepend include "fromdate"; to filter.

# replace fromdateiso8601 and fromdate with ones that supports fractional seconds
# NOTE: does not support timezones
# Usage:
# $ jq -n -L . 'include "fromdate"; "2024-02-13T11:10:32.123Z" | fromdate'
# 1707822632.123
def fromdateiso8601:
  ( capture("(?<y>\\d+)-(?<m>\\d+)-(?<d>\\d+)T(?<H>\\d+):(?<M>\\d+):(?<S>\\d+)(?<F>\\.\\d+)?Z") as {$y,$m,$d,$H,$M,$S,$F}
  | [$y,$m,$d,$H,$M,$S,0,0]
  | map(tonumber)
  | .[1] |= .-1 # month starts at 0
  | mktime + ($F | if . then tonumber else 0 end)
  ) // error("date \"\(.)\" does not match format");
def fromdate: fromdateiso8601;

Had a quick look into fixing it in jq, is a bit messy. Maybe the least painful way is to modify your own strptime in utils.c to add fraction support but then we also have to have our own tm struct etc as tm_sec is an int. Are there libc's that have a strptime with fraction support? how do the communicate fractions?

Are there libc's that have a strptime with fraction support?

I suspect not. The standard requires tm_sec to be int so no implementation can support fractions that way, the only way would be by adding a new member (say, tm_nsec), but that new member cannot by supported by mktime because standard programs may leave tm_nsec uninitialised requiring it to be ignored, whereas sensible programs using that hypothetical platform extension would expect the usual mktime behaviour where out-of-range values are valid and the struct tm is normalised by adjusting other fields. If mktime sees a tm_nsec value of e.g. -1, it has no way of knowing whether that was uninitialised and therefore whether it should be ignored.

ksh solves it by having a struct Tm_s which has all the standard struct tm members, plus extra. It has a tmscan function, mostly a wrapper around tmxscan, which does basically the same thing as strptime except for filling a struct Tm_s rather than struct tm. And it even has its own strptime function (I think it's from an era where libc could not be assumed to provide it) that works by calling tmscan and copying the standard fields over from struct Tm_s to struct tm, but since you actually want the new fields too, you would not use it. ksh code could be included in jq directly if its license (EPL) is acceptable, or the same approach can be taken with a custom implementation.

@hvdijk Thanks for the info and break down! if someone would look into this i think there are at least two viable paths as i see it:

  • Modify the existing strptime implementation to add %f. Seems to be what other implementations call fractions, but didn't find what it comes from? the ksh seems to have %N?
  • Adopt some other strptime implementation like ksh

Another concern is how to support both integer seconds and optionally fractions. A maybe not optimal but straight forward solution is to redefine fromdate as def fromdateiso8601: strptime("%Y-%m-%dT%H:%M:%SZ")? // strptime("%Y-%m-%dT%H:%M:%S%fZ") | mktime; i guess?

Also i'm also not sure if something like this should be coordinated with other date/time improvement, there is a bunch of issues or old open PRs touching similar things.

Seems to be what other implementations call fractions, but didn't find what it comes from? the ksh seems to have %N?

nicowilliams pointed out in #1413 that there is a lot of variance in other implementations. Ruby appears to use %N like ksh to mean fractional seconds, allowing a width to be specified between the % and the N. It is up to the user to specify the decimal separator between %S and %N. R appears to use %OS to mean seconds including decimal separator and fractions, and has the width after it. But %OS is specified in ISO C to mean "the seconds, using the locale’s alternative numeric symbols", rather than "the seconds, including fractions", so it seems questionable to repurpose this to mean something different. ISO 8601 uses a decimal separator between ss (as in e.g. ss,sss or ss.s) to specify fractional seconds, but of course that does not work for any strftime/strptime. Python uses %f, yes, and it seems to have gone with a different letter because Python specifically did not want to implement the at-the-time established %N, because supporting %3N, %6N, etc. complicated parsing and no other specifier needed anything like that? source. So in Python, %f in strftime is simply always exactly six digits. Presumably jq will want whatever is added to strptime to also be added to strftime too.

Another concern is how to support both integer seconds and optionally fractions.

In Python, there is an open issue to add support for this in some way, python/cpython#100929. No way has been picked yet.

In Ruby, I cannot find anything indicating support for this.

In R, with %OS, this is already handled automatically.

In .NET (DateTime.ParseExact), this can be handled by using the form that takes a list of permitted formats.

Also i'm also not sure if something like this should be coordinated with other date/time improvement

Good point. Since you specifically say %Y-%m-%dT%H:%M:%SZ -- hardcoding Z there may or may not always be desirable. Issue #1053 may be good to consider at the same time.

Great summary, learned a lot, and it seems we're not alone with this mess :) personally i like the R approach with the %OS variants that optionally support fractions. That would minimize performance impact and also makes it possible to use it in custom strptime formats.

@hvdijk is this something you would like to work on? not sure how much time/motivation i have to put into it atm.

@hvdijk is this something you would like to work on? not sure how much time/motivation i have to put into it atm.

I'm happy to have a more in-depth look into this and related strftime/strptime issues next week. Whether I'd also be able to do any coding for it I don't know yet, it's possible that this more in-depth look reveals more problems that need to be accounted for, but in that case at least writing down those problems should be helpful.

👍 sounds good

Representation in struct tm

There are two ways we can represent fractional seconds in jq's equivalent of struct tm (an array): we can either include it in the seconds field, or we can add a new field. Experimentation reveals that jq already includes fractional seconds in the seconds field:

$ ~/jq/jq -rnc '0.25 | gmtime'
[1970,0,1,0,0,0.25,4,0]

Unless a compelling reason is given to change this, I would suggest keeping this as it is now.

Representation in time_t

It seems obvious that if time t represents a particular moment in time, t+1 represents one second later, then t+0.25 should represent 250 milliseconds later. But this is less obvious when we use negative timestamps. The current behaviour of jq appears to be that if gmtime is given a fractional value, t - floor(t) represents the fractional seconds to be added to the time represented by trunc(t). That is, currently we have:

$ ~/jq/jq -rn '[-1.25, -1, -0.75, -0.25, 0, 0.25, 0.75, 1, 1.25][] | tostring + ": " + (gmtime | tostring)'
-1.25: [1969,11,31,23,59,59.75,3,364]
-1: [1969,11,31,23,59,59,3,364]
-0.75: [1970,0,1,0,0,0.25,4,0]
-0.5: [1970,0,1,0,0,0.5,4,0]
-0.25: [1970,0,1,0,0,0.75,4,0]
0: [1970,0,1,0,0,0,4,0]
0.25: [1970,0,1,0,0,0.25,4,0]
0.5: [1970,0,1,0,0,0.5,4,0]
0.75: [1970,0,1,0,0,0.75,4,0]
1: [1970,0,1,0,0,1,4,0]
1.25: [1970,0,1,0,0,1.25,4,0]

This breaks monotonicity and does not match Python, which shows:

$ TZ=UTC python3 -c 'from datetime import datetime
print(datetime.fromtimestamp(-1.25))'
1969-12-31 23:59:58.750000

Python simply makes it so that a time_t value of -1.25 means 1.25 seconds before a time_t value of 0.

Implementing this in jq will result in a subtle incompatibility. In my opinion, this is justifiable.

Locale

jq calls setlocale(LC_ALL, ""); at startup and uses the current locale for strftime/strptime. There is no way to change the locale within a jq script. It would be convenient to be able to do so, but this is something I would consider not necessary to be part of the same PR.

I believe that as strftime and strptime are locale-aware, whatever is created for fractional seconds should also use the locale-specific decimal separator. Pretending for a moment that we have %OS for this, I believe the correct behaviour for jq will be:

$ export LC_ALL=en_US.UTF-8
$ ~/jq/jq -rn '1234.5 | strftime("%H:%M:%OS")'
00:20:34.500000 (hypothetical, not current jq output)
$ export LC_ALL=nl_NL.UTF-8
$ ~/jq/jq -rn '1234.5 | strftime("%H:%M:%OS")'
00:20:34,500000 (hypothetical, not current jq output)

If a user wishes to parse a timestamp that is formatted with a decimal separator other than the one used in the user's locale, the user should set the locale prior to invoking jq, as in e.g. LC_ALL=C jq.

As for which format specifier to use, %OS exists already:

$ export LC_TIME=fa_IR.UTF-8
$ ~/jq/jq -rn '1234567890 | strftime("%Oy/%Om/%Od %OH:%OM:%OS")'
۰۹/۰۲/۱۳ ۲۳:۳۱:۳۰

I believe it would be wrong to change this to print ۰۹/۰۲/۱۳ ۲۳:۳۱:30.000000 or such. %OS already has an established meaning, jq already uses it in that established meaning, and breaking that would lead to an inconsistency in jq.

Standard strftime/strptime have E and O as modifier characters. %O* means to use an alternative numeric symbols, %E* means to use an alternative era-based format. Neither seems appropriate.

However, we can take inspiration from fprintf and support %.3S to print seconds with three decimals. We can also support %.S to print seconds with a reasonable to-be-determined number of decimals. As for strptime, we can support %S to read integer seconds as we do now, preserving compatibility as much as we reasonably can, but support %.S to read seconds either with or without fractions.

There is no way to use alternative numeric symbols with fractional seconds: the way alternative digits are specified in locale data only tells us how to format values 0-99. Because this is provably impossible to support, I propose not even trying.

For both strftime and strptime, it should be possible to have a custom implementation that parses the format string, handles each literal character itself, handles %.(n)S itself, and handles every other format specifier by calling the standard strftime and strptime functions.

Time zones

Time zones are permitted in strftime, using the %z and %Z specifiers, but these do not work properly in strftime, and are ignored in strptime (on glibc). These are issues #2475 and #2195.

$ TZ=Europe/Amsterdam ~/jq/jq -cnr '1234567890 | strftime("%Y-%m-%dT%H:%M:%S%Z")'
2009-02-13T23:31:30CET
$ TZ=Europe/Amsterdam ~/jq/jq -cnr '1234567890 | strftime("%Y-%m-%dT%H:%M:%S%z")'
2009-02-13T23:31:30+0000
$ ~/jq/jq -cnr '["2001-01-01T12:34:56+0000", "2001-01-01T12:34:56+0100", "2001-01-01T12:34:56+0200"][] | strptime("%Y-%m-%dT%H:%M:%S%z")'
[2001,0,1,12,34,56,1,0]
[2001,0,1,12,34,56,1,0]
[2001,0,1,12,34,56,1,0]

jq tries to handle this in its my_mktime function if the system provides a struct tm definition that includes time zone information, but given that jv2tm and tm2jv are hardcoded to only preserve the standard fields of struct tm, I do not see how this can possibly work. I believe the right thing to do here is to ensure that any time zone information that the system's struct tm provides is saved and restored, at which point the existing functions should just do the right thing.

Because, at least for common implementations, this does not appear to require any changes to strftime/strptime itself, I suspect this would not conflict with the changes needed for fractional seconds, and we can keep them separate.

Summary

I think for this particular issue:

  • change the handling of negative time_t values to be what Python does
  • create wrapper functions for strftime and strptime that accept %.S and %.3S (and other widths) to format and parse fractional seconds, using the decimal separator of the current locale

I think independent of this issue:

  • preserve any platform-specific time zone information in struct tm (covered by existing issues)
  • add a way to change the locale from within a jq script (not covered by an existing issue, but related issue: #2218)

Do these seem reasonable, or do you prefer something else?

Thanks 👍 that summary will be very useful for me or someone that want to implement support for this.

By wrapping you mean have a functions that parse/massaga the format in way that strftime/strptime don't see or have to care about %.S etc?

I wonder if an alternativ is to import some strftime implementation in addition to the existing strptime implementation, modify both to support fractions and then use them unconditionally for all platforms?

@nicowilliams when you have time: any thought?

By wrapping you mean have a functions that parse/massaga the format in way that strftime/strptime don't see or have to care about %.S etc?

Basically yes, I was thinking we can implement our_strftime("%H:%M:%.S", t) as (pseudo-code) strftime("%H", t) + ":" + strftime("%M", t) + ":" + our_strftime_seconds(t). Or optionally combine more to reduce the number of calls to the standard strftime function. The advantage here is that everything that the underlying libc supports, is already handled automatically, whereas a fully custom strftime implementation is hard to get right, especially that alternative digit stuff. And likewise for strptime (although that one is a little bit more complicated).

Yes, I think we're going to have to import an implementation of strftime()/strptime(). The alternative is to say "sorry, complain to your platform vendor/distro".