jqlang / jq

Command-line JSON processor

Home Page:https://jqlang.github.io/jq/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

fromdate should support timezone offsets

CFoltin opened this issue · comments

Hi,

parsing the following command:
"2015-12-24T09:29:30+01:00" | fromdate
gives the error
jq: error (at :49): date "2015-12-24T09:29:30+01:00" does not match format "%Y-%m-%dT%H:%M:%SZ"
although it seems pretty ok for. Only timezone "Z" seems to work.

BR, Chris from FreeMind

Supporting timezones in general is problematic. Currently we use the C library for dealing with parsing and formatting datetime strings. The more features of the C library we use, the harder it will be to stop using it later. Also, the C library's datetime capabilities vary quite a bit from one version/platform to another, so the more features we use, the more issues we can expect users to file that are ultimately caused by C library issues. Limiting datetime strings to UTC ISO8601 was an explicit choice to limit this pain. Adding timezone offset support when parsing is doable though, and probably worth the effort. I'm much less interested in adding timezone offsets when formatting datetime strings.

Hi,

thanks for the replay. AFAIK, the format I put to be parsed is a valid ISO8601 date. I've got the impression, that only the "timezone" 'Z' is working at jq, or could you give me an example of a different zone being parsed?

BR, Chris

One of the nice things about dates being in UTC is that then dates can be compared and ordered lexicographically. So I'm inclined to only support timezone offsets in fromdate.

Incidentally, you can use strptime() and the %z format specifier to parse ISO8061 timezones. Is that good enough for you?

Hi,
yes, an alternative would be ok.
I tried
"2015-12-24T09:29:30+01:00" | strptime("%Y-%m-%dT%H:%M:%SZ"), but same error.
Do you have an example at hand?
BR, Chris

@CFoltin You should use "2015-12-24T09:29:30+01:00" | strptime("%Y-%m-%dT%H:%M:%S%z"), but note that glibc doesn't seem to support timezone offsets very well :(

(And, of course, %z is a glibc extension to the POSIX standard, so we can't rely on it.)

I see that glibc doesn't like a : in the tz offset, and it doesn't seem to adjust the result by the tz offset. I'm not inclined to work around this glibc bug, nor to implement timezone offsets in jq proper at this time, but I'll leave this open, and if someone submits a PR, we'll consider it!

+1 for having timezone support in fromdate, even if that means working with UTC (or the computer's locale) internally and as todate output timezone.

Although dates aren't in the JSON standard, I see a bunch of use cases for jq processing. My current use case is to use group_by to organize objects in day or hour bins.

All my incoming JSON dates include the %z timezone format (Ruby's datetime) and I cannot change these inputs. I'm pondering if dates can be scanned and timezone-converted using sed or awk, before piping the JSON objects to jq, but that will be an aweful hack to work with.

@aliekens About your specific use case, since jq 1.5 has support for regular expressions and string substitutions and replacements, you can use this to rearrange the string so that the timezone appears where it can be parsed.

Yay, I have figured out a way to correctly parse datetimes with timezones in jq, but it requires a bit of hacking. Here's how I can now parse my (Ruby's) datetimes:

$ TZ=/usr/share/zoneinfo/UTC jq -n '"2015-12-24T09:29:30+00:00" | sub("(?<before>.*):"; .before ) | strptime("%Y-%m-%dT%H:%M:%S%z") | todate'
"2015-12-24T09:29:30Z"
$ TZ=/usr/share/zoneinfo/UTC jq -n '"2015-12-24T09:29:30+01:00" | sub("(?<before>.*):"; .before ) | strptime("%Y-%m-%dT%H:%M:%S%z") | todate'
"2015-12-24T08:29:30Z"

Some notes:

  • The example above is on a Mac. Behavior may be different in other environments because of C library differences. For example, jqplay (which runs on what platform?) always returns the same datetime, independent of the timezone.
  • It is important to set jq's environment's timezone to UTC with TZ=/usr/share/zoneinfo/UTC or strptime will assume your computer locale's timezone is UTC (not good if your environment's locale is not UTC)
  • strptime's %z format does not support "+01:00" timezones, it needs to be formatted as "+0100" (some implementations of strptime have a %: flag to support timezones with a colon). The last colon in the string is therefore subbed using a regex. (BTW, the docs need info or an example on how to use named captures)

(BTW, the docs need info or an example on how to use named captures)

In the meantime, the FAQ has a question: Q: How are named capture variables used?

I have tested aliekens in jq-1.5-1-a5b5cbe, and I get every time the same output:
"2015-12-24T09:29:30Z"

Another workaround, in case your gmt offset is already a float (remember to reverse the offset to get UTC):

echo '{"date": "2015-03-06T04:21:47Z", "offset": 6.5}' \
| jq '(.date | fromdate) - 3600 * .offset | todate'
"2015-03-05T22:51:47Z"

Ugly manual parsing, but, hey, it works:

echo '{"date":"2020-01-30T02:35:20-08:00"}' | \
  jq 'def parseDate(date): date | capture("(?<no_tz>.*)(?<tz_sgn>[-+])(?<tz_hr>\\d{2}):(?<tz_min>\\d{2})$") | (.no_tz + "Z" | fromdateiso8601) - (.tz_sgn + "60" | tonumber) * ((.tz_hr | tonumber) * 60 + (.tz_min | tonumber)); parseDate(.date)'

My main use case for this is consuming timestamps in json produced by go, which will (without going to some extra work) serialize times as e.g. 2022-08-17T22:59:45.157237491-07:00. Which is to say RFC 3339 format, which which amounts to a subset of ISO-8601 (with a couple of exception which need to be supported in practice). I suspect there's lots of similar use cases.

It's entirely reasonable to only support UTC for the "broken down" datetime representation, but like it or not there's a lot of json data out there which is stored with a zone offset. There is also no need to support parsing named time zones (which are not permitted by either RFC3339 or ISO-8601). It's misleading to claim to support iso8601 date format if you don't support data in formats which conform to the spec and are produced by the standard libraries of common programming languages.

@erhhung's solution is a reasonable one. I'd extend it a little to

  1. Handle all time zone formats in the spec, including Z, ±hh:mm, ±hhmm, and ±hh timezone styles.
  2. Handle fractional seconds (which are not handled by at least most glibc implementations)
capture("(?<no_tz>[^.]*)(?<frac_sec>\\.\\d+)?(?:(?:(?<tz_sgn>[-+])(?<tz_hr>\\d{2}):?(?<tz_min>\\d{2})?)|Z)$") |
(.no_tz + "Z" | fromdateiso8601)
+ ("0"+.frac_sec | tonumber)
- (.tz_sgn + "60" | tonumber)
* ((.tz_hr // "0" | tonumber) * 60 + (.tz_min // "0" | tonumber))

That is,

$ echo '"2020-01-30T02:35:20.001-08:00"
"2020-01-30T02:35:20Z"
"2020-01-30T02:35:20+0330"
"2020-01-30T02:35:20+03"
' | jq 'capture("(?<no_tz>[^.]*)(?<frac_sec>\\.\\d+)?(?:(?:(?<tz_sgn>[-+])(?<tz_hr>\\d{2}):?(?<tz_min>\\d{2})?)|Z)$") | (.no_tz + "Z" | fromdateiso8601) + ("0"+.frac_sec |tonumber) - (.tz_sgn + "60" | tonumber) * ((.tz_hr // "0" | tonumber) * 60 + (.tz_min // "0" | tonumber))|todate'
"2020-01-30T10:35:20Z"
"2020-01-30T02:35:20Z"
"2020-01-29T23:05:20Z"
"2020-01-29T23:35:20Z"

Note that the above specifically does not handle "implicitly local time" dates (those lacking either Z or a time zone suffix), which are allowed by the spec but for which actually supporting them would probably lead to incorrect results in most use cases. It will permit -00:00 (and interpret it as equivalent to UTC), which is illegal in ISO-8601 but is used to indicate local time in RFC3339.

Building that data massaging into fromdateiso8601 so that it can support all RFC3339 dates (or at least all of those which are also valid in ISO-8601) would make the function considerably more useful in many real-world use cases.

For a workaround, please check #1117

There are external small libraries in C that specifically do ISO8601 parsing. https://github.com/chansen/c-dt has great test suite and no dependencies AFAIK.

Would a PR building on that be accepted? Currently JQ hardly supports ISO8601 at all, as it does not support a single line in the following list of valid ISO8601 strings:

20121224
2012-12-24 23:59:59
2012-12-24T00:00:00+00:00
2012359
2012359T235959+0130
2012-359
2012W521
2012-W52-1
2012Q485
2012-Q4-85
0001-Q1-01

Building on @chansen's lib would change that.

: ; jq -cnr '"2015-12-24T09:29:30+01:00" | strptime("%Y-%m-%dT%H:%M:%S%z")|todate'
2015-12-24T09:29:30Z

On platforms where %z is supported by strptime()/strftime() this works today.

But not all platforms have equally good C time function support, so, yes, @fatso83, I think we'd consider a replacement of the C library's time functions with something like @chansen's c-dt. However, we should make sure that that library handles the tzdata database on Unix, Linux, and Windows first.

Are people asking for fromdate to be flexible in the formats it parses?

I don't know about others and the general function for date parsing, but I assumed a function that indicated parsing iso8601 would do just that, so support for tz in strings - without involved workarounds - is much wanted 😃

That would be nice, but mainly I think primarily people just want fromdateiso8601 to be able to handle at least the subset of ISO-8601 date strings which are also compliant with RFC3339.

I don't really see how tzdata enters into this, given that ISO-8601 doesn't allow named time zones, and RFC3339 does not allow unqualified local time. It only allows numeric offsets, which don't require a database lookup. And in general most use cases (e.g. sorting or filtering relative to some timestamp) don't even require that it keeps track of the offset after parsing.

@fatso83 @adam-azarchs try this:

def fromdateiso8601: first(strptime("%Y-%m-%dT%H:%M:%S%z")?,strptime("%Y-%m-%dT%H:%M:%SZ"))|timegm;

The Z in UTC times can be lower case, but on Linux at least strptime() doesn't handle that. And for %z the : between hours and minutes in the offset is optional on Linux.

@nicowilliams

Linux

$ uname -a
Linux litten 6.1.0-17-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.69-1 (2023-12-30) x86_64 GNU/Linux
$ jq --version
jq-1.6
$ jq '. | first(strptime("%Y-%m-%dT%H:%M:%S%z")?,strptime("%Y-%m-%dT%H:%M:%SZ"))' <<< '"2023-03-31T17:15:00+00:00"'
[
  2023,
  2,
  31,
  17,
  15,
  0,
  5,
  89
]
jq: error (at <stdin>:1): date "2023-03-31T17:15:00+00:00" does not match format "%Y-%m-%dT%H:%M:%SZ"
$ jq-1.7.1 --version
jq-1.7.1
$ jq-1.7.1 '. | first(strptime("%Y-%m-%dT%H:%M:%S%z")?,strptime("%Y-%m-%dT%H:%M:%SZ"))' <<< '"2023-03-31T17:15:00+00:00"'
[
  2023,
  2,
  31,
  17,
  15,
  0,
  5,
  89
]

macOS

$ sw_vers
ProductName:		macOS
ProductVersion:		14.2.1
BuildVersion:		23C71
$ jq --version
jq-1.7.1
$ jq '. | first(strptime("%Y-%m-%dT%H:%M:%S%z")?,strptime("%Y-%m-%dT%H:%M:%SZ"))' <<< '"2023-03-31T17:15:00+00:00"'
jq: error (at <stdin>:1): date "2023-03-31T17:15:00+00:00" does not match format "%Y-%m-%dT%H:%M:%SZ"
commented

Hello,

I meet an unsupported date format...
The format has millisecond and timezone that is not supported by fromdateiso8601

The format is "YYYY-MM-DDTHH:MM:SS.xxx[+-]HHMM". See example below.

I made a fromdateiso8601gmt function to support it:

def fromdateiso8601gmt:
        scan("^(....)-(..)-(..)T(..):(..):(..)(\\.?[0-9]*)([Z+-])(.?.?)(.?.?)$")|
        (.[6]) as $ms|
        if .[7] == "Z" then
                "\(.[0])-\(.[1])-\(.[2])T\(.[3]):\(.[4]):\(.[5])Z"
                |fromdateiso8601
        else 
                ((.[7]+.[8]|tonumber*3600)+(.[7]+.[9]|tonumber*60)) as $offset
                |"\(.[0])-\(.[1])-\(.[2])T\(.[3]):\(.[4]):\(.[5])Z"
                |fromdateiso8601-$offset
        end| .+("0\($ms)"|tonumber)
;

That is far to be optimal, but it is better than nothing.

$ echo '
"2024-02-06T16:53:19.1234Z"
"2024-02-06T17:53:19.1234+0100"
"2024-02-06T14:53:19.1234-0200"
"2024-02-05T23:53:19.123Z"
"2024-02-06T00:53:19.123+0100"
"2024-02-06T01:53:19.123+0200"
' | jq "$def_fromdateiso8601gmt"'fromdateiso8601gmt'
1707238399.1234
1707238399.1234
1707238399.1234
1707177199.123
1707177199.123
1707177199.123