AtlasOfLivingAustralia / biocache-store

Occurrence processing, indexing and batch processing

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

parse <start date>/<period> iso dates

M-Nicholls opened this issue · comments

requested by Tasmanian Herbarium:

ALA import ISO-8601 date ranges expressed as start date/period, such as 2017-11-01/P1M or 1898-01-01/P1Y

Also add support for 'X' character to indicate uncertainty:

https://www.iso.org/standard/70908.html

https://www.loc.gov/standards/datetime/iso-tc154-wg5_n0039_iso_wd_8601-2_2016-02-16.pdf

which among other things proposes that "The character 'X' may be used as a replacement character, in place of a digit to indicate that the value of that digit is unspecified.", with the examples "1985-04-XX" and "1985-XX-XX".

The approach using 'X' offers more flexibility, in that records sometimes specify day and month but not year, and this cannot be expressed as a start time/duration expression.

From Bob:

Here's a free version of the new (Feb 2019) extensions to the standard, as represented by the US Library of Congress:

https://www.loc.gov/standards/datetime/edtf.html

The standard referenced above is ISO 8601-2:2019 Date and time -- Representations for information interchange -- Part 2: Extensions. It was just released, so there is no support for it in the JDK at this point, and we are restricted to JDK8 by our use of Scala-2.10. It also defines extensions to the core ISO8601 standard, so it is possible that it will never be implemented by the JDK itself.

That means we would need to customise the parsing ourselves to accommodate this example.

Could you give the expected values for eventDate/eventDateStart/eventDateEnd/year/month/day for the cases above so we can write tests to verify that the modified parsing is what you expect?

I'm going by the HISPID version of the DarwinCore archive: https://hiscom.github.io/hispid/terms/ as that is the only one I've worked with. It doesn't include eventDateStart/eventDateEnd/year/month/day, only eventDate. However:

version eventDate eventDateStart eventDateEnd year month day
any 1967-02-07 1967-02-07 1967-02-07 1967 02 07
pre-2019 2017-11-01/P1M 2017-11-01 2017-11-30 2017 11
pre-2019 1898-01-01/P1Y 1898-01-01 1898-12-31 1898
2019-rev 2017-11-XX 2017-11-01 2017-11-30 2017 11
2019-rev 1898-XX-XX 1898-01-01 1898-12-31 1898

In addition records of unknown year but known month and day can be rendered using the 2019 revision:

version eventDate eventDateStart eventDateEnd year month day
2019-rev XXXX-12-25 12 25

These are rare cases, and constitute a tiny proportion of all records. However these are not a representation of a time period.

Thanks for giving those examples. My interpretations, which are slightly different for the P1M/P1Y cases, would have been:

version eventDate eventDateStart eventDateEnd year month day
any 1967-02-07 1967-02-07 1967-02-07 1967 02 07
pre-2019 2017-11-01/P1M 2017-11-01 2017-12-01 2017 11
pre-2019 1898-01-01/P1Y 1898-01-01 1899-01-01 1898
2019-rev 2017-11-XX 2017-11-01 2017-11-30 2017 11
2019-rev 1898-XX-XX 1898-01-01 1898-12-31 1898

I think you're correct and I'm not, since the duration would be one complete month or year, and we're not using times, (technically the eventDateStart and eventDateEnd would be at T00:00:00). In any case which is easier to code?

I also wonder if these cases are not already adequately covered by supplying dates as YYYY-MM or YYYY for unknown day or month, respectively.

Specifying month and day without year was not allowed in the standard after 2004, and I cannot see that it will be allowed in the new standard, negating any advantage of the use of X to indicate missing data for an unknown year.

See this related discussion on the TDWG tracker tdwg/dwc#220

biocache-store has been replaced by pipelines.