nla / outbackcdx

Web archive index server based on RocksDB

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Handle invalid dates

ato opened this issue · comments

Currently we allow out of range dates to be inserted. This doesn't normally cause problems as the dates aren't parsed. However the access points system does need interpret the dates which causes queries to return no results.

One option is to pad/truncate them to something parsable in #82. Another is to just outright reject them upfront.

@kris-sigur reports that truncated timestamps still cause problems with sorting:
image
We might need to pad them at indexing time rather than just at query time. This leaves a problem for what to do about existing indexes though.

I suspect you have avoided the sorting problems so far as generally shorter timestamps = older timestamps. Unfortunately, the data I have from IA comes from a mix of crawls over time.

I suggest padding out shorter timestamps on ingest. As for longer timestamps ... you can get all sorts of weirdness if you start truncating them. The order you feed them in will determine how collisions are handled, for example. So I suggest rejecting those exceeding 14 characters for now.

It is also worth noting that longer timestamps are poorly defined at the moment. The ISO standard has milliseconds as a decimal fraction of the second, and the number of digits being undefined. So probably best to treat timestamps over 14 chars as invalid for now.

I've written up something along those lines and I'll submit a PR later today, or tomorrow, once I've tested it out a bit more.

As for existing indexes. I'm deleting all the records from mine before updating. It is easy in my case. I know which set of CDXs they come from. A full rebuild wouldn't have been overly onerous either. I'm sure this varies.

I expect padding on ingest wont break the older indexes, but it will likely become impossible to delete those older items if the delete function also runs through the same padding procedure.

One possible option (no idea how practical this is), would be to have a housekeeping task that looks for timestamps that are "under" 14 digits long and then multiply them with 10, 100 etc. depending on the range they fall in and updates the index. I guess this is only really viable if you can do a query based solely on the timestamp.

@galgeek Any thoughts on this? Kris' PR seems reasonable to me but I don't have any indexes with truncated dates so I'm not really affected by this change.

apologies for my delay in responding, and Kris' PR looks fine to me!

I've extracted most, if not all, of AIT's shorter timestamps, padded them with 0, to 14 chars, and reingested.