open-contracting / kingfisher-collect

Downloads OCDS data and stores it on disk

Home Page:https://kingfisher-collect.readthedocs.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Incremental updates with database store

jbothma opened this issue · comments

I think there's a bug with the incremental update behaviour of the DatabaseStore.

If I understand correctly, crawl_time has to be set to the same value each time the spider is run to have incremental crawl.

The first time,

  1. it crawls
  2. it saves data to the crawl_time directory in file names like start.json, PageNumber-2.json, ...
  3. it creates one CSV file from all the crawl files
  4. it creates the OCDS data table and inserts the data from the CSV file

Subsequent times,

  1. it gets the latest publish date from the data in the table
  2. it crawls from that publish date
    1. it saves data to the crawl_time directory in file names like start.json, PageNumber-2.json, ... (I think it's overwriting files here)
  3. it creates one CSV file from all the crawl files
  4. deletes the existing data and inserts the data from the CSV file

expected: All the data crawled previously plus the new data should be in the database
actual: Data in the overwritten files is missing from the database

Am I doing something wrong or is the overwriting an issue here? If I change crawl time for each crawl, none of the first crawl's data is included.

Some options I see:

  • parse out the latest PageNumber- index and save to the next number (decouple the API page number from the saved file)
  • use the existing data as part of the data going into compiling releases. That only solves the problem for people who enable compiling releases, but I want that, so it's fine for me.

Thank you, @jbothma for reporting. This is a bug indeed. This happens for spiders who use "generic" names as file names. One approach could be to ensure each file name is always unique (including a timestamp as part of the filename, for example). The only issue with this approach is that the compile release option will be required to avoid duplicates in some cases.

If a crawl is performed twice with the same parameters, the filenames should be the same.

I think the simplest solution might be to prepend from_date to start.json and to set formatter in start_requests to something like join(pretty(self.from_date), parameters('page')) (where pretty is a new function that formats datetimes).

The path and qs:* spider arguments are the only other parameters that change the response, but I don't think they are changed between incremental updates, so they don't need to be included in the filename.

Amazing. Thanks both!