Incremental updates with database store

Question

Incremental updates with database store

jbothma opened this issue a year ago · comments

I think there's a bug with the incremental update behaviour of the DatabaseStore.

If I understand correctly, crawl_time has to be set to the same value each time the spider is run to have incremental crawl.

The first time,

it crawls
it saves data to the crawl_time directory in file names like start.json, PageNumber-2.json, ...
it creates one CSV file from all the crawl files
it creates the OCDS data table and inserts the data from the CSV file

Subsequent times,

it gets the latest publish date from the data in the table
it crawls from that publish date
1. it saves data to the crawl_time directory in file names like start.json, PageNumber-2.json, ... (I think it's overwriting files here)
it creates one CSV file from all the crawl files
deletes the existing data and inserts the data from the CSV file

expected: All the data crawled previously plus the new data should be in the database
actual: Data in the overwritten files is missing from the database

Am I doing something wrong or is the overwriting an issue here? If I change crawl time for each crawl, none of the first crawl's data is included.

Some options I see:

parse out the latest PageNumber- index and save to the next number (decouple the API page number from the saved file)
use the existing data as part of the data going into compiling releases. That only solves the problem for people who enable compiling releases, but I want that, so it's fine for me.

Yohanna Lisnichuk · Answer 1 · Tue Sep 19 2023 21:43:03 GMT+0800 (China Standard Time)

Thank you, @jbothma for reporting. This is a bug indeed. This happens for spiders who use "generic" names as file names. One approach could be to ensure each file name is always unique (including a timestamp as part of the filename, for example). The only issue with this approach is that the compile release option will be required to avoid duplicates in some cases.

James McKinney · Answer 2 · Tue Sep 19 2023 23:14:37 GMT+0800 (China Standard Time)

If a crawl is performed twice with the same parameters, the filenames should be the same.

I think the simplest solution might be to prepend from_date to start.json and to set formatter in start_requests to something like join(pretty(self.from_date), parameters('page')) (where pretty is a new function that formats datetimes).

James McKinney · Answer 3 · Tue Sep 19 2023 23:21:43 GMT+0800 (China Standard Time)

The path and qs:* spider arguments are the only other parameters that change the response, but I don't think they are changed between incremental updates, so they don't need to be included in the filename.

JD Bothma · Answer 4 · Wed Sep 20 2023 15:56:36 GMT+0800 (China Standard Time)

Amazing. Thanks both!