webrecorder / browsertrix-crawler

Run a high-fidelity browser-based crawler in a single Docker container

Home Page:https://crawler.docs.browsertrix.com

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

WARC Validation Error appears from time to time

gitreich opened this issue · comments

I am currently running Version 0.12.4
After every Crawl the warchaeology tool performs a validation check (See https://nlnwa.github.io/warchaeology/)
warc validate -r "crawls/collections/COLLECTION/archive/" --log-console error -L "log_file --log-file error
Example Output:

000010 /home/netarchive/browsertrix/crawls/collections/parlament_politik_daily_20240306082001/archive/rec-20240306072011313412-0592d15c6129.warc.gz: records: 901, processed: 901, errors: 1, duplicates: 0
   rec num: 901, offset: 22264015, cause: unexpected EOF
000015 /home/netarchive/browsertrix/crawls/collections/parlament_politik_daily_20240306082001/archive/rec-20240306072011167493-0592d15c6129.warc.gz: records: 4067, processed: 4067, errors: 1, duplicates: 0
   rec num: 4067, offset: 234974739, cause: unexpected EOF

I sent this specific crawl to @tw4l also with debug logging in the logs folder
But it is happening regulary on different servers (always using the same Browsertrix Version 0.12.4)
I tried many times to isolate this issue on small crawls with same seeds but different limits, i never was able to reproduce it;
I also reshueduled crawls with validations errors to the exact same parameters on the same date, but it never occured, that the same error appears on the same site (usually the crawls are not even ending on the same seed);
But one thing is always the same: the sizeLimit forced to end the crawl and it always happening when the last page had contained a video and the sizeLimit was overshooted; As Example:
{"timestamp":"2024-03-01T08:28:17.873Z","logLevel":"info","context":"pageStatus","message":"Page Finished","details":{"loadState":4,"page":"https://www.parlament.gv.at/person/52687","workerid":0}}
{"timestamp":"2024-03-01T08:28:17.877Z","logLevel":"info","context":"general","message":"Size threshold reached 1785990395 >= 1607286400, stopping","details":{}}
{"timestamp":"2024-03-01T08:28:17.899Z","logLevel":"info","context":"general","message":"Crawler interrupted, gracefully finishing current pages","details":{}}

When I am opening the warc files with the validation error with VI, I am getting a very similar error message:
image

We have now released Browsertrix Crawler 1.0.0, which includes a totally different WARC serialization mechanism. Can you try with 1.0.0? I looked at a couple of WARCs and were not able to find any issues (though warchaelogy did crash occasionally). If you have the exact crawl command line used, that would also help with testing.

Hi, thanks for your anwer;
the exact executed Command is stored in the metadata folder with the name container_info.log, in the tranfered crawl it was:

docker run -d --name ONB_Btrix_parlament_politik_daily_20240306082001 -e NODE_OPTIONS="--max-old-space-size=32768" -p 9071:9071 -p 18126:18126 -v /home/netarchive/browsertrix/crawls/:/crawls/ webrecorder/browsertrix-crawler:0.12.4 crawl --screencastPort 9071 --seedFile /crawls/config/parlament_politik_daily_seeds.txt --scopeType prefix --depth 5 --extraHops 0 --workers 1 --healthCheckPort 18126 --limit 15000 --sizeLimit 1607286400 --timeLimit 20800 --delay 1 --waitUntil networkidle0 --saveState always --logging stats,js errors, info, debug --logLevel debug,info,warning,error --warcInfo ONB_CRAWL_parlament_politik_daily_Depth_5_20240306082001 --userAgentSuffix +ONB_Bot_Btrix_0.12.4, webarchiv@onb.ac.at --crawlId id_ONB_CRAWL_parlament_politik_daily_Depth_5_20240306082001 --collection parlament_politik_daily_20240306082001

But I think too we should move on with the observation to 1.0.0 so i will update one server with repitative crawl with size Limit on a video plattform (still think the videos are somehow causing it)

Intresting, that warchaelogy crashes on your site, I never observed a crash from it, but the validation errors are also appearing at the warc rewriting and marks the same files as invalid (mostly the unexpected EOF zip error) like warchaelogy does

Intresting, that warchaelogy crashes on your site, I never observed a crash from it, but the validation errors are also appearing at the warc rewriting and marks the same files as invalid (mostly the unexpected EOF zip error) like warchaelogy does

Reported that issue here: nlnwa/warchaeology#99

@gitreich Is this still an issue with any WARCs created with 1.x versions? (from 1.0.1 on)

I made now 78 Crawls where prevously the validation error occured and it never occured
The only difference from the general setup was that the cronjob was now triggered with root
I changed that back and will observe this issue a little longer with a regular user (and crontab from this user)
But it seems like it is resolved at this point
(Tested with 1.0.0, 1.0.2, 1.0.3, 1.0.4)

Closing as there are no repros with 1.x branch. Please re-open if this happens again with 1.x