bitbot disconnects finding the title of really large pages

Question

bitbot disconnects finding the title of really large pages

OrichalcumCosmonaut opened this issue 3 years ago · comments

when given a URI like https://git.causal.agency/scooper/tree/sqlite3.c?h=vendor-sqlite3 (a 24MiB page), bitbot disconnects while getting its title, probably because it tries to parse the entire page, just to find the title, leading to it timing out.

this could probably be fixed by limiting the amount of the page to be parsed to 64KiB or thereabouts.

jesopo commented 3 years ago

😰

jesopo · Answer 1 · Mon Sep 13 2021 21:58:07 GMT+0800 (China Standard Time)

deadline for slow transfers
https://github.com/jesopo/bitbot/blob/4a6037c77405f3584efadc10ae75826b6b9ac422/src/utils/http.py#L240

max file size for large files that could oom the bot
https://github.com/jesopo/bitbot/blob/4a6037c77405f3584efadc10ae75826b6b9ac422/src/utils/http.py#L224

don't know what problem you've managed to stumble on, but there's already code intended to handle what you've described. do you have a stacktrace?

Quinn J. · Answer 2 · Mon Sep 13 2021 22:04:12 GMT+0800 (China Standard Time)

https://github.com/jesopo/bitbot/blob/4a6037c77405f3584efadc10ae75826b6b9ac422/src/utils/http.py#L35

it seems that the default for that variable is 100MiB, which probably doesn’t help for a page smaller than that, especially when the title probably isn’t very far into the file.

i don’t have a stacktrace, this is tildebot that disconnected, so maybe ben has one, assuming it did crash?

jesopo · Answer 3 · Mon Sep 13 2021 22:05:29 GMT+0800 (China Standard Time)

I'm sure the machine it's running on can read and parse 100mib of html unless it's somehow akin to a zipbomb. can you get the stacktrace from ben? we're going to be blind without it

David Schultz · Answer 4 · Mon Sep 13 2021 22:05:44 GMT+0800 (China Standard Time)

I think that limit ought to be configurable. Some people's stuff can handle that but others obviously can't.

june · Answer 5 · Mon Sep 13 2021 22:06:11 GMT+0800 (China Standard Time)

My guess is it spends forever in html5lib trying to parse the page. Pure python parser = sadness.

jesopo · Answer 6 · Mon Sep 13 2021 22:06:39 GMT+0800 (China Standard Time)

I've hit it with much worse in testing. eager to see a stacktrace

jesopo · Answer 7 · Mon Sep 13 2021 22:12:47 GMT+0800 (China Standard Time)

if it is a timeout, especially on the .soup() call outside the deadline, I'd be inclined to do a much less thorough parse, even just a regex to grab the <title>. it'd work the majority of the time

jesopo · Answer 8 · Mon Sep 13 2021 22:21:00 GMT+0800 (China Standard Time)

benchmarking lxml against html5lib puts the former far ahead of the latter but I recall picking the latter for fault tolerances the former doesn't have

Ben Harris · Answer 9 · Mon Sep 13 2021 22:21:07 GMT+0800 (China Standard Time)

looks like my log level was too low it just shut down

june · Answer 10 · Mon Sep 13 2021 22:32:41 GMT+0800 (China Standard Time)

>>> def parseit():
...     with open("big.html", "rb") as f:
...             return html5lib.parse(f)
... 
>>> timeit.timeit(parseit)

This has been running for over 20 minutes...

jesopo · Answer 11 · Mon Sep 13 2021 22:39:30 GMT+0800 (China Standard Time)

I don't think the correct solution is limiting file size, I imagine it's trivial to code golf something html5lib finds hard to parse, I'd either deadline soupifying the results or switch to something closer to O(1). lxml is undoubtedly faster but I can't remember what exact case caused me to switch away from it

june · Answer 12 · Mon Sep 13 2021 23:23:58 GMT+0800 (China Standard Time)

Well I wanted to post the final count for timing html5lib for posterity but it seems python got OOM killed while I wasn't looking 🙁