crawlsite.js crashes on PDFs

Question

crawlsite.js crashes on PDFs

minthemiddle opened this issue 6 years ago · comments

When the script reaches a PDF, it crashes.

Example:

(node:23872) UnhandledPromiseRejectionWarning: Error: net::ERR_ABORTED at https://code.design/files/code-design-magazine-001.pdf
    at navigate (/Users/martin/Sites/crawlsite/node_modules/puppeteer/lib/Page.js:539:37)
    at process._tickCallback (internal/process/next_tick.js:68:7)
(node:23872) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 1)
(node:23872) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code.

Eric Bidelman commented 6 years ago

Sure

Eric Bidelman · Answer 1 · Fri Jun 08 2018 00:55:17 GMT+0800 (China Standard Time)

Good catch. Do you have the starting page you were running it on? That'll help me debug.

Martin · Answer 2 · Fri Jun 08 2018 00:59:10 GMT+0800 (China Standard Time)

Yes, my non-profit: https://code.design

Anthony Amaker (LSA) · Answer 3 · Sat Oct 13 2018 10:13:52 GMT+0800 (China Standard Time)

@ebidel any progress on the crash on PDF documents issue... this is a really cool project!

Anthony Amaker (LSA) · Answer 4 · Sat Oct 13 2018 11:19:40 GMT+0800 (China Standard Time)

I found a way around the by making this modification

.filter(el => el.localName === 'a' && el.href && el.href.indexOf('.pdf') < 0) // element is an anchor with an href.

... basically it checks to make sure the href of the a tag does NOT contain .pdf

Eric Bidelman · Answer 5 · Wed Oct 17 2018 00:33:33 GMT+0800 (China Standard Time)

@aamakerlsa Right, it would be something like that. However, not every PDF link contains ".pdf" in the name :)

Trupti Mujumdar · Answer 6 · Sun Jan 13 2019 17:19:34 GMT+0800 (China Standard Time)

Can I work on this issue?

Trupti Mujumdar · Answer 7 · Mon Jan 21 2019 10:28:20 GMT+0800 (China Standard Time)

@ebidel Thanks.
Can we just read the header of the file pointed by href in hex and figure out if its of .pdf format file or not?
Pdf File Format Basic Structure

Trupti Mujumdar · Answer 8 · Tue Jan 29 2019 13:30:29 GMT+0800 (China Standard Time)

Hi @ebidel,
Did you get a chance to look into the above query?
Thanks.

Eric Bidelman · Answer 9 · Wed Jan 30 2019 05:04:02 GMT+0800 (China Standard Time)

Not sure if that would work but you could try. You'd have to read the response body of every request though :(