puppeteer / examples

Use case-driven examples for using Puppeteer and headless chrome

Home Page:https://developers.google.com/web/tools/puppeteer/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

crawlsite.js crashes on PDFs

minthemiddle opened this issue · comments

When the script reaches a PDF, it crashes.

Example:

(node:23872) UnhandledPromiseRejectionWarning: Error: net::ERR_ABORTED at https://code.design/files/code-design-magazine-001.pdf
    at navigate (/Users/martin/Sites/crawlsite/node_modules/puppeteer/lib/Page.js:539:37)
    at process._tickCallback (internal/process/next_tick.js:68:7)
(node:23872) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 1)
(node:23872) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code.

Good catch. Do you have the starting page you were running it on? That'll help me debug.

Yes, my non-profit: https://code.design

@ebidel any progress on the crash on PDF documents issue... this is a really cool project!

I found a way around the by making this modification

.filter(el => el.localName === 'a' && el.href && el.href.indexOf('.pdf') < 0) // element is an anchor with an href.

... basically it checks to make sure the href of the a tag does NOT contain .pdf

@aamakerlsa Right, it would be something like that. However, not every PDF link contains ".pdf" in the name :)

Can I work on this issue?

@ebidel Thanks.
Can we just read the header of the file pointed by href in hex and figure out if its of .pdf format file or not?
Pdf File Format Basic Structure

Hi @ebidel,
Did you get a chance to look into the above query?
Thanks.

Not sure if that would work but you could try. You'd have to read the response body of every request though :(