Add inputUrl option when providing raw html

Question

Add inputUrl option when providing raw html

kylealwyn opened this issue 2 years ago · comments

Articles like https://www.dreamsongs.com/RiseOfWorseIsBetter.html or https://grugbrain.dev/ return content in the demo but bail out due to no links being found when providing html.

How do we feel about adding an option to provide a url in this case, similar to Postlight? https://github.com/postlight/parser#pre-fetched-html

Context: I'm prefetching the html with scrapingant and running it through a metascraper pipeline amongst other things

Dong Nguyen · Answer 1 · Wed Jan 11 2023 13:01:03 GMT+0800 (China Standard Time)

@kylealwyn thank you for raising this problem. Currently there is only a workaround.

I'm considering to expose a new method in next release, something such as:

import { extractFromHtml } from '@extractus/article-parser'

const result = await extractFromHtml (String html, String url)

Postlight APIs is good for reference too. I'm reading their docs. If you have any other advice, please share.

Kyle Alwyn · Answer 2 · Wed Jan 11 2023 13:06:24 GMT+0800 (China Standard Time)

Hi, thanks for the quick response! I like your suggestion of a separate entrypoint rather than overloading a single interface.

Kyle Alwyn · Answer 3 · Wed Jan 11 2023 14:37:08 GMT+0800 (China Standard Time)

It'd also be great to have some bindings into JSDOM or something similar. I'm running the html through article-extractor (great lib btw) unconditionally and encounter frequent errors such as fetch is not defined. Curious if there's a way to get around.

Dong Nguyen · Answer 4 · Wed Jan 11 2023 16:21:05 GMT+0800 (China Standard Time)

@kylealwyn fetch is now available by default in almost platform. Could you share a little more about your environment?

article-extractor heavily depends on linkedom for dom manipulation. However you still can use JSDOM to modify your raw HTML before passing into this lib.

Kyle Alwyn · Answer 5 · Thu Jan 12 2023 03:01:39 GMT+0800 (China Standard Time)

I'm still digging in but in case you see something, it's a variety of client js errors seeming to stem from third party scripts. This might be coming from metascraper actually.

crawler:dev: TypeError: window.requestAnimationFrame is not a function
crawler:dev:     at j (https://ads.blogherads.com/static/blogherads.js:2:118920)
crawler:dev:     at qi (https://ads.blogherads.com/static/blogherads.js:33:65498) {"date":"Wed Jan 11 2023 10:11:01 GMT-0800 (Pacific Standard Time)","error":{},"exception":true,"os":{"loadavg":[7.7177734375,6.7529296875,4.482421875],"uptime":1187037},"process":{"argv":["/Users/kyle/dev/playground/basis/node_modules/.pnpm/ts-node@10.9.1_vq46kxj6zfka4f6ijsosnft3hy/node_modules/ts-node/dist/child/child-entrypoint.js","/Users/kyle/dev/playground/test/apps/crawler/src/server.ts"],"cwd":"/Users/kyle/dev/playground/test/apps/crawler","execPath":"/Users/kyle/Library/Application Support/fnm/node-versions/v18.12.1/installation/bin/node","gid":20,"memoryUsage":{"arrayBuffers":5499315,"external":7243456,"heapTotal":351469568,"heapUsed":266899520,"rss":511655936},"pid":30447,"uid":502,"version":"v18.12.1"},"stack":"TypeError: window.requestAnimationFrame is not a function\n    at j (https://ads.blogherads.com/static/blogherads.js:2:118920)\n    at qi (https://ads.blogherads.com/static/blogherads.js:33:65498)","trace":[{"column":118920,"file":"https://ads.blogherads.com/static/blogherads.js","function":"j","line":2,"method":null,"native":false},{"column":65498,"file":"https://ads.blogherads.com/static/blogherads.js","function":"qi","line":33,"method":null,"native":false}]}

Kyle Alwyn · Answer 6 · Thu Jan 12 2023 03:06:44 GMT+0800 (China Standard Time)

7.2.8 works like a charm, amazingly fast turnaround, thank you!

Dong Nguyen · Answer 7 · Thu Jan 12 2023 09:59:38 GMT+0800 (China Standard Time)

@kylealwyn yeah, regarding the error you posted, it seems you are using something like headless browser to parse web content.