extractus / article-extractor

To extract main article from given URL with Node.js

Home Page:https://extractor-demos.pages.dev/article-extractor

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Add inputUrl option when providing raw html

kylealwyn opened this issue · comments

Articles like https://www.dreamsongs.com/RiseOfWorseIsBetter.html or https://grugbrain.dev/ return content in the demo but bail out due to no links being found when providing html.

How do we feel about adding an option to provide a url in this case, similar to Postlight? https://github.com/postlight/parser#pre-fetched-html

Context: I'm prefetching the html with scrapingant and running it through a metascraper pipeline amongst other things

@kylealwyn thank you for raising this problem. Currently there is only a workaround.

I'm considering to expose a new method in next release, something such as:

import { extractFromHtml } from '@extractus/article-parser'

const result = await extractFromHtml (String html, String url)

Postlight APIs is good for reference too. I'm reading their docs. If you have any other advice, please share.

Hi, thanks for the quick response! I like your suggestion of a separate entrypoint rather than overloading a single interface.

It'd also be great to have some bindings into JSDOM or something similar. I'm running the html through article-extractor (great lib btw) unconditionally and encounter frequent errors such as fetch is not defined. Curious if there's a way to get around.

@kylealwyn fetch is now available by default in almost platform. Could you share a little more about your environment?

article-extractor heavily depends on linkedom for dom manipulation. However you still can use JSDOM to modify your raw HTML before passing into this lib.

I'm still digging in but in case you see something, it's a variety of client js errors seeming to stem from third party scripts. This might be coming from metascraper actually.

crawler:dev: TypeError: window.requestAnimationFrame is not a function
crawler:dev:     at j (https://ads.blogherads.com/static/blogherads.js:2:118920)
crawler:dev:     at qi (https://ads.blogherads.com/static/blogherads.js:33:65498) {"date":"Wed Jan 11 2023 10:11:01 GMT-0800 (Pacific Standard Time)","error":{},"exception":true,"os":{"loadavg":[7.7177734375,6.7529296875,4.482421875],"uptime":1187037},"process":{"argv":["/Users/kyle/dev/playground/basis/node_modules/.pnpm/ts-node@10.9.1_vq46kxj6zfka4f6ijsosnft3hy/node_modules/ts-node/dist/child/child-entrypoint.js","/Users/kyle/dev/playground/test/apps/crawler/src/server.ts"],"cwd":"/Users/kyle/dev/playground/test/apps/crawler","execPath":"/Users/kyle/Library/Application Support/fnm/node-versions/v18.12.1/installation/bin/node","gid":20,"memoryUsage":{"arrayBuffers":5499315,"external":7243456,"heapTotal":351469568,"heapUsed":266899520,"rss":511655936},"pid":30447,"uid":502,"version":"v18.12.1"},"stack":"TypeError: window.requestAnimationFrame is not a function\n    at j (https://ads.blogherads.com/static/blogherads.js:2:118920)\n    at qi (https://ads.blogherads.com/static/blogherads.js:33:65498)","trace":[{"column":118920,"file":"https://ads.blogherads.com/static/blogherads.js","function":"j","line":2,"method":null,"native":false},{"column":65498,"file":"https://ads.blogherads.com/static/blogherads.js","function":"qi","line":33,"method":null,"native":false}]}

7.2.8 works like a charm, amazingly fast turnaround, thank you!

@kylealwyn yeah, regarding the error you posted, it seems you are using something like headless browser to parse web content.