croqaz / clean-mark

Convert an article into clean text

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Would it be possible to use cleanmark with readability.js or another similar tool ?

alanalvarado opened this issue · comments

A-extractor doesn't work with multiple websites and requires a manual database. Other tools like mercury, readability are working fine on multiple websites, would it be possible to integrate clean-mark with any of those tools ?

Thanks in advance.
EDIT: Or maybe would it be possible to use a local a-extractor database in addition to the online one, so we can create our own rules.

EDIT: Or maybe would it be possible to use a local a-extractor database in addition to the online one, so we can create our own rules.

Thanks!

Hi. Indeed, clean-mark has its own logic which is kind of limited and may be outdated.
Clean-mark is written in Javascript so it should be possible to integrate it with other Javascript tools, possible readability from Mozilla. I actually looked at it in 2017 when I started the project, but readability needs a fake DOM (for node.js like in the case of this app), or a real DOM (from a browser) to work and I intentionally didn't want to bloat this project -- JSDOM is a HUGE library with tons of dependencies.
So there's no way to using any library without bloating the project, but like I mentioned in the previous PR, I'm considering using Puppeteer sometime in the future.

Yes it is possible to use a local A-extractor DB, but it's a bit of work.
Basically you clone https://github.com/croqaz/a-extractor
Then you install a-extractor locally with: npm install /the/local-folder-with/a-extractor; Basically you install your local package.
Or you can use npm link in your local folder.
You can add your own rules in the local folder and that's it. The local clean-mark app will use your local a-extract with your own rules.

Hope that helps.