metafates / mangal

📖 The most advanced (yet simple) cli manga downloader in the entire universe! Lua scrapers, export formats, anilist integration, fancy TUI and more!

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Advanced scraping system

metafates opened this issue · comments

commented

Feature Description

Current scraping system is very weak and unstable with a lot of restrictions. If one site gets blocked it is very complicated to find a new one that would pass all the requirements. So I propose to use embedded scripts that would allow to define more complex actions.

Solution you would like

Ferret

Ferret is a declarative query language. It has the ability to scrape JS rendered pages, handle all page events and emulate user interactions.

Syntax looks like that

LET doc = DOCUMENT('https://github.com/topics')

FOR el IN ELEMENTS(doc, '.py-4.border-bottom')
    LIMIT 10

    LET url = ELEMENT(el, 'a')
    LET name = ELEMENT(el, '.f3')
    LET description = ELEMENT(el, '.f5')

    RETURN {
        name: TRIM(name.innerText),
        description: TRIM(description.innerText),
        url: 'https://github.com' + url.attributes.href
    }
            

Alternatives you have considered

Integrate Lua scripts with Gopher Lua. But that is way more complicated than Ferret and unnecessary to be honest

Anko is a great alternative!

Additional context

No response