Advanced scraping system
metafates opened this issue · comments
Feature Description
Current scraping system is very weak and unstable with a lot of restrictions. If one site gets blocked it is very complicated to find a new one that would pass all the requirements. So I propose to use embedded scripts that would allow to define more complex actions.
Solution you would like
Ferret
Ferret is a declarative query language. It has the ability to scrape JS rendered pages, handle all page events and emulate user interactions.
Syntax looks like that
LET doc = DOCUMENT('https://github.com/topics')
FOR el IN ELEMENTS(doc, '.py-4.border-bottom')
LIMIT 10
LET url = ELEMENT(el, 'a')
LET name = ELEMENT(el, '.f3')
LET description = ELEMENT(el, '.f5')
RETURN {
name: TRIM(name.innerText),
description: TRIM(description.innerText),
url: 'https://github.com' + url.attributes.href
}
Alternatives you have considered
Integrate Lua scripts with Gopher Lua. But that is way more complicated than Ferret and unnecessary to be honest
Anko is a great alternative!
Additional context
No response