jina-ai / reader

Convert any URL to an LLM-friendly input with a simple prefix https://r.jina.ai/

Home Page:https://jina.ai/reader

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Only non-relevant page components returned

RamXX opened this issue · comments

Fantastic project. Thank you!

Here is a page that (one would think) is straightforward to parse: https://access.redhat.com/security/cve/CVE-2023-45853 . However, none of the relevant information in the page makes it to the parsed version, only the corporate links and "scaffolding".

I figured I'd report it in case this can highlight some areas of improvement. Thanks again!

Thanks for reporting, will dig in.

image

found the problem, somehow this site doesn't even work with chrome->view source code view-source:https://access.redhat.com/security/cve/CVE-2023-45853. because it requires js to be running,

so using stream mode solves the problem:

curl -H "Accept: text/event-stream" -H 'x-no-cache: true' https://r.jina.ai/https://access.redhat.com/security/cve/CVE-2023-45853

pay attention to the last chunk in the event stream, it should give you:

image

Thank🙏

Thanks a lot! I'll make a note whenever I can't parse a site, to attempt this mechanism. Wondering if we should keep this open basically to ensure it gets in the documentation. Otherwise we can just close it. Thanks again!