Only non-relevant page components returned
RamXX opened this issue · comments
Fantastic project. Thank you!
Here is a page that (one would think) is straightforward to parse: https://access.redhat.com/security/cve/CVE-2023-45853 . However, none of the relevant information in the page makes it to the parsed version, only the corporate links and "scaffolding".
I figured I'd report it in case this can highlight some areas of improvement. Thanks again!
Thanks for reporting, will dig in.
![image](https://private-user-images.githubusercontent.com/2041322/323022655-7bdc9caa-50ce-44e8-bf45-d82a0828a508.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjI0MzUwMTgsIm5iZiI6MTcyMjQzNDcxOCwicGF0aCI6Ii8yMDQxMzIyLzMyMzAyMjY1NS03YmRjOWNhYS01MGNlLTQ0ZTgtYmY0NS1kODJhMDgyOGE1MDgucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI0MDczMSUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNDA3MzFUMTQwNTE4WiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9M2EyYjhkNDJhMjViNjA0YmFjMDVlZjA2ZGIxMDg2MGFjNmUxZDgxZjEzYmU0MGJiNTBiNGE2NmVmNThjZmEwMSZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QmYWN0b3JfaWQ9MCZrZXlfaWQ9MCZyZXBvX2lkPTAifQ.YoKEVmtB4UMnBtjWIimPVX5P3Et95zyQDT_PAZ1DoLw)
found the problem, somehow this site doesn't even work with chrome->view source code view-source:https://access.redhat.com/security/cve/CVE-2023-45853
. because it requires js to be running,
so using stream mode solves the problem:
curl -H "Accept: text/event-stream" -H 'x-no-cache: true' https://r.jina.ai/https://access.redhat.com/security/cve/CVE-2023-45853
pay attention to the last chunk in the event stream, it should give you:
![image](https://private-user-images.githubusercontent.com/2041322/323022973-22231468-d807-461c-9eb2-cc8314a72db6.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjI0MzUwMTgsIm5iZiI6MTcyMjQzNDcxOCwicGF0aCI6Ii8yMDQxMzIyLzMyMzAyMjk3My0yMjIzMTQ2OC1kODA3LTQ2MWMtOWViMi1jYzgzMTRhNzJkYjYucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI0MDczMSUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNDA3MzFUMTQwNTE4WiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9YjBiODM5ODllNzUzZmIxNWQzNzk0YjFkM2U0NzdmYmRlNWUyZDQyMTU3NmY0NmJjNjQ4ZTVlMDNjMjgwODUzZCZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QmYWN0b3JfaWQ9MCZrZXlfaWQ9MCZyZXBvX2lkPTAifQ.M3vFQhejQEZ2TOM8XtOdNqpHvZJ8koqXJp43CnbLYs0)
Thank🙏
Thanks a lot! I'll make a note whenever I can't parse a site, to attempt this mechanism. Wondering if we should keep this open basically to ensure it gets in the documentation. Otherwise we can just close it. Thanks again!