ScrapeGraphAI / Scrapegraph-ai

Python scraper based on AI

Home Page:https://scrapegraphai.com

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Smart websites return messages like"... With JavaScript and cookies enabled... "

Bandit253 opened this issue · comments

commented

I have come across a number of sites that seem to have implemented some clever javascript which just returns a useless message basically saying we are not going to let you scrape our site.
using Smartscrapergraph on https://cointelegraph.com/news/ethereum-due-new-all-time-high-ether-etf-nears-en
asking for a summary results in
'summary': 'In a groundbreaking development, cointelegraph.com has successfully verified the security of its connection. This verification process, which took a few seconds, has ensured the safety of user data on the website. With JavaScript and cookies enabled, users can now continue to access cointelegraph.com without any security concerns.'

It would be great to be able to get around this I am almost wondering if loaded and OCR-ed the text, but I am not that smart.

I don't know the solution, but a smarter way should be possible, as of course the content is readable in a browser

but the link you provided is not a valid link, there is no this page.

but the link you provided is not a valid link, there is no this page.

missing a "d" in the link: https://cointelegraph.com/news/ethereum-due-new-all-time-high-ether-etf-nears-end

commented

Ah thanks! I see the problem has been found minutes before I had the opportunity.
Sorry for the inconvenience.

I have just tried it again and got
{'summary': 'The story is about the need to enable JavaScript and cookies in order to continue on the website.'}

with prompt
prompt="Give a summary of the story",

Hey @Bandit253 try setting the headless flag to False in the graph_config, I got the right answer:

{'summary': "The article discusses the anticipation of Ethereum (ETH) reaching a new all-time high as the launch of spot Ether ETFs in the United States nears. Market analyst Michaël van de Poppe predicts that ETH/USD is likely to surpass its previous record peak, driven by the approval of these ETFs. This development is expected to reduce Bitcoin's market dominance, giving altcoins like Ethereum more room to grow. At the time of writing, ETH is trading at around $3,850, still below its record high of $4,900 set in late 2021. The article also mentions that BlackRock's IBIT has become the world's largest Bitcoin ETF, surpassing the Grayscale Bitcoin Trust (GBTC)."}
commented

Fabulous, I do also...
thanks so much!!

commented

I thought it was done but alas no,
https://www.coindesk.com/markets/2024/05/31/bitcoin-breaks-to-low-end-of-trading-range-but-june-data-could-be-next-catalyst/
returns
In a shocking turn of events, a website has been found with no body content, leaving readers puzzled. The mysterious disappearance of information has sparked curiosity and speculation. Stay tuned for updates on this developing story.

What gets me is the smugness of the site author!

another...
https://www.investing.com/news/cryptocurrency-news/colossal-253-billion-bitcoin-withdrawal-stuns-major-exchanges-3466201

try add a slow_mo param to the config like:

graph_config = {
    "llm": {
...
    },
    "loader_kwargs": {
        "slow_mo": 10000
    }
}

I found this in another issue.

Yes, some websites need time to get the code

commented

I like and understand the thinking but alas the websites still elude. I got this from https://www.coindesk.com/markets/2024/05/31/bitcoin-breaks-to-low-end-of-trading-range-but-june-data-could-be-next-catalyst/

A gripping tale with no body content leaves readers on the edge of their seats. What secrets lie within?

I slowed it down by the suggested 10000 then 20000, with the same result.

graph_config = { "llm": {
                    "api_key": self.oAI_key,
                      "model": "gpt-3.5-turbo",
                        },
                            "loader_kwargs": {
                                "slow_mo": 20000
                            },
                            "headless": False,
                            }

As usual thanks for the tips!

try with 5k or 10k, just for trial

commented

yep tried 50k and 100k same result. :(
Unrelated but interesting observation, I opened the same page in edge and while it is open it thrushes my C: drive!
The red arrow indicates when I opened the page and when I close the tab. Wondering if the BTC mining
image

hi, please install the new beta and try again the script