VinciGit00 / Scrapegraph-ai

Python scraper based on AI

Home Page:https://scrapegraphai.com

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

SmartScraperGraph always gives `ValueError: No HTML body content found, please try setting the 'headless' flag to False in the graph configuration.`

pdavis68 opened this issue · comments

commented

Describe the bug
Using a SmartScrapeGraph example, I always get the error: ValueError: No HTML body content found, please try setting the 'headless' flag to False in the graph configuration.

This is the current code:

graph_config = {
    "llm": {
        "api_key": "<<key here>>",
        "model": "gpt-4o",
    },
    "embeddings": {
        "model": "ollama/nomic-embed-text",
        "base_url": "http://localhost:11434/",  # set ollama URL arbitrarily
    },
    "max_results": 5,
    "headless": False
}

# Create the SearchGraph instance
search_graph = SearchGraph(
   prompt="List me all the traditional recipes from Chioggia",
   config=graph_config
)

# Run the graph
result = search_graph.run()
print(result)

It gives me the error whether or not I set headless to False.

To Reproduce
Run the above example

Expected behavior
I expected it to not throw an error and eventually return the results.

Desktop (please complete the following information):

  • OS: Windows
  • Browser Chrome
  • Version 125.0.6422.114

Trace

python app.py
--- Executing SearchInternet Node ---
Search Query: leading causes of death among Masai warriors
--- Executing GraphIterator Node with batchsize 16 ---
processing graph instances:   0%|                                                                                                                                          | 0/5 [00:00<?, ?it/s]--- Executing Fetch Node ---
--- Executing Fetch Node ---
--- Executing Fetch Node ---
--- Executing Fetch Node ---
--- Executing Fetch Node ---
--- Executing Parse Node ---
--- Executing RAG Node ---
--- (updated chunks metadata) ---
--- (tokens compressed and vector stored) ---
--- Executing GenerateAnswer Node ---
Processing chunks: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<?, ?it/s] 
--- Executing Parse Node ---                                                                                                                                               | 0/1 [00:00<?, ?it/s] 
--- Executing RAG Node ---
--- (updated chunks metadata) ---
processing graph instances:  20%|██████████████████████████                                                                                                        | 1/5 [00:05<00:23,  5.87s/it]--- (tokens compressed and vector stored) ---
--- Executing GenerateAnswer Node ---
Processing chunks: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<?, ?it/s] 
processing graph instances:  40%|████████████████████████████████████████████████████                                                                              | 2/5 [00:07<00:09,  3.30s/it]--- Executing Parse Node ---
--- Executing RAG Node ---
--- (updated chunks metadata) ---
--- (tokens compressed and vector stored) ---
--- Executing GenerateAnswer Node ---
Processing chunks: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 3995.53it/s] 
--- Executing Parse Node ---                                                                                                                                               | 0/4 [00:00<?, ?it/s] 
--- Executing RAG Node ---
--- (updated chunks metadata) ---
--- (tokens compressed and vector stored) ---
--- Executing GenerateAnswer Node ---
Processing chunks: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 3919.91it/s] 
D:\scrapegraphai\.venv\lib\site-packages\scrapegraphai\utils\cleanup_html.py:27: MarkupResemblesLocatorWarning: The input looks more like a filename than markup. You may want to open this file and pass the filehandle into Beautiful Soup.
  soup = BeautifulSoup(html_content, 'html.parser')
Traceback (most recent call last):
  File "D:\scrapegraphai\app.py", line 25, in <module>
    result = search_graph.run()
  File "D:\scrapegraphai\.venv\lib\site-packages\scrapegraphai\graphs\search_graph.py", line 124, in run
    self.final_state, self.execution_info = self.graph.execute(inputs)
  File "D:\scrapegraphai\.venv\lib\site-packages\scrapegraphai\graphs\base_graph.py", line 171, in execute
    return self._execute_standard(initial_state)
  File "D:\scrapegraphai\.venv\lib\site-packages\scrapegraphai\graphs\base_graph.py", line 110, in _execute_standard
    result = current_node.execute(state)
  File "D:\scrapegraphai\.venv\lib\site-packages\scrapegraphai\nodes\graph_iterator_node.py", line 74, in execute
    state = asyncio.run(self._async_execute(state, batchsize))
  File "c:\python310\lib\asyncio\runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "c:\python310\lib\asyncio\base_events.py", line 641, in run_until_complete
    return future.result()
  File "D:\scrapegraphai\.venv\lib\site-packages\scrapegraphai\nodes\graph_iterator_node.py", line 134, in _async_execute
    answers = await tqdm.gather(
  File "D:\scrapegraphai\.venv\lib\site-packages\tqdm\asyncio.py", line 79, in gather
    res = [await f for f in cls.as_completed(ifs, loop=loop, timeout=timeout,
  File "D:\scrapegraphai\.venv\lib\site-packages\tqdm\asyncio.py", line 79, in <listcomp>
    res = [await f for f in cls.as_completed(ifs, loop=loop, timeout=timeout,
  File "c:\python310\lib\asyncio\tasks.py", line 571, in _wait_for_one
    return f.result()  # May raise f.exception().
  File "D:\scrapegraphai\.venv\lib\site-packages\tqdm\asyncio.py", line 76, in wrap_awaitable
    return i, await f
  File "D:\scrapegraphai\.venv\lib\site-packages\scrapegraphai\nodes\graph_iterator_node.py", line 123, in _async_run
    return await asyncio.to_thread(graph.run)
  File "c:\python310\lib\asyncio\threads.py", line 25, in to_thread
    return await loop.run_in_executor(None, func_call)
  File "c:\python310\lib\concurrent\futures\thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "D:\scrapegraphai\.venv\lib\site-packages\scrapegraphai\graphs\smart_scraper_graph.py", line 118, in run
    self.final_state, self.execution_info = self.graph.execute(inputs)
  File "D:\scrapegraphai\.venv\lib\site-packages\scrapegraphai\graphs\base_graph.py", line 171, in execute
    return self._execute_standard(initial_state)
  File "D:\scrapegraphai\.venv\lib\site-packages\scrapegraphai\graphs\base_graph.py", line 110, in _execute_standard
    result = current_node.execute(state)
  File "D:\scrapegraphai\.venv\lib\site-packages\scrapegraphai\nodes\fetch_node.py", line 162, in execute
    title, minimized_body, link_urls, image_urls = cleanup_html(
  File "D:\scrapegraphai\.venv\lib\site-packages\scrapegraphai\utils\cleanup_html.py", line 65, in cleanup_html
    raise ValueError("No HTML body content found, please try setting the 'headless' flag to False in the graph configuration.")
ValueError: No HTML body content found, please try setting the 'headless' flag to False in the graph configuration.
processing graph instances:  40%|████████████████████████████████████████████████████                                                                              | 2/5 [01:05<01:37, 32.56s/it]

How did you Solved the issue?

commented

I have not solved it. I get it with both the SearchGraph and the SmartScraperGraph (which SearchGraph uses internally), so I suspect the issue is actually with SmartScraperGraph. But I haven't found a solution.

Hey @pdavis68 can you give the url that doesn't work for you in smartscrapergraph? For the SearchGraph maybe it goes inside a website which is not html but maybe only a image or document. Will investigate

I have not solved it. I get it with both the SearchGraph and the SmartScraperGraph (which SearchGraph uses internally), so I suspect the issue is actually with SmartScraperGraph. But I haven't found a solution.

It worked for me , Try with Linux Operating System

commented

I don't run a linux desktop. I run a windows desktop.

commented

@PeriniM Maybe I was wrong about that one. SmartScraperGraph seems to be working for me now. It's just the SearchGraph that's failing.

Ok can you try to use a lower max_result number? In any case I created a new branch to solve the issue, probably will investigate if the problem is related to the fetching of a non html page and add some fault tolerance mechanisms ;)

hi, please install the new beta