SmartScraperGraph always gives `ValueError: No HTML body content found, please try setting the 'headless' flag to False in the graph configuration.`
pdavis68 opened this issue · comments
Describe the bug
Using a SmartScrapeGraph example, I always get the error: ValueError: No HTML body content found, please try setting the 'headless' flag to False in the graph configuration.
This is the current code:
graph_config = {
"llm": {
"api_key": "<<key here>>",
"model": "gpt-4o",
},
"embeddings": {
"model": "ollama/nomic-embed-text",
"base_url": "http://localhost:11434/", # set ollama URL arbitrarily
},
"max_results": 5,
"headless": False
}
# Create the SearchGraph instance
search_graph = SearchGraph(
prompt="List me all the traditional recipes from Chioggia",
config=graph_config
)
# Run the graph
result = search_graph.run()
print(result)
It gives me the error whether or not I set headless to False.
To Reproduce
Run the above example
Expected behavior
I expected it to not throw an error and eventually return the results.
Desktop (please complete the following information):
- OS: Windows
- Browser Chrome
- Version 125.0.6422.114
Trace
python app.py
--- Executing SearchInternet Node ---
Search Query: leading causes of death among Masai warriors
--- Executing GraphIterator Node with batchsize 16 ---
processing graph instances: 0%| | 0/5 [00:00<?, ?it/s]--- Executing Fetch Node ---
--- Executing Fetch Node ---
--- Executing Fetch Node ---
--- Executing Fetch Node ---
--- Executing Fetch Node ---
--- Executing Parse Node ---
--- Executing RAG Node ---
--- (updated chunks metadata) ---
--- (tokens compressed and vector stored) ---
--- Executing GenerateAnswer Node ---
Processing chunks: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<?, ?it/s]
--- Executing Parse Node --- | 0/1 [00:00<?, ?it/s]
--- Executing RAG Node ---
--- (updated chunks metadata) ---
processing graph instances: 20%|██████████████████████████ | 1/5 [00:05<00:23, 5.87s/it]--- (tokens compressed and vector stored) ---
--- Executing GenerateAnswer Node ---
Processing chunks: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<?, ?it/s]
processing graph instances: 40%|████████████████████████████████████████████████████ | 2/5 [00:07<00:09, 3.30s/it]--- Executing Parse Node ---
--- Executing RAG Node ---
--- (updated chunks metadata) ---
--- (tokens compressed and vector stored) ---
--- Executing GenerateAnswer Node ---
Processing chunks: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 3995.53it/s]
--- Executing Parse Node --- | 0/4 [00:00<?, ?it/s]
--- Executing RAG Node ---
--- (updated chunks metadata) ---
--- (tokens compressed and vector stored) ---
--- Executing GenerateAnswer Node ---
Processing chunks: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 3919.91it/s]
D:\scrapegraphai\.venv\lib\site-packages\scrapegraphai\utils\cleanup_html.py:27: MarkupResemblesLocatorWarning: The input looks more like a filename than markup. You may want to open this file and pass the filehandle into Beautiful Soup.
soup = BeautifulSoup(html_content, 'html.parser')
Traceback (most recent call last):
File "D:\scrapegraphai\app.py", line 25, in <module>
result = search_graph.run()
File "D:\scrapegraphai\.venv\lib\site-packages\scrapegraphai\graphs\search_graph.py", line 124, in run
self.final_state, self.execution_info = self.graph.execute(inputs)
File "D:\scrapegraphai\.venv\lib\site-packages\scrapegraphai\graphs\base_graph.py", line 171, in execute
return self._execute_standard(initial_state)
File "D:\scrapegraphai\.venv\lib\site-packages\scrapegraphai\graphs\base_graph.py", line 110, in _execute_standard
result = current_node.execute(state)
File "D:\scrapegraphai\.venv\lib\site-packages\scrapegraphai\nodes\graph_iterator_node.py", line 74, in execute
state = asyncio.run(self._async_execute(state, batchsize))
File "c:\python310\lib\asyncio\runners.py", line 44, in run
return loop.run_until_complete(main)
File "c:\python310\lib\asyncio\base_events.py", line 641, in run_until_complete
return future.result()
File "D:\scrapegraphai\.venv\lib\site-packages\scrapegraphai\nodes\graph_iterator_node.py", line 134, in _async_execute
answers = await tqdm.gather(
File "D:\scrapegraphai\.venv\lib\site-packages\tqdm\asyncio.py", line 79, in gather
res = [await f for f in cls.as_completed(ifs, loop=loop, timeout=timeout,
File "D:\scrapegraphai\.venv\lib\site-packages\tqdm\asyncio.py", line 79, in <listcomp>
res = [await f for f in cls.as_completed(ifs, loop=loop, timeout=timeout,
File "c:\python310\lib\asyncio\tasks.py", line 571, in _wait_for_one
return f.result() # May raise f.exception().
File "D:\scrapegraphai\.venv\lib\site-packages\tqdm\asyncio.py", line 76, in wrap_awaitable
return i, await f
File "D:\scrapegraphai\.venv\lib\site-packages\scrapegraphai\nodes\graph_iterator_node.py", line 123, in _async_run
return await asyncio.to_thread(graph.run)
File "c:\python310\lib\asyncio\threads.py", line 25, in to_thread
return await loop.run_in_executor(None, func_call)
File "c:\python310\lib\concurrent\futures\thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
File "D:\scrapegraphai\.venv\lib\site-packages\scrapegraphai\graphs\smart_scraper_graph.py", line 118, in run
self.final_state, self.execution_info = self.graph.execute(inputs)
File "D:\scrapegraphai\.venv\lib\site-packages\scrapegraphai\graphs\base_graph.py", line 171, in execute
return self._execute_standard(initial_state)
File "D:\scrapegraphai\.venv\lib\site-packages\scrapegraphai\graphs\base_graph.py", line 110, in _execute_standard
result = current_node.execute(state)
File "D:\scrapegraphai\.venv\lib\site-packages\scrapegraphai\nodes\fetch_node.py", line 162, in execute
title, minimized_body, link_urls, image_urls = cleanup_html(
File "D:\scrapegraphai\.venv\lib\site-packages\scrapegraphai\utils\cleanup_html.py", line 65, in cleanup_html
raise ValueError("No HTML body content found, please try setting the 'headless' flag to False in the graph configuration.")
ValueError: No HTML body content found, please try setting the 'headless' flag to False in the graph configuration.
processing graph instances: 40%|████████████████████████████████████████████████████ | 2/5 [01:05<01:37, 32.56s/it]
How did you Solved the issue?
I have not solved it. I get it with both the SearchGraph and the SmartScraperGraph (which SearchGraph uses internally), so I suspect the issue is actually with SmartScraperGraph. But I haven't found a solution.
Hey @pdavis68 can you give the url that doesn't work for you in smartscrapergraph? For the SearchGraph maybe it goes inside a website which is not html but maybe only a image or document. Will investigate
I have not solved it. I get it with both the SearchGraph and the SmartScraperGraph (which SearchGraph uses internally), so I suspect the issue is actually with SmartScraperGraph. But I haven't found a solution.
It worked for me , Try with Linux Operating System
I don't run a linux desktop. I run a windows desktop.
@PeriniM Maybe I was wrong about that one. SmartScraperGraph seems to be working for me now. It's just the SearchGraph that's failing.
Ok can you try to use a lower max_result number? In any case I created a new branch to solve the issue, probably will investigate if the problem is related to the fetching of a non html page and add some fault tolerance mechanisms ;)
hi, please install the new beta