Problem with scrapegraphai/graphs/pdf_scraper_graph.py
tindo2003 opened this issue · comments
Hello,
output of FetchNode being fed directly into RAGNode won't work because RagNode is expecting the second argument as a list of str. However, FetchNode is outputing a list of langchain Document.
When I perform the run
function on an instance of PDFScraperGraph, I get the following error
ValidationError Traceback (most recent call last)
Cell In[3], [line 13](vscode-notebook-cell:?execution_count=3&line=13)
[1](vscode-notebook-cell:?execution_count=3&line=1) from scrapegraphai.graphs import PDFScraperGraph
[3](vscode-notebook-cell:?execution_count=3&line=3) pdf_scraper = PDFScraperGraph(
[4](vscode-notebook-cell:?execution_count=3&line=4) prompt="Which company sponsored the research?",
[5](vscode-notebook-cell:?execution_count=3&line=5) source="/Users/tindo/Desktop/lang_graph/data/lorem_ipsum.pdf",
(...)
[11](vscode-notebook-cell:?execution_count=3&line=11) },
[12](vscode-notebook-cell:?execution_count=3&line=12) )
---> [13](vscode-notebook-cell:?execution_count=3&line=13) result = pdf_scraper.run()
File ~/Desktop/lang_graph/env/lib/python3.10/site-packages/scrapegraphai/graphs/pdf_scraper_graph.py:105, in PDFScraperGraph.run(self)
[97](https://file+.vscode-resource.vscode-cdn.net/Users/tindo/Desktop/lang_graph/~/Desktop/lang_graph/env/lib/python3.10/site-packages/scrapegraphai/graphs/pdf_scraper_graph.py:97) """
[98](https://file+.vscode-resource.vscode-cdn.net/Users/tindo/Desktop/lang_graph/~/Desktop/lang_graph/env/lib/python3.10/site-packages/scrapegraphai/graphs/pdf_scraper_graph.py:98) Executes the web scraping process and returns the answer to the prompt.
[99](https://file+.vscode-resource.vscode-cdn.net/Users/tindo/Desktop/lang_graph/~/Desktop/lang_graph/env/lib/python3.10/site-packages/scrapegraphai/graphs/pdf_scraper_graph.py:99)
[100](https://file+.vscode-resource.vscode-cdn.net/Users/tindo/Desktop/lang_graph/~/Desktop/lang_graph/env/lib/python3.10/site-packages/scrapegraphai/graphs/pdf_scraper_graph.py:100) Returns:
[101](https://file+.vscode-resource.vscode-cdn.net/Users/tindo/Desktop/lang_graph/~/Desktop/lang_graph/env/lib/python3.10/site-packages/scrapegraphai/graphs/pdf_scraper_graph.py:101) str: The answer to the prompt.
[102](https://file+.vscode-resource.vscode-cdn.net/Users/tindo/Desktop/lang_graph/~/Desktop/lang_graph/env/lib/python3.10/site-packages/scrapegraphai/graphs/pdf_scraper_graph.py:102) """
[104](https://file+.vscode-resource.vscode-cdn.net/Users/tindo/Desktop/lang_graph/~/Desktop/lang_graph/env/lib/python3.10/site-packages/scrapegraphai/graphs/pdf_scraper_graph.py:104) inputs = {"user_prompt": self.prompt, self.input_key: self.source}
--> [105](https://file+.vscode-resource.vscode-cdn.net/Users/tindo/Desktop/lang_graph/~/Desktop/lang_graph/env/lib/python3.10/site-packages/scrapegraphai/graphs/pdf_scraper_graph.py:105) self.final_state, self.execution_info = self.graph.execute(inputs)
[107](https://file+.vscode-resource.vscode-cdn.net/Users/tindo/Desktop/lang_graph/~/Desktop/lang_graph/env/lib/python3.10/site-packages/scrapegraphai/graphs/pdf_scraper_graph.py:107) return self.final_state.get("answer", "No answer found.")
File ~/Desktop/lang_graph/env/lib/python3.10/site-packages/scrapegraphai/graphs/base_graph.py:171, in BaseGraph.execute(self, initial_state)
[169](https://file+.vscode-resource.vscode-cdn.net/Users/tindo/Desktop/lang_graph/~/Desktop/lang_graph/env/lib/python3.10/site-packages/scrapegraphai/graphs/base_graph.py:169) return (result["_state"], [])
[170](https://file+.vscode-resource.vscode-cdn.net/Users/tindo/Desktop/lang_graph/~/Desktop/lang_graph/env/lib/python3.10/site-packages/scrapegraphai/graphs/base_graph.py:170) else:
--> [171](https://file+.vscode-resource.vscode-cdn.net/Users/tindo/Desktop/lang_graph/~/Desktop/lang_graph/env/lib/python3.10/site-packages/scrapegraphai/graphs/base_graph.py:171) return self._execute_standard(initial_state)
File ~/Desktop/lang_graph/env/lib/python3.10/site-packages/scrapegraphai/graphs/base_graph.py:110, in BaseGraph._execute_standard(self, initial_state)
[107](https://file+.vscode-resource.vscode-cdn.net/Users/tindo/Desktop/lang_graph/~/Desktop/lang_graph/env/lib/python3.10/site-packages/scrapegraphai/graphs/base_graph.py:107) current_node = next(node for node in self.nodes if node.node_name == current_node_name)
[109](https://file+.vscode-resource.vscode-cdn.net/Users/tindo/Desktop/lang_graph/~/Desktop/lang_graph/env/lib/python3.10/site-packages/scrapegraphai/graphs/base_graph.py:109) with get_openai_callback() as cb:
--> [110](https://file+.vscode-resource.vscode-cdn.net/Users/tindo/Desktop/lang_graph/~/Desktop/lang_graph/env/lib/python3.10/site-packages/scrapegraphai/graphs/base_graph.py:110) result = current_node.execute(state)
[111](https://file+.vscode-resource.vscode-cdn.net/Users/tindo/Desktop/lang_graph/~/Desktop/lang_graph/env/lib/python3.10/site-packages/scrapegraphai/graphs/base_graph.py:111) node_exec_time = time.time() - curr_time
[112](https://file+.vscode-resource.vscode-cdn.net/Users/tindo/Desktop/lang_graph/~/Desktop/lang_graph/env/lib/python3.10/site-packages/scrapegraphai/graphs/base_graph.py:112) total_exec_time += node_exec_time
File ~/Desktop/lang_graph/env/lib/python3.10/site-packages/scrapegraphai/nodes/rag_node.py:85, in RAGNode.execute(self, state)
[82](https://file+.vscode-resource.vscode-cdn.net/Users/tindo/Desktop/lang_graph/~/Desktop/lang_graph/env/lib/python3.10/site-packages/scrapegraphai/nodes/rag_node.py:82) chunked_docs = []
[84](https://file+.vscode-resource.vscode-cdn.net/Users/tindo/Desktop/lang_graph/~/Desktop/lang_graph/env/lib/python3.10/site-packages/scrapegraphai/nodes/rag_node.py:84) for i, chunk in enumerate(doc):
---> [85](https://file+.vscode-resource.vscode-cdn.net/Users/tindo/Desktop/lang_graph/~/Desktop/lang_graph/env/lib/python3.10/site-packages/scrapegraphai/nodes/rag_node.py:85) doc = Document(
[86](https://file+.vscode-resource.vscode-cdn.net/Users/tindo/Desktop/lang_graph/~/Desktop/lang_graph/env/lib/python3.10/site-packages/scrapegraphai/nodes/rag_node.py:86) page_content=chunk,
[87](https://file+.vscode-resource.vscode-cdn.net/Users/tindo/Desktop/lang_graph/~/Desktop/lang_graph/env/lib/python3.10/site-packages/scrapegraphai/nodes/rag_node.py:87) metadata={
[88](https://file+.vscode-resource.vscode-cdn.net/Users/tindo/Desktop/lang_graph/~/Desktop/lang_graph/env/lib/python3.10/site-packages/scrapegraphai/nodes/rag_node.py:88) "chunk": i + 1,
[89](https://file+.vscode-resource.vscode-cdn.net/Users/tindo/Desktop/lang_graph/~/Desktop/lang_graph/env/lib/python3.10/site-packages/scrapegraphai/nodes/rag_node.py:89) },
[90](https://file+.vscode-resource.vscode-cdn.net/Users/tindo/Desktop/lang_graph/~/Desktop/lang_graph/env/lib/python3.10/site-packages/scrapegraphai/nodes/rag_node.py:90) )
[91](https://file+.vscode-resource.vscode-cdn.net/Users/tindo/Desktop/lang_graph/~/Desktop/lang_graph/env/lib/python3.10/site-packages/scrapegraphai/nodes/rag_node.py:91) chunked_docs.append(doc)
[93](https://file+.vscode-resource.vscode-cdn.net/Users/tindo/Desktop/lang_graph/~/Desktop/lang_graph/env/lib/python3.10/site-packages/scrapegraphai/nodes/rag_node.py:93) self.logger.info("--- (updated chunks metadata) ---")
File ~/Desktop/lang_graph/env/lib/python3.10/site-packages/langchain_core/documents/base.py:22, in Document.__init__(self, page_content, **kwargs)
[20](https://file+.vscode-resource.vscode-cdn.net/Users/tindo/Desktop/lang_graph/~/Desktop/lang_graph/env/lib/python3.10/site-packages/langchain_core/documents/base.py:20) def __init__(self, page_content: str, **kwargs: Any) -> None:
[21](https://file+.vscode-resource.vscode-cdn.net/Users/tindo/Desktop/lang_graph/~/Desktop/lang_graph/env/lib/python3.10/site-packages/langchain_core/documents/base.py:21) """Pass page_content in as positional or named arg."""
---> [22](https://file+.vscode-resource.vscode-cdn.net/Users/tindo/Desktop/lang_graph/~/Desktop/lang_graph/env/lib/python3.10/site-packages/langchain_core/documents/base.py:22) super().__init__(page_content=page_content, **kwargs)
File ~/Desktop/lang_graph/env/lib/python3.10/site-packages/pydantic/v1/main.py:341, in BaseModel.__init__(__pydantic_self__, **data)
[339](https://file+.vscode-resource.vscode-cdn.net/Users/tindo/Desktop/lang_graph/~/Desktop/lang_graph/env/lib/python3.10/site-packages/pydantic/v1/main.py:339) values, fields_set, validation_error = validate_model(__pydantic_self__.__class__, data)
[340](https://file+.vscode-resource.vscode-cdn.net/Users/tindo/Desktop/lang_graph/~/Desktop/lang_graph/env/lib/python3.10/site-packages/pydantic/v1/main.py:340) if validation_error:
--> [341](https://file+.vscode-resource.vscode-cdn.net/Users/tindo/Desktop/lang_graph/~/Desktop/lang_graph/env/lib/python3.10/site-packages/pydantic/v1/main.py:341) raise validation_error
[342](https://file+.vscode-resource.vscode-cdn.net/Users/tindo/Desktop/lang_graph/~/Desktop/lang_graph/env/lib/python3.10/site-packages/pydantic/v1/main.py:342) try:
[343](https://file+.vscode-resource.vscode-cdn.net/Users/tindo/Desktop/lang_graph/~/Desktop/lang_graph/env/lib/python3.10/site-packages/pydantic/v1/main.py:343) object_setattr(__pydantic_self__, '__dict__', values)
ValidationError: 1 validation error for Document
page_content
str type expected (type=type_error.str)
I believe the programmer expects us to perform the pipeline FetchNode -> ParseNode -> RAGNode, instead. Although, this may not make sense in the pdf scraping scenario. This is because the text_splitter.split_text()
in parse_node turns a list of Document into a list of str. Thanks!
can you give me the script you used?
I was trying to run the following:
from scrapegraphai.graphs import PDFScraperGraph
pdf_scraper = PDFScraperGraph(
prompt="Which company sponsored the research?",
source="data/lorem_ipsum.pdf",
config={
"llm": {
"model": "gpt-3.5-turbo",
"api_key": "my_key",
}
},
)
result = pdf_scraper.run()
and I received the error above. Thanks.
please open it separately
I am not sure what you mean. The code snippet I post on the very top of this issue is the source code in pdf_scraper_graph.py, not the code that I tried to run. Sorry for the confusion if that is the case.
Yes can I have the entire main?