VinciGit00 / Scrapegraph-ai

Python scraper based on AI

Home Page:https://scrapegraphai.com

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Problem with scrapegraphai/graphs/pdf_scraper_graph.py

tindo2003 opened this issue · comments

Hello,

Screenshot 2024-06-06 at 12 58 28 AM

output of FetchNode being fed directly into RAGNode won't work because RagNode is expecting the second argument as a list of str. However, FetchNode is outputing a list of langchain Document.

When I perform the run function on an instance of PDFScraperGraph, I get the following error

ValidationError                           Traceback (most recent call last)
Cell In[3], [line 13](vscode-notebook-cell:?execution_count=3&line=13)
      [1](vscode-notebook-cell:?execution_count=3&line=1) from scrapegraphai.graphs import PDFScraperGraph
      [3](vscode-notebook-cell:?execution_count=3&line=3) pdf_scraper = PDFScraperGraph(
      [4](vscode-notebook-cell:?execution_count=3&line=4)     prompt="Which company sponsored the research?",
      [5](vscode-notebook-cell:?execution_count=3&line=5)     source="/Users/tindo/Desktop/lang_graph/data/lorem_ipsum.pdf",
   (...)
     [11](vscode-notebook-cell:?execution_count=3&line=11)     },
     [12](vscode-notebook-cell:?execution_count=3&line=12) )
---> [13](vscode-notebook-cell:?execution_count=3&line=13) result = pdf_scraper.run()

File ~/Desktop/lang_graph/env/lib/python3.10/site-packages/scrapegraphai/graphs/pdf_scraper_graph.py:105, in PDFScraperGraph.run(self)
     [97](https://file+.vscode-resource.vscode-cdn.net/Users/tindo/Desktop/lang_graph/~/Desktop/lang_graph/env/lib/python3.10/site-packages/scrapegraphai/graphs/pdf_scraper_graph.py:97) """
     [98](https://file+.vscode-resource.vscode-cdn.net/Users/tindo/Desktop/lang_graph/~/Desktop/lang_graph/env/lib/python3.10/site-packages/scrapegraphai/graphs/pdf_scraper_graph.py:98) Executes the web scraping process and returns the answer to the prompt.
     [99](https://file+.vscode-resource.vscode-cdn.net/Users/tindo/Desktop/lang_graph/~/Desktop/lang_graph/env/lib/python3.10/site-packages/scrapegraphai/graphs/pdf_scraper_graph.py:99) 
    [100](https://file+.vscode-resource.vscode-cdn.net/Users/tindo/Desktop/lang_graph/~/Desktop/lang_graph/env/lib/python3.10/site-packages/scrapegraphai/graphs/pdf_scraper_graph.py:100) Returns:
    [101](https://file+.vscode-resource.vscode-cdn.net/Users/tindo/Desktop/lang_graph/~/Desktop/lang_graph/env/lib/python3.10/site-packages/scrapegraphai/graphs/pdf_scraper_graph.py:101)     str: The answer to the prompt.
    [102](https://file+.vscode-resource.vscode-cdn.net/Users/tindo/Desktop/lang_graph/~/Desktop/lang_graph/env/lib/python3.10/site-packages/scrapegraphai/graphs/pdf_scraper_graph.py:102) """
    [104](https://file+.vscode-resource.vscode-cdn.net/Users/tindo/Desktop/lang_graph/~/Desktop/lang_graph/env/lib/python3.10/site-packages/scrapegraphai/graphs/pdf_scraper_graph.py:104) inputs = {"user_prompt": self.prompt, self.input_key: self.source}
--> [105](https://file+.vscode-resource.vscode-cdn.net/Users/tindo/Desktop/lang_graph/~/Desktop/lang_graph/env/lib/python3.10/site-packages/scrapegraphai/graphs/pdf_scraper_graph.py:105) self.final_state, self.execution_info = self.graph.execute(inputs)
    [107](https://file+.vscode-resource.vscode-cdn.net/Users/tindo/Desktop/lang_graph/~/Desktop/lang_graph/env/lib/python3.10/site-packages/scrapegraphai/graphs/pdf_scraper_graph.py:107) return self.final_state.get("answer", "No answer found.")

File ~/Desktop/lang_graph/env/lib/python3.10/site-packages/scrapegraphai/graphs/base_graph.py:171, in BaseGraph.execute(self, initial_state)
    [169](https://file+.vscode-resource.vscode-cdn.net/Users/tindo/Desktop/lang_graph/~/Desktop/lang_graph/env/lib/python3.10/site-packages/scrapegraphai/graphs/base_graph.py:169)     return (result["_state"], [])
    [170](https://file+.vscode-resource.vscode-cdn.net/Users/tindo/Desktop/lang_graph/~/Desktop/lang_graph/env/lib/python3.10/site-packages/scrapegraphai/graphs/base_graph.py:170) else:
--> [171](https://file+.vscode-resource.vscode-cdn.net/Users/tindo/Desktop/lang_graph/~/Desktop/lang_graph/env/lib/python3.10/site-packages/scrapegraphai/graphs/base_graph.py:171)     return self._execute_standard(initial_state)

File ~/Desktop/lang_graph/env/lib/python3.10/site-packages/scrapegraphai/graphs/base_graph.py:110, in BaseGraph._execute_standard(self, initial_state)
    [107](https://file+.vscode-resource.vscode-cdn.net/Users/tindo/Desktop/lang_graph/~/Desktop/lang_graph/env/lib/python3.10/site-packages/scrapegraphai/graphs/base_graph.py:107) current_node = next(node for node in self.nodes if node.node_name == current_node_name)
    [109](https://file+.vscode-resource.vscode-cdn.net/Users/tindo/Desktop/lang_graph/~/Desktop/lang_graph/env/lib/python3.10/site-packages/scrapegraphai/graphs/base_graph.py:109) with get_openai_callback() as cb:
--> [110](https://file+.vscode-resource.vscode-cdn.net/Users/tindo/Desktop/lang_graph/~/Desktop/lang_graph/env/lib/python3.10/site-packages/scrapegraphai/graphs/base_graph.py:110)     result = current_node.execute(state)
    [111](https://file+.vscode-resource.vscode-cdn.net/Users/tindo/Desktop/lang_graph/~/Desktop/lang_graph/env/lib/python3.10/site-packages/scrapegraphai/graphs/base_graph.py:111)     node_exec_time = time.time() - curr_time
    [112](https://file+.vscode-resource.vscode-cdn.net/Users/tindo/Desktop/lang_graph/~/Desktop/lang_graph/env/lib/python3.10/site-packages/scrapegraphai/graphs/base_graph.py:112)     total_exec_time += node_exec_time

File ~/Desktop/lang_graph/env/lib/python3.10/site-packages/scrapegraphai/nodes/rag_node.py:85, in RAGNode.execute(self, state)
     [82](https://file+.vscode-resource.vscode-cdn.net/Users/tindo/Desktop/lang_graph/~/Desktop/lang_graph/env/lib/python3.10/site-packages/scrapegraphai/nodes/rag_node.py:82) chunked_docs = []
     [84](https://file+.vscode-resource.vscode-cdn.net/Users/tindo/Desktop/lang_graph/~/Desktop/lang_graph/env/lib/python3.10/site-packages/scrapegraphai/nodes/rag_node.py:84) for i, chunk in enumerate(doc):
---> [85](https://file+.vscode-resource.vscode-cdn.net/Users/tindo/Desktop/lang_graph/~/Desktop/lang_graph/env/lib/python3.10/site-packages/scrapegraphai/nodes/rag_node.py:85)     doc = Document(
     [86](https://file+.vscode-resource.vscode-cdn.net/Users/tindo/Desktop/lang_graph/~/Desktop/lang_graph/env/lib/python3.10/site-packages/scrapegraphai/nodes/rag_node.py:86)         page_content=chunk,
     [87](https://file+.vscode-resource.vscode-cdn.net/Users/tindo/Desktop/lang_graph/~/Desktop/lang_graph/env/lib/python3.10/site-packages/scrapegraphai/nodes/rag_node.py:87)         metadata={
     [88](https://file+.vscode-resource.vscode-cdn.net/Users/tindo/Desktop/lang_graph/~/Desktop/lang_graph/env/lib/python3.10/site-packages/scrapegraphai/nodes/rag_node.py:88)             "chunk": i + 1,
     [89](https://file+.vscode-resource.vscode-cdn.net/Users/tindo/Desktop/lang_graph/~/Desktop/lang_graph/env/lib/python3.10/site-packages/scrapegraphai/nodes/rag_node.py:89)         },
     [90](https://file+.vscode-resource.vscode-cdn.net/Users/tindo/Desktop/lang_graph/~/Desktop/lang_graph/env/lib/python3.10/site-packages/scrapegraphai/nodes/rag_node.py:90)     )
     [91](https://file+.vscode-resource.vscode-cdn.net/Users/tindo/Desktop/lang_graph/~/Desktop/lang_graph/env/lib/python3.10/site-packages/scrapegraphai/nodes/rag_node.py:91)     chunked_docs.append(doc)
     [93](https://file+.vscode-resource.vscode-cdn.net/Users/tindo/Desktop/lang_graph/~/Desktop/lang_graph/env/lib/python3.10/site-packages/scrapegraphai/nodes/rag_node.py:93) self.logger.info("--- (updated chunks metadata) ---")

File ~/Desktop/lang_graph/env/lib/python3.10/site-packages/langchain_core/documents/base.py:22, in Document.__init__(self, page_content, **kwargs)
     [20](https://file+.vscode-resource.vscode-cdn.net/Users/tindo/Desktop/lang_graph/~/Desktop/lang_graph/env/lib/python3.10/site-packages/langchain_core/documents/base.py:20) def __init__(self, page_content: str, **kwargs: Any) -> None:
     [21](https://file+.vscode-resource.vscode-cdn.net/Users/tindo/Desktop/lang_graph/~/Desktop/lang_graph/env/lib/python3.10/site-packages/langchain_core/documents/base.py:21)     """Pass page_content in as positional or named arg."""
---> [22](https://file+.vscode-resource.vscode-cdn.net/Users/tindo/Desktop/lang_graph/~/Desktop/lang_graph/env/lib/python3.10/site-packages/langchain_core/documents/base.py:22)     super().__init__(page_content=page_content, **kwargs)

File ~/Desktop/lang_graph/env/lib/python3.10/site-packages/pydantic/v1/main.py:341, in BaseModel.__init__(__pydantic_self__, **data)
    [339](https://file+.vscode-resource.vscode-cdn.net/Users/tindo/Desktop/lang_graph/~/Desktop/lang_graph/env/lib/python3.10/site-packages/pydantic/v1/main.py:339) values, fields_set, validation_error = validate_model(__pydantic_self__.__class__, data)
    [340](https://file+.vscode-resource.vscode-cdn.net/Users/tindo/Desktop/lang_graph/~/Desktop/lang_graph/env/lib/python3.10/site-packages/pydantic/v1/main.py:340) if validation_error:
--> [341](https://file+.vscode-resource.vscode-cdn.net/Users/tindo/Desktop/lang_graph/~/Desktop/lang_graph/env/lib/python3.10/site-packages/pydantic/v1/main.py:341)     raise validation_error
    [342](https://file+.vscode-resource.vscode-cdn.net/Users/tindo/Desktop/lang_graph/~/Desktop/lang_graph/env/lib/python3.10/site-packages/pydantic/v1/main.py:342) try:
    [343](https://file+.vscode-resource.vscode-cdn.net/Users/tindo/Desktop/lang_graph/~/Desktop/lang_graph/env/lib/python3.10/site-packages/pydantic/v1/main.py:343)     object_setattr(__pydantic_self__, '__dict__', values)

ValidationError: 1 validation error for Document
page_content
  str type expected (type=type_error.str)

I believe the programmer expects us to perform the pipeline FetchNode -> ParseNode -> RAGNode, instead. Although, this may not make sense in the pdf scraping scenario. This is because the text_splitter.split_text() in parse_node turns a list of Document into a list of str. Thanks!

can you give me the script you used?

I was trying to run the following:

from scrapegraphai.graphs import PDFScraperGraph

pdf_scraper = PDFScraperGraph(
    prompt="Which company sponsored the research?",
    source="data/lorem_ipsum.pdf",
    config={
        "llm": {
            "model": "gpt-3.5-turbo",
            "api_key": "my_key",
        }
    },
)
result = pdf_scraper.run()

and I received the error above. Thanks.

please open it separately

I am not sure what you mean. The code snippet I post on the very top of this issue is the source code in pdf_scraper_graph.py, not the code that I tried to run. Sorry for the confusion if that is the case.

Yes can I have the entire main?