Trying to modify the PDF reader with "Sources" information

Question

Trying to modify the PDF reader with "Sources" information

MaxSychevskiy opened this issue a year ago · comments

Hi,

I wanted to modify the PDF bot slightly by removing the automatic clean-up of the previous information. Essentially I can load several PDFs and run questions across those.
It works in simple terms, but I'm a bit struggling how to add "Source" information to the Neo4J graph so it can beused as part of the answer. The Source could be as simple as name of the file.

Any help from anyone?

MikePos1581 · Answer 1 · Mon Dec 18 2023 23:19:33 GMT+0800 (China Standard Time)

There's a RetrievalQAWithSourcesChain mentioned here https://python.langchain.com/docs/integrations/vectorstores/neo4jvector

I've tried swapping that for from RetrievalQA in pdf_bot.py but not managed to get it to work yet.

Michael Hunger · Answer 2 · Thu Jan 25 2024 01:02:24 GMT+0800 (China Standard Time)

@tomasonjo correct me if I'm wrong but the main thing is to provide a {metadata: {source: source-link}} to the qa_chain_with_sources ?

from langchain.chains.qa_with_sources import load_qa_with_sources_chain

https://python.langchain.com/docs/use_cases/question_answering/sources

Michael Hunger · Answer 3 · Thu Jan 25 2024 01:02:35 GMT+0800 (China Standard Time)

Feel free to send a PR

Tomaz Bratanic · Answer 4 · Thu Jan 25 2024 01:22:02 GMT+0800 (China Standard Time)

To store source information to Neo4j, you would need to use from_documents instead of from_texts method to populate the vector index. You could do this with something like:

from langchain.schema import Document
documents = [Document(content=text, metadata={source:'PDF file name'}) for text in texts]

Any key-value pair in metadata is stored as additional node properties.