Trying to modify the PDF reader with "Sources" information
MaxSychevskiy opened this issue · comments
Hi,
I wanted to modify the PDF bot slightly by removing the automatic clean-up of the previous information. Essentially I can load several PDFs and run questions across those.
It works in simple terms, but I'm a bit struggling how to add "Source" information to the Neo4J graph so it can beused as part of the answer. The Source could be as simple as name of the file.
Any help from anyone?
There's a RetrievalQAWithSourcesChain mentioned here https://python.langchain.com/docs/integrations/vectorstores/neo4jvector
I've tried swapping that for from RetrievalQA in pdf_bot.py but not managed to get it to work yet.
@tomasonjo correct me if I'm wrong but the main thing is to provide a {metadata: {source: source-link}}
to the qa_chain_with_sources ?
from langchain.chains.qa_with_sources import load_qa_with_sources_chain
https://python.langchain.com/docs/use_cases/question_answering/sources
Feel free to send a PR
To store source information to Neo4j, you would need to use from_documents
instead of from_texts
method to populate the vector index. You could do this with something like:
from langchain.schema import Document
documents = [Document(content=text, metadata={source:'PDF file name'}) for text in texts]
Any key-value pair in metadata is stored as additional node properties.