This application is designed to convert PDF documents into a knowledge graph stored in Neo4j. It utilizes the power of OpenAI's GPT/Diffbot LLM(Large language model) to extract nodes, relationships and properties from the text content of the PDF and then organizes them into a structured knowledge graph using Langchain framework. Files can be uploaded from local machine or S3 bucket and then LLM model can be chosen to create the knowledge graph.
-
Run Docker Compose to build and start all components:
docker-compose up --build
-
Alternatively, you can run specific directories separately:
-
For the frontend:
cd frontend yarn yarn run dev
-
For the backend:
cd backend python -m venv envName source envName/bin/activate pip install -r requirements.txt uvicorn score:app --reload
-
To deploy the app and packages on Google Cloud Platform, run the following command on google cloud run:
# Frontend deploy
gcloud run deploy
source location current directory > Frontend
region : 32 [us-central 1]
Allow unauthenticated request : Yes
# Backend deploy
gcloud run deploy --set-env-vars "OPENAI_API_KEY = " --set-env-vars "DIFFBOT_API_KEY = " --set-env-vars "NEO4J_URI = " --set-env-vars "NEO4J_PASSWORD = " --set-env-vars "NEO4J_USERNAME = "
source location current directory > Backend
region : 32 [us-central 1]
Allow unauthenticated request : Yes
- PDF Upload: Users can upload PDF documents using the Drop Zone.
- S3 Bucket Integration: Users can also specify PDF documents stored in an S3 bucket for processing.
- Knowledge Graph Generation: The application employs OpenAI/Diffbot's LLM to extract relevant information from the PDFs and construct a knowledge graph.
- Neo4j Integration: The extracted nodes and relationships are stored in a Neo4j database for easy visualization and querying.
- Grid View of source node files with : Name,Type,Size,Nodes,Relations,Duration,Status,Source,Model
Create .env file and update the following env variables.
OPENAI_API_KEY = ""
DIFFBOT_API_KEY = ""
NEO4J_URI = ""
NEO4J_USERNAME = ""
NEO4J_PASSWORD = ""
AWS_ACCESS_KEY_ID = ""
AWS_SECRET_ACCESS_KEY = ""
EMBEDDING_MODEL = ""
IS_EMBEDDING = "TRUE"
KNN_MIN_SCORE = ""\
Extracts nodes , relationships and properties from a PDF file leveraging LLM models.
Args:
uri: URI of the graph to extract
userName: Username to use for graph creation ( if None will use username from config file )
password: Password to use for graph creation ( if None will use password from config file )
file: File object containing the PDF file path to be used
model: Type of model to use ('OpenAI GPT 3.5' or 'OpenAI GPT 4')
Returns:
Json response to API with fileName, nodeCount, relationshipCount, processingTime,
status and model as attributes.
![neoooo](https://private-user-images.githubusercontent.com/118245454/309273153-01e731df-b565-4f4f-b577-c47e39dd1748.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjI2MDI2NDYsIm5iZiI6MTcyMjYwMjM0NiwicGF0aCI6Ii8xMTgyNDU0NTQvMzA5MjczMTUzLTAxZTczMWRmLWI1NjUtNGY0Zi1iNTc3LWM0N2UzOWRkMTc0OC5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjQwODAyJTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI0MDgwMlQxMjM5MDZaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT02YjZmNDk0YTYzMTkyMDgzMmQzYzhjZmJjNmJhN2E0M2JkMzY4NzFjZTEyNmQ1MmNmYmE0ZmIwY2VjYTQwMWMxJlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCZhY3Rvcl9pZD0wJmtleV9pZD0wJnJlcG9faWQ9MCJ9.Mx4l2SMttpoaz4ptGWvFqbfTG2Qu2y6Kay7-Bb8-CQM)
Creates a source node in Neo4jGraph and sets properties.
Args:
uri: URI of Graph Service to connect to
userName: Username to connect to Graph Service with ( default : None )
password: Password to connect to Graph Service with ( default : None )
file: File object with information about file to be added
Returns:
Success or Failure message of node creation
![neo_workspace](https://private-user-images.githubusercontent.com/118245454/309272338-f2eb11cd-718c-453e-bec9-11410ec6e45d.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjI2MDI2NDYsIm5iZiI6MTcyMjYwMjM0NiwicGF0aCI6Ii8xMTgyNDU0NTQvMzA5MjcyMzM4LWYyZWIxMWNkLTcxOGMtNDUzZS1iZWM5LTExNDEwZWM2ZTQ1ZC5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjQwODAyJTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI0MDgwMlQxMjM5MDZaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT01ZDU1Zjg2NmJlYjE4MTU2ZmFjMGJkNzMyZThkOTVmZjkwNDBiZGEzYzJhNmQ1NDQ3Y2I1MWU2OWJiNTZiYTYxJlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCZhY3Rvcl9pZD0wJmtleV9pZD0wJnJlcG9faWQ9MCJ9.8jyBL0iSvp0RIEG-vHQjFTxMTB6A8kW4RNhqFdDvJiU)
Returns a list of file sources in the database by querying the graph and
sorting the list by the last updated date.
![get_source](https://private-user-images.githubusercontent.com/118245454/309273465-1d8c7a86-6f10-4916-a4c1-8fdd9f312bcc.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjI2MDI2NDYsIm5iZiI6MTcyMjYwMjM0NiwicGF0aCI6Ii8xMTgyNDU0NTQvMzA5MjczNDY1LTFkOGM3YTg2LTZmMTAtNDkxNi1hNGMxLThmZGQ5ZjMxMmJjYy5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjQwODAyJTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI0MDgwMlQxMjM5MDZaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT1mMzcyMWIzZjM0NmNkZGQwMDZkN2IyYTBlZTA5ODU5ZGQ1MTJmYWEzNjE1NDYwZDEzNDFhNGU5NjVkZjIzMGQ2JlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCZhY3Rvcl9pZD0wJmtleV9pZD0wJnJlcG9faWQ9MCJ9.gIE-pluVlRpjhXLkUGS76xyUQibZ_yobLkkZDX7_RJM)
![chunking](https://private-user-images.githubusercontent.com/118245454/309275406-4d61479c-e5e9-415e-954e-3edf6a773e72.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjI2MDI2NDYsIm5iZiI6MTcyMjYwMjM0NiwicGF0aCI6Ii8xMTgyNDU0NTQvMzA5Mjc1NDA2LTRkNjE0NzljLWU1ZTktNDE1ZS05NTRlLTNlZGY2YTc3M2U3Mi5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjQwODAyJTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI0MDgwMlQxMjM5MDZaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT0xODA5ZGQ0MTM0ZWFkMDNmMmZlMGQ3NDk0MWEyYTI0NGJkYmQ3NmVkNTUyMjYwZTkzY2Q0Y2EzYTAxZTM5MjVlJlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCZhY3Rvcl9pZD0wJmtleV9pZD0wJnJlcG9faWQ9MCJ9.tqAzA2X7fW1zJmpTIOY85rfYUBwHHYiTQ1iZ5W5kHuI)
The Public Google cloud Run URL. Workspace URL