Quick guide to get started on analyzing documents using open source technologies
This project deals with open-source implementations of Microsoft Azure Cognitive services for extracting meaningful insights and keyword search service in a pool of documents.
- Documents are stored in a Azure container as blob items.
- Each document carries a unique Azure-Blob URL.
- Documents could be of following data types: pdf, png, jpg, jpeg, tiff, heic, txt, md.
- Components for this project are: - Azure Search services - Azure Cognitive Computer Vision services - OpenAI API endpoints
Give a input query, search serivce locates the top ranked documents, which are then sent to OCR services for extraction and lastly OpenAI uses Completion services to extract insights.
Make a get call to get search results.
url = /search/<<QUERY>>
r = requests.get(url)
OUTPUT:
[
{'metadata':....}
]
Make a get call to get the version info and check if the server is running...
url = /docai/version
r = requests.get(url)
OUTPUT:
{
openai_model: da-vinci-001,
openai_version: gpt-3,
read_api_version: v3.2
}
Make a post call to run OCR on a storage-azure-blob document, passing it's Blob-URL.
url = /docai/run_ocr
payload = {
file_path httpsdiscoveraiwork5625301287.blob.core.windows.netdocumentsThe%20Prospect%20of%20a%20Continued%20Correction.pdf
}
r = requests.post(url, json=payload)
ocr_response_json = r.json() # upload this in azure-blob as a JSON object. Need to pass this to next api!
OUTPUT:
{
title: DOC_NAME,
lines: INFO,
...
corpus: CORPUS
}
Make a post call to run doc-ai on a document, need to send document's OCR-RESPONSE-JSON Object (saved in #2).
# Option 1 ---- When user types a input query -----
url = /docai/extract_insights
payload = {
document ocr_response_json,
query Generate summary for this in 100 words!
}
r = requests.post(url, json=payload)
OUTPUT:
{ 'output' : 'The summary is.....'}
# Option 2 ----- When user clicks on a BUTTON -----
'default_insight': <<could be one of from the list>>
default-summary,
default-faq,
default-keypoints,
default-keywords,
default-phrases,
default-sentiment,
default-abbrv
...
url = /docai/extract_insights
payload = {
document ocr_response_json,
query ,
default_insight default-summary
}
r = requests.post(url, json=payload)
OUTPUT:
{ 'output' : 'The summary is.....'}