I developed this project by using Python. In this project we analyze the patient related document and convert it from pdf to jpg format. And extract the meaningful information from that document.
The medical information of the patient that we are using is called front end. The backend is python program. API is requesting python program for extracting fields from the medical document and the python program returns your response of the fields.
We build the software extract pro which handles two types of documents i.e. patient's prescription and patient's details. It extract the meaningful information i.e. patient address, medicines, direction, and refill.
Step 1: Convert PDF file into an image file by using a python module pdf2image.
Step 2: Extract text from image file by using a python module pytesseract and a google OCR engine.
By extracting text from an image, we see a little problem in here. In this text extraction, the black area does not extracting the text. So, after Step 1, Convert the original image into pixelated image by using a framework openCV and a python module CV2.
Then we again apply the step 2.
Step 3: After extracting the text from image. It converts the text into structured data where we can extract individual fields such as patient name, patient address, medicines, and so on by using Regex (Regular Expression).