mayurcybercz/AI-Exam-evaluation

CLI-Tool to recognise handwritten text from answer sheets using Tesseract OCR.
Using this extracted text to evaluate marks using NLP.

Installation:
Install Tesseract-OCR-Engine https://github.com/tesseract-ocr/tesseract/wiki
Install python dependencies pytesseract,pillow,pandas,numpy,matplotlib

Usage:
1)Clone the repository into your working directory
2)Make sure you update path of tesseract executable in main.py
3)add image for testing to images folder
4)main.py imagename
It will return a HOCR file,which is very similar to XHTML
5)file_conversion.py hocrfilename.
It will convert HOCR into dataframe and store the output in a pickle file/json file

Phase1 demonstration of the OCR of handwritten text and exploiting into JSON
(Rendered python notebook displayed as markdown using nbconvert)

Phase2 Using nltk to Create A NLP model to evaluate Answers

Download all the packages using the nltk downloader

import nltk
nltk.download()

from pytesseract import pytesseract
import sys
import os

#Edit path to tesseract executable if you installation directory changed

pytesseract.tesseract_cmd = 'C:\\Program Files (x86)\\Tesseract-OCR\\tesseract'

from datetime import datetime

def replaceMultiple(mainString, toBeReplaces, newString):
   
    for elem in toBeReplaces :
        
        if elem in mainString :
            
            mainString = mainString.replace(elem, newString)
    
    return  mainString

mainStr=str(datetime.now())
file_name = replaceMultiple(mainStr, [':', '-', '.',' '] , "")

def generateFilename():
	mainStr=str(datetime.now())
	file_name = replaceMultiple(mainStr, [':', '-', '.',' '] , "")
	return file_name

from PIL import Image
from IPython.display import display
import matplotlib.pyplot as plt

im = Image.open("testfile1.jpg")
fig, ax = plt.subplots()
ax.imshow(im)
print("(width,height):"+str(im.size))

(width,height):(3000, 3115)

box=(250,180,2800,400)
cropped_image = im.crop(box)
display(cropped_image)
cropped_text= pytesseract.image_to_string(cropped_image, lang = 'eng')
print(cropped_text)

Conductor wn magnetic Field Produce voltage :

def createHOCR(imagepath):
	filename= generateFilename()
	pytesseract.run_tesseract(imagepath, filename, lang=None,extension='html', config="hocr")
	print("HOCR file generated: "+str(filename)+".hocr")

createHOCR("testfile.jpg")

HOCR file generated: 20181021042317089205.hocr

from lxml import etree
import pandas as pd
import os
import sys
import generate_filename as gf

def hocr_to_dataframe(fp):

    doc = etree.parse(fp)
    words = []
    wordConf = []

    for path in doc.xpath('//*'):
        if 'ocrx_word' in path.values():
            conf = [x for x in path.values() if 'x_wconf' in x][0]
            wordConf.append(int(conf.split('x_wconf ')[1]))
            words.append(path.text)

    dfReturn = pd.DataFrame({'word' : words,
                             'confidence' : wordConf})

    return(dfReturn)

filename=generateFilename()
dataframe=hocr_to_dataframe("20181021041156998790.hocr")

dataframe

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	word	confidence
0		95
1		95
2	Q1.	89
3	Define	96
4	electromagnetic	96
5	induction.	95
6	Sane	23
7	\|	90
8	Conductor	93
9	mM	42
10	magnetic	70
11	Field	63
12	produce	67
13	voltage	65
14	‘Seconaewctntmnstnn	0
15	esionainsnenaneenrenncconanniiti	0
16	Q2.	89
17	What	96
18	are	96
19	3	96
20	examples	96
21	of	95
22	transparent	95
23	objects?	96
24	(Professor	96
25	provides	96
26	5	96
27	as	95
28	input)	90
29		95
30	Q3.	92
31	Complete	96
32	the	96
33	network	95
34	tree.	96
35		95

dataframe.to_json(filename+".json",orient='columns')
print("JSON generated: "+filename+".JSON")
dataframe.to_pickle(filename+".pkl")
print("Pickle generated: "+filename+".pkl")

JSON generated: 20181021042319190731.JSON
Pickle generated: 20181021042319190731.pkl

mayurcybercz / AI-Exam-evaluation

About

Languages