odia-dictionary

A repository for organizing contributions to the creation of an Odia Dictionary dataset for the Dictionary Augmented Translations project in C4GT'23.

demo notebook

To get a better idea of the code-flow, refer to the demo colab notebook

Description

This repository is for the purposes of building a parser that is able to read the Odia.Dictionary.pdf file and parse the definitions into a Dataset. The issues tab contains aspects of the current solution that need to be worked upon and refined, or added in the future.

Dependencies / setup

pdf2image setup
Tesseract API setup
PyPDF2 PDF Reader setup
OpenAI API setup
[FTFY - Fixes Unicode]: run pip install ftfy

How to run

To parse the Odia Dictionary pdf:

python src/a-getting_page_images/pdf_to_imgs.py - Converts each page of the Odia Dictionary PDF to a 300 DPI image stored in pages.
Open the png files generated in ./pages with the Paint application and blank-out the unwanted letter section separator by selecting the relevant portion and pressing the delete key.
python src/b-cropping_page_images/cropper.py - Crops out the columns of interest from pages 6-87. Outputs stored in pages_processed
python src/c-images_to_pdfs_with_text/pdfmaker.py - Runs Tesseract OCR on the images in pages_processed. Outputs PDFs to parsed_pdfs
python src/d-read_pdfs_with_text/reader.py - Gets unstructured OCR text output from PDFs in parsed_pdfs. Outputs .txt files to parsed_texts
rm GPT_outputs/* - the GPT outputs folder must be emptied as the API will not be called to replace text files already present in the folder. (refer to sender.py)
python src/e-gpt_api_sender/sender.py - Calls the GPT API to structure the raw OCR text output files in parsed_texts. Outputs .txt files to GPT_outputs
python src/f-dataframe_maker/preprocess.py - moves file pointer of every .txt file in GPT_outputs to first occurence of "|", until every .txt file in the gpt outputs folder starts with a CSV-style column header.
python src/f-dataframe_maker/maker.py - Compiles GPT_outputs to the desired .csv - parsed_dicts/parsed_dict_very_unclean.csv

Structure

Folders

GPT_outputs - stores parsed & formatted text tables
pages - stores images for every page of the pdf
pages_processed - stores cleaned and cropped column images for pages 6-88
parsed_dicts - stores the final output csv
parsed_pdfs - stores the intermediate pdfs generated using Tesseract OCR for each processed image
parsed_texts - stores the intermediate texts generated from the intermediate pdfs
src - stores the scripts necessary for parsing the dictionary.
testgpt - (IN TESTING) for prompt engineering tests
mergegpt - (IN TESTING) better parsing by merging OCR outputs generated by two different PSM modes, 3 and 6.
new_dict - (INT TESTING) for parsing en-or.pdf

Files

.env - stores the OPENAPI_SECRET_KEY variable
Odia.Dictionary - the pdf to be parsed - contains ~ 6,000 translations
en-or.pdf - the new pdf to be parsed - contains ~ 14,000 translations
run1.sh - converts the pdf to images
run2.sh - executes the rest of the scripts for parsing to a basic dictionary
sample.env - a reference for how the .env file should look (without the key)

shradhayy / odia-dictionary