A repository for organizing contributions to the creation of an Odia Dictionary dataset for the Dictionary Augmented Translations project in C4GT'23.
To get a better idea of the code-flow, refer to the demo colab notebook
This repository is for the purposes of building a parser that is able to read the Odia.Dictionary.pdf file and parse the definitions into a Dataset. The issues tab contains aspects of the current solution that need to be worked upon and refined, or added in the future.
- pdf2image setup
- Tesseract API setup
- PyPDF2 PDF Reader setup
- OpenAI API setup
- [FTFY - Fixes Unicode]: run
pip install ftfy
To parse the Odia Dictionary pdf:
python src/a-getting_page_images/pdf_to_imgs.py
- Converts each page of the Odia Dictionary PDF to a 300 DPI image stored in pages.- Open the png files generated in
./pages
with the Paint application and blank-out the unwanted letter section separator by selecting the relevant portion and pressing the delete key. python src/b-cropping_page_images/cropper.py
- Crops out the columns of interest from pages 6-87. Outputs stored in pages_processedpython src/c-images_to_pdfs_with_text/pdfmaker.py
- Runs Tesseract OCR on the images in pages_processed. Outputs PDFs to parsed_pdfspython src/d-read_pdfs_with_text/reader.py
- Gets unstructured OCR text output from PDFs in parsed_pdfs. Outputs .txt files to parsed_textsrm GPT_outputs/*
- the GPT outputs folder must be emptied as the API will not be called to replace text files already present in the folder. (refer to sender.py)python src/e-gpt_api_sender/sender.py
- Calls the GPT API to structure the raw OCR text output files in parsed_texts. Outputs .txt files to GPT_outputspython src/f-dataframe_maker/preprocess.py
- moves file pointer of every .txt file in GPT_outputs to first occurence of"|"
, until every .txt file in the gpt outputs folder starts with a CSV-style column header.python src/f-dataframe_maker/maker.py
- Compiles GPT_outputs to the desired .csv - parsed_dicts/parsed_dict_very_unclean.csv
Folders
- GPT_outputs - stores parsed & formatted text tables
- pages - stores images for every page of the pdf
- pages_processed - stores cleaned and cropped column images for pages 6-88
- parsed_dicts - stores the final output csv
- parsed_pdfs - stores the intermediate pdfs generated using Tesseract OCR for each processed image
- parsed_texts - stores the intermediate texts generated from the intermediate pdfs
- src - stores the scripts necessary for parsing the dictionary.
- testgpt - (IN TESTING) for prompt engineering tests
- mergegpt - (IN TESTING) better parsing by merging OCR outputs generated by two different PSM modes, 3 and 6.
- new_dict - (INT TESTING) for parsing en-or.pdf
Files
- .env - stores the OPENAPI_SECRET_KEY variable
- Odia.Dictionary - the pdf to be parsed - contains ~ 6,000 translations
- en-or.pdf - the new pdf to be parsed - contains ~ 14,000 translations
- run1.sh - converts the pdf to images
- run2.sh - executes the rest of the scripts for parsing to a basic dictionary
- sample.env - a reference for how the .env file should look (without the key)