shradhayy / odia-dictionary

A repository for organizing contributions to the creation of an Odia Dictionary dataset for the Dictionary Augmented Translations project in C4GT'23.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

odia-dictionary

A repository for organizing contributions to the creation of an Odia Dictionary dataset for the Dictionary Augmented Translations project in C4GT'23.

demo notebook

To get a better idea of the code-flow, refer to the demo colab notebook

Description

This repository is for the purposes of building a parser that is able to read the Odia.Dictionary.pdf file and parse the definitions into a Dataset. The issues tab contains aspects of the current solution that need to be worked upon and refined, or added in the future.

Dependencies / setup

  1. pdf2image setup
  2. Tesseract API setup
  3. PyPDF2 PDF Reader setup
  4. OpenAI API setup
  5. [FTFY - Fixes Unicode]: run pip install ftfy

How to run

To parse the Odia Dictionary pdf:

  1. python src/a-getting_page_images/pdf_to_imgs.py - Converts each page of the Odia Dictionary PDF to a 300 DPI image stored in pages.
  2. Open the png files generated in ./pages with the Paint application and blank-out the unwanted letter section separator by selecting the relevant portion and pressing the delete key.
  3. python src/b-cropping_page_images/cropper.py - Crops out the columns of interest from pages 6-87. Outputs stored in pages_processed
  4. python src/c-images_to_pdfs_with_text/pdfmaker.py - Runs Tesseract OCR on the images in pages_processed. Outputs PDFs to parsed_pdfs
  5. python src/d-read_pdfs_with_text/reader.py - Gets unstructured OCR text output from PDFs in parsed_pdfs. Outputs .txt files to parsed_texts
  6. rm GPT_outputs/* - the GPT outputs folder must be emptied as the API will not be called to replace text files already present in the folder. (refer to sender.py)
  7. python src/e-gpt_api_sender/sender.py - Calls the GPT API to structure the raw OCR text output files in parsed_texts. Outputs .txt files to GPT_outputs
  8. python src/f-dataframe_maker/preprocess.py - moves file pointer of every .txt file in GPT_outputs to first occurence of "|", until every .txt file in the gpt outputs folder starts with a CSV-style column header.
  9. python src/f-dataframe_maker/maker.py - Compiles GPT_outputs to the desired .csv - parsed_dicts/parsed_dict_very_unclean.csv

Structure

Folders

  • GPT_outputs - stores parsed & formatted text tables
  • pages - stores images for every page of the pdf
  • pages_processed - stores cleaned and cropped column images for pages 6-88
  • parsed_dicts - stores the final output csv
  • parsed_pdfs - stores the intermediate pdfs generated using Tesseract OCR for each processed image
  • parsed_texts - stores the intermediate texts generated from the intermediate pdfs
  • src - stores the scripts necessary for parsing the dictionary.
  • testgpt - (IN TESTING) for prompt engineering tests
  • mergegpt - (IN TESTING) better parsing by merging OCR outputs generated by two different PSM modes, 3 and 6.
  • new_dict - (INT TESTING) for parsing en-or.pdf

Files

  • .env - stores the OPENAPI_SECRET_KEY variable
  • Odia.Dictionary - the pdf to be parsed - contains ~ 6,000 translations
  • en-or.pdf - the new pdf to be parsed - contains ~ 14,000 translations
  • run1.sh - converts the pdf to images
  • run2.sh - executes the rest of the scripts for parsing to a basic dictionary
  • sample.env - a reference for how the .env file should look (without the key)

About

A repository for organizing contributions to the creation of an Odia Dictionary dataset for the Dictionary Augmented Translations project in C4GT'23.


Languages

Language:Jupyter Notebook 91.1%Language:Python 8.7%Language:Shell 0.2%