Geelhem / OCR-import-tool-for-Digital-Record

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

This Repo is part of Guernsey French Language Preservation Project Overview The Guernsey French Language Preservation Project aims to digitize and preserve resources related to the endangered Guernsey French Language. By creating a digital record, we ensure that valuable texts, documents, and cultural heritage are accessible to future generations.

Aims The objective is to digitally preserve a critically endangered language, which currently lacks a standardized spelling. The commission in charge of preserving the language has identified two priorities : Priority 1 is to safeguard the language by recording native speakers of the language and creating a digital library of audio recordings. Priority 2 is to create a digital record of existing written work, by creating a digital library of bilingual texts.

Goals and Objectives of the script

Preservation: Digitize Guernsey French Language resources, including PDFs and images.

The program for Priority 2 will need to be able to import data program should run Tikka and Tesseract from pdfs and common image files into the digital record of existing written work. a.The code should be optimised, so it can ingest documents from a directory be post processed so that the scanned documents in pdf or image are not at outputting gibberish and inform the user if the quality is not good enough. The process will be automated or promt the user for verification b.Data Extracted from the images and PDFs that should be converted to text file and output organised, so that files names indicate whether the content of the files is in English (containing "eng" before the file extention) or the target language containing "gf" before the file extention) C.One part of this program should also allow The user to provide a web link that will scrape the website for the contents and include it in to the digital library, separating the sections in English and In the target language. D.the digital library of bilingual texts should be accessible via the UI, The same text should be available, side-by-side on one side in English, and on one side in the target language

Community Access: Make the digital records available to the Guernsey French language-speaking community and researchers.

Raise Awareness: Highlight the importance of language preservation.

Installation Check and Install Required Modules: Run the all the code in https://github.com/Geelhem/OCR-for-Digital-Record/tree/main/workflow_code/Installation%20Checks Python script to verify if the necessary modules are installed and install them if needed: Python

Contributing We welcome contributions! If you’d like to help, follow these steps:

Fork this repository. Create a new branch: git checkout -b feature/my-contribution. Commit your changes: git commit -m "Add feature". Push to the branch: git push origin feature/my-contribution. Open a pull request. License This project is licensed under the MIT License.

Acknowledgments We thank you: the community, linguists, and scholars for their support. This is a secondary school student lead programme by the Ladies College Guernsey in support of the Guernsey Language Commission ur girls have they set up a contact email gf@ladiescollege.gg and a form https://forms.office.com/e/qtPAzwwP5z to fill out for anyone who would like to submit audio recordings of people speaking Guernsey French.
Our girls have they set up a contact email gf@ladiescollege.gg and a form https://forms.office.com/e/qtPAzwwP5z to fill out for anyone who would like to submit audio recordings of people speaking Guernsey French.

https://github.com/Geelhem/OCR-import-tool-for-Digital-Record https://github.com/Geelhem/Guernsey-French-Digital-Record

Guilhem Chene https://gg.linkedin.com/in/gchene

About


Languages

Language:C 53.1%Language:C++ 40.8%Language:Roff 2.8%Language:Python 2.2%Language:Makefile 0.6%Language:CMake 0.5%