diamonddev107 / parcl_ml

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

udot-parcel-ml

A repository for processing udot parcel images and extracting parcel numbers using machine learning. Currently, this tool processes pdf's and images looking for circles. These circles are extracted and tiled and stored to be run against the Google Cloud DocumentAI optical character recognition processor.

Example source image

image

Example output

image

This project is organized to work with build pack and Google Cloud Run Jobs and to run the commands locally via a CLI.

CLI

To work with the CLI,

  1. Create a python environment and install the requirements.dev.txt into that environment
  2. Execute the CLI to see the commands and options available
    • python row_cli.py

Workflow steps

  1. generate an index of all files

  2. filter the index to remove non image files and deeds

    python row_cli.py index filter ./data/elephant/remaining_index.txt

  3. put the index in storage

  4. run the job referencing the index location (edit the job name, file size, and task count)

  5. generate another index from the resulting job

    python row_cli.py storage generate-index --from=gs://ut-dts-agrc-udot-parcels-dev --prefix=elephant/mosaics/ --save-to=./data/elephant

  6. use a logging sink to add files with 0 circles detected and query for the file names and add that to the index generated in the previous step to avoid double processing files.

  7. generate a remaining index between the original and the prior

    python row_cli.py storage generate-remaining-index --full-index=gs://ut-dts-agrc-udot-parcels-dev --processed-index=./data/elephant --save-to=./data/elephant

    assuming the index in the bucket is the last remaining index for comparison

  8. filter out the deeds which have no circles

    python row_cli.py index filter ./data/elephant/remaining_index.txt

  9. move the current index into the job and replace with the remaining index renamed as index.txt

  10. repeat 4-9 until there are no more files left to process

  11. Authentication for document ai job

    • activate your terminal as a service account

      gcloud auth activate-service-account email@address --key-file=/path/to/sa.json

  12. start the job

    python row_cli.py process circles --job=elephant --from=gs://bucket--save-to=bucket --index=gs://bucket --task-index=0 --file-count=1 --instances=1 --project=1234 --processor=123abc

About

License:MIT License


Languages

Language:Python 99.3%Language:Dockerfile 0.6%Language:Procfile 0.1%