iilei / hocr-to-json

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Hocr to JSON - WIP

alt text

WORK IN PROGRESS

Simple tool to convert .hocr files to json for further processing

Work in progress, so far tested with tesseract output

Example result JSON

The result will give some meta information and a representation of the hocr file in json.

Example output

{
  "contentType": "text/html;charset=utf-8",
  "ocrCapabilities": "ocr_page ocr_carea ocr_par ocr_line ocrx_word ocrp_wconf",
  "ocrSystem": "tesseract 4.1.0-rc1-752-g8b69",
  "pages": [
    {
      "bbox": [[ 0, 0 ], [ 640, 480 ]],
      "careas": [
        {
          "bbox": [[ 36, 92], [ 618, 361 ]],
          "id": "block_1_1",
          "pars": [
            {
              "id": "par_1_1",
              "lang": "eng",
              "lines": [
                {
                  "baseline": [ 0, -6 ],
                  "bbox": [[ 36, 92 ], [ 580, 122 ]],
                  "id": "line_1_1",
                  "words": [
                    {
                      "bbox": [[ 36, 92 ], [ 96, 116 ]],
                      "content": "This",
                      "id": "word_1_1",
                      "xWconf": 91
                    },
                    {
                      "bbox": [[ 109, 92 ], [ 129, 116 ]],
                      "content": "is",
                      "id": "word_1_2",
                      "xWconf": 92
                    },
                    {
                      "bbox": [[ 141, 98 ], [ 156, 116 ]],
                      "content": "a",
                      "id": "word_1_3",
                      "xWconf": 92
                    },
                    {
                      "bbox": [[ 169, 92 ], [ 201, 116 ]],
                      "content": "lot",
                      "id": "word_1_4",
                      "xWconf": 90
                    },
                    {
                      "bbox": [[ 212, 92 ], [ 240, 116 ]],
                      "content": "of",
                      "id": "word_1_5",
                      "xWconf": 93
                    },
                    // ...

Further example content can be found in /stub/ directory.

About

License:MIT License


Languages

Language:JavaScript 100.0%