Shoshin23/Envision-Computer-Vision-Engineer-Assignment

The Why

Document Text Recognition is one of the used features of the Envision App and Glasses. We process millions of document text recognition requests a month, and users read a variety of documents — letters, textbooks, reciepts, etc. with Envision.

One of the most requested features from users is to help them identify interesting regions in a document — headings, columns, tables, etc. Envision already does this in some instances and they work online. Going forward, Envision would want to build a single endpoint that detects all these regions of interest within a document and lay them out in an accessible way for our blind and visually impaired users.

The Assignment

This is a typical problem you'll face as a Computer Vision Engineer at Envision. We'd like you a build an algorithm/model that detects headings in the given test set in this repo given the OCR and bounding box output.

For OCR, you can use Google Cloud Vision OCR(or any Cloud-based OCR that returns bounding box data). If you're using Cloud Vision, you can follow this really simple tutorial to setup a sample API key: https://codelabs.developers.google.com/codelabs/cloud-vision-api-python

The algorithm can tailored to work with only the images that are provided, these represent the most common type of images that users tend to use with Envision. Your final result can be in the form of an executable Python/C++ script or be deployed as a Flask Web Server.

Your script/API endpoint should accept an image and return the headings found in the image. Below is a sample of how the Python script could work.

python detect_headings.py --image="/Users/EnvisionDev/test.jpg"

Heading 1
Heading 2
Heading 3

Note: You can also take a deep learning approach to solve this problem. If you're taking this approach then please implement it using a Jupyter Notebook and push that instead.

Other Important Stuff

For the sake of simplicity, we'll ignore all the other layout information within the image, such as columns, etc. unless you think that would be helpful in solving the problem.
Also for simplicity's sake, you can assume that the images passed onto the script are properly cropped documents. You don't have to concern yourself with documents/images that don't go through a cropping/perspective-transform process.
The emphasis for this assignment is the core algorithm and less about code quality. Though code quality is very much emphasised at Envision, I can understand that it might be time consuming as well at the early prototyping stage.
It's not mandatory to distinguish between the header types(H1, H2, H3, H4, etc.). Bonus points if you can work on distinguishing headings but that can be considered outside the scope of the implementation.
If you think that there are other reasonable constraints you want to take into account, please do so and make a note of them in the code/readme.

Deadline

There's no official deadline for this assignment and the emphasis here is on the core algorithm versus code quality, etc. You can also submit the final task as a Jupyter Notebook as well.

Submission

Fork this repo and push your changes to the forked repo(please keep the repo private). Once you're done please send me the link.

Shoshin23 / Envision-Computer-Vision-Engineer-Assignment

The Why

The Assignment

Other Important Stuff

Deadline

Submission

About