Table-Generator

Tables provide a concise and systematic way of displaying and arrange data of interests. Although tables are easily identifiable and interpretable to humans this is not the case for machines. With the amount of digital data available growing exponentially, the amount of data represented in table format also continues to grow. Typical examples of everyday data that is represented in tables includes invoices, receipts, medical records and so on. While human beings can read and interpret tables , doing this manually is a time consuming task and can be error prone. Automated table detection methods are preffered over manual processing of tables.

The purpose of this project was to produce a synthetic dataset that can be used for developing table detection and table structure recognition methods for the purpose of extracting tabular data from scanned documents. This dataset features tables generated from latex, html and word. The actual contents of the table were not regardes as important only the different structures/styling of tables were taken into consideration. Follow steps below to generate a dataset

Dependencies

First install all the required components by running the following (I know a Docker Image would be better working on it). (Note you must have python 3.8 or later)

./configure.sh # note run as sudo
./setup.sh 
pipenv install # note if using pipenv (recommended)
pip install -r requirements.txt # note if not using pipenv to install libraries

Output Description

Each data point produces has three things, the actual image, a mask image and an annotation.

Raw Image - The actual image (PNG) with tables
Mask - A binary image (PNG) with tables localized.
Annotation - A json file containing more info about the tables , has number of tables , bounding boxes and structures.

Config Description

This a description of important parameters that are specified in the config.json file.

sample_size (int) - how many data points to be produce.
types (List[int]) - what sort of tables may appear in the dataset (description of each type can be found in types_map)
parallel (boolean) - specifies whether the dataset is to be generated sequentially/parallel
img_path (str) - path where output images will be saved.
mask_path (str) path where output masks will be saved.
annotaion_path (str) path where annotations will be saved.

Running

Run the main script i.e python main.py note if pipenv was used pipenv run python main.py

Benchmarks

An experiment was done in which the resulting dataset was as follows. We employed a machine with the following specs to run the code:

OS name : Ubuntu 20.04.3 LTS.
Processor : Intel® Core™ i7-8750H CPU @ 2.20GHz × 12
Memory: 15.5 GiB

First Header	Latex	Html	Word
Image Dimension	1700 × 2200 × 3	1653 × 2339 × 3	1700 × 2200 × 3
Image Count	10 000	10 000	10 000
Table Count	25 210	25 276	21 854
Size	11.5 GB	14.2 GB	13.8 GB
Time	2.35 hrs	1.5 hrs GB	1.15 hrs

The total run time for generating was 5 hrs and total storage was 40 GB.

About

This is a synthetic dataset generator. It generates an image dataset that can be used for developing table detections and structure recogntion models

Languages

Language:Jupyter Notebook 69.0%Language:Python 29.9%Language:Shell 0.4%Language:HTML 0.4%Language:Dockerfile 0.3%Language:TeX 0.1%