PedroBarcha / old-books-dataset

Old book pages (with groundtruth), formerly used for OCR studies. There are several versions of the set (concerning resolution and binarization). Noised and denoised sets (done by several methods) are eventually going to be uploaded.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Old scanned books dataset with groundtruth. The groundtruth was built with Project Gutenberg ebooks. All the .tiff pages were converted from project Internet Archive's books (PDFs). They were selected among the following books:

-Betrayed Armenia, de Diana Agabeg Apcar

-The Boy Apprenticed to an Enchanter, de Padraic Colum

-The Child of the Moat, de Stoughton Holborn

-The Corset and the Crinoline, de W.B.L

-Engraving of Lions, Tigers, Panthers, Leopards, Dogs, &C., de Thomas Landseer

-Half-Hours with Highwaymen, de Charles G. Harper

-Historical Sketches of Colonial Florida, de Richard L. Campbell

-Horton Genealogy, de Geo. F. Horton

-The Lusitania's Last Voyage, de Charles E. Lauriat

-Seat Weaving, de L. Day Perry

The dataset is presented in several resolutions: 300dpi,500dpi,1000dpi. Also there are severa sets of 300dpi binarized with different methods.

Feel free to use and study the sets contained here :)

About

Old book pages (with groundtruth), formerly used for OCR studies. There are several versions of the set (concerning resolution and binarization). Noised and denoised sets (done by several methods) are eventually going to be uploaded.

License:GNU General Public License v3.0


Languages

Language:HTML 100.0%