pdf-association / pdf-corpora

An index of PDF-centric corpora

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

"pdf corpora" ...

Albretch opened this issue · comments

to my understanding, the idea of using pdf files as the resident format in a corpus is so out of it "crazy" and wastefully so, that I wonder what could that possibly mean. I tried to start a discussion on such matters at the corpora list days ago:

// __ towards a "pan document format" (pun intended) ...

I also checked your:


and most (if not all) aren't really corpora, but -text banks- apparently being used for functional and compliance testing, forensic research and such matters.

Could you actually show examples of corpora which resident format is PDF? Optimally, examples which claims could be tested by oneself?


This corpora index is targeted to PDF technology developers, as the corpora listed represent many of the realities in supporting the file format.