EleutherAI / the-pile

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

PDF parsing

leogao2 opened this issue · comments

Existing pdf parsing solutions are often not high enough quality. This is a wishlist for things we would want in a PDF parser.

  • multilingual - should work for all (or at least, a lot of) languages
  • table detection - be able to detect tables and convert to some kind of structured format (this item is more of a wishlist thing; I'd be ok if for the first version it just omitted tables. However, I really want to avoid ending up with a big mess of numbers where the table is supposed to be.)
  • handle both ocr and hidden layer - for documents with a high quality hidden layer text, that should be used. it should also detect when the hidden layer quality is bad and perform sota ocr on it, and if the quality is still bad it should discard altogether
  • error correction - it should be able to correct minor ocr errors, detect and reject irrecoverable errors. There are a lot of different failure cases, but a few off the top of my head are: wrong script extraction (i.e cyrillic gets turned into garbage that looks like üúôóöú by latin-only ocr), lots of extra spaces between words (i.e ex t ra c tio n), things like math get irrecoverably mangled into a big mess, sometimes ocr introduces typos
  • multi column or other nonlinear layouts - it should be able to do something sane for these cases. maybe sometimes the order is ambiguous, but at least it shouldn't i.e interleave different text boxes
  • Header and footer detection - explicitly seperate from main content
  • output some kind of structured format - possibly pandoc compatible, or possibly something else. we want this because we'd want to be able to get all sorts of different format output.

I do have access to a commercial OCR that do check all those boxes. I cannot share the code, but if you provide the PDFs I can generate the .html or .txt for each file.

I will give it a try. Using PDFMiner.six I convert a random sci-hub PDF to text and obtain for its first page the text appended below.
Issues:

  • Works only for PDFs that contain text (not OCR)
  • Text flows in correct order, but contains footnotes and similar pieces within
  • No distinction between headings and body text

In general I believe that a useful fraction is convertible (>10%). Perhaps we need "only" an automatically way to determine whether the output is fine or garbage? What is the expected minimum quality level?

`PHYTOTHERAPY RESEARCH
Phytother. Res. 13, 655–659 (1999)

Effects of Mistletoe Lectin I and Ionizing
Radiation on the Glucose and Thymidine
Uptake in Tumour Cells in vitro

Tamara Kubasova,1* Ileana Petcu,2 U. Pfu¨ller3 and G. J. Ko¨teles1
1Fre´de´ric Joliot-Curie National Research Institute for Radiobiology and Radiohygiene, Budapest, P.O. Box 101, H-1775 Hungary
2Horia Hulubei Institute of Physics and Nuclear Engineering, P.O. Box MG-6, R-76900 Bucharest, Romania
3Institute of Phytochemistry, University of Witten/Herdecke, D-58448 Witten, Germany

The increased uptake of hexose by mammalian cells is considered to be a general response to stress.
Nowadays, mistletoe lectin separated from the extracts of the European mistletoe (Viscum album L.) is
often used in adjuvant cancer therapy. The present work studies the effect of the lectin on unirradiated
and x-irradiated tumour cells. The response of cultured human lung carcinoma cells (Calu-1) was fol-
lowed by radioactive glucose uptake as well as by tritiated thymidine incorporation. The cells were main-
tained either in a complete or a so-called restrictive medium.

Slight metabolic changes were found in the restrictive medium but not in the complete one. Mistletoe
lectin I at a very low concentration (0.001 ng/mL)
increased the glucose uptake and thymidine
incorporation. Ionizing radiation (1 Gy) did not influence the hexose uptake but it enhanced the
incorporation of thymidine. It seems that the actions of two different factors (mistletoe lectin I and
radiation) proved to be rather provoking stress effects for the tumour cells as detected in the restrictive
medium. Copyright # 1999 John Wiley & Sons, Ltd.

Keywords: Calu-1; mistletoe lectin I; ionizing radiation; thymidine; D-glucose; metabolic response.

INTRODUCTION

The hexose uptake of mammalian cells is known to
change under certain stress circumstances (Gray et al.,
1983; Weber et al., 1984; Warren et al., 1986; Pasternak
et al., 1991). This increased glucose uptake upon
environmental stress can be considered as a general
response of cells through changes in plasma membrane
function. Alteration of the physiological conditions of
membranes have also been shown in our earlier
experiments in vitro on different cell cultures and blood
cells exposed to ionizing radiation at relatively low doses
(0.25–2.5 Gy), as detected by the binding of radiolabelled
concanavalin A lectin to the cell surfaces (Ko¨teles et al.,
1976; Kubasova et al., 1981a, 1981b, 1984).
the use of different

treatments
(cytostatic drugs, radiation, adjuvant preparations) in
cancer therapy can lead to the alteration of plasma
membrane function and metabolic processes in both
malignant and normal cells. The favourable effects of
extracts from the European mistletoe Viscum album L.
have been known for over 70 years for the treatment of
inflammatory diseases and also cancer
hypertonia,
(Kwaja et al., 1986; Hajto et al., 1989; Franz, 1991;
Kuttan, 1993; Gabius et al., 1994). The effect of the
extracts is attributed to their main constituent, lectin.

is evident

that

It

  • Correspondence to: T. Kubasova, Budapest, P.O. Box 101, H-1775
    Hungary.
    Contract/grant sponsor: European Commission PECO Programme; Contract/
    grant number: ERBBMHICT 931238; Contract/grant number: ERBCIPDCT
    940224

The aim of the present work was to study the metabolic
changes in cultured tumour cells (human lung carcinoma
line Calu-1) on the effect of mistletoe lectin I (ML I) used
widely in cancer adjuvant therapy. Uptake of 3H-glucose
by the cells and incorporation of 3H-thymidine into them
were used to reflect the metabolic changes in Calu-1 cell
cultures. This experimental approach was intended to
reveal whether ML I treatment at very low lectin
concentrations (0.001 ng/mL) produces any modification
in the response of x-irradiated cells.

MATERIALS AND METHODS

Human lung carcinoma cell line Calu-1. This was a gift
of the Memorial Sloan-Kettering Cancer Center, since
1986 the cells have been adapted to RPMI-1640 medium
supplemented with 10% fetal calf serum (FCS), L-
glutamine and antibiotics (complete medium). The cells
were grown on tissue culture plates of 24 wells (Greiner,
Germany). In separate experiments, 1 (cid:2) 105–1.4 (cid:2) 105
cells/mL in a well were used for plating. All cells were
incubated in the complete medium for 4 h; then, for one
part of the cells (used in the deoxy-D-glucose uptake
assay),
the medium was replaced by the restrictive
medium containing 0.5 % FCS only and the incubation
was continued for 3 h. Half of the cultures, in both the
complete and the restrictive media, were irradiated with
x-rays. For the irradiation period the medium was
changed to a serum-free one. Starting immediately after
the radiation exposure the irradiated and unirradiated

CCC 0951–418X/99/080655–05 $17.50
Copyright # 1999 John Wiley & Sons, Ltd.

Received 25 November 1998
Accepted 28 January 1999

656

I do have access to a commercial OCR that do check all those boxes. I cannot share the code, but if you provide the PDFs I can generate the .html or .txt for each file.

Try it on this pdf http://www.math.bas.bg/mathmod/Proceedings_CTF/CTF-1984/files_CTF-1984/CTF-1984-334-345.pdf

Attached is the result from ABBYY FineReader 15. Looks OK to me. I'd give it 90%. The document is pretty difficult (russian math) and in bad shape (warps). Most of sci-hub will be much better. I estimate >90% of sci-hub will be 99% or better.

ABBYY has automation capability.

Is that "good enough"?

CTF-1984-334-345.docx
CTF-1984-334-345-from-abbyy.txt

Also, companies like ABBYY and Omnipage have built these OCR solutions over decades, and likely put 100m++ USD into the research and dev. We won't improve over that on our own in the short term. It's either such a solution, or it's not good enough and can be tried again in a decade.

Attached is the result from ABBYY FineReader 15. Looks OK to me. I'd give it 90%. The document is pretty difficult (russian math) and in bad shape (warps). Most of sci-hub will be much better. I estimate >90% of sci-hub will be 99% or better.

ABBYY has automation capability.

Is that "good enough"?

CTF-1984-334-345.docx
CTF-1984-334-345-from-abbyy.txt

Thanks for doing the conversion! I’ll take a look at it this weekend. Just an FYI, the text is in English not Russian. It’s from a Russian academic journal.

And yeah, we know it’s worse than most texts we will encounter. That’s what makes it a good test case :)

I do have access to a commercial OCR that do check all those boxes. I cannot share the code, but if you provide the PDFs I can generate the .html or .txt for each file.

@bratao What volume of pdfs are you able to process? We may be processing a very large amount of pdfs. (Think 100TB in total)

Compute will be relevant. For the 7.3 MB testfile my Core i7-7700k needs 15s in FineReader, i.e. 500kb/sec. For 100 TB of PDFs it would be 5 months.

Compute will be relevant. For the 7.3 MB testfile my Core i7-7700k needs 15s in FineReader, i.e. 500kb/sec. For 100 TB of PDFs it would be 5 months.

That is not an issue. We're not in a hurry, and we can obtain much, much more compute than a single 7700k.

(At present we have 64 cores at our immediate disposal, which would bring that time down to well under a month, and we can obtain more if necessary.)

I would recommend checking out JSL's PDF OCR - has high scalability, and have tested it myself on 4GBs of PDFs, with good results. Getting started requires some tweaking and fine tuning (memory, cores, etc), but once all the settings are in place, it's fairly stable and reliable.

I went through several PDF readers including the ones listed above (notable mention to parsr) and ultimately went with JSL.

@leogao2 the ocr I do have access will have results very similar or better than the finereader or omnipage. So the results are practically identical to @hippke . But I can further tweak to try to eliminate the header and footer.

It can process 30 pages per minute in a single vps core.
If I rent a epyc dedicated server from hetzner I think it will process 100tb in less than a month. I can do it.