axa-group / Parsr

Transforms PDF, Documents and Images into Enriched Structured Data

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Not able to parse PDF

JiriSch opened this issue · comments

Hello, thank you for developing this library, I found a strange PDF, which blocks parsr at 100%CPU without any progress in hours... PDF has only 1 page.

trying to debug with the simplified config, but without success, it looks as it ended, but CPU still on 100% and no end:

`root@pdfparsr:/opt/Parsr# npm run run:debug -- --input-file samples/lukr.pdf --output-folder samples/ --document-name example --config server/defaultConfigSafe.json --pretty-logs

parsr@1.1.2 run:debug
ts-node server/bin/index.ts "--input-file" "samples/lukr.pdf" "--output-folder" "samples/" "--document-name" "example" "--config" "server/defaultConfigSafe.json" "--pretty-logs"

[2021-12-29T21:13:25] INFO (parsr): Current version: [3a2193b] (HEAD -> master, origin/master, origin/HEAD) - Merge pull request #553 from axa-group/dependabot/npm_and_yarn/aws-sdk-2.814.0 - (Sun Dec 5 12:38:33 2021 +0100, GitHub noreply@github.com)
[2021-12-29T21:13:25] INFO (parsr): Using config:
[2021-12-29T21:13:25] INFO (parsr): Config {
version: 0.9,
cleaner: [],
extractor: {
pdf: 'pdfminer',
ocr: 'tesseract',
language: [
'ces',
'eng',
'deu'
]
},
output: {
granularity: 'word',
includeMarginals: false,
includeDrawings: false,
formats: {
json: false,
text: true,
csv: false,
markdown: false,
pdf: false,
simpleJson: false
}
}
}
[2021-12-29T21:13:25] INFO (parsr): Using extractor: PdfminerExtractor
[2021-12-29T21:13:25] INFO (parsr): executing command: qpdf --decrypt --no-warn /opt/Parsr/samples/lukr.pdf /tmp/4953f35fe137466ea693cb82367e3e.pdf
[2021-12-29T21:13:25] INFO (parsr): Qpdf repair succeed --> /tmp/4953f35fe137466ea693cb82367e3e.pdf
[2021-12-29T21:13:25] INFO (parsr): executing command: mutool clean -g /tmp/4953f35fe137466ea693cb82367e3e.pdf /tmp/854aed90a5fcd98dd61f1f8c10cc4d.pdf
[2021-12-29T21:13:25] INFO (parsr): Mutool clean succeed --> /tmp/854aed90a5fcd98dd61f1f8c10cc4d.pdf
[2021-12-29T21:13:25] INFO (parsr): executing command: python3 /opt/Parsr/server/assets/PdfPageNumber.py /tmp/854aed90a5fcd98dd61f1f8c10cc4d.pdf
[2021-12-29T21:13:25] INFO (parsr): Pages number extraction succeed
[2021-12-29T21:13:25] INFO (parsr): Extracting contents (1 pages) with pdfminer's pdf2txt.py tool...
[2021-12-29T21:13:25] INFO (parsr): executing command: mutool extract -r /tmp/854aed90a5fcd98dd61f1f8c10cc4d.pdf
[2021-12-29T21:13:25] INFO (parsr): PdfMiner extracting contents (pages 1 to 1)
[2021-12-29T21:13:26] INFO (parsr): executing command: python3 /usr/local/bin/pdf2txt.py -p 1 --detect-vertical -R 0 -c utf-8 -t xml --word-margin 0.2 -o /tmp/af1271a7531fb6b507bb70a52d7ec0.xml /tmp/854aed90a5fcd98dd61f1f8c10cc4d.pdf
[2021-12-29T21:13:26] INFO (parsr): Mutool extract succeed --> /tmp/95a5939731a01852ad5376c3c71af4
[2021-12-29T21:13:26] INFO (parsr): PdfMiner pdf2txt.py succeed --> /tmp/af1271a7531fb6b507bb70a52d7ec0.xml
[2021-12-29T21:13:26] INFO (parsr): Saving response for key: "pdf2txt -p 1 --detect-vertical -R 0 -c utf-8 -t xml --word-margin 0.2 -o /tmp/854aed90a5fcd98dd61f1f8c10cc4d.pdf"
[2021-12-29T21:13:26] INFO (parsr): PdfMiner xml: 0.452s
[2021-12-29T21:13:26] INFO (parsr): Sanitize XML: 0.007s
[2021-12-29T21:13:26] INFO (parsr): Saving response for key: "sanitizeXML-/tmp/af1271a7531fb6b507bb70a52d7ec0.xml"
[2021-12-29T21:13:26] INFO (parsr): Xml to Js: 0.1s
[2021-12-29T21:13:26] INFO (parsr): Js to Document: 0.016s
[2021-12-29T21:13:26] INFO (parsr): Page rotation detection and correction finished in 0.002 s
[2021-12-29T21:13:26] INFO (parsr): Extracting SVG paths with pdfminer...
[2021-12-29T21:13:26] INFO (parsr): PdfMiner extracting contents (pages 1 to 1)
[2021-12-29T21:13:26] INFO (parsr): Returning cached data for key: "pdf2txt -p 1 --detect-vertical -R 0 -c utf-8 -t xml --word-margin 0.2 -o /tmp/854aed90a5fcd98dd61f1f8c10cc4d.pdf"
[2021-12-29T21:13:26] INFO (parsr): PdfMiner xml: 0.001s
[2021-12-29T21:13:26] INFO (parsr): Returning cached data for key: "sanitizeXML-/tmp/af1271a7531fb6b507bb70a52d7ec0.xml"
[2021-12-29T21:13:26] INFO (parsr): SVGs extraction time: 0.059s
`
lukr.pdf