chrismattmann / tika-python

Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Docker Tika-server PDF OCR

RNWTenor opened this issue · comments

Can someone assist? I am trying to get tika-python to return json with metadata and text when using the docker image of tika. I can get the results I want using the curl command, but not with python, which returns only empty content.


RESULTS for a 2 page none searchable PDF:

PYTHON:

headers = {"X-Tika-PDFextractInlineImages": "true", "X-Tika-PDFocrStrategy": "OCR_ONLY"}
parsed = parser.from_file(
    "sample_notext.pdf",
    serverEndpoint="http://localhost:9998/rmeta",
    headers=headers,
    )
print(parsed["content"])

sample_notext.pdf"

CURL:

>curl -H "X-Tika-PDFextractInlineImages: true" -H "X-Tika-PDFocrStrategy: OCR_ONLY" -H "Accept: application/json" -T sample_notext.pdf localhost:9998/tika | json_pp

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 81740    0  6296  100 75444   1253  15025  0:00:05  0:00:05 --:--:--  1648
{
   "Author" : "rober",
   "Content-Type" : "application/pdf",
   "Creation-Date" : "2022-02-26T15:38:16Z",
   "Last-Modified" : "2022-02-26T15:38:16Z",
   "Last-Save-Date" : "2022-02-26T15:38:16Z",
   "X-Parsed-By" : [
      "org.apache.tika.parser.DefaultParser",
      "org.apache.tika.parser.pdf.PDFParser",
      "class org.apache.tika.parser.ocr.TesseractOCRParser"
   ],
   "X-TIKA:content" : "<html xmlns=\"http://www.w3.org/1999/xhtml\">\n<head>\n<meta name=\"date\" content=\"2022-02-26T15:38:16Z\" />\n<meta name=\"pdf:PDFVersion\" content=\"1.7\" />\n<meta name=\"pdf:docinfo:title\" content=\"sample_notext.pdf\" />\n<meta name=\"pdf:hasXFA\" content=\"false\" />\n<meta name=\"access_permission:modify_annotations\" content=\"true\" />\n<meta name=\"access_permission:can_print_degraded\" content=\"true\" />\n<meta name=\"dc:creator\" content=\"rober\" />\n<meta name=\"dcterms:created\" content=\"2022-02-26T15:38:16Z\" />\n<meta name=\"Last-Modified\" content=\"2022-02-26T15:38:16Z\" />\n<meta name=\"dcterms:modified\" content=\"2022-02-26T15:38:16Z\" />\n<meta name=\"dc:format\" content=\"application/pdf; version=1.7\" />\n<meta name=\"Last-Save-Date\" content=\"2022-02-26T15:38:16Z\" />\n<meta name=\"access_permission:fill_in_form\" content=\"true\" />\n<meta name=\"pdf:docinfo:modified\" content=\"2022-02-26T15:38:16Z\" />\n<meta name=\"meta:save-date\" content=\"2022-02-26T15:38:16Z\" />\n<meta name=\"pdf:encrypted\" content=\"false\" />\n<meta name=\"dc:title\" content=\"sample_notext.pdf\" />\n<meta name=\"modified\" content=\"2022-02-26T15:38:16Z\" />\n<meta name=\"pdf:hasMarkedContent\" content=\"false\" />\n<meta name=\"Content-Type\" content=\"application/pdf\" />\n<meta name=\"pdf:docinfo:creator\" content=\"rober\" />\n<meta name=\"X-Parsed-By\" content=\"org.apache.tika.parser.DefaultParser\" />\n<meta name=\"X-Parsed-By\" content=\"org.apache.tika.parser.pdf.PDFParser\" />\n<meta name=\"X-Parsed-By\" content=\"class org.apache.tika.parser.ocr.TesseractOCRParser\" />\n<meta name=\"creator\" content=\"rober\" />\n<meta name=\"meta:author\" content=\"rober\" />\n<meta name=\"meta:creation-date\" content=\"2022-02-26T15:38:16Z\" />\n<meta name=\"created\" content=\"2022-02-26T15:38:16Z\" />\n<meta name=\"access_permission:extract_for_accessibility\" content=\"true\" />\n<meta name=\"access_permission:assemble_document\" content=\"true\" />\n<meta name=\"xmpTPg:NPages\" content=\"2\" />\n<meta name=\"Creation-Date\" content=\"2022-02-26T15:38:16Z\" />\n<meta name=\"pdf:hasXMP\" content=\"false\" />\n<meta name=\"access_permission:extract_content\" content=\"true\" />\n<meta name=\"access_permission:can_print\" content=\"true\" />\n<meta name=\"Author\" content=\"rober\" />\n<meta name=\"producer\" content=\"Microsoft: Print To PDF\" />\n<meta name=\"access_permission:can_modify\" content=\"true\" />\n<meta name=\"pdf:docinfo:producer\" content=\"Microsoft: Print To PDF\" />\n<meta name=\"pdf:docinfo:created\" content=\"2022-02-26T15:38:16Z\" />\n<title>sample_notext.pdf</title>\n</head>\n<body><div class=\"page\"><div class=\"ocr\">A Simple PDF File\n\nThis is a small demonstration .pdf file -\n\njust for use in the Virtual Mechanics tutorials. More text. And more\ntext. And more text. And more text. And more text.\n\nAnd more text. And more text. And more text. And more text. And more\ntext. And more text. Boring, zzzzz. And more text. And more text. And\nmore text. And more text. And more text. And more text. And more text.\nAnd more text. And more text.\n\nAnd more text. And more text. And more text. And more text. And more\ntext. And more text. And more text. Even more. Continued on page 2...\n</div>\n</div>\n<div class=\"page\"><div class=\"ocr\">simple PDF File 2\n\n...continued from page 1. Yet more text. And more text. And more text.\nAnd more text. And more text. And more text. And more text. And more\ntext. Oh, how boring typing this stuff. But not as boring as watching\npaint dry. And more text. And more text. And more text. And more text.\nBoring. More, a little more text. The end, and just as well.\n</div>\n</div>\n</body></html>",
   "access_permission:assemble_document" : "true",
   "access_permission:can_modify" : "true",
   "access_permission:can_print" : "true",
   "access_permission:can_print_degraded" : "true",
   "access_permission:extract_content" : "true",
   "access_permission:extract_for_accessibility" : "true",
   "access_permission:fill_in_form" : "true",
   "access_permission:modify_annotations" : "true",
   "created" : "2022-02-26T15:38:16Z",
   "creator" : "rober",
   "date" : "2022-02-26T15:38:16Z",
   "dc:creator" : "rober",
   "dc:format" : "application/pdf; version=1.7",
   "dc:title" : "sample_notext.pdf",
   "dcterms:created" : "2022-02-26T15:38:16Z",
   "dcterms:modified" : "2022-02-26T15:38:16Z",
   "meta:author" : "rober",
   "meta:creation-date" : "2022-02-26T15:38:16Z",
   "meta:save-date" : "2022-02-26T15:38:16Z",
   "modified" : "2022-02-26T15:38:16Z",
   "pdf:PDFVersion" : "1.7",
   "pdf:charsPerPage" : [
      "0",
      "0"
   ],
   "pdf:docinfo:created" : "2022-02-26T15:38:16Z",
   "pdf:docinfo:creator" : "rober",
   "pdf:docinfo:modified" : "2022-02-26T15:38:16Z",
   "pdf:docinfo:producer" : "Microsoft: Print To PDF",
   "pdf:docinfo:title" : "sample_notext.pdf",
   "pdf:encrypted" : "false",
   "pdf:hasMarkedContent" : "false",
   "pdf:hasXFA" : "false",
   "pdf:hasXMP" : "false",
   "pdf:unmappedUnicodeCharsPerPage" : [
      "0",
      "0"
   ],
   "producer" : "Microsoft: Print To PDF",
   "title" : "sample_notext.pdf",
   "xmpTPg:NPages" : "2"

Update: I have now been able to return both metadata and content in a single JSON dump, but I am still not able to force OCR using the Python CLI or API.

headers = {"X-Tika-PDFextractInlineImages": "true", "X-Tika-PDFocrStrategy": "OCR_ONLY"}
parsed = parser.from_file(
    "sample_notext.pdf",
    serverEndpoint="http://localhost:9998/rmeta",
    service="all",
    headers=headers,
)
pretty_parsed = json.dumps(parsed, indent=2)
print(pretty_parsed)

Returns:

{
  "metadata": {
    "Author": "rober",
    "Content-Type": "application/pdf",
    "Creation-Date": "2022-02-26T15:38:16Z",
    "Last-Modified": "2022-02-26T15:38:16Z",
    "Last-Save-Date": "2022-02-26T15:38:16Z",
    "X-Parsed-By": [
      "org.apache.tika.parser.DefaultParser",
      "org.apache.tika.parser.pdf.PDFParser",
      "class org.apache.tika.parser.ocr.TesseractOCRParser"
    ],
    "X-TIKA:EXCEPTION:runtime": "org.apache.commons.io.IOExceptionWithCause: org.apache.tika.exception.TikaException: Tesseract is not available. Please set the OCR_STRATEGY to NO_OCR or configure Tesseract correctly\n\tat org.apache.tika.parser.pdf.OCR2XHTML.processPage(OCR2XHTML.java:100)\n\tat org.apache.tika.parser.pdf.AbstractPDF2XHTML.processPages(AbstractPDF2XHTML.java:963)\n\tat org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)\n\tat org.apache.tika.parser.pdf.OCR2XHTML.process(OCR2XHTML.java:66)\n\tat org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:167)\n\tat org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)\n\tat org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)\n\tat org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)\n\tat org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:233)\n\tat org.apache.tika.server.resource.TikaResource.parse(TikaResource.java:409)\n\tat org.apache.tika.server.resource.RecursiveMetadataResource.parseMetadata(RecursiveMetadataResource.java:147)\n\tat org.apache.tika.server.resource.RecursiveMetadataResource.getMetadata(RecursiveMetadataResource.java:123)\n\tat jdk.internal.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)\n\tat java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n\tat java.base/java.lang.reflect.Method.invoke(Method.java:566)\n\tat org.apache.cxf.service.invoker.AbstractInvoker.performInvocation(AbstractInvoker.java:179)\n\tat org.apache.cxf.service.invoker.AbstractInvoker.invoke(AbstractInvoker.java:96)\n\tat org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:201)\n\tat org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:104)\n\tat org.apache.cxf.interceptor.ServiceInvokerInterceptor$1.run(ServiceInvokerInterceptor.java:59)\n\tat org.apache.cxf.interceptor.ServiceInvokerInterceptor.handleMessage(ServiceInvokerInterceptor.java:96)\n\tat org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:308)\n\tat org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)\n\tat org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:267)\n\tat org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:247)\n\tat org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:79)\n\tat org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)\n\tat org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:235)\n\tat org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1297)\n\tat org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:190)\n\tat org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1212)\n\tat org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)\n\tat org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:221)\n\tat org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)\n\tat org.eclipse.jetty.server.Server.handle(Server.java:500)\n\tat org.eclipse.jetty.server.HttpChannel.lambda$handle$1(HttpChannel.java:383)\n\tat org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:547)\n\tat org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:375)\n\tat org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:270)\n\tat org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311)\n\tat org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:103)\n\tat org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:117)\n\tat org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:336)\n\tat org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:313)\n\tat org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:171)\n\tat org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:129)\n\tat org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:388)\n\tat org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:806)\n\tat org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:938)\n\tat java.base/java.lang.Thread.run(Thread.java:834)\nCaused by: org.apache.tika.exception.TikaException: Tesseract is not available. Please set the OCR_STRATEGY to NO_OCR or configure Tesseract correctly\n\tat org.apache.tika.parser.pdf.AbstractPDF2XHTML.doOCROnCurrentPage(AbstractPDF2XHTML.java:433)\n\tat org.apache.tika.parser.pdf.OCR2XHTML.processPage(OCR2XHTML.java:97)\n\t... 49 more\n",
    "X-TIKA:content_handler": "ToTextContentHandler",
    "X-TIKA:embedded_depth": "0",
    "X-TIKA:parse_time_millis": "8",
    "access_permission:assemble_document": "true",
    "access_permission:can_modify": "true",
    "access_permission:can_print": "true",
    "access_permission:can_print_degraded": "true",
    "access_permission:extract_content": "true",
    "access_permission:extract_for_accessibility": "true",
    "access_permission:fill_in_form": "true",
    "access_permission:modify_annotations": "true",
    "created": "2022-02-26T15:38:16Z",
    "creator": "rober",
    "date": "2022-02-26T15:38:16Z",
    "dc:creator": "rober",
    "dc:format": "application/pdf; version=1.7",
    "dc:title": "sample_notext.pdf",
    "dcterms:created": "2022-02-26T15:38:16Z",
    "dcterms:modified": "2022-02-26T15:38:16Z",
    "meta:author": "rober",
    "meta:creation-date": "2022-02-26T15:38:16Z",
    "meta:save-date": "2022-02-26T15:38:16Z",
    "modified": "2022-02-26T15:38:16Z",
    "pdf:PDFVersion": "1.7",
    "pdf:docinfo:created": "2022-02-26T15:38:16Z",
    "pdf:docinfo:creator": "rober",
    "pdf:docinfo:modified": "2022-02-26T15:38:16Z",
    "pdf:docinfo:producer": "Microsoft: Print To PDF",
    "pdf:docinfo:title": "sample_notext.pdf",
    "pdf:encrypted": "false",
    "pdf:hasMarkedContent": "false",
    "pdf:hasXFA": "false",
    "pdf:hasXMP": "false",
    "producer": "Microsoft: Print To PDF",
    "resourceName": "b'sample_notext.pdf'",
    "title": "sample_notext.pdf",
    "xmpTPg:NPages": "2"
  },
  "content": "\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nsample_notext.pdf\n\n",
  "status": 200
}

Make sure you are using the tika image that contains ocr, currently the latest available is: 1.28.2-full

Just as a note, if you use X-Tika-PDFextractInlineImages and X-Tika-PDFocrStrategy at the same time, both will be executed and it may slow down the text extraction.

Note: These two options are independent. If you set extractInlineImages to true and select an OcrStrategy that includes OCR on the rendered page, Tika will run OCR on the extracted inline images and the rendered page.

Source

@mfernaal is right, I think it had to do with the docker image that was being used. Thanks.