chrismattmann / tika-python

Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Compatibility with Apache Tika version 2.1.0

bikashg opened this issue · comments

Hi @chrismattmann ,

Fantastic library! I was wondering if you have near plans/roadmap to make it compatible with Apache Tika version 2.1.0

I used the tika-server-standard-2.1.0.jar file from https://tika.apache.org/download.html to run locally on my machine but get the following error:

>>> os.environ["TIKA_SERVER_JAR"] = "file:////home/bikashg/work/tealbear/tika-server-standard-2.1.0.jar"
>>> import tika
>>> tika.initVM()
>>> from tika import parser
>>> parsed1 = parser.from_file('notes.txt')
2021-11-16 16:46:31,249 [MainThread  ] [INFO ]  Retrieving file:////home/bikashg/work/tealbear/tika-server-standard-2.1.0.jar to /tmp/tika-server.jar.
2021-11-16 16:46:31,309 [MainThread  ] [INFO ]  Retrieving file:////home/bikashg/work/tealbear/tika-server-standard-2.1.0.jar.md5 to /tmp/tika-server.jar.md5.
2021-11-16 16:46:31,410 [MainThread  ] [INFO ]  Retrieving file:////home/bikashg/work/tealbear/tika-server-standard-2.1.0.jar to /tmp/tika-server.jar.
2021-11-16 16:46:31,456 [MainThread  ] [WARNI]  Failed to see startup log message; retrying...
2021-11-16 16:46:36,462 [MainThread  ] [WARNI]  Failed to see startup log message; retrying...
2021-11-16 16:46:41,467 [MainThread  ] [WARNI]  Failed to see startup log message; retrying...
2021-11-16 16:46:46,472 [MainThread  ] [ERROR]  Tika startup log message not received after 3 tries.
2021-11-16 16:46:46,473 [MainThread  ] [ERROR]  Failed to receive startup confirmation from startServer.
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/bikashg/miniconda3/envs/temp_bikash2/lib/python3.9/site-packages/tika/parser.py", line 40, in from_file
    output = parse1(service, filename, serverEndpoint, headers=headers, config_path=config_path, requestOptions=requestOptions)
  File "/home/bikashg/miniconda3/envs/temp_bikash2/lib/python3.9/site-packages/tika/tika.py", line 336, in parse1
    status, response = callServer('put', serverEndpoint, service, f,
  File "/home/bikashg/miniconda3/envs/temp_bikash2/lib/python3.9/site-packages/tika/tika.py", line 531, in callServer
    serverEndpoint = checkTikaServer(scheme, serverHost, port, tikaServerJar, classpath, config_path)
  File "/home/bikashg/miniconda3/envs/temp_bikash2/lib/python3.9/site-packages/tika/tika.py", line 601, in checkTikaServer
    raise RuntimeError("Unable to start Tika server.")
RuntimeError: Unable to start Tika server.
>>> exit()```

After the Tika team releases CVE-2021-44228 fixed version.
Tika 2.1.0 is the first Tika release to use log4j2 which most likely within the identified Log4J2 versions that have the CVE-2021-44228 vulnerability.

Apache Tika 2.2.1 has implemented Log4J2 2.17.0, which has addressed all but the most recent CVE-2021-44832 that requires an attacker to have access to the actual configuration file. It appears that Tika is in no hurry to release a new version with Log4J2 2.17.1. So, the question is: Will tika-python be waiting until this happens even with the vulnerabilities that Apache Tika 1.24.1 has?

Tika 2.3.0 addressed log4j 2.17.1 , so that seems to satisfy the remaining issue here.

Moreover, we're now at Tika 2.4.1 AND 1.x will stop receiving updates in 3 weeks (Sept 30,2022). So, we really need to make this project compatible with the latest versions.

@chrismattmann thanks for all you've done, but could you please give us some guidance as to whether this project is completely abandoned now? Should those who are using it make other plans - be it forking it or something else?

@nickchomey @bikashg Our requests have been heard and it's now an active WIP (see #377).

thanks, sorry for the delays on updates. I will spend some time over the winter holidays here getting this merged.

OK not in this release (which is going to be 1.24.2) but I have 2 PRs I will look at for 2.6.x release which I will make next week. Thanks. This 1.24.2 release will include all the updates the past 2 years that haven't been released.