fourdigits / wagtail_textract

Text extraction for Wagtail document search

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Textract dependency issue; Wagtail version dependency

DanielSwain opened this issue · comments

I’m working to set up Wagtail Textract. I use pipenv and was getting package mismatch errors due to Textract on PyPI not being updated with the latest repo from https://github.com/deanmalmgren/textract (there was a chardet dependency error). However, @deanmalmgren ’s repo DOES have an updated chardet dependency (3.0.4, the latest at this point), so I was able to get around all but one of the errors by installing directly from the repo:
pip install git+https://github.com/deanmalmgren/textract.git –-upgrade

One remaining error (I’m at the latest Wagtail, 2.4):

wagtail-textract 1.0 has requirement wagtail<2.2,>=2, but you'll have wagtail 2.4 which is incompatible.

Would you be willing to remove the wagtail<2.2 dependency? If not, I could do a little testing for you by forking and removing that dependency and installing from my fork, but my testing wouldn’t be extensive. I would have around a hundred documents that I could run the transcription command on, but none of them would require OCR.

I would be willing to propose a re-write of your installation instructions based on the above (you could likely get rid of having to mention the statements about incompatibility errors).

@DanAtShenTech Yes, i'm completely okay with removing that restriction. Not sure why it's in there, maybe a conservative move. But it looks like there's no reason for it now.

We'd have to update the build matrix as well.

I'd be happy to accept a PR.

I've submitted a PR. Would you be willing to update the install script to install from git+https://github.com/deanmalmgren/textract.git rather than from PyPI? I imagine this is non-standard, but if done, then in the install instructions I could remove the notes about errors and add a note to mention that installation of textract is from @deanmalmgren's github repo due to the PyPI resource not being kept up-to-date.

Hi Dan,

That does not seem the proper solution. But maybe you could document the issues you have with textract itself, and show how users can install it directly from VCS to solve theses issues, in the README?

OK Kees. As soon as you post to PyPI, I'll go through the whole process of installing and then provide a PR for an update to the README.

I wanted to bring to the attention of anyone reading this issue some information that I just discovered. Back in 2016 @deanmalmgren called for someone to take over the Textract repo. He tweeted about this need as recently as April 9, 2019. A review of his commit history shows his last commit to the Textract repo was the summer of 2017. While I've been able to get document extraction capability to work somewhat well using wagtail_textract, it feels pretty brittle. I still haven't gotten OCR to work when uploading a file though, and OCR'ed data is not saved with the PDF - see this issue. Also, I use pipenv and can't yet produce a Pipfile.lock to use in production because of dependency issues related to the repo not being kept up-to-date. I'm not at a point that I could take over maintenance of this repo, but I wanted to particularly point this problem out to @khink in case he is. One dependency that it would be nice to update would be to move from Tesseract 3.x to the latest 4.x.