Textract dependency issue; Wagtail version dependency

Question

Textract dependency issue; Wagtail version dependency

DanielSwain opened this issue 5 years ago · comments

I’m working to set up Wagtail Textract. I use pipenv and was getting package mismatch errors due to Textract on PyPI not being updated with the latest repo from https://github.com/deanmalmgren/textract (there was a chardet dependency error). However, @deanmalmgren ’s repo DOES have an updated chardet dependency (3.0.4, the latest at this point), so I was able to get around all but one of the errors by installing directly from the repo:
pip install git+https://github.com/deanmalmgren/textract.git –-upgrade

One remaining error (I’m at the latest Wagtail, 2.4):

wagtail-textract 1.0 has requirement wagtail<2.2,>=2, but you'll have wagtail 2.4 which is incompatible.

Would you be willing to remove the wagtail<2.2 dependency? If not, I could do a little testing for you by forking and removing that dependency and installing from my fork, but my testing wouldn’t be extensive. I would have around a hundred documents that I could run the transcription command on, but none of them would require OCR.

I would be willing to propose a re-write of your installation instructions based on the above (you could likely get rid of having to mention the statements about incompatibility errors).

Kees Hink · Answer 1 · Wed Apr 10 2019 20:13:04 GMT+0800 (China Standard Time)

@DanAtShenTech Yes, i'm completely okay with removing that restriction. Not sure why it's in there, maybe a conservative move. But it looks like there's no reason for it now.

We'd have to update the build matrix as well.

I'd be happy to accept a PR.

Dan Swain · Answer 2 · Thu Apr 11 2019 00:26:12 GMT+0800 (China Standard Time)

I've submitted a PR. Would you be willing to update the install script to install from git+https://github.com/deanmalmgren/textract.git rather than from PyPI? I imagine this is non-standard, but if done, then in the install instructions I could remove the notes about errors and add a note to mention that installation of textract is from @deanmalmgren's github repo due to the PyPI resource not being kept up-to-date.

Kees Hink · Answer 3 · Sat Apr 13 2019 18:37:16 GMT+0800 (China Standard Time)

Hi Dan,

That does not seem the proper solution. But maybe you could document the issues you have with textract itself, and show how users can install it directly from VCS to solve theses issues, in the README?

Dan Swain · Answer 4 · Mon Apr 15 2019 20:58:43 GMT+0800 (China Standard Time)

OK Kees. As soon as you post to PyPI, I'll go through the whole process of installing and then provide a PR for an update to the README.

Dan Swain · Answer 5 · Wed Apr 17 2019 08:18:50 GMT+0800 (China Standard Time)

I wanted to bring to the attention of anyone reading this issue some information that I just discovered. Back in 2016 @deanmalmgren called for someone to take over the Textract repo. He tweeted about this need as recently as April 9, 2019. A review of his commit history shows his last commit to the Textract repo was the summer of 2017. While I've been able to get document extraction capability to work somewhat well using wagtail_textract, it feels pretty brittle. I still haven't gotten OCR to work when uploading a file though, and OCR'ed data is not saved with the PDF - see this issue. Also, I use pipenv and can't yet produce a Pipfile.lock to use in production because of dependency issues related to the repo not being kept up-to-date. I'm not at a point that I could take over maintenance of this repo, but I wanted to particularly point this problem out to @khink in case he is. One dependency that it would be nice to update would be to move from Tesseract 3.x to the latest 4.x.