Search: Extract text from documents

Question

Search: Extract text from documents

kaedroho opened this issue 10 years ago · comments

By the looks of it, this should be quick and easy to implement now that theres a good python module that can do most of the work for us (https://github.com/deanmalmgren/textract).

Notes:

textract has many dependencies (including lxml) so it would be a good idea to make it only an optional dependency
For large PDFs (eg, ebooks). textract can take a few minutes to extract the text

Implementation ideas:

Add a method to Document model called extract_text. Calling this will use the textract module to extract text from the document file and return it as a string.
Add a new field to Document called extracted_text. This is to store the extracted text in the database to make indexing the document into the database faster. This field should be nullable (which would indicate that the text hasn't been extracted from the document yet).
Create some kind of background task to extract text from documents.
Add a custom manager for Document and override the get_queryset method to run .defer('extracted_text') on the queryset (https://docs.djangoproject.com/en/dev/ref/models/querysets/#defer). This prevents django from selecting the extracted text when it's not needed so large documents don't slow anything down.
Add extracted_text into search_fields of Document.

Bertrand Bordage · Answer 1 · Wed Apr 12 2017 01:27:42 GMT+0800 (China Standard Time)

I agree with all your ideas @kaedroho.

I suggest creating a new command named transcribe_documents and naming the new field transcription instead of extracted_text.
This new command should use the default method for transcription, and do something special for PDFs. When the PDF transcription is extremely small compared to the file size, we fall back to using the Tesseract method.
It should also issue a warning when an unsupported file type is met and advice users to install a dependency. For example, if a document is an OGG file and sox is not installed, then we log a warning telling sox can be installed in order to transcribe this document.

We could also use tqdm in this new command to give an estimate of where we are in the transcription and how long it should take.
By the way, we could also use tqdm in update_index instead of the current system. It would be way shorter (only one progress bar for each model, for example) and it would also give an estimate of the time remaining.

François GUÉRIN · Answer 2 · Fri May 05 2017 22:13:16 GMT+0800 (China Standard Time)

What about PDF OCR ?

Tika, associated to elasticsearch, provides a solution to extract text from PDF including images (document scans, thanks to tesseract), that's I think a elasticsearch based solution should more transparent.

Much more, the long time taking analysis is performed in elasticsearch, in a delegated process, which avoid to use celery or cron to manage those processing.

Thanks for your great CMS !

François GUÉRIN · Answer 3 · Fri May 05 2017 22:16:56 GMT+0800 (China Standard Time)

Oops, seen that tesseract is also included in textract.

Bertrand Bordage · Answer 4 · Fri May 05 2017 23:17:48 GMT+0800 (China Standard Time)

Even if we use the elasticsearch mapper-attachments solution, we would still have to write a different solution for PostgreSQL, most probably using textract.

And of course two different solutions to do the same thing would lead to 2 times more potential issues, plus different indexing quality/results. And of course, using textract means we can easily improve it, as it’s written in Python.

Maybe we’ll end up also using mapper-attachments for performance reasons, but it’s not the right way to start, as we’re trying to create a uniform interface for search across backends.

Kees Hink · Answer 5 · Tue May 01 2018 19:56:40 GMT+0800 (China Standard Time)

Tom pointed out this issue to me after my question on Slack.

@kaedroho:

Would it make sense to place this in a separate package (something like wagtail-textract or wagtail-document-extraction)? This would make the dependency optional in an easy way. I'm just wondering if it's easy and recommendable to patch/modify Wagtail's (Abstract)Document model.
If not, and we'd need to get it in Wagtail itself, would we do this with a textract_extras in extras_require, and try: import; except ImportError clauses in the relevant methods?

@BertrandBordage Why transcribe rather than extract?

Kees Hink · Answer 6 · Tue May 01 2018 22:53:59 GMT+0800 (China Standard Time)

We have a use case for building this for a customer. I'd ideally like to build this in such a way that it can be easily used by others, so i'd appreciate any guidance.

As a first attempt, i made a package that overrides Wagtail's Document model. I couldn't find much documentation on it, but this approach seems to work: fourdigits/wagtail_textract@1ab2405#diff-212e85d6b805221168ffba19cccbdea7 and fourdigits/wagtail_textract@1ab2405#diff-55d35ded409e4ba2ffaa719e13674bd9

Is this wise, or is there a better way to override/extend a Wagtail model?

Kees Hink · Answer 7 · Tue May 01 2018 23:24:36 GMT+0800 (China Standard Time)

My colleague Tom suggested an alternative: Instead of adding the field to the Document model, we might create a new model (not a Wagtail content type) that has a one-on-one relation with a Document. This model (DocumentText?) would keep the extracted_text.

We'd still need to be able to modify search_fields on the Document though.

Bertrand Bordage · Answer 8 · Wed May 02 2018 05:38:33 GMT+0800 (China Standard Time)

@khink transcription is the term used by librarians & digital humanities researchers for a plain text version of a document, either a photography, a video or an audio document.

In lots of cases in digital humanities, we want to manually write transcriptions instead of using OCR or extracting the already OCRed text from a PDF. For example, medievalists transcribe documents almost impossible to OCR, even today. Or musicians transcribe scores using languages such as LilyPond, again almost impossible to OCR.

That’s why I think it’s better to have an editable transcription field. And for consistency, use the verb transcribe for the functions/commands that automatically fill the transcription. The transcription method itself should be configurable, of course, so we can specify the backend and its options, like mention we want Tesseract with these letters only and this dictionary, etc.

Kees Hink · Answer 9 · Wed May 02 2018 15:50:39 GMT+0800 (China Standard Time)

Thanks @BertrandBordage for that explanation, i'll include it.

Only now i see the get_document_model method and the WAGTAILDOCS_DOCUMENT_MODEL setting. No need to override Wagtail's Document model.

Kees Hink · Answer 10 · Tue May 08 2018 20:59:43 GMT+0800 (China Standard Time)

Just a short note: We did an alpha release of https://github.com/fourdigits/wagtail_textract today. We're hoping this may scratch other peoples' itch as well. Maybe this helps pave the way for getting this functionality in Wagtail core, although there should be a fallback when Textract's installation requirements aren't met. We welcome any comments, hints, PRs and other feedback. If at one point the package is deemed good enough that the repository can be placed in the Wagtail organisation on Github, i'd welcome that.

Kees Hink · Answer 11 · Mon Jul 23 2018 17:54:15 GMT+0800 (China Standard Time)

Update: https://github.com/fourdigits/wagtail_textract is now in beta, and it looks like we're going live with it in August.

I'd like for this to be part of Wagtail somehow, for instance by putting it under https://github.com/wagtail/. My main concern is that it should be maintainable for different Wagtail developers, and keeping it under https://github.com/fourdigits/ is not the best way to achieve this. I also think that if more senior Wagtail developers have a say in it, it will make for a higher quality solution.

Any thoughts on this?

Kees Hink · Answer 12 · Wed Sep 05 2018 17:04:46 GMT+0800 (China Standard Time)

In addition, i'd like to add somebody outside our organization (preferably from the core team) as maintainer on PyPI.