wagtail / wagtail

A Django content management system focused on flexibility and user experience

Home Page:https://wagtail.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Search: Extract text from documents

kaedroho opened this issue · comments

By the looks of it, this should be quick and easy to implement now that theres a good python module that can do most of the work for us (https://github.com/deanmalmgren/textract).

Notes:

  • textract has many dependencies (including lxml) so it would be a good idea to make it only an optional dependency
  • For large PDFs (eg, ebooks). textract can take a few minutes to extract the text

Implementation ideas:

  • Add a method to Document model called extract_text. Calling this will use the textract module to extract text from the document file and return it as a string.
  • Add a new field to Document called extracted_text. This is to store the extracted text in the database to make indexing the document into the database faster. This field should be nullable (which would indicate that the text hasn't been extracted from the document yet).
  • Create some kind of background task to extract text from documents.
  • Add a custom manager for Document and override the get_queryset method to run .defer('extracted_text') on the queryset (https://docs.djangoproject.com/en/dev/ref/models/querysets/#defer). This prevents django from selecting the extracted text when it's not needed so large documents don't slow anything down.
  • Add extracted_text into search_fields of Document.

I agree with all your ideas @kaedroho.

I suggest creating a new command named transcribe_documents and naming the new field transcription instead of extracted_text.
This new command should use the default method for transcription, and do something special for PDFs. When the PDF transcription is extremely small compared to the file size, we fall back to using the Tesseract method.
It should also issue a warning when an unsupported file type is met and advice users to install a dependency. For example, if a document is an OGG file and sox is not installed, then we log a warning telling sox can be installed in order to transcribe this document.

We could also use tqdm in this new command to give an estimate of where we are in the transcription and how long it should take.
By the way, we could also use tqdm in update_index instead of the current system. It would be way shorter (only one progress bar for each model, for example) and it would also give an estimate of the time remaining.

What about PDF OCR ?

Tika, associated to elasticsearch, provides a solution to extract text from PDF including images (document scans, thanks to tesseract), that's I think a elasticsearch based solution should more transparent.

Much more, the long time taking analysis is performed in elasticsearch, in a delegated process, which avoid to use celery or cron to manage those processing.

Thanks for your great CMS !

Oops, seen that tesseract is also included in textract.

Even if we use the elasticsearch mapper-attachments solution, we would still have to write a different solution for PostgreSQL, most probably using textract.

And of course two different solutions to do the same thing would lead to 2 times more potential issues, plus different indexing quality/results. And of course, using textract means we can easily improve it, as it’s written in Python.

Maybe we’ll end up also using mapper-attachments for performance reasons, but it’s not the right way to start, as we’re trying to create a uniform interface for search across backends.

Tom pointed out this issue to me after my question on Slack.

@kaedroho:

  • Would it make sense to place this in a separate package (something like wagtail-textract or wagtail-document-extraction)? This would make the dependency optional in an easy way. I'm just wondering if it's easy and recommendable to patch/modify Wagtail's (Abstract)Document model.
  • If not, and we'd need to get it in Wagtail itself, would we do this with a textract_extras in extras_require, and try: import; except ImportError clauses in the relevant methods?

@BertrandBordage Why transcribe rather than extract?

We have a use case for building this for a customer. I'd ideally like to build this in such a way that it can be easily used by others, so i'd appreciate any guidance.

As a first attempt, i made a package that overrides Wagtail's Document model. I couldn't find much documentation on it, but this approach seems to work: fourdigits/wagtail_textract@1ab2405#diff-212e85d6b805221168ffba19cccbdea7 and fourdigits/wagtail_textract@1ab2405#diff-55d35ded409e4ba2ffaa719e13674bd9

Is this wise, or is there a better way to override/extend a Wagtail model?

My colleague Tom suggested an alternative: Instead of adding the field to the Document model, we might create a new model (not a Wagtail content type) that has a one-on-one relation with a Document. This model (DocumentText?) would keep the extracted_text.

We'd still need to be able to modify search_fields on the Document though.

@khink transcription is the term used by librarians & digital humanities researchers for a plain text version of a document, either a photography, a video or an audio document.

In lots of cases in digital humanities, we want to manually write transcriptions instead of using OCR or extracting the already OCRed text from a PDF. For example, medievalists transcribe documents almost impossible to OCR, even today. Or musicians transcribe scores using languages such as LilyPond, again almost impossible to OCR.

That’s why I think it’s better to have an editable transcription field. And for consistency, use the verb transcribe for the functions/commands that automatically fill the transcription. The transcription method itself should be configurable, of course, so we can specify the backend and its options, like mention we want Tesseract with these letters only and this dictionary, etc.

Thanks @BertrandBordage for that explanation, i'll include it.

Only now i see the get_document_model method and the WAGTAILDOCS_DOCUMENT_MODEL setting. No need to override Wagtail's Document model.

Just a short note: We did an alpha release of https://github.com/fourdigits/wagtail_textract today. We're hoping this may scratch other peoples' itch as well. Maybe this helps pave the way for getting this functionality in Wagtail core, although there should be a fallback when Textract's installation requirements aren't met. We welcome any comments, hints, PRs and other feedback. If at one point the package is deemed good enough that the repository can be placed in the Wagtail organisation on Github, i'd welcome that.

Update: https://github.com/fourdigits/wagtail_textract is now in beta, and it looks like we're going live with it in August.

I'd like for this to be part of Wagtail somehow, for instance by putting it under https://github.com/wagtail/. My main concern is that it should be maintainable for different Wagtail developers, and keeping it under https://github.com/fourdigits/ is not the best way to achieve this. I also think that if more senior Wagtail developers have a say in it, it will make for a higher quality solution.

Any thoughts on this?

In addition, i'd like to add somebody outside our organization (preferably from the core team) as maintainer on PyPI.