ljvmiranda921 / comments.ljvmiranda921.github.io

Blog comments for my personal blog: ljvmiranda921.github.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

A framework for designing document processing solutions

utterances-bot opened this issue · comments

A framework for designing document processing solutions

Document processing may not be the hottest problem of the century, but it may as well be one of the important ones. In this blog post, I'll discuss a framewo...

https://ljvmiranda921.github.io/notebook/2022/06/19/document-processing-framework/

Thanks for that great overview! We recently had a similar project, in which we want to classify certain aspects of customer documents as relevant. We used prodigy before and I'm a big fan and would have loved to use it for this case as well. However, since it does not support OCR out of the box we went with tagtog this time (also a nice tool I have to admit).

Actually, for our classifier, text-only was enough (adding spatial information lead to a drop in the metrics) and interestingly enough a simple SVM looking at a few linguistic properties (like number of nouns and verbs) (which we got with spacy obviously ;) ) already had a F1 score of 90%. So can't complain :D If we get more complex documents or tasks, I'll take a look at LayoutLMv3.

Hi @RichardSieg, thanks for sharing about your project :) It's nice hearing about different solutions and approaches to this problem.

It seems a great idea, can I pick this idea for my PhD problem?

Wondering,.how many documents with annotation are needed to train the model. Is it in hundreds, thousands? thanks