ui-libraries / flint

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

OCR issue: skipping text?

hzadeh17 opened this issue · comments

So I was trying to look further into Bcc's and I found that there are indeed quite a few emails in the drive (particularly in deqs 1, 2, and 4) where Bcc: is included in the email header. But, in many instances this is not reflected in the OCR text--but perhaps more importantly, in this format of email output where Bcc's do show up, entire bodies of email are skipped over by OCR.

For example: this text file has the headers but not the bodies of the emails included in deq01_Part316 in the drive. Same with others like it, like deq01_Part385 and this file.

(Note that as far as Bcc: goes, it does sometimes show up, as in this text file.)

maybe this is not something we can fix, but still probably good to know. I wonder if there is a way to check how much this is happening? The body text that is being skipped is light blue, so maybe that is why...but that doesn't explain why some of the header text that was black (i.e. Bcc:) is also being skipped.

another instance of skipping text: it appears that deq09, which from the drive seems to only have a few emails and a bunch of attachment pages, did not OCR correctly. I think that of the 30 files, there is only one file that has any text--this one--and it's nothing much..