Issue with reading documents with double columns

Question

Issue with reading documents with double columns

Hastyrush opened this issue a month ago · comments

Hi, thanks for the amazing work done on MiniCPM!

I would like to enquire if the model is capable of extracting text (be it ocr or not) on documents that have double columns such as research papers. I.e. the paragraphs are meant to be read vertically instead of horizontally. I did some experiments on the prompts but it seems that the model cannot interpret documents with double columns. The result is either omitting the other column, or it combines a line from both columns (reading it horizontally instead of vertically). Not sure if this can be mitigated, so some advice would be appreciated. Thanks!

Cui Junbo · Answer 1 · Fri Jun 14 2024 08:23:22 GMT+0800 (China Standard Time)

Can you give us an example or two so that we can get a clearer picture, our model has some capacity of table extraction ~ but to makeit perform very well in specific scenarios, it may require small amounts of data to fine-tune it