X-PLUG / mPLUG-DocOwl

mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

如果是专注于文档理解任务的话,vision tower选用layoutlmv3初始化是不是会比vit更有竞争力?

whalefa1I opened this issue · comments

如题

Hi,@whalefa1I , layoutlmv3是依赖ocr识别的文本和位置作为输入的,docowl系列都是不依赖ocr的~