如果是专注于文档理解任务的话,vision tower选用layoutlmv3初始化是不是会比vit更有竞争力?
whalefa1I opened this issue · comments
sunzheng commented
如题
Anwen Hu commented
Hi,@whalefa1I , layoutlmv3是依赖ocr识别的文本和位置作为输入的,docowl系列都是不依赖ocr的~
mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding
whalefa1I opened this issue · comments
如题
Hi,@whalefa1I , layoutlmv3是依赖ocr识别的文本和位置作为输入的,docowl系列都是不依赖ocr的~