PaddlePaddle / PaddleOCR

Awesome multilingual OCR toolkits based on PaddlePaddle (practical ultra lightweight OCR system, support 80+ languages recognition, provide data annotation and synthesis tools, support training and deployment among server, mobile, embedded and IoT devices)

表格识别模型PubTabNet_2.0.0_train.jsonl缺失

s957995299 opened this issue 2 years ago · comments

s957995299 commented 2 years ago

请问，该去哪里下载paddle提供的PubTabNet_2.0.0_train.jsonl ?
没有这个文件的话，无法按照OCR十讲提供的表格识别模型教程训练
https://aistudio.baidu.com/bd-cpu-01/user/995689/3481601//files/train_data%2Ftable%2Fpubtabnet%2FPubTabNet_2.0.0_train.jsonl?download=1这个网页打不开，不让下载

zhoujun commented 2 years ago

在这里https://github.com/ibm-aur-nlp/PubTabNet

s957995299 commented 2 years ago

在这里https://github.com/ibm-aur-nlp/PubTabNet

您好，请问PubTabNet官网下的数据的标注需要特殊处理吗？我自己split之后，怎么训练acc都是0

zhoujun commented 2 years ago

不需要特别处理，slipt为训练验证集就行了，一般1个epoch acc会到10%多