Discrepancy in Data Count Between Paper and Huggingface Dataset

Question

Discrepancy in Data Count Between Paper and Huggingface Dataset

nth2000 opened this issue 2 months ago · comments

nth2000 commented 2 months ago

First of all, thank you for your outstanding work!

I noticed that the chart_upload.json file in the Huggingface dataset contains 2,633,068 entries. However, Table 1 in your paper mentions a total of 39M data samples. So I'm wondering which part of the data has not yet been released and are there any plans to release the remaining data samples?

Many thanks！

FanqingM · Answer 1 · Mon Jun 24 2024 22:32:58 GMT+0800 (China Standard Time)

Hello, We have not uploaded all image-table pairs, beacuse it is just generated by some APIs (matploblib,etc.). It is easy to generate. Besides, We do not upload the open-source dataset.
Therefore, we upload the most important part of our dataset: MathQA,ReferQA(which use COT-answer) and Arxiv dataset.

Also we find that we do not use all data for ChartAst-S, it can be archievd similar performance using this and other open-source dataset.