vis-nlp / UniChart

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Does the pertrainging Corpus contain the test and val images in ChartQA?

Evanwu1125 opened this issue · comments

I've benchmarked Unichart on some personalized questions with ChartQA pictures recently, but I find that it did not perform as well as you claimed in the paper.

This is my test code copied from Hugging Face

from transformers import DonutProcessor, VisionEncoderDecoderModel
from PIL import Image
import torch, os, re


model_name = "ahmed-masry/unichart-chartqa-960"
image_path = "../images/6.png"
input_prompt = "<chartqa> What is average value of all the 'Female' bars? <s_answer>"

model = VisionEncoderDecoderModel.from_pretrained(model_name)
processor = DonutProcessor.from_pretrained(model_name)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

image = Image.open(image_path).convert("RGB")
decoder_input_ids = processor.tokenizer(input_prompt, add_special_tokens=False, return_tensors="pt").input_ids
pixel_values = processor(image, return_tensors="pt").pixel_values

outputs = model.generate(
    pixel_values.to(device),
    decoder_input_ids=decoder_input_ids.to(device),
    max_length=model.decoder.config.max_position_embeddings,
    early_stopping=True,
    pad_token_id=processor.tokenizer.pad_token_id,
    eos_token_id=processor.tokenizer.eos_token_id,
    use_cache=True,
    num_beams=4,
    bad_words_ids=[[processor.tokenizer.unk_token_id]],
    return_dict_in_generate=True,
)

sequence = processor.batch_decode(outputs.sequences)[0]
sequence = sequence.replace(processor.tokenizer.eos_token, "").replace(processor.tokenizer.pad_token, "")
sequence = sequence.split("<s_answer>")[1].strip()
print(sequence)

This is chart I test
6
This is the answer Unichart returns
20000

Hello,
Thanks for your interest in our work. No, we didn't finetune our model on the val nor test sets. We made sure to filter them out before we trained the model.

We acknowledge that the numerical reasoning questions are still quite challenging to the model. As you can see in the paper, the performance on the human questions (which contain numerical reasoning questions) is still very limited (~ 43%), so there is still a huge room for improvement.