yashkant / sam-textvqa

Official code for paper "Spatially Aware Multimodal Transformers for TextVQA" published at ECCV, 2020.

Home Page:https://yashkant.github.io/projects/sam-textvqa

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Question about reproduce result.

lixiangpengcs opened this issue · comments

I reproduce the baseline tvqa-c3 and the final accuracy is about 42.70% on the validation set. But it is reported the 43.9% on val set in the paper. Are there any details that I ignored? Or what is the reason for that?

Another question: why do we set the 1 and 2 quadrants as 0 in 'train-tvqa-eval-tvqa-c3.yml'. Is any corresponding explanation in the paper?

Hi Xianpeng,

I reproduce the baseline tvqa-c3 and the final accuracy is about 42.70% on the validation set. But it is reported the 43.9% on val set in the paper. Are there any details that I ignored? Or what is the reason for that?

The difference could come due to the following reasons:

Another question: why do we set the 1 and 2 quadrants as 0 in 'train-tvqa-eval-tvqa-c3.yml'. Is any corresponding explanation in the paper?

Masking quadrants 1 and 2 correspond to masking A and B quadrants in Figure 2 (See below). The reason is described in Section 4.1.
image

Hope these answers help!

Thanks a lot. I will have a try according to your suggestions and report the effects of your suggestion.

Hello, I upload my tvqa-c3 results to the EvalAI server. It attains 42.70% on val when training. However, it is reported 42.52% on the server. I also upload tvqa_stvqa-c3 results to the server. It gets 43.95% when training and it gets 43.76% when evaluating on EvalAI server. It is a little different from what you said.

Hi Xianpeng,

It attains 42.70% on val when training. However, it is reported 42.52% on the server.

Sorry about this. I am not sure what has caused this negative shift in training vs Eval-AI results. I recently turned off the beam-search evaluation and perhaps that might be one of the reasons.

I also upload tvqa_stvqa-c3 results to the server. It gets 43.95% when training and it gets 43.76% when evaluating on EvalAI

Can you please try evaluating the pre-trained checkpoint that I have uploaded with the following command:

python train.py \
--config configs/train-tvqa_stvqa-eval-tvqa-c3.yml \
--pretrained_eval data/pretrained-models/best_model.tar

Using this checkpoint, I achieve 44.25% on validation and 44.53% on the test. Could you verify you get these results as well?

The Eval-AI files generated from this pre-trained checkpoint at my end are uploaded in the sam-textvqa/pretrained-models/. I am also attaching the links below for reference:

Although the validation accuracy (44.25%) is low, the accuracy on the test (44.53%) without beam-search is quite impressive given that we reported 44.6% with beam-search.

Also, I have noticed that these results are quite sensitive to hyper-parameter tuning, seeds, number of GPUs, and workers used. I have tried to preserve most details as used in the reported results. Still, it might be possible that you won't be able to reproduce the exact numbers but results in the same ball-park (+/- 1%).

Closing due to inactivity, feel free to reopen.

commented

I got the similar number to Xiangpeng. Is there any tricks not included in the released code?

@akira-l Same problem. The result of the provided pre-trained model is about 44.19% on the validation set on the server. I am not sure that the model is pre-trained on tvqa or tvqa+stvqa. If the provided model is trained on tvqa+stvqa, it is still far from the reported 45.1% / 45.4% in the paper.

Hi @akira-l and @lixiangpengcs , I was wondering if you ran into any issues setting up your environment. I'm trying to run the pretrained_model on the textvqa dataset on an AWS EC2 instance with this command:
python train.py --config configs/train-tvqa-eval-tvqa-c3.yml --pretrained_eval data/pretrained-models/best_model.tar

I'm running it without beam search and wasn't getting the accuracy of above 40%. I'm getting around 28% so I'm wondering where my error is. I'm calculating my accuracy in my run_model_no_beam here:

def run_model_no_beam(self, split):
        scores, batch_sizes = [], []
        predictions = []
        self.model.eval()
        with torch.no_grad():
            for batch_dict in tqdm(self.dataloaders[split], desc=f"Eval on {split}"):
                loss, score, batch_size, batch_predictions = forward_model(
                    {"loss": "textvqa", "metric": "textvqa"}, self.device, self.model, batch_dict=batch_dict
                )
                print("batch_acc",score,"batch_size",batch_size)
                scores.append(score * batch_size)
                batch_sizes.append(batch_size)
                predictions.extend(batch_predictions)       
        print("accuracy:",sum(scores) / sum(batch_sizes))
        evalai_preds = [{"question_id": x["question_id"], "answer": x["pred_answer"]} for x in predictions]
        return evalai_preds

I was wondering if this is how y'all did it. Also, did either of you do anything with the use_aux_heads parameter within the SAM4C initialization? I also had to install a newer version of Pytorch-Transformers(1.2.0) to get the model to work. I didn't know if you did that as well.

Another potential issue might be that I got a weird warning when installing apex on my instance. It had something to do with a sign bit might have flipped. I'm just trying to evaluate the model and apex is mainly used for training so I thought it shouldn't have any affect on the evaluation run.

Thanks in advance!

Hi!
Has anyone tried to migrate the SAM model to Facebook's MMF framework?
I wanted to use the results of Google OCR in the MMF framework, so I replaced the original.npy and.lMDB files in the MMF framework with the author's own files and used the "google_ocr_tokens_filtered" and "google_ocr_info_filtered" infos in the tvqa_ {}_imdb.npy file in my code.However, the result on val data set is only 39.94, which is quite different from the 41.8 mentioned in the paper.
image

@JayZhu0104 Did you manage to reproduce the results w/ Google OCR w/ MMF framework? Thanks in advance.

@JayZhu0104 Did you manage to reproduce the results w/ Google OCR w/ MMF framework? Thanks in advance.
I did not reproduce the SA-M4C model w/ MMF framework, but only used Google OCR w/ MMF framework. And the result was lower than in the paper

Sorry about the confusion. That is exactly what I want to ask. Thanks!

@JayZhu0104 How did you use the Google OCR LMDB files and .npy files from Spatially Aware in the MMF framework? We are trying it now and we had to strip "train/" from the LMDB database "keys" but now we are running into the issue that the obj_bbox_coordinates data do not exist in the SampleList batch objects. We are having trouble troubleshooting this since we don't know how to read the bytes in the LMDB database "values".

Did you encounter these issues and if not, how did you achieve migrating the data into MMF?

Does anyone have such a question, the memory keeps rising until its limit (64G memory on my server)? Then the running speed heavily slows down.

@JayZhu0104 @HenryJunW Were you able to reproduce the results in the end?
I tried using the .lmdb files provided with SAM and using a newly created .yaml file that points to them.
Then, I modified sam-mmf/mmf/datasets/databases/readers/feature_readers.py to strip "train/" from the filenames in SAM's lmdb so that the filenames matched keys in the lmbd.
Lastly, I edited sam-mmf/mmf/datasets/builders/textvqa/dataset.py to use the new OCR tokens, bounding boxes etc.

However, our results were much worse than 40% accuracy on the val dataset.
Is this exactly what you guys did? Any steps that I am missing or is there any tricky thing that is easily overlooked?

@michaelzyang I didn't do the experiments. But I do think your steps were correct. Perhaps @yashkant can help answer your questions.

Thanks @HenryJunW. @yashkant any thoughts on this issue? Many thanks :)

Hi @michaelzyang,

IIUC, you are using the object and ocr features both from SAM's codebase, I think that's the right thing to do.

There are other things that might change the results --

  1. I used an earlier version of MMF available on this branch -- https://github.com/facebookresearch/mmf/tree/project/m4c.
  2. The evaluation criteria in the TextVQA challenges 2019, 2020 hosted on EvalAI had a minor issue, which affected the results by +/- 1%. It was recently fixed in the 2021 iteration, so your numbers might not match exactly, but it should not be a significant difference.

Hope this helps!

@michaelzyang Yes, I also modified these two .py files.
And the result was lower than in the paper w/ MMF framework

@zachkitowski Some of the information is different w/ MMF framework. And I did the same thing as michaelzyang

Does anyone have such a question, the memory keeps rising until its limit (64G memory on my server)? Then the running speed heavily slows down.

same question.I run the code on my server with 62G memory.After running for a while, the training was interrupted.