Question about reproduce result.

Question

Question about reproduce result.

lixiangpengcs opened this issue 4 years ago · comments

I reproduce the baseline tvqa-c3 and the final accuracy is about 42.70% on the validation set. But it is reported the 43.9% on val set in the paper. Are there any details that I ignored? Or what is the reason for that?

Xiangpeng Li · Answer 1 · Tue Oct 27 2020 19:14:08 GMT+0800 (China Standard Time)

Another question: why do we set the 1 and 2 quadrants as 0 in 'train-tvqa-eval-tvqa-c3.yml'. Is any corresponding explanation in the paper?

Yash Kant · Answer 2 · Tue Oct 27 2020 23:28:28 GMT+0800 (China Standard Time)

Hi Xianpeng,

I reproduce the baseline tvqa-c3 and the final accuracy is about 42.70% on the validation set. But it is reported the 43.9% on val set in the paper. Are there any details that I ignored? Or what is the reason for that?

The difference could come due to the following reasons:

We reported values of context=2 in the paper and here we are using context=3. If you want to use context=2, you can build a new matrix in this file [https://github.com/yashkant/sam-textvqa/blob/main/sam/datasets/textvqa_dataset.py#L373]. Although I do not think that it should create a big difference.
Did you try submitting the results on EvalAI? In our experiments, we found that the accuracy calculation on EvalAI usually bumped up the scores by ~0.5-0.7% than what is reported during training. This happens because the VQA-Accuracy calculation behind the server and the one in our codebase are slightly different.
Finally, could you try setting the num_workers [https://github.com/yashkant/sam-textvqa/blob/main/configs/train-tvqa-eval-tvqa-c3.yml#L36] to 0? Again, this shouldn't lead to a big difference but this setting was used for reported results.

Another question: why do we set the 1 and 2 quadrants as 0 in 'train-tvqa-eval-tvqa-c3.yml'. Is any corresponding explanation in the paper?

Masking quadrants 1 and 2 correspond to masking A and B quadrants in Figure 2 (See below). The reason is described in Section 4.1.

Hope these answers help!

Xiangpeng Li · Answer 3 · Wed Oct 28 2020 09:18:51 GMT+0800 (China Standard Time)

Thanks a lot. I will have a try according to your suggestions and report the effects of your suggestion.

Xiangpeng Li · Answer 4 · Wed Oct 28 2020 09:35:36 GMT+0800 (China Standard Time)

Hello, I upload my tvqa-c3 results to the EvalAI server. It attains 42.70% on val when training. However, it is reported 42.52% on the server. I also upload tvqa_stvqa-c3 results to the server. It gets 43.95% when training and it gets 43.76% when evaluating on EvalAI server. It is a little different from what you said.

Yash Kant · Answer 5 · Wed Oct 28 2020 12:19:55 GMT+0800 (China Standard Time)

Hi Xianpeng,

It attains 42.70% on val when training. However, it is reported 42.52% on the server.

Sorry about this. I am not sure what has caused this negative shift in training vs Eval-AI results. I recently turned off the beam-search evaluation and perhaps that might be one of the reasons.

I also upload tvqa_stvqa-c3 results to the server. It gets 43.95% when training and it gets 43.76% when evaluating on EvalAI

Can you please try evaluating the pre-trained checkpoint that I have uploaded with the following command:

python train.py \
--config configs/train-tvqa_stvqa-eval-tvqa-c3.yml \
--pretrained_eval data/pretrained-models/best_model.tar

Using this checkpoint, I achieve 44.25% on validation and 44.53% on the test. Could you verify you get these results as well?

The Eval-AI files generated from this pre-trained checkpoint at my end are uploaded in the sam-textvqa/pretrained-models/. I am also attaching the links below for reference:

validation [link]
test [link]

Although the validation accuracy (44.25%) is low, the accuracy on the test (44.53%) without beam-search is quite impressive given that we reported 44.6% with beam-search.

Also, I have noticed that these results are quite sensitive to hyper-parameter tuning, seeds, number of GPUs, and workers used. I have tried to preserve most details as used in the reported results. Still, it might be possible that you won't be able to reproduce the exact numbers but results in the same ball-park (+/- 1%).

Yash Kant · Answer 6 · Fri Oct 30 2020 10:28:16 GMT+0800 (China Standard Time)

Closing due to inactivity, feel free to reopen.

Akira · Answer 7 · Mon Nov 02 2020 16:07:49 GMT+0800 (China Standard Time)

I got the similar number to Xiangpeng. Is there any tricks not included in the released code?

Xiangpeng Li · Answer 8 · Mon Nov 02 2020 16:28:58 GMT+0800 (China Standard Time)

@akira-l Same problem. The result of the provided pre-trained model is about 44.19% on the validation set on the server. I am not sure that the model is pre-trained on tvqa or tvqa+stvqa. If the provided model is trained on tvqa+stvqa, it is still far from the reported 45.1% / 45.4% in the paper.

Zach Kitowski · Answer 9 · Tue Nov 03 2020 07:45:36 GMT+0800 (China Standard Time)

Hi @akira-l and @lixiangpengcs , I was wondering if you ran into any issues setting up your environment. I'm trying to run the pretrained_model on the textvqa dataset on an AWS EC2 instance with this command:
python train.py --config configs/train-tvqa-eval-tvqa-c3.yml --pretrained_eval data/pretrained-models/best_model.tar

I'm running it without beam search and wasn't getting the accuracy of above 40%. I'm getting around 28% so I'm wondering where my error is. I'm calculating my accuracy in my run_model_no_beam here:

def run_model_no_beam(self, split):
        scores, batch_sizes = [], []
        predictions = []
        self.model.eval()
        with torch.no_grad():
            for batch_dict in tqdm(self.dataloaders[split], desc=f"Eval on {split}"):
                loss, score, batch_size, batch_predictions = forward_model(
                    {"loss": "textvqa", "metric": "textvqa"}, self.device, self.model, batch_dict=batch_dict
                )
                print("batch_acc",score,"batch_size",batch_size)
                scores.append(score * batch_size)
                batch_sizes.append(batch_size)
                predictions.extend(batch_predictions)       
        print("accuracy:",sum(scores) / sum(batch_sizes))
        evalai_preds = [{"question_id": x["question_id"], "answer": x["pred_answer"]} for x in predictions]
        return evalai_preds

I was wondering if this is how y'all did it. Also, did either of you do anything with the use_aux_heads parameter within the SAM4C initialization? I also had to install a newer version of Pytorch-Transformers(1.2.0) to get the model to work. I didn't know if you did that as well.

Another potential issue might be that I got a weird warning when installing apex on my instance. It had something to do with a sign bit might have flipped. I'm just trying to evaluate the model and apex is mainly used for training so I thought it shouldn't have any affect on the evaluation run.

Thanks in advance!

JayZhu0104 · Answer 10 · Tue Dec 08 2020 10:27:13 GMT+0800 (China Standard Time)

Hi!
Has anyone tried to migrate the SAM model to Facebook's MMF framework?
I wanted to use the results of Google OCR in the MMF framework, so I replaced the original.npy and.lMDB files in the MMF framework with the author's own files and used the "google_ocr_tokens_filtered" and "google_ocr_info_filtered" infos in the tvqa_ {}_imdb.npy file in my code.However, the result on val data set is only 39.94, which is quite different from the 41.8 mentioned in the paper.

Jun Wang · Answer 11 · Wed Mar 03 2021 04:34:18 GMT+0800 (China Standard Time)

@JayZhu0104 Did you manage to reproduce the results w/ Google OCR w/ MMF framework? Thanks in advance.

JayZhu0104 · Answer 12 · Wed Mar 03 2021 10:00:08 GMT+0800 (China Standard Time)

@JayZhu0104 Did you manage to reproduce the results w/ Google OCR w/ MMF framework? Thanks in advance.
I did not reproduce the SA-M4C model w/ MMF framework, but only used Google OCR w/ MMF framework. And the result was lower than in the paper

Jun Wang · Answer 13 · Wed Mar 03 2021 10:04:40 GMT+0800 (China Standard Time)

Sorry about the confusion. That is exactly what I want to ask. Thanks!

Zach Kitowski · Answer 14 · Mon Mar 08 2021 08:18:19 GMT+0800 (China Standard Time)

@JayZhu0104 How did you use the Google OCR LMDB files and .npy files from Spatially Aware in the MMF framework? We are trying it now and we had to strip "train/" from the LMDB database "keys" but now we are running into the issue that the obj_bbox_coordinates data do not exist in the SampleList batch objects. We are having trouble troubleshooting this since we don't know how to read the bytes in the LMDB database "values".

Did you encounter these issues and if not, how did you achieve migrating the data into MMF?

Mingkun Yang · Answer 15 · Thu Mar 11 2021 08:57:18 GMT+0800 (China Standard Time)

Does anyone have such a question, the memory keeps rising until its limit (64G memory on my server)? Then the running speed heavily slows down.

Michael Yang · Answer 16 · Mon Apr 26 2021 08:24:40 GMT+0800 (China Standard Time)

@JayZhu0104 @HenryJunW Were you able to reproduce the results in the end?
I tried using the .lmdb files provided with SAM and using a newly created .yaml file that points to them.
Then, I modified sam-mmf/mmf/datasets/databases/readers/feature_readers.py to strip "train/" from the filenames in SAM's lmdb so that the filenames matched keys in the lmbd.
Lastly, I edited sam-mmf/mmf/datasets/builders/textvqa/dataset.py to use the new OCR tokens, bounding boxes etc.

However, our results were much worse than 40% accuracy on the val dataset.
Is this exactly what you guys did? Any steps that I am missing or is there any tricky thing that is easily overlooked?

Jun Wang · Answer 17 · Mon Apr 26 2021 13:58:40 GMT+0800 (China Standard Time)

@michaelzyang I didn't do the experiments. But I do think your steps were correct. Perhaps @yashkant can help answer your questions.

Michael Yang · Answer 18 · Mon Apr 26 2021 14:58:51 GMT+0800 (China Standard Time)

Thanks @HenryJunW. @yashkant any thoughts on this issue? Many thanks :)

Yash Kant · Answer 19 · Tue Apr 27 2021 05:40:17 GMT+0800 (China Standard Time)

Hi @michaelzyang,

IIUC, you are using the object and ocr features both from SAM's codebase, I think that's the right thing to do.

There are other things that might change the results --

I used an earlier version of MMF available on this branch -- https://github.com/facebookresearch/mmf/tree/project/m4c.
The evaluation criteria in the TextVQA challenges 2019, 2020 hosted on EvalAI had a minor issue, which affected the results by +/- 1%. It was recently fixed in the 2021 iteration, so your numbers might not match exactly, but it should not be a significant difference.

Hope this helps!

JayZhu0104 · Answer 20 · Sat May 01 2021 11:10:54 GMT+0800 (China Standard Time)

@michaelzyang Yes, I also modified these two .py files.
And the result was lower than in the paper w/ MMF framework

JayZhu0104 · Answer 21 · Sat May 01 2021 11:16:12 GMT+0800 (China Standard Time)

@zachkitowski Some of the information is different w/ MMF framework. And I did the same thing as michaelzyang

Aaron Han · Answer 22 · Mon Apr 18 2022 23:50:37 GMT+0800 (China Standard Time)

Does anyone have such a question, the memory keeps rising until its limit (64G memory on my server)? Then the running speed heavily slows down.

same question.I run the code on my server with 62G memory.After running for a while, the training was interrupted.