yashkant / sam-textvqa

Official code for paper "Spatially Aware Multimodal Transformers for TextVQA" published at ECCV, 2020.

Home Page:https://yashkant.github.io/projects/sam-textvqa

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Issue running textvqa dataset on AWS EC2 instance

zachkitowski opened this issue · comments

Hi,
I'm trying to run the pretrained model on an AWS EC2 instance. I'm running it on a g4dn.4xlarge instance with 64GB of RAM and 500GB of disk space. I was trying to run the evaluation command but my process got killed. I was running with the num_workers =0. Once I tried to rerun the command, I got an EOF error. I was wondering if you had any ideas of where my problem could be.

I ran this command: python train.py --config configs/train-tvqa-eval-tvqa-c3.yml --pretrained_eval data/pretrained-models/best_model.tar
image

It loaded this:
image

Made it all the way here and then the process was killed:
image

Then when I tried to run the same command, I got this error:
image

I thought it might be a memory error but I'm not sure.

Thank you for your consideration.

Hi @zachkitowski,

It appears to me that your cache file data/textvqa/tva_train_spat_cache_reset.pkl was not saved correctly. Can you try reading it manually with cPickle.load()? I am quite certain you will hit the same error as above.

And yes these features are quite memory intensive so 64 GB of RAM seems to be the problem. You can attempt to mitigate this by breaking the spatial-features saving into multiple files here:

def process_spatials(self):

But even with this fix, I cannot guarantee if the limited RAM won't create problems during training, and setting the num_workers=0 is the right thing to do.

Hope this answer helps!

Thank you for the speedy reply!

Yes, this answer helps. I increased the amount of RAM and I no longer ran into my previous issue. The EOF error was caused because the pickled file was empty.

However, I ran into another issue in which the bert_tokenizer.encode function call had an additional add_special_tokens argument that wasn't handled properly. This is found in sam/datasets/processor.py , line 485. I took took out the add_special_tokens argument and ran the code. I didn't run into any other errors but when I evaluated the pretrained-model, I got an accuracy of 26.78. I was wondering if this is to be expected. I see that the model was trained for 100 epochs in your paper while the pretrained model is only run for 31 epochs. I was wondering if the lower accuracy is due to less training or if I need to make a change to the code/packages to incorporate the add_special_tokens keyword into the bert_tokenizer.

image

Thanks again!

Hi @zachkitowski,

During training, we save the best checkpoint based on validation performance and that occurs after the 30th epoch. I can confirm that the checkpoint uploaded is the right one and gets you > 44% performance on validation and test set (see this).

So you should perhaps debug and fix that argument which is unavailable. Also, check if you have the right version of pytorch-transformers installed (see requirements.txt).

Hi @yashkant

Thanks for the information. I checked and the pytorch-transformers version in requirements.txt is 1.0.0 but add_special_tokens and one of the init_weights calls needed a newer version of pytorch-transformers. I installed the newest version(1.2.0) and that solved my add_special_tokens issue.

I reran the evaluation code on the val set and still wasn't getting above 40% accuracy(I'm getting around 28% accuracy). I'm running it with no beam search and the only change I made to the config file was to change num_workers=0. In tracing through the code, I realized that we don't set the use_aux_heads and use the default value in this config file.

I ran this command.
python train.py --config configs/train-tvqa-eval-tvqa-c3.yml --pretrained_eval data/pretrained-models/best_model.tar

I believe I'm calculating accuracy correctly but I've included it here for reference.

def run_model_no_beam(self, split):
        scores, batch_sizes = [], []
        predictions = []
        self.model.eval()
        with torch.no_grad():
            for batch_dict in tqdm(self.dataloaders[split], desc=f"Eval on {split}"):
                loss, score, batch_size, batch_predictions = forward_model(
                    {"loss": "textvqa", "metric": "textvqa"}, self.device, self.model, batch_dict=batch_dict
                )
                print("batch_acc",score,"batch_size",batch_size)
                scores.append(score * batch_size)
                batch_sizes.append(batch_size)
                predictions.extend(batch_predictions)       
        print("accuracy:",sum(scores) / sum(batch_sizes))
        evalai_preds = [{"question_id": x["question_id"], "answer": x["pred_answer"]} for x in predictions]
        return evalai_preds

Another potential issue might be that I got a weird warning when installing apex on my instance. It had something to do with a sign bit might have flipped. I'm just trying to evaluate the model and apex is mainly used for training so I thought it shouldn't have any affect on the evaluation run.

Do you know of anything else I might look into?

Thanks in advance!

I figured out my issue and it was that the lock.lmdb files were mixed up in my trainval_ocr and trainval_obj folders.

I now have a val accuracy of 44.19.

@zachkitowski Hi, can you share what's the RAM size you allocated ?