LinWeizheDragon / Retrieval-Augmented-Visual-Question-Answering

This is the official repository for Retrieval Augmented Visual Question Answering

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Question: minimum hardware requirements to run experiments

msharmavikram opened this issue · comments

The repository provides no information about the minimum hardware requirement to execute the project. Even the original paper provides no details on NVIDIA A100 cluster details (apart from mentioning, we use A100 cluster which is useless).

Please specify details on the minimum hardware requirement. Also, does this work with a single RTX 3090 (single hugging face transformer is supported, one can technically run a much smaller model like llama2-7B to generate the response but i am unsure if initial object detection, caption generation, and ocr generation can be done with an single RTX GPU - wonder why not)?

For training the FLMR retriever,
A100 (40G) or equivalent GPUs with sufficient memory is required if bz=4, grad_accum=8, num_negative_examples=4, in-batch negative sampling
A100 (80G) or equivalent GPUs with sufficient memory is required if bz=30, grad_accum=2, num_negative_examples=1, in-batch negative sampling
Batch size can be reduced if it does not fit, with a potential slight decrease in performance.

For training the BLIP2 answer generator:
Since documents are pre-extracted, you can integrate with any existing training techniques e.g. LoRA, deepspeed, etc. to reduce GPU memory required. Note that the framework uses pytorch-lightning. Thus, you can use deepspeed utilities provided by pytorch-lightning with some code changes.

For inference
RTX 3090 is sufficient for FLMR retrieval, object detection, caption generation, ocr, and answer generation with BLIP 2. In the framework's code, retrieval, object detection, caption generation, ocr are pre-extracted to reduce memory required.

Of course, you can build a complete pipeline with your own code (input -> object detection, caption generation, ocr -> retrieval -> answer generation) with an RTX 3090, though some engineering is required to load and unload models to avoid GPU OOM issues.