leaderj1001 / Vision-Language

Vision-Language, Solve GQA(Visual Reasoning in the Real World) dataset.

gqa vision-language vqa

GQA: Visual Reasoning in the Real World

Data structure

├── Question Number
    ├── Annotations
    |   ├── answer
    |   ├── full Answer
    |   └── question
    │   
    ├── answer
    ├── entailed
    ├── equivalent
    ├── fullAnswer
    ├── groups
    ├── imageId
    ├── isBalanced
    ├── question
    ├── semantic
    ├── semanticStr
    └── types
        ├── detailed
        ├── semantic
        └── structural

answer
imageId
question

Network Architecture

Image-Question Aggregator

Image Pretrained
- Tensornets github
Question Pretrained
- ELMo using tensorflow-hub
Attention model, We just use attention module
- Self-Attention Generative Adversarial Networks paper
- Attention github

Requirements

tensorflow-gpu==1.13.1
numpy==1.16.2
tensorflow-hub==0.4.0
python==3.7.3
cv2==4.0.0
tqdm==4.31.1

About

Vision-Language, Solve GQA(Visual Reasoning in the Real World) dataset.

gqa vision-language vqa

Languages

Language:Python 63.5%Language:Jupyter Notebook 36.5%