Visual Genome Test Set Curation

Code used to curate the test set for the Visual Genome dataset.

Overview

Each time your send a set of images through the AMT pipeline, it is called an "experiment". For example, you may decide to send 10 images through the whole pipeline and call the experiment "10_images". This repo allows you to send these images through the whole pipeline, which consists of 3 stages. Each stage may need to be launched multiple times:

Initial launch: You first send the data through the initial launch of a stage. For example, a stage 1 initial launch on "10_images" would entail showing the 10 images to AMT workers and asking them "How many <s, p, o> are in this image?" for each image. In this example, because we have x tasks per HIT in stage 1, we would launch 1 HIT with 2 assignments, meaning up to 2 workers can work on the HITs.
Disagreement launch: Because we launch 2 assignments for each HIT in the initial launch, we could very well have disagreements. For example, after completing the stage 1 initial launch, say that Worker 1 said image_1 had 3 <s, p, o> while Worker 2 said the same image had 10 <s, p, o>. That's a big difference! We would want to get a third opinion. To that end, we collect all images from the initial launch that had disagreements between workers and send them to AMT again in what's called a disagreement launch. In our exampoe, a stage 1 disagreement launch would show a HIT with 1 task (i.e. image_1) to just 1 worker (i.e. assignments=1). Note: stage 2 cannot have a disagreement launch because of the nature of the task (see Stages: Stage 2 for more info).
Relaunches: In either the initial launch or disagreement launch, a relaunch may be necessary to get acceptable HIT results after a worker fails an attention check (more info on each stage's attention checks in "Stages"). For example, say you run an stage 1 initial launch on "10_images" (recall: up to 2 workers could be working on these HITs). If one of the workers failed an attention check, we would want to relaunch the HIT that they did so that we can get acceptable results. Note: even though it is a relaunch, we would still call this part of the "stage 1 initial launch"! Another note: relaunches can, of course, also happen in the disagreement launch if a worker fails an attention check.

To begin an experiment, you must create a folder in the data folder with the experiment's name, and create the initial_data.json file. As you send experiment through these launches, the code will incrementally build out other files in the experiment folder. These files keep track of the launch results and change the data as we learn more about it through the AMT HITs. See "data Folder" for more info on these files.

There are various code files that you must interact with in order to successfully send an experiment through the full AMT pipeline. Each file has its own purpose, and they all ultimately touch (i.e. read from or write to) some file in the experiment folder. See "Pipeline" for info on each file.

While it is important that you follow the right steps, don't stress! There are many, many asserts throughout the code that mostly ensure that the steps were followed in the correct order. Even so, it's better to be safe than sorry. See "Workflow" for the detailed steps.

data Folder

The data folder must contain folders with the names of the experiments that you want to run. For example, if you want to run python launch.py --exp_name test_exp --stage 1 --initial_launch --sandbox, then there must be a folder named test_exp in the data folder. In the test_exp folder, you must supply one file called initial_data.json which contains information about the images that you want to launch to MTurk. Here is an example initial_data.json file:

[{"url": "https://cs.stanford.edu/people/rak248/VG_100K/2349753.jpg",
  "predicate": "near",
  "subject": {
    "name": "person"
  },
  "object": {
   "name": "bear"
  }
}]

In short, the JSON file should be a list of dictionaries, which each dictionary has a url, predicate, subject, and object attribute. The latter two are themselves dictionaries, each with attribute name.

As part of the launches, many files will be autogenerated and modified to keep track of the launch metadata and knowledge gained through the MTurk responses. For example, stage_x_launches will contain metadata for each launch in the Xth stage (i.e. 1, 2, or 3), such as the HIT ID, number of assignments in that HIT, whether or not it is done and dumped, which assignments were approved, and whether or not the knowledge from the HIT was extracted. Next, amt_dump.json is a massive dictionary of {hit_id:results}. Finally, images_knowledge.json, arguably the most important file, contains all of the actual knowledge gained through MTurk responses. For each image, it will contain, in a readable format, the knowledge we want to gain about the image. For example, here might be a sample in the file after a stage 1 launch:

{
  "2349753.jpg": {
    "stage_1": {
      "person_near_bear": {
        "hit_ids": [
          "3VAOOVPI3ZR829SRTJKDU4CW0PULLR",
          "36U4VBVNQOCMOXAY7H9A73HEJ8CURK"
        ],
        "worker_1": {
          "worker_id": "AU3NU1RYO2FVA",
          "answer": 2
        },
        "worker_2": {
          "worker_id": "AU3NU1RYO2FVB",
          "answer": 2
        },
        "final_answer": 2
      }
    }
  }
}

What this tells us is, for the image 2349753.jpg, we did a stage 1 launch asking about just one relationship, <pearson, near, bear> (if we had asked about more than one relationship say in the same or even in a different HIT in that launch, that would appear here too). We had two workers, AU3NU1RYO2FVA and AU3NU1RYO2FVB answer the stage 1 question, and they both said 2 and therefore the final answer is 2. The HITs that allowed us to gather this knoweledge are listed (since there are two HIT IDs that means that there must have been a relaunch, as one of the worker assignments from the first launched failed an attention check--if they hadn't, there would be only one HIT ID).

Pipeline

Coming Soon

Workflow

Below is the "psuedocode" for how you should run the various files to send a batch of images (i.e. and experiment) through the pipeline. In summary: for each stage, run the launch, dump the HIT results, and extract the HIT knowledge until no attention checks are failing. For stages 1 and 2, you will have to do this for both an initial launch and a disagreement launch.

# Stage 1
# -------
do {
    python launch.py --exp_name name --stage 1 --initial_launch --sandbox
    while (there are incomplete HITs) python dump.py --exp_name name --stage 1 --sandbox
    python knowledge.py --exp_name name --stage 1 --sandbox
} while (there are failed attention checks)

do {
    python launch.py --exp_name name --stage 1 --disagreement_launch --sandbox
    while (there are incomplete HITs) python dump.py --exp_name name --stage 1 --sandbox
    python knowledge.py --exp_name name --stage 1 --sandbox
} while (there are failed attention checks)


# Stage 2
# -------
do {
    python launch.py --exp_name name --stage 2 --initial_launch
    while (there are incomplete HITs) python dump.py --exp_name name --stage 2
    python knowledge.py --exp_name name --stage 2
} while (there are failed attention checks)


# Stage 3
# -------
do {
    python launch.py --exp_name name --stage 3 --initial_launch
    while (there are incomplete HITs) python dump.py --exp_name name --stage 3
    python knowledge.py --exp_name name --stage 1
} while (there are failed attention checks)

do {
    python launch.py --exp_name name --stage 3 --disagreement_launch
    while (there are incomplete HITs) python dump.py --exp_name name --stage 3
    python knowledge.py --exp_name name --stage 3
} while (there are failed attention checks)

Other Folders and Files