[OWL-ViT] Issues running the training script

Question

[OWL-ViT] Issues running the training script

JKelle opened this issue 10 months ago · comments

I am having trouble running the training script as suggested by the README.

I'm running everything on my ubuntu EC2 instance, not colab. I followed the installation instructions in the README. I'm using the same command as shown in the README:

python -m scenic.projects.owl_vit.main \
  --alsologtostderr=true \
  --workdir=/tmp/training \
  --config=scenic/projects/owl_vit/configs/clip_b32.py

However, I've run into a series of issues when trying to run this training script. Here are the changes I've made to the code locally to resolve some issues so far:

upgraded to python 3.10 since some of the dependencies now use type hints that require at least 3.10
cloned big_vision repo and added it to my PYTHONPATH (I prepend PYTHONPATH=$PYTHONPATH:/home/ubuntu/big_vision/ to the command above when I run the training script)
changed lvis:1.2.0 to lvis:1.3.0 in the DECODERS dictionary since the TFDS dataset for LVIS has version 1.3.0 now.
added builder.download_and_prepare() before creating the DataSet, otherwise the script failed to find the dataset

Now I'm stuck because downloading the LVIS files fails with an AccessDenied error while attempting to download some files from the dl.fbaipublicfiles.com S3 bucket.

The README suggests having the LVIS dataset locally instead of downloading it via --config.dataset_configs.train.decoder_kwarg_list='({"tfds_data_dir": "//your/data/dir"},)', but I'm not sure how to get a TFDS version of LVIS on disk.

Am I going about this the wrong way? Is my next step to download the TFDS version of LVIS, or is there a better way to get local training to work?

GangqiangZhao · Answer 1 · Fri Oct 13 2023 11:50:58 GMT+0800 (China Standard Time)

I have encountered the similar problem and It takes me 2 weeks to tackle this problem.
First, you can download the TFDS code and modify the lvis download url as described by tensorflow/datasets#5094 . The dl.fbaipublicfiles.com S3 does not work now.
Second, pip install your modified version of TFDS. It should work.

Josh Kelle · Answer 2 · Fri Oct 20 2023 04:15:17 GMT+0800 (China Standard Time)

@GangqiangZhao Thank you! I'm now able to download those annotation files.

However, I still cannot finish building the dataset. When I run builder.download_and_prepare(), it runs for a few hours before being killed, I suspect because I run out of memory. I'm running it locally on a machine that has 64 GB of RAM. Where you able to build the LVIS dataset, and did it build locally or in Google Cloud Dataflow via Apache Beam?

import apache_beam as beam
import tensorflow_datasets as tfds

builder = tfds.builder("lvis")
flags = ["--direct_num_workers=4", "--direct_running_mode=multi_processing"]
builder.download_and_prepare(
    download_config=tfds.download.DownloadConfig(
        beam_runner="DirectRunner",
        beam_options=beam.options.pipeline_options.PipelineOptions(flags=flags),
    )
)

As far as I can tell, downloading the tf.record representation of the LVIS dataset directly isn't an option unfortunately.

edit: I made a GitHub issue about this on the tensorflow/datasets repo: tensorflow/datasets#5113

rishabh-akridata · Answer 3 · Wed Feb 28 2024 22:34:35 GMT+0800 (China Standard Time)

@JKelle Did you manage to solve this issue? I am also facing exact same issue.