mlfoundations / datacomp

DataComp: In search of the next generation of multimodal datasets

Home Page:http://datacomp.ai/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Usage with AWS S3 and Ray

0x2b3bfa0 opened this issue · comments

Usage

Cluster creation

ray up --yes cluster.yml
ray dashboard cluster.yml

Job submission

git clone https://github.com/mlfoundations/datacomp
ray job submit \
--address=http://localhost:8265 \
--working-dir=datacomp \
--runtime-env-json="$(
  jq --null-input '
    {
      conda: "datacomp/environment.yml",
      env_vars: {
        AWS_ACCESS_KEY_ID: env.AWS_ACCESS_KEY_ID,
        AWS_SECRET_ACCESS_KEY: env.AWS_SECRET_ACCESS_KEY,
        AWS_SESSION_TOKEN: env.AWS_SESSION_TOKEN
      }
    }
  '
)" \
-- \
python download_upstream.py \
--subjob_size=11520 \
--thread_count=128 \
--processes_count=1 \
--distributor=ray \
--metadata_dir=/tmp/metadata \
--data_dir=s3://datacomp-small \
--scale=small

Note

Image shards would be saved to the datacomp-small AWS S3 bucket, specified with the --data_dir option.

Cluster deletion

$ ray down --yes cluster.yml

Configuration

Sample cluster.yml

cluster_name: datacomp-downloader

min_workers: 0
max_workers: 10
upscaling_speed: 1.0

docker:
  run_options: [--dns=127.0.0.1]
  image: rayproject/ray:2.6.1-py310
  container_name: ray

provider:
  type: aws
  region: us-east-1
  cache_stopped_nodes: false

available_node_types:
  ray.head.default:
    resources: {}
    node_config:
      InstanceType: m5.12xlarge
      ImageId: ami-068d304eca3399469
      BlockDeviceMappings:
        - DeviceName: /dev/sda1
          Ebs:
            DeleteOnTermination: true
            VolumeSize: 200
            VolumeType: gp2
  ray.worker.default:
    resources: {}
    node_config:
      InstanceType: m5.12xlarge
      ImageId: ami-068d304eca3399469
      BlockDeviceMappings:
        - DeviceName: /dev/sda1
          Ebs:
            DeleteOnTermination: true
            VolumeSize: 200
            VolumeType: gp2

initialization_commands:
  - wget https://secure.nic.cz/files/knot-resolver/knot-resolver-release.deb
  - sudo dpkg --install knot-resolver-release.deb
  - sudo apt-get update
  - sudo apt-get install --yes knot-resolver
  - echo $(hostname --all-ip-addresses) $(hostname) | sudo tee --append /etc/hosts
  - sudo systemctl start kresd@{1..48}.service
  - echo nameserver 127.0.0.1 | sudo tee /etc/resolv.conf
  - sudo systemctl stop systemd-resolved

setup_commands:
  - sudo apt-get update
  - sudo apt-get install --yes build-essential ffmpeg

Obscure details

  • When --data_dir points to a cloud storage like S3, we also have to specify a local --metadata_dir because the downloader script doesn't support saving metadata to cloud storage.

  • The last pip install on the setup_commands section is needed for compatibility with AWS S3, because the required libraries aren't included in the conda environment file.

  • There is no need to provide additional AWS credentials if the destination bucket is on the same account as the cluster, because it already has S3 full access through an instance profile.

    • While the cluster has a default instance profile that grants full S3 access, it doesn't seem to work as intended (probably due to rate limit of IMDS endpoint), and I ended up having to pass my local AWS credentials as environment variables.
  • The Python version in environment.yml must match the Python version of the Ray cluster; make sure that docker.image on cluster.yaml has exactly the same version as the environment.yml from this project.

Hey why did you close it ?

I think it's a good improvement and people will review the PRs soon

Hello! I closed the issue because it wasn't quite actionable, but rather a “note to my future self” that could eventually become documentation. 🙈 I'll reopen it if you wish, though.

Alternative version, without containers.

cluster_name: datacomp-downloader

min_workers: 0
max_workers: 10
upscaling_speed: 1.0

provider:
  type: aws
  region: us-east-1
  cache_stopped_nodes: false

available_node_types:
  ray.head.default:
    resources: {}
    node_config:
      InstanceType: m5.12xlarge
      ImageId: ami-068d304eca3399469
      BlockDeviceMappings:
        - DeviceName: /dev/sda1
          Ebs:
            DeleteOnTermination: true
            VolumeSize: 200
            VolumeType: gp2
  ray.worker.default:
    resources: {}
    node_config:
      InstanceType: m5.12xlarge
      ImageId: ami-068d304eca3399469
      BlockDeviceMappings:
        - DeviceName: /dev/sda1
          Ebs:
            DeleteOnTermination: true
            VolumeSize: 200
            VolumeType: gp2

initialization_commands:
  # Knot Resolver
  - wget https://secure.nic.cz/files/knot-resolver/knot-resolver-release.deb
  - sudo dpkg --install knot-resolver-release.deb
  - rm knot-resolver-release.deb
  - sudo apt-get update
  - sudo apt-get install --yes knot-resolver
  - echo $(hostname --all-ip-addresses) $(hostname) | sudo tee --append /etc/hosts
  - sudo systemctl start kresd@{1..48}.service
  - echo nameserver 127.0.0.1 | sudo tee /etc/resolv.conf
  - sudo systemctl stop systemd-resolved
  # Anaconda
  - sudo mkdir /opt/miniconda3 && sudo chown $USER /opt/miniconda3
  - wget https://repo.anaconda.com/miniconda/Miniconda3-py39_22.11.1-1-Linux-x86_64.sh
  - bash Miniconda3-py39_22.11.1-1-Linux-x86_64.sh -f -b -p /opt/miniconda3
  - rm Miniconda3-py39_22.11.1-1-Linux-x86_64.sh
  - /opt/miniconda3/bin/conda init bash
  # Ray
  - conda create --yes --name=ray python=3.10.8
  - echo conda activate ray >> ~/.bashrc
  - pip install ray[all]==2.7.0

setup_commands:
  - sudo apt-get update
  - sudo apt-get install --yes build-essential ffmpeg