entrpn / serving-model-cards

Collection of OSS models that are containerized into a serving container

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Change aiplatform.gapic.AcceleratorType used from TPU to A100 GPU

StateGovernment opened this issue · comments

commented

How do I change the default accelerator type used for Dreambooth training.

Simply changing the following line is throwing me a cascade of RPC errors, please point me towards a way.

"accelerator_type": aiplatform.gapic.AcceleratorType.TPU_V3,

@StateGovernment please post the error message.

Is there a reason you want to use A100? TPU trains really fast and the model weights can be easily converted to pytorch weights with diffusers later if needed.

I haven't run this code with GPUs, but it should technically work. My guess is that the machine type needs to be changed to one that supports A100s. If you're using a single A100 (40GB), change machine_type line to a2-highgpu-1g and call gcp_run_train.py with --accelerator-count=1.

For a compatibility of machine types to GPU types take a look at this link

You'll also need to install the jaxlib cuda version, change this line to:

RUN pip install --upgrade "jax[cuda]" -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html

Rebuild and push the container to gcr and run gcp_run_train.py again.

commented

@entrpn I only have a TPU quota of 8, so the training fails after 4-5mins, I requested to increase the quota to 30 which will take a while. So in the meanwhile I'd like to see how the model trains on A100s, probably even have metrics to compare it with TPUs once I have some quota.

This was the error I ran into as I tried to change the accelerator type.
Screenshot 2023-03-16 at 10 13 48 AM

@StateGovernment that's because you need to set the accelerator count to minimum of 8, so if you set the accelerator count to 8 with TPU, it should work.

commented

@entrpn The accelerator count was by default set to 8, and I only had 8 limited TPU quota for my account. I tried to change the count to 6 through cli but it didn't let me, so the count is hard-set to 8 from what I believe. Training still stops after 11mins, let me attach a screenshot of what I see on console when the training stops.

Screenshot 2023-03-16 at 11 02 00 AM

commented

@entrpn I've successfully launched a training job with A100 changing the configuration as suggested above, but there was almost no activity in the console or logs, it took almost 25mins and it still says in progress with 0 activity. Please refer to the screenshots below, along with CPU utilisation and logs at the very end. Please help.

Screenshot 2023-03-16 at 3 08 57 PM

Screenshot 2023-03-16 at 3 11 32 PM

Screenshot 2023-03-16 at 3 12 05 PM

@StateGovernment I forgot to add another step, the container doesn't install cuda drivers, so it won't use the GPU, and will be extremely slow. You'll need to change (this line)[https://github.com/entrpn/serving-model-cards/blob/main/training-dreambooth/Dockerfile#L1] to something like:

FROM nvidia/cuda:11.3.1-base-ubuntu20.04

At this point, you might need to make extra modifications to the Dockerfile, you can look at (this)[https://github.com/entrpn/serving-model-cards/blob/main/stable-diffusion-batch-job/Dockerfile] dockerfile for reference.

commented

@entrpn I see, I somehow missed that detail too, thank you for pointing out.

I also believe this line needs to change. Am not sure what to change it to though, please help me out.

I might even end up making a different Dockerfile altogether for GPUs.

commented

@entrpn I've followed the instructions above but the training wouldn't start at all. please refer to screenshots below, I've also attached the Dockerfile I've used to build, and config to launch the job. please help.

Screenshot 2023-04-05 at 5 56 20 PM

Screenshot 2023-04-05 at 5 56 35 PM

Dockerfile

FROM nvidia/cuda:11.3.1-base-ubuntu20.04

RUN apt-get update && \
    apt-get install -y software-properties-common && \
    add-apt-repository -y ppa:deadsnakes/ppa && \
    apt-get update && \
    apt install -y python3.8 && \
    apt-get -y install python3-pip

RUN apt-get update && apt-get -y upgrade \
  && apt-get install -y --no-install-recommends \
    git \
    wget \
    g++ \
    ca-certificates \
    && rm -rf /var/lib/apt/lists/*

RUN apt-get update && apt-get install -y curl
RUN echo "deb [signed-by=/usr/share/keyrings/cloud.google.gpg] http://packages.cloud.google.com/apt cloud-sdk main" | \
    tee -a /etc/apt/sources.list.d/google-cloud-sdk.list && curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | \
    tee /usr/share/keyrings/cloud.google.gpg && apt-get update -y && apt-get install google-cloud-sdk -y

# RUN pip install "jax[tpu]>=0.2.16" -f https://storage.googleapis.com/jax-releases/libtpu_releases.html
RUN pip install --upgrade "jax[cuda]" -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html
RUN pip install git+https://github.com/huggingface/diffusers.git
RUN pip install transformers flax optax torch torchvision ftfy tensorboard modelcards


WORKDIR 'training_dreambooth'

COPY . .

Config used to launch training-job

custom_job = {
        "display_name": "training-dreambooth-alisha-1000steps",
        "job_spec": {
            "worker_pool_specs": [
                {
                    "machine_spec": {
                        # "machine_type": "cloud-tpu",
                        # "accelerator_type": aiplatform.gapic.AcceleratorType.TPU_V3,
                        # "accelerator_count": 8,
                        "machine_type": "a2-highgpu-1g",
                        "accelerator_type": aiplatform.gapic.AcceleratorType.NVIDIA_TESLA_A100,
                        "accelerator_count": 1,
                    },
                    "replica_count": 1,
                    "disk_spec" : {
                        "boot_disk_type": "pd-ssd",
                        "boot_disk_size_gb" : 500
                    },
                    "container_spec": {
                        "image_uri": "gcr.io/dreamboothtest/training-dreambooth-new-gpu:latest",
                        "command": [],
                        "args": [],
                        "env" : [
                            {"name" : "MODEL_NAME", "value" : "runwayml/stable-diffusion-v1-5"},
                            {"name" : "INSTANCE_PROMPT", "value" : "a photo of al45 person"},
                            {"name" : "GCS_OUTPUT_DIR", "value" : "gs://alishadreamboothtest"},
                            {"name" : "RESOLUTION", "value" : "512"},
                            {"name" : "BATCH_SIZE", "value" : "1"},
                            {"name" : "LEARNING_RATE", "value" : "1e-6"},
                            {"name" : "MAX_TRAIN_STEPS", "value" : "1000"},
                            {"name" : "HF_TOKEN", "value" : "<>"},
                            {"name" : "CLASS_PROMPT", "value" : "A photo of a person"},
                            {"name" : "NUM_CLASS_IMAGES", "value" : "56"},
                            {"name" : "PRIOR_LOSS_WEIGHT", "value" : "1.0"},
                            {"name" : "GCS_INPUT_DIR", "value" : "gs://alishadreamboothtest/training_images"},
                        ]
                    },
                }
            ],
            "enable_web_access" : True
        },
    }

the reason why your job completes is because the base TPU image knows to find main.sh as the entrypoint. Add this to the end of your Dockerfile:

ENTRYPOINT ["./main.sh"]

This should start the job