DDSP Docker

Docker image for training autoencoder on Google Cloud AI Platform.

Before you begin

Make sure that you have completed the following steps:

Set up your GCP project
Create a Google Cloud Storage Bucket
Enable AI Platform Training and Prediction, Container Registry, and Compute Engine APIs
Install Docker
Configure Docker for Cloud Container Registry
Upload the training data in the TFRecord format to the GCS bucket. You can preprocess your audio files into this format using the ddsp_prepare_tfrecord tool as described in Making a TFRecord dataset from your own sounds.

Quickstart:

Define some environment variables

We recommend setting $REGION accordingly to your location. We also recommend to setup hostname in $IMAGE_URI based of the $REGION choice as if your Docker images are stored in different region than the job is computed additional charges will be applied.

export PROJECT_ID=[YOUR_PROJECT_ID]
export SAVE_DIR=gs://[YOUR_STORAGE_BUCKET_NAME]/[PATH_IN_STORAGE_BUCKET]
export FILE_PATTERN=gs://[YOUR_STORAGE_BUCKET_NAME]/[PATH_IN_STORAGE_BUCKET]/train.tfrecord*
export IMAGE_REPO_NAME=ddsp_train
export IMAGE_TAG=ai_platform
export IMAGE_URI=gcr.io/$PROJECT_ID/$IMAGE_REPO_NAME:$IMAGE_TAG
export REGION=us-central1
export JOB_NAME=ddsp_container_job_$(date +%Y%m%d_%H%M%S)

Build the image and push it to Container Registry

In the folder containing Dockerfile and task.py run following commands:

docker build -f Dockerfile -t $IMAGE_URI ./
docker push $IMAGE_URI

Submit the training to AI Platform

gcloud ai-platform jobs submit training $JOB_NAME \
  --region $REGION \
  --config config_single_vm.yaml \
  --master-image-uri $IMAGE_URI \
  -- \
  --save_dir=$SAVE_DIR \
  --file_pattern=$FILE_PATTERN \
  --batch_size=16 \
  --learning_rate=0.0001 \
  --num_steps=40000 \
  --early_stop_loss_value=5.0

AI Platform flags:

--region - Region when the training job is run
--config - Cluster configuration. In the example above, training on multiple VMs with multiple GPUs is set. For more information about various configurations take a look at Note on cluster configuration and hyperparameters below.
--master-image-uri - URI of the Docker image you've built and submitted to Container Registry

Program flags:

--save_dir - Mandatory flag. Bucket directory where checkpoints and summary events will be saved during training.
--file_pattern - Mandatory flag. Pattern of the data files names. Must include a whole bucket directory.
--restore_dir - Bucket directory from which checkpoints will be restored before training. When not provided defaults to save_dir. If there are no checkpoints in the given directory, training will resume.
--batch_size - Batch size.
--learning_rate - Learning rate.
--num_steps - Number of training steps.
--early_loss_stop_value - Early stopping. When the total_loss reaches below this value training stops.
--steps_per_save - Steps per model save.
--restore_per_summary - Steps per training summary save.
--hypertune - If True enables metric reporting for hyperparameter tuning.

Additional configuration flags

--gin_param - Gin parameter bindings. Using this flag requires some familiarity with the Magenta DDSP source code. Take a look at parameters you can specify in Gin config files.
--gin_search_path - Additional gin file search path. Must be path inside Docker container and necessary gin configs should be added at the Docker image building stage.
--gin_file - Additional Gin config file. If the file is in gstorage bucket specify a whole gstorage path.

You can add your own Gin config files in two ways:

Add Gin config file to the gstorage bucket and specify --gin-file as a gstorage path:

--gin_file=gs://[YOUR_STORAGE_BUCKET_NAME]/[PATH_IN_STORAGE_BUCKET]/config_file.gin

Add copying local Gin configs to the Dockerfile, build and push the image. Specify --gin_search_path flag as a directory inside the Docker container where gin file is located and --gin-file as a copied file name.

Note on cluster configuration and hyperparameters

There are two cluster configurations prepared:

config_single_vm.yaml - 1 VM configuration with 1 NVIDIA Tesla T4 GPUs. Training with this configuration and recommended parameters (batch_size: 16, learning_rate:0.0001, num_steps:40000, early_stop_loss_value:5.0) takes around 10 hours and consumes around 19 ML units.
config_multiple_vms.yaml - 4 VM configuration with 8 NVIDIA Tesla T4 GPUs. Training with this configuration and recommended parameters (batch_size: 128, learning_rate:0.001, num_steps:15000, early_stop_loss_value:5.0) takes around 5 hours and consumes around 44 ML units.

Feel free to experiment and define your cluster configurations.

Note on dataset location

Instead of uploading the preprocessed dataset into the GCS bucket, you can copy it inside the Docker container. To do so you need to place the folder with files in the same folder as Dockerfile and rebuild the image with the following code snippet added into Dockerfile:

COPY [FOLDER_WITH_DATA] /root/data

Then you should set the file pattern variable as follows:

export FILE_PATTERN=/root/data/train.tfrecord*

and complete the remaining steps as described above.

werkaaa / magenta_gce_vm