werkaaa / magenta_gce_vm

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

DDSP Docker

Docker image for training autoencoder on Google Cloud AI Platform.

Before you begin

Make sure that you have completed the following steps:

Quickstart:

Define some environment variables

We recommend setting $REGION accordingly to your location. We also recommend to setup hostname in $IMAGE_URI based of the $REGION choice as if your Docker images are stored in different region than the job is computed additional charges will be applied.

export PROJECT_ID=[YOUR_PROJECT_ID]
export SAVE_DIR=gs://[YOUR_STORAGE_BUCKET_NAME]/[PATH_IN_STORAGE_BUCKET]
export FILE_PATTERN=gs://[YOUR_STORAGE_BUCKET_NAME]/[PATH_IN_STORAGE_BUCKET]/train.tfrecord*
export IMAGE_REPO_NAME=ddsp_train
export IMAGE_TAG=ai_platform
export IMAGE_URI=gcr.io/$PROJECT_ID/$IMAGE_REPO_NAME:$IMAGE_TAG
export REGION=us-central1
export JOB_NAME=ddsp_container_job_$(date +%Y%m%d_%H%M%S)

Build the image and push it to Container Registry

In the folder containing Dockerfile and task.py run following commands:

docker build -f Dockerfile -t $IMAGE_URI ./
docker push $IMAGE_URI

Submit the training to AI Platform

gcloud ai-platform jobs submit training $JOB_NAME \
  --region $REGION \
  --config config_single_vm.yaml \
  --master-image-uri $IMAGE_URI \
  -- \
  --save_dir=$SAVE_DIR \
  --file_pattern=$FILE_PATTERN \
  --batch_size=16 \
  --learning_rate=0.0001 \
  --num_steps=40000 \
  --early_stop_loss_value=5.0
AI Platform flags:
  • --region - Region when the training job is run
  • --config - Cluster configuration. In the example above, training on multiple VMs with multiple GPUs is set. For more information about various configurations take a look at Note on cluster configuration and hyperparameters below.
  • --master-image-uri - URI of the Docker image you've built and submitted to Container Registry
Program flags:
  • --save_dir - Mandatory flag. Bucket directory where checkpoints and summary events will be saved during training.
  • --file_pattern - Mandatory flag. Pattern of the data files names. Must include a whole bucket directory.
  • --restore_dir - Bucket directory from which checkpoints will be restored before training. When not provided defaults to save_dir. If there are no checkpoints in the given directory, training will resume.
  • --batch_size - Batch size.
  • --learning_rate - Learning rate.
  • --num_steps - Number of training steps.
  • --early_loss_stop_value - Early stopping. When the total_loss reaches below this value training stops.
  • --steps_per_save - Steps per model save.
  • --restore_per_summary - Steps per training summary save.
  • --hypertune - If True enables metric reporting for hyperparameter tuning.
Additional configuration flags
  • --gin_param - Gin parameter bindings. Using this flag requires some familiarity with the Magenta DDSP source code. Take a look at parameters you can specify in Gin config files.
  • --gin_search_path - Additional gin file search path. Must be path inside Docker container and necessary gin configs should be added at the Docker image building stage.
  • --gin_file - Additional Gin config file. If the file is in gstorage bucket specify a whole gstorage path.

You can add your own Gin config files in two ways:

  • Add Gin config file to the gstorage bucket and specify --gin-file as a gstorage path:
--gin_file=gs://[YOUR_STORAGE_BUCKET_NAME]/[PATH_IN_STORAGE_BUCKET]/config_file.gin
  • Add copying local Gin configs to the Dockerfile, build and push the image. Specify --gin_search_path flag as a directory inside the Docker container where gin file is located and --gin-file as a copied file name.

Note on cluster configuration and hyperparameters

There are two cluster configurations prepared:

  • config_single_vm.yaml - 1 VM configuration with 1 NVIDIA Tesla T4 GPUs. Training with this configuration and recommended parameters (batch_size: 16, learning_rate:0.0001, num_steps:40000, early_stop_loss_value:5.0) takes around 10 hours and consumes around 19 ML units.

  • config_multiple_vms.yaml - 4 VM configuration with 8 NVIDIA Tesla T4 GPUs. Training with this configuration and recommended parameters (batch_size: 128, learning_rate:0.001, num_steps:15000, early_stop_loss_value:5.0) takes around 5 hours and consumes around 44 ML units.

Feel free to experiment and define your cluster configurations.

Note on dataset location

Instead of uploading the preprocessed dataset into the GCS bucket, you can copy it inside the Docker container. To do so you need to place the folder with files in the same folder as Dockerfile and rebuild the image with the following code snippet added into Dockerfile:

COPY [FOLDER_WITH_DATA] /root/data

Then you should set the file pattern variable as follows:

export FILE_PATTERN=/root/data/train.tfrecord*

and complete the remaining steps as described above.

About


Languages

Language:Python 77.4%Language:Shell 14.0%Language:Dockerfile 8.6%