The above picture illustrates the submission and debug workflows of TACC job.
Before using tcloud SDK, please make sure that you have submitted your public key to TACC. You may generate SSH public key according to the steps.
- Download tcloud SDK
Download the latest tcloud SDK from tags. - Install tcloud SDK
Placesetup.sh
andtcloud
in the same directory, and runsetup.sh
.
- Before using the tcloud CLI to submit ML jobs, you need to configure your TACC credentials. You can do this by running the
tcloud config
command:
$ tcloud config [-u/--username] MYUSERNAME
$ tcloud config [-f/--file] MYPRIVATEFILEPATH
- You need to run
tcloud init
command to obtain the latest cluster hardware information from TACC cluster.
You can use tcloud init
to pull the latest cluster configuration from TACC. There are four parts in tuxiv.conf
that configure different parts of job submission. Noted that tuxiv.conf
follows yaml format.
-
Entrypoint
In this section, you should input you shell commands to run your code line-by-line. The tcloud CLI will help run the job according to your commands.
entrypoint: - python ${TACC_WORKDIR}/mnist.py --epoch=3
-
Environment
In this section, you can specify your software requirements, including the environment name, dependencies, source channels and so on. The tcloud CLI will help build your environment with miniconda.
environment: name: torch-env dependencies: - pytorch=1.6.0 - torchvision=0.7.0 channels: pytorch
-
Job
In this section, you can specify your slurm configurations for slurm cluster resources, including number of nodes, CPUs, GPUs, output file and so on. All the slurm cluster configuration should be set in the general part.
job: name: test general: - nodes=2 - output=${TACC_SLURM_USERLOG}/output.log
Note: You can modify the output log path in Job section. For debugging purpose, we recommend you set the
output
value under${TACC_USERDIR}
directory and check it usingtcloud ls
andtcloud download
. -
Datasets
-
tcloud will help place the public datasets access in
TACC_USERDIR
. You can view the table of datasets at Dataset Info or check the table below.-
Dataset Name 0 imagenet 1 mnist 2 cifar-10 3 coco17 4 more datasets upon request -
to access the public dataset you need to add this command in your tuxiv.conf file:
datasets: - imagenet
-
also use this path as a dataset directory:
${TACC_USERDIR}/DATASET_NAME
-
-
User dataset: if you want to use your own dataset, you may skip this part and follow the instructions to upload and use your dataset.
-
TACC_WORKDIR
: TACC job workspace directory. Each job has a different workspace directory.TACC_USERDIR
: TACC User directory.TACC_SLURM_USERLOG
: Slurm log directory. The default value is${TACC_USERDIR}/slurm_log
.
In this section, we provide two methods to monitor the job log.
-
Download
You can either save your output files in
USERDIR
or copy your output files toUSERDIR
in your own code. After training, you can usetcloud ls [filepath]
to find the output files and usetcloud download [filepath]
. Note that you can only read and download files inUSERDIR
, and the files inWORKDIR
may be removed after the job is finished. -
Run application services
Here we give an example of application service, which uses tensorboard to monitor a job.
ssh -p 30041 -L 10006:127.0.0.1:10006 username@ising.cse.ust.hk /mnt/home/username/.Miniconda3/envs/torch-env/bin/tensorboard --logdir=/mnt/home/username/WORKDIR/PyTorch/runs --port=10006
The following videos will help you use tcloud CLI to begin your TACC journey: demo video and conda-cache video.
Basic examples are provided under the example folder. These examples include: HelloWorld, TensorFlow, PyTorch and MXNet.