This is the associated code for Rowan's TPU tutorial. This is a limited version of a longer and more detailed set of tutorials, which you can find at this website. Here are the steps to get set up:
- Download google cloud utilities. There are links to an interactive installer for osx/windows, and apt-get instructions for Ubuntu.
- Open a terminal and run
gcloud init
(note that if you're following this tutorial on a server that doesn't have a display, you need to rungcloud init --console-only
) You'll need to sign up with a project. If you're at UW, use the private lab-only account. Otherwise, use the project for your team at AI2. - We need to make sure the right zone/regions are set, so use
gcloud config set compute/zone us-central1-b gcloud config set compute/region us-central1
- Download the Cloud TPU tool. On OSX this is
On Linux this is
curl -O https://dl.google.com/cloud_tpu/ctpu/latest/darwin/ctpu && chmod a+x ctpu && mv ctpu ~/google-cloud-sdk/bin/
wget https://dl.google.com/cloud_tpu/ctpu/latest/linux/ctpu && chmod a+x ctpu && sudo mv ctpu /usr/bin/
- For some reason I often need to authenticate once more, so run
(If you are on a remote server, use
gcloud auth application-default login
gcloud auth application-default login --no-launch-browser
)
Finally done with setup! We're now ready to get started. The command that you want is:
ctpu up --name $(hostname) --tpu-size=v2-8 --preemptible --tf-version '1.12'
(replace $(hostname)
with something better)
This will create a virtual machine, and an associated TPU. We're using one of the older (more stable) TPUs. There are also additional TPU options if we wanted more compute. The --preemptible
flag means that your TPU might get suddenly killed by Google. These TPUs are much cheaper though!
(If you run into errors with the above command try following these steps.)
At any time, you can look at the status of your VM and TPUs by going to the VM instances or TPUs page.
There's one more thing you'll need to do. Look at theVM instance you created and add your SSH key into SSH Keys
there. You can get your SSH key by running cat ~/.ssh/id_rsa.pub
.
If you don't have an SSH key, use this tutorial. You will also need the External IP of your server, available on that page. Mine is 104.154.97.121
.
LAST, we need to create a cloud storage bucket. Visit the cloud storage page and make a cloud storage bucket in the us-central1
region. I named mine tpututorial
. We'll also need to fix the permissions on this storage bucket. Grab the project number from the main google cloud console (mine is 335436385550
) and edit the permissions (right hand side). Add
service-[PROJECT_NUMBER]@cloud-tpu.iam.gserviceaccount.com
to both the Storage Legacy Writer
and Storage Legacy Reader
groups.
Let's upload this folder onto your new machine, and then SSH there. For me I'd run
scp -r ~/code/tpututorial rowanz@104.154.97.121:~
Usually I have PyCharm automatically upload stuff that's local to the cloud. So for me, I have:
- local (osx):
/Users/rowanz/code/tpututorial
- remote (server):
/home/rowanz/tpututorial
Now, cd
into your remote directory tpututorial
, and we'll install some dependencies by running chmod +x setup.sh && ./setup.sh && source ~/.bashrc
And that's it! We should be good to go. Let's train BERT on SWAG! Edit the file train_and_val_swag.sh
with the name of your cloud storage bucket (since mine was tpututorial
) and then run
chmod +x train_and_val_swag.sh && ./train_and_val_swag.sh
and you're done!!