Chip 'n Scale: Queue Arranger
chip-n-scale-queue-arranger
helps you run machine learning models over satellite imagery at scale. It is a collection of AWS CloudFormation templates deployed by kes
, lambda functions, and utility scripts for monitoring and managing the project.
Requirements
node
8.xyarn
(ornpm
)- A TensorFlow Serving Docker Image which accepts base64 encoded images. For a walkthrough of this process, check our this post.
- An XYZ raster tile endpoint
- A corresponding list of tiles over the area you'd like to predict on. If you know the extent of your prediction area as
GeoJSON
, you can usegeodex
,mercantile
, ortile-cover
- An AWS account with sufficient privileges to deploy
config/cloudformation.template.yml
Deploying
To create your own project, first install the node
dependencies:
yarn install
Then add values to config/.env
and to config/config.yml
to configure your project. Samples for each are provided and you can find more information on the kes
documentation page.
Deploy to AWS (takes ~10 minutes):
yarn deploy
...
CF operation is in state of CREATE_COMPLETE
The stack test-stack is deployed or updated.
- The database is available at: postgres://your-db-string
- The queue is available at https://your-queue-url
Is this the first time setting up this stack? Run the following command to set up the database:
$ yarn setup postgres://your-db-string
✨ Done in 424.62s.
This will return a database string to run a migration:
yarn setup [DB_STRING]
If you'd like to confirm the everything is deployed correctly (recommended), run:
yarn verify
This will test a few portions of the deployed stack to ensure that it will function correctly. Once you're ready, begin pushing tile messages to the SQS queue.
Running
Once the stack is deployed, you can kick off the prediction by adding messages to the SQS queue. Each individual message will look like:
{ "x": 1, "y": 2, "z": 3}
where x
, y
, and z
specify an individual map tile. Because pushing these messages into the queue quickly is important to running the prediction at scale, we've included a utility script to assist this process:
yarn sqs-push [tiles.txt] [https://your-queue-url]
The first argument, tiles.txt
, is a line-delimited file containing your tile indices in the format x-y-z
and the second argument is the URL of your SQS Queue. If you have a lot of tiles to push to the queue, it's best to run this script in the background or on a separate computer.
Completion
After the prediction is complete, you should download the data from the AWS RDS database. Then it's okay to delete the stack:
yarn delete
Speed, Cost, and GPU Utilization
The primary costs of running this stack come from Lambda Functions and GPU instances. The Lambdas parallelize the image downloading and database writing; The GPU instances provide the prediction capacity. To run the inference optimally, from a speed and cost perspective, these two resources need to be scaled in tandem. Roughly four scenarios can occur:
- Lambda concurrency is much higher than GPU prediction capacity. When too many Lambdas call the prediction endpoint at once, many of them will timeout and fail. The GPU instances will be fully utilized (good) but Lambda costs will be very high running longer and for more times than necessary. This will also hit the satellite imagery tile endpoint more times than needed. If this is happening, Lambda errors will be high, Lambda run time will be high, GPU utilization will be high, and SQS messages will show up in the dead letter queue. To fix it, lower the maximum Lambda concurrency or increase GPU capacity.
- Lambda concurrency is slightly higher than GPU prediction capacity. Similar to the above case, if the Lambda concurrency is slightly too high compared to GPU prediction throughput, the Lambdas will run for longer than necessary but not timeout. If this is happening, Lambda errors will be low, Lambda run time will be high, and GPU utilization will be high. To fix it, lower the maximum Lambda concurrency or increase GPU capacity.
- Lambda concurrency is lower than GPU prediction capacity. In this case, the Lambda monitoring metrics will look normal (low errors and low run time) but the GPU prediction instances have the capacity to predict many more images. To see this, run
yarn gpu-util [ssh key]
which will show the GPU utilization of each instance/GPU in the cluster:
$ yarn gpu-util ~/.ssh/my-key.pem
yarn run v1.3.2
$ node scripts/gpu-util.js ~/.ssh/my-key.pem
┌────────────────────────┬────────────────────────┬────────────────────────┐
│ IP Address │ Instance Type │ GPU Utilization │
├────────────────────────┼────────────────────────┼────────────────────────┤
│ 3.89.130.180 │ p3.2xlarge │ 5 % │
├────────────────────────┼────────────────────────┼────────────────────────┤
│ 23.20.130.19 │ p3.2xlarge │ 2 % │
├────────────────────────┼────────────────────────┼────────────────────────┤
│ 54.224.113.60 │ p3.2xlarge │ 3 % │
├────────────────────────┼────────────────────────┼────────────────────────┤
│ 34.204.40.177 │ p3.2xlarge │ 12 % │
└────────────────────────┴────────────────────────┴────────────────────────┘
✨ Done in 3.30s.
To fix this, increase the number of concurrent Lambdas or decrease the GPU capacity. (Note that by default, the security group on the instances won't accept SSH connections. To use gpu-util
, add a new rule to your EC2 security group)
- Optimal 🎉
High GPU utilization, low Lambda errors, and low Lambda run time. 🚢
Motivation
Running machine learning inference at scale can be challenging. One bottleneck is that it's often hard to ingest/download images fast enough to keep a GPU fully utilized. This seeks to solve that bottleneck by parallelizing the imagery acquisition on AWS Lambda functions and running that separate from the machine learning predictions.
Acknowledgements
- The World Bank, Nethope, and UNICEF partnered with us on machine learning projects that provided opportunities to test these capabilities.
- Digital Globe assisted in using their services to access ultra high-resolution satellite imagery at scale.
- Azavea's raster-vision-aws repo provides the base AMI for these EC2 instances (
nvidia-docker
+ ECS enabled).