Data science brings undeniable value to business, academia, and other sectors. As we have begun to realize this, businesses and organizations have been collecting and storing more data than ever before. The challenge to data science practitioners, then, is not in the sophistication of machine learning methods, but in our ability to process massive volumes of data. Applying parallelization and distributed computing to machine learning use cases, such as computer vision tasks, can increase speed and productivity, allowing us to use high volume data and complex methods to the fullest.
Harnessing distributed computing and translating standard Python into optimized parallel code can be challenging. After this course, students will have the foundational knowledge necessary to use distributed computing and parallelization to scale up their machine learning. The course will include a case study example demonstrating these strategies on image classification with PyTorch, one of the varied machine learning methods that can benefit from parallelization.
After this course, students will be able to:
- Describe what distributed computing is, and know how to access and interface with a machine cluster in the cloud
- Explain differences between a CPU and GPU, and know when each is appropriate to use
- Apply parallelization to a Python workflow using Dask
- Apply parallelization and distributed computing together to create a high volume PyTorch image classification workflow
Software developers, data scientists, analysts, statisticians and other data-related professionals:
- with some deep learning or machine learning experience wishing to scale up their workflows
- interested in Dask or how parallelization of Python works
- interested in distributed computing and how to understand a cluster
Intermediate
I will be doing demonstrations with Python, so familiarity with that language will be helpful. I will also be giving only a brief discussion of deep learning as a concept, so students may want to review that subject matter first before taking this course. It’s not necessary to get value out of the discussions of parallel and distributed computing, however.