aws / aws-ofi-nccl

This is a plugin which lets EC2 developers use libfabric as network provider while running NCCL applications.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

What are some AI/ML workloads users can utilize to test performance of the plugin?

tmh97 opened this issue · comments

Hey AWS team,

At Cornelis Networks we have had good luck so far with the plugin. We are able to run all of NVIDIA's nccl performance tests with the plugin and our OPX libfabric provider!

We want to start running some real pytorch/tensorflow workloads and assess performance for some 'real-world' applications. I was hoping you'd be able to point me towards some apps/workloads that you folks use for performance benchmarking :) I noticed in #240 that someone mentioned the 'PyTorch-FSDP' workload, more examples similar to that would be greatly appreciated.

Thanks again for accepting our patches! Also, if there is a more appropriate forum for general questions like this (email, slack, etc), please let me know.