Request to Support AWS Inferentia2 for More Cost-Effective and Faster Inference in MPT
anjiefang opened this issue Β· comments
π Feature Request
Integrate support for AWS Inferentia2 into MPT, enabling to leverage this powerful and cost-friendly inference solution through AWS.
Motivation
Refer to this post:
- AWS Inferentia2 is designed for its cost-efficiency in comparison to Nvidia chips. This cost savings can be significant for users who rely on MPT for various applications.
- AWS Inferentia2 has demonstrated the potential for faster inference, which would improve the overall responsiveness and usability of MPT.
[Optional] Implementation
Additional context
Does the team already have a plan to leverage Inferentia2? If not, can the team provide any guidance on how to migrate to Inferentia2 chips?
@anjiefang : We do have plan to leverage inf2 but it's currently not prioritized high. MPT architecture is a standard gpt-decoder style architecture with one change. Attention module has ALiBi. As long as inf2 supports Attention with ALiBi, converting MPT to run on inf2 is not a huge lift. If inf2 doesn't support attention with ALiBi, we will have to ask the aws team to support it. Please checkout this script we used to convert MPT weights to FT format and something similar for inf2 can be used.