aws / aws-parallelcluster

AWS ParallelCluster is an AWS supported Open Source cluster management tool to deploy and manage HPC clusters in the AWS cloud.

Home Page:https://github.com/aws/aws-parallelcluster

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[Feedback Requested] PCluster 3 Python 3.X environments

stefan-maxar opened this issue · comments

We currently spin-up clusters using the PCluster AMI for AmazonLinux 2, either with PCluster 3.7.2 or PCluster 3.8.0. For our current workflow, we have very little Python dependencies, and in fact only really need boto3 (and possibly git-remote-codecommit) which are AWS-specific pip installs. We are looking to use Python 3.9.X+ to ensure we avoid boto3 depreciation of Python 3.7 (the "default" Python on the PCluster AMI).

Given how "lean" our required Python environment is (only AWS-specific modules), we have been reluctant to install Python (e.g., pyenv) either through post-install or a custom AMI given the extra time/complexity required and/or version locking involved. We have noticed the numerous PCluster-focused pyenv environments for various PCluster workflows (e.g., node_virtualenv, awsbatch_virtualenv, cfn_bootstrap_virtualenv), each of which has its own environment of Python 3.9.X.

When logging onto a cluster headnode or computenode as ec2-user, one of these virtual environments is sourced, yet is different based on which PCluster version you are using. For example, a which pip3.9 as ec2-user returns the awsbatch virtual environment in PCluster 3.7.2 while the same command returns the cfn_bootstrap virtual environment in PCluster 3.8.0. So, there doesn't seem to be consistency on which is loaded, but the environments that exist could be useful for us given their current version pinning and installed packages.

So - we are looking for guidance on if we could use one these PCluster-created virtual environments to access boto3 (some of them have it already installed), and if so, which one is recommended. We were thinking of either adding the environment to our PATH via bash profiles to ensure a consistent "lean" AWS-focused Python environment, but could use further guidance. We don't want to mess up any of the on-cluster python workflows (e.g., clustermgtd) by using these environments or possibly adding to them via pip.

If the answer to the above is "not recommended", this ticket turns into a feature request for PCluster to create a "blank" user-focused pyenv that could be sourced, added to, etc. such that the user doesn't have to reinvent the wheel for "lean" python environments like we have.

Thanks for your help! - Stefan

Hello Stefan,
thank you for reaching out with the details!
If you are using ParallelCluster with Slurm scheduler, feel free to use the awsbatch_virtualenv since it is not used with ParallelCluster. However, we can not gurantee the compatibility and the stable of your use case with these python environment with future Pcluster update.
I would recommend to create a script that configure your own virtual python environment and use OnNodeConfigured https://docs.aws.amazon.com/parallelcluster/latest/ug/HeadNode-v3.html#yaml-HeadNode-CustomActions-OnNodeConfigured. The script will be executed during cluster setup.

Thanks,
Wanyi

Thanks for the details! I did end up going with a "OnNodeConfigured" install of Python3.11 miniconda upon cluster configuration as its pretty "lean" within minimal requirements and doesnt take too long to setup. Definitely will note the use of awsbatch_virtualenv as we do use slurm. Thanks again for the help.