SageMaker SSH Helper

SageMaker SSH Helper is a library that helps you to securely connect to Amazon SageMaker's training jobs, processing jobs, realtime inference endpoints, and SageMaker Studio notebook containers for fast interactive experimentation, remote debugging, and advanced troubleshooting.

The two most common scenarios for the library are:

Open a terminal session into a container running in SageMaker to diagnose a stuck training job, use CLI commands like nvidia-smi, or iteratively fix and re-execute your training script within seconds.
Remote debug a code running in SageMaker from your local favorite IDE like PyCharm Professional Edition or Visual Studio Code.

Other scenarios include but not limited to connecting to a remote Jupyter Notebook in SageMaker Studio from your IDE, connect with your browser to a TensorBoard process running in the cloud, or start a VNC session to SageMaker Studio to run GUI apps.

Also see our Frequently Asked Questions, especially if you're using Windows on your local machine.

How it works

SageMaker SSH helper uses AWS Systems Manager (SSM) Session Manager, to register the SageMaker container in SSM, followed by creating an SSM session between your client machine and the SageMaker container. You can then create an SSH connection on top of the SSM session, that allows opening a Linux shell, and/or configuring bidirectional SSH port forwarding to enable applications like remote development/debugging/desktop, and others.

See detailed architecture diagrams of the complete flow of participating components in Training Diagram, and IDE integration with SageMaker Studio diagram.

Getting started

To get started, your AWS system administrator must set up needed IAM and SSM configuration in your AWS account as shown in Setting up your AWS account with IAM and SSM configuration.

Note: This solution is a sample AWS content. You should not use this content in your production accounts, in a production environment or on production or other critical data. If you plan to use the solution in production, please, carefully review it with your security team. You are responsible for testing, securing, and optimizing the sample content as appropriate for production grade use based on your specific business requirements, including any quality control practices and standards.

Use Cases

SageMaker SSH Helper supports a variety of use cases:

Connecting to SageMaker training jobs with SSM - open a shell to the training job to examine its file systems, monitor resources, produce thread-dumps for stuck jobs, and interactively run your train script.
Connecting to SageMaker inference endpoints with SSM
Connecting to SageMaker processing jobs with SSM
Remote debugging with PyCharm Debug Server over SSH
Remote code execution with PyCharm / VSCode over SSH
Local IDE integration with SageMaker Studio over SSH for PyCharm / VSCode

If you want to add a new use case or a feature, see CONTRIBUTING.

Connecting to SageMaker training jobs with SSM

Download Demo (.mov)

Step 1: Install the library

Install the latest stable version of library from the PyPI repository:

pip install sagemaker-ssh-helper

Step 2: Modify your start training job code

Add import for SSHEstimatorWrapper
Add a dependencies parameter to the Estimator object.
Add an SSHEstimatorWrapper.create(estimator,...) call before calling fit() and add SageMaker SSH Helper as dependencies.
Add a call to ssh_wrapper.get_instance_ids() to get the SSM instance(s) id. We'll use this as the target to connect to later on.

For example:

from sagemaker_ssh_helper.wrapper import SSHEstimatorWrapper  # <--NEW--

estimator = PyTorch(entry_point='train.py',
                    source_dir='code',
                    dependencies=[SSHEstimatorWrapper.dependency_dir()],  # <--NEW--
                    role=role,
                    framework_version='1.9.1',
                    py_version='py38',
                    instance_count=1,
                    instance_type='ml.m5.xlarge')

ssh_wrapper = SSHEstimatorWrapper.create(estimator, connection_wait_time_seconds=600)  # <--NEW--

estimator.fit(wait=False)

instance_ids = ssh_wrapper.get_instance_ids()  # <--NEW--
print(f'To connect over SSM run: aws ssm start-session --target {instance_ids[0]}')  # <--NEW--

Note: connection_wait_time_seconds is the amount of time the SSH helper will wait inside SageMaker before it continues normal execution. It's useful for training jobs, when you want to connect before training starts. If you don't want to wait, set it to 0.

Note: If you use distributed training (i.e., instance_count > 1), SSH Helper will start by default only on the first 2 nodes (e.g., on algo-1 and algo-2). If you want to connect to SSH to other nodes, you can log in to either of these nodes, e.g., algo-1, and then SSH from this node to any other node of the training cluster, e.g., algo-4, without running SSH Helper on these nodes.

Alternatively, pass the additional parameter ssh_instance_count with the desired instance count to SSHEstimatorWrapper.create().

Step 3: Modify your training script

Add into your train.py the following lines at the top:

import sagemaker_ssh_helper
sagemaker_ssh_helper.setup_and_start_ssh()

The setup_and_start_ssh() will start an SSM agent that will connect the training instance to AWS Systems Manager.

Step 4: Connect over SSM

Once you launched the job, you'll need to wait, a few minutes, for the SageMaker container to start and the SSM agent to start successfully. Then you'll need to have the ID of the managed instance. The instance id is prefixed by mi- and will appear in the job's CloudWatch log like this:

Successfully registered the instance with AWS SSM using Managed instance-id: mi-01234567890abcdef

To fetch the instance IDs from the logs in an automated way, call the Python method of ssh_wrapper, as mentioned in the previous step:

instance_ids = ssh_wrapper.get_instance_ids()

This method accepts the optional parameter retry with the number of retry attempts (default is 60). It will retry to get instance IDs until they appear in the CloudWatch logs or number of attempts reached. The pause between attempts is fixed to 10 seconds, hence by default it will wait not more than 10 minutes.

With the instance id at hand, you will be able to connect to the training container using the command line or the AWS web console:

A. Connecting using command line:

On the local machine, make sure that the latest AWS CLI v2 is installed (v1 won't work), as described in the documentation for AWS CLI.
Verify that running aws --version prints aws-cli/2.x.x ...
Install AWS Session Manager CLI plugin as described in the SSM documentation.
Run this command (replace the target value with the instance id for your SageMaker job). Example:

aws ssm start-session --target mi-0d8404fdeef955ca3

B. Connecting using the AWS Web Console:

In AWS Web Console, navigate to Systems Manager > Fleet Manager.
Select the node, then Node actions > Start terminal session.

Once connected to the container, you would want to switch to the root user with sudo su - command.

Tip: Useful CLI commands

Here are some useful commands to run in a terminal session:

ps xfaww - Show running tree of processes
ps xfawwe - Show running tree of processes with environment variables
ls -l /opt/ml/input/data - Show input channels
ls -l /opt/ml/code - Show your training code
pip freeze | less - Show all Python packages installed
dpkg -l | less - Show all system packages installed

Tip: Generating a thread dump for stuck training jobs

In case your training job is stuck, it can be useful to observe what where its threads are waiting/busy. This can be done without connecting to a python debugger beforehand.

Having connected to the container as root, find the process id (pid) of the training process (assuming it's named train.py): pgrep --newest -f train.py
Install GNU debugger:
apt-get -y install gdb python3.9-dbg
Start the GNU debugger with python support:
gdb python
source /usr/share/gdb/auto-load/usr/bin/python3.9-dbg-gdb.py
Connect to the process (replace 361 with your pid):
attach 361
Show C low-level thread dump:
info threads
Show Python high-level thread dump:
py-bt
It might also be useful to observe what system calls the process is making: apt-get install strace
Trace the process (replace 361 with your pid):
sudo strace -p 361

Connecting to SageMaker inference endpoints with SSM

Adding SageMaker SSH Helper to inference endpoint is similar to training with the following differences.

Wrap your model into SSHModelWrapper before calling deploy() and add SSH Helper to dependencies:

from sagemaker_ssh_helper.wrapper import SSHModelWrapper  # <--NEW--

model = estimator.create_model(entry_point='inference.py',
                               source_dir='source_dir/',
                               dependencies=[SSHModelWrapper.dependency_dir()])  # <--NEW--

ssh_wrapper = SSHModelWrapper.create(model, connection_wait_time_seconds=0)  # <--NEW--

predictor = model.deploy(initial_instance_count=1,
                         instance_type='ml.m5.xlarge',
                         endpoint_name=endpoint_name)

Note: For the inference endpoint, which is always up and running, there's not too much value in setting connection_wait_time_seconds, so it's usually set to 0.

Similar to training jobs, you can fetch the instance ids for connecting to the endpoint with SSM with the following API:

instance_ids = ssh_wrapper.get_instance_ids()

Add the following lines at the top of your inference.py script:

import os
import sys
sys.path.append(os.path.join(os.path.dirname(__file__), "lib"))

import sagemaker_ssh_helper
sagemaker_ssh_helper.setup_and_start_ssh()

Note: adding lib dir to Python path is required, because SageMaker inference is putting dependencies into the code/lib directory, while SageMaker training put libs directly to code.

Multi-model endpoints

For multi-model endpoints, the setup procedure is slightly different from regular endpoints:

from sagemaker_ssh_helper.wrapper import SSHModelWrapper, SSHMultiModelWrapper  # <--NEW--

model = estimator.create_model(entry_point='inference.py',
                               source_dir='source_dir/',
                               dependencies=[SSHModelWrapper.dependency_dir()])  # <--NEW--

mdm = MultiDataModel(
    name=model.name,
    model_data_prefix=model_data_prefix,
    model=model
)

ssh_wrapper = SSHMultiModelWrapper.create(mdm, connection_wait_time_seconds=0)  # <--NEW--

predictor: Predictor = mdm.deploy(initial_instance_count=1,
                                  instance_type='ml.m5.xlarge',
                                  endpoint_name=endpoint_name)


mdm.add_model(model_data_source=model.repacked_model_data, model_data_path=model_name)

predictor.predict(data=..., target_model=model_name)

Important: Make sure that you're passing to add_model() a model ready for deployment located at model.repacked_model_data, not the estimator.model_data.

Also note that SageMaker SSH Helper will be lazy loaded together with your model upon the first prediction request. So you should try to connect only after calling predict().

The inference.py script is the same as for regular endpoints.

If you are using PyTorch containers, make sure you select the latest versions, e.g. 1.12, 1.11, 1.10 (1.10.2), 1.9 (1.9.1). This code might not work if you use PyTorch 1.8, 1.7 or 1.6.

Connecting to SageMaker processing jobs with SSM

SageMaker SSH Helper supports both Script Processors and Framework processors and setup procedure is similar to training jobs and inference endpoints.

A. Framework processors

The code to set up a framework processor (e.g. PyTorch) is the following:

from sagemaker_ssh_helper.wrapper import SSHProcessorWrapper  # <--NEW--

torch_processor = PyTorchProcessor(
    base_job_name='ssh-pytorch-processing',
    framework_version='1.9.1',
    py_version='py38',
    role=role,
    instance_count=1,
    instance_type="ml.m5.xlarge"
)

ssh_wrapper = SSHProcessorWrapper.create(torch_processor, connection_wait_time_seconds=600)  # <--NEW--

torch_processor.run(
    source_dir="source_dir/",
    dependencies=[SSHProcessorWrapper.dependency_dir()],  # <--NEW--
    code="process.py"
)

Also add the following lines at the top of process.py:

import sagemaker_ssh_helper
sagemaker_ssh_helper.setup_and_start_ssh()

B. Script Processors

The code to set up a script processor (e.g. PySpark) is the following:

from sagemaker_ssh_helper.wrapper import SSHProcessorWrapper  # <--NEW--

spark_processor = PySparkProcessor(
    base_job_name='ssh-spark-processing',
    framework_version="3.0",
    role=role,
    instance_count=1,
    instance_type="ml.m5.xlarge"
)

ssh_wrapper = SSHProcessorWrapper.create(spark_processor, connection_wait_time_seconds=600)  # <--NEW--

spark_processor.run(
    submit_app="source_dir/process.py",
    inputs=[ssh_wrapper.augmented_input()]  # <--NEW--
)

Also add the following lines at the top of process.py:

import sys
sys.path.append("/opt/ml/processing/input/")

import sagemaker_ssh_helper
sagemaker_ssh_helper.setup_and_start_ssh()

Remote debugging with PyCharm Debug Server over SSH

This procedure uses PyCharm's Professional feature: Remote debugging with the Python remote debug server configuration

On the local machine, make sure that the latest AWS CLI v2 is installed (v1 won't work), as described in the documentation for AWS CLI.
Verify that running aws --version prints aws-cli/2.x.x ...
Install AWS Session Manager CLI plugin as described in the SSM documentation.
In PyCharm, go to the Run/Debug Configurations (Run -> Edit Configurations...), add a new Python Debug Server. Choose the fixed port, e. g. 12345.
Take the correct version of pydevd-pycharm package from the configuration window and install it either through requirements.txt or by calling pip from your source code.
Add commands to connect to the Debug Server to your code:

import pydevd_pycharm
pydevd_pycharm.settrace('localhost', port=12345, stdoutToServer=True, stderrToServer=True, suspend=True)

Tip: Check the argument's description in the library source code.

Set extra breakpoints in your code with PyCharm, if needed
Start the Debug Server in PyCharm
Submit your code to SageMaker with SSH Helper as described in previous sections. Make sure you allow enough time for manually setting up the connection (do not set connection_wait_time_seconds to 0, recommended minimum value is 600, i.e. 10 minutes). Don't worry to set it to higher values, e.g. to 30 min, because you will be able to terminate the waiting loop once you connect.
Start the port forwarding script once SSH helper connects to SSM and starts waiting inside the training job:

sm-local-ssh-training connect <<training_job_name>>

It will reverse-forward the remote debugger port 12345 to your local machine's Debug Server port. The local port 11022 will be connected to the remote SSH server port, to allow you easily connect with SSH from command line.

Check the source code of the script sm-local-ssh-training if you want to change the default ports.

While this script is running, you may connect with SSH to the specified local port:

ssh -i ~/.ssh/sagemaker-ssh-gw -p 11022 root@localhost

Tip: If you log in to the node with SSH and don't see a sm-sleep process, the training script has already started and failed to connect to the PyCharm Debug Server, so you need to increase the connection_wait_time_seconds, otherwise the debugger will miss your breakpoints.

Tip: If you don't want ssh command to complain about remote host keys, when you switch to a different node, consider adding this two options to the command: -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null.

Stop the waiting loop – connect to the instance and terminate the loop.

As already mentioned in the step 8, make sure you've put enough timeout to allow the port forwarding script set up a tunnel before execution of your script continues.

You can use the following CLI command from your local machine to stop the waiting loop (the sm-sleep remote process):

sm-local-ssh-training stop-waiting

After you stop the waiting loop, your code will continue running and will connect to your PyCharm Debug server.

Remote code execution with PyCharm / VSCode over SSH

Follow the following steps from the section Remote debugging with PyCharm Debug Server: 1, 2 and 8, 9, 10.

Before terminating the waiting loop (step 10), make sure you configured and connected to SSH host and port localhost:11022 from your IDE as the remote Python interpreter. Provide ~/.ssh/sagemaker-ssh-gw as the private key.

Note, that after you finished the waiting loop, your training script will run only once, and you will be able to execute additional code only while your script is running. Once the script finishes, you will need to submit another training job and repeat the procedure again.

But there's a useful trick: submit a dummy script with the infinite loop, and while this loop will be running, you can run your real training script again and again with the remote interpreter. Setting max_run parameter of the estimator is highly recommended in this case.

A dummy script train_placeholder.py may look like this:

import time

import sagemaker_ssh_helper
sagemaker_ssh_helper.setup_and_start_ssh()

while True:
    time.sleep(10)

Make also sure that you're aware of SageMaker Managed Warm Pools feature, which is also helpful in such a scenario.

Local IDE integration with SageMaker Studio over SSH for PyCharm / VSCode

Download Demo (.mov)

Follow the next steps for your local IDE integration with SageMaker Studio:

Inside SageMaker Studio checkout (unpack) this repo and run SageMaker_SSH_IDE.ipynb.
On the local machine, make sure that the latest AWS CLI v2 is installed (v1 won't work), as described in the documentation for AWS CLI.
Verify that running aws --version prints aws-cli/2.x.x ...
Install AWS Session Manager CLI plugin as described in the SSM documentation.
Install the library: pip install sagemaker-ssh-helper .
Start SSH tunnel and port forwarding from a terminal session as follows:

sm-local-ssh-ide <<kernel_gateway_app_name>>

The parameter <<kernel_gateway_app_name>> is either taken from SageMaker Studio when you run notebook SageMaker_SSH_IDE.ipynb, or you can find it in the list of running apps in AWS Console under Amazon SageMaker -> Control Panel -> User Details. It looks like this: datascience-1-0-ml-g4dn-xlarge-afdb4b3051726e2ee18a399903fb.

The local port 10022 will be connected to the remote SSH server port, to let you connect with SSH from IDE.
In addition, the local port 8889 will be connected to remote Jupyter notebook port, the port 5901 to the remote VNC server and optionally the remote port 443 will be connected to your local PyCharm license server address (check the source of the script sm-local-ssh-ide and modify it with your server address).

Connect local PyCharm or VSCode with remote Python interpreter by using root@localhost:10022 as SSH parameters. Also provide ~/.ssh/sagemaker-ssh-gw as the private key.

Note: The SSH key is automatically generated on your local machine every time when you run sm-local-ssh-ide command from step 5.

You can check that connection is working by running the SSH command in command line:

ssh -i ~/.ssh/sagemaker-ssh-gw -p 10022 root@localhost

Now with PyCharm or VSCode you can run and debug the code remotely inside the kernel gateway app.

Moreover, you may now configure a remote Jupyter Server as http://127.0.0.1:8889/?token=<<your_token>>. You will find the full URL with remote token in the SageMaker_SSH_IDE.ipynb notebook in the output after running the cell with sm-ssh-ide start command.

Instructions for remote notebooks in PyCharm
Instructions for remote notebooks in VSCode (don't forget to switch kernel to remote after configuring the remote server).

You can also start the VNC session to vnc://localhost:5901 (e.g. on macOS with Screen Sharing app) and run IDE or any other GUI app on the remote desktop instead of your local machine.

Troubleshooting IDE integration

Check that sshd process is started in SageMaker Studio notebook by running a command in the image terminal:

ps xfa | grep sshd

If it's not started, there might be some errors in the output of the notebook, and you might get this error on the local machine:

Connection closed by UNKNOWN port 65535

Check carefully the notebook output in SageMaker Studio or try to stop and start SSM & services again.

ivan-khvostishkov / sagemaker-ssh-helper