This file can be read online (with drawings and pictures) at https://tinyurl.com/ReadmeIDC.
Use https://github.com/bjodom/idc if you are blocked from the tinyurl redirect.
- Batch Mode Access to PVC-Enabled SPR Systems in the Intel® Developer Cloud
- Overview
- Simple 1-2-3
- Account Registration
- SSH Setup
- Head Node vs Compute Nodes
- Environment Setup
- Jupyter
- Additional Software
- Common Slurm Commands
- Sample GPU Test Code
- Some Example Scripts
- Running MPI
- If you use MobaXterm
- VTune, Advisor, PTrace - and other things you will not get
- Notable Known Issues
- Extend your access
- Where to get Support
- Revisit This Page for Tips
The Intel Developer Cloud (IDC) trial is open to pre-qualified Intel customers, approved developers, and all Intel employees. While we plan to formerly launch in the future, today you will already be gaining access to a powerful, highly functional, system, and nd one that can greatly benefit from you sharing your experinces so we can improve it. In other words, it is not perfect and we would appreciate your help finding the rough edges.
The picture below illustrates how to think of this system at a high level.
This is NOT a cluster, and it is not a system which will giv eyou root access. The system is a shared system (please be respectful of others in not hogging resources unnecessarily), not a virtualized one.
This IS a system with round-the-clock access to systems with Intel GPU Max series GPUs (PVCs).
This Readme files has a lot of detail, but you should start by focusing on only three things:
- Get an account on Intel® Developer Zone. If you have one, you do not need to create a new one. If you need one, sign up now - it is free and instant. Note: Intel employees also have an employee login option that is only usable internally (in the office, or externally via VPN) - just look for the "Employee Sign In" and click that instead of entering a username, etc. If you are an Intel employee, you can create an account with any non-Intel email in order to sign in without being on the corporate network.
- Have an Public-Private Authentication Key Pair that has ed25519 or RSA 4096 level of strength. If you have one, you do not need to create a new one. If you need one, follow SSH Setup instructions to create one (free and instant). With your key, we recommend setting up your config file to make ssh easy (see SSH .config Client Setup).
- Create and use the service - it is free and instant (no need to enter any payment information - no credit card needed). Everyone can schedule and deploy the service with Intel® Developer Cloud management console. From there pick
Scheduled access - Intel® Max Series GPU (PVC) on 4th Gen Intel® Xeon® processors - 1100 series (4x)
. Intel employees wanting to use their employee login, should get on the internal network (VPN) and go here and click Sign In.
These three steps will get you ON the system. You'll find more instructions on setting up an environment, using Jupyter, and more later in this Readme.
Here the process to get your account and access the service, described above, step-by-step.
To access the batch service, external users must register for an Intel® Developer Cloud user account, via the Sign Up button on the Intel® Developer Cloud landing page (http://cloud.intel.com) and follow the steps in the Intel Cloud registration process. Intel employees can use their existing intel.com credentials to access the Intel® Developer Cloud portal and select the batch service via the "Employee Sign In" link on this same page.
To register, press the ' Create the Account' button, as indicated:
In the new registration screen, fill out the registration input fields and press "Next: Verify your email" button.
A confirmation is sent to the email address that was entered in the registration screen. The email contains a single-use code, illustrated below. You should get the email within minutes, please look in SPAM and JUNK folders if you do not see it promptly.
Enter the verification code in the field, as illustrated below and enter 'Create an account' button.
Next, we proceed to https://scheduler.cloud.intel.com/#/systems to go to the cloud management console.
Here we need to make sure our SSH public key is in our profile. Click the person/profile icon on the blue bar (NOT the one higher up on the same page).
Paste in your public key
Click "Save Key" and then Select the "Instances" Tab.
Check the "Scheduled..." instance, and click "Launch Instance"
Request access by clicking "Request Access." Note: once you have an instance, this page would show you the information (username, etc.) in case you have forgotten it.
Enter your organization/affiliation/company and an explanation. Once your explanation is 35 characters long, you can click "Request Access" and you will be granted access immediately.
Make note of you user ID, you will need it.
This is a good time to follow the SSH .config Client Setup instructions using your new user ID information.
Please, please, please be sure your .ssh file permissions are set correctly. Failure to do so, is the NUMBER ONE REASON for FAILURE to be able to ssh to the instance.
Now you can ssh to the node. "ssh myidc" (use the name you set in your ~/.ssh/config).
From here, the most likely thing you want to do is "srun --pty bash" to open a live session on a 4th Gen Xeon system with Intel Max GPUs (PVCs) - where you can compile and run code (including single node multirank MPI programs), launch Jupyter notebooks, and more!
ssh-keygen is a tool for creating new authentication key pairs for SSH. Such key pairs are used for automating logins, single sign-on, and for authenticating hosts. The IDC uses SSH Keys exclusively and you will never use a password for authentication.
To create a key use the ssh-keygen
utility found in your terminal application. Windows powershell, Windows Subsystem for Linux (WSL) Terminal, Linux or MAC Terminal
For WSL, Linux and MAC clients enter the below command
ssh-keygen -o -a 100 -t ed25519 -f ~/.ssh/id_ed25519_idc -C "you@email.com"
For PowerShell enter:
ssh-keygen -o -a 100 -t ed25519 -f C:\Users\YourID\.ssh\id_ed25519_idc
The passphrase is optional and you can hit enter for no pass phrase. This will result in two files being generated: id_ed25519_idc
and id_ed25519_idc.pub
take care of these files as they are the private and public key pair that will be tied to your IDC account.
To make accessing the IDC convenient, it is recommended to setup a .ssh\config
file.
Host myidc #←YOU CAN CALL IT ANYTHING
Hostname idcbetabatch.eglb.intel.com
User uXXXXXX #← Request "scheduled access" at https://scheduler.cloud.intel.com/#/systems" to get your user identifier.
IdentityFile ~/.ssh/id_ed25519_idc
#ProxyCommand /usr/bin/nc -x YourProxy:XXXX %h %p # Uncomment if necessary
ServerAliveInterval 60
ServerAliveCountMax 10
StrictHostKeyChecking no # Frequent changes in the setup are taking place now, this will help reduce the known hosts errors.
UserKnownHostsFile=/dev/null
Visit Internal Wiki for a run down of settings, which may differ based on your location.
ProxyCommand "C:\Program Files\Git\mingw64\bin\connect.exe" -S proxy-dmz.intel.com:1080 %h %p
Host myidc
Hostname idcbetabatch.eglb.intel.com
User uXXXXXX #← Request "scheduled access" at https://scheduler.cloud.intel.com/#/systems" to get your user identifier.
IdentityFile ~/.ssh/id_ed25519_idc
ServerAliveInterval 60
ServerAliveCountMax 10
StrictHostKeyChecking no
UserKnownHostsFile=/dev/null
Ensure that .ssh
has 600 privilege bits set -rw-------
Ensure that config
has 600 privilege bits set -rw-------
Ensure that ~/.ssh/id_ed25519_idc
and ~/.ssh/id_ed25519_idc.pub
have 400 privilege bits set -r--------
In this configuration from the terminal future connections are established by entering:
ssh myidc
You are allowed up to 4 connections to the IDC. If you lose a connection and rejoin, you may want to look for other lost login processes and kill them off (look with ps -aux).
Upon initial connection to the IDC, you are connected to the head node. This environment is a standard Ubuntu 22.04.02 LTS environment including dev-essential
and the Intel oneAPI Basekit. This IDC utilizes SLURM to manage job scheduling and resource management. As a cluster workload manager, SLURM has three key functions. First, it allocates exclusive and/or non-exclusive access to resources (compute nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes. Finally, it arbitrates contention for resources by managing a queue of pending work.
The head
node is a login node and a method of authentication, no development work can occur on the head node
. There are no accelerators
on the head node. From the head node you can do file management, launch and manage SLURM jobs on the dedicated partition or launch an interactive job on one of the worker nodes. Your home directory is automatically mounted on the worker node when you connect. It has a maximum of 20GB of storage. There is a data directory where training and datasets can be stored. /home/common/data
srun -p pvc-shared --pty bash
source /opt/intel/oneapi/setvars.sh
The interactive worker nodes
are resource constrained in that they are shared resources, so please be courteous. Running code on the interactive session is just like being on a local session and you can run code without submitting to a queue. There is at least one PVC in each worker node. For maximum performance submit your job to the pvc partition
which will run your code using all resources available and if your code can make use of Intel(R) Data Center GPU Max 1100's
there are 4 in each non interactive node
. Keep in mind this is a one at a time job, so you might have to wait awhile.
Enter source /opt/intel/oneapi/setvars.sh
and the oneAPI development environment will be initialized.
Enter conda env list
and activate the python environment of your choice. Both Tensorflow and Pytorch environments have Jupyter installed. If you don't like those environments create your own conda environment and customize to your liking.
This will get better, there are security issues that need to be overcome in future versions of IDC, for now these are the overview of steps to run Jupyter-lab
:
- Login to the head node.
- Launch an interactive session
- Find the IP of the interactive session.
- Activate a conda environment that has jupyter-lab
- Launch jupyter-lab and take note of the port
- From another terminal port forward to your localhost
- Launch a browser and paste the long link from the other terminal tab
- Have fun with Jupyter-lab!
The details, enter these from a terminal:
ssh myidc
srun --pty bash
echo $(ip a | grep -v -e "127.0.0.1" -e "inet6" | grep "inet" | awk {'print($2)}' | sed 's/\/.*//')
conda activate pytorch_xpu
jupyter-lab --ip 10.10.10.X
Take note of the IP and the port that jupyter launches on, it will look something like this:
http://10.10.10.8:8888/lab?token=9d83e1d8a0eb3ffed84fa3428aae01e592cab170a4119130
Your port will likely be different so replace 8888
with what was provided to you. From a new terminal
enter:
ssh myidc -L 8888:10.10.10.X:8888
Open your browser and enter localhost:8888
or shift+click the link in the other terminal, or paste the token that was provided to you when you initialized the server as your password and use Jupyter lab as usual.
It's possible to install additional software if regular user permissions are the only requirements. For example to install the Intel® Distribution for Python follow these steps: Miniconda is already installed, but you will be creating a virtual environment in your home directory.
conda activate base
conda update conda
conda config --add channels intel
conda create -n idp intelpython3_core python=3.10
conda activate idp
Keep in mind you have 20GB of storage in your home directory for all software and data.
sinfo -al (What Nodes are available)
PARTITION AVAIL TIMELIMIT JOB_SIZE ROOT OVERSUBS GROUPS NODES STATE NODELIST
pvc* up 2:00:00 1 no NO all 3 idle idc-beta-batch-pvc-node-[01-03]
squeue -al (How many jobs are in the queue)
JOBID PARTITION NAME USER STATE TIME TIME_LIMI NODES NODELIST(REASON)
sbatch -p {PARTITION-NAME} {SCRIPT-NAME}
srun -p {PARTITION-NAME} {SCRIPT-NAME}
scancel {JOB-ID}
Go interactive with a compute node
srun -p {PARTITION-NAME} -n 1 -t 00-00:10 --pty bash -i (with time and specific node.)
srun --job-name "u-pick" --pty bash -i (First available.)
Here is a sample GPU test code that demonstrates functionality and how to offload the application execution to a compute node
. Follow the below steps:
Step 1. Copy the code below into a file.
#include <sycl/sycl.hpp>
using namespace sycl;
int main() {
//# Create a device queue with device selector
queue q(gpu_selector_v);
//# Print the device name
std::cout << "Device: " << q.get_device().get_info<info::device::name>() << "\n";
return 0;
}
Step 2. Save the file as getdev.cpp
Step 3. Enter the following commands to compile and run the application.
source /opt/intel/oneapi/setvars.sh
icpx -fsycl getdev.cpp
srun a.out
If successful it should return Device: Intel(R) Data Center GPU Max 1100. Demonstrating that you successfully compiled a SYCL application and offloaded it's execution to a GPU on the compute node.
#!/bin/bash
#SBATCH -A <account>
#SBATCH -p pvc
#SBATCH --error=job.%J.err
#SBATCH --output=job.%J.out
srun ./my_a.out
When using MPI, you should set these environment variables (put in your ~/.bashrc to always have them):
export I_MPI_PORT_RANGE=50000:50500
export btl_tcp_port_min_v4=1024
MPI is currently limited to a single node, and must be run without SLURM. Since SLURM (srun) will be the default, you need to specify a different launcher using the -launcher option.
For instance - either of these should work:
mpirun -launcher ssh -n 128 ./a.out
mpirun -launcher fork -n 128 ./a.out
These are probably the same (ssh and fork), but honestly we don't know. They seem to run in the same time. Let us know if you decide one is a superior choice.
Visit the MPI with SYCL example page for a quick example of how to get a SYCL Hello World from 40 different connections to GPUs (40 ranks).
If you like using ModaXterm, here are notes from a user (thank you Yuning!) on the steps to make it fully work:
Step 1: Get ip address
echo $(ip a | grep -v -e "127.0.0.1" -e "inet6" | grep "inet" | awk {'print($2)}' | sed 's//.*//')
example output: 10.10.10.8
Step2: Select the tunneling tab in MobaXterm
Select your own private key
Description automatically generated
If you need a proxy (Intel employees do when on internal network or VPN) - set up proxy (Intel is “proxy-dmz.intel.com:1080”).
Edit this tunnel
Step3: Launch Jupyter notebook in MobaXterm (use the IP address 10.10.10.X you were assigned)
jupyter-lab --ip 10.10.10.X --no-browser
We will not install or offer tools that give system wide insight, due to serious security concerns that exist when you have a very diverse community of users. This means that VTune, Advisor, and PTrace will not be installed or activated. We do plan to offer more isolated systems in the future which will host these highly useful tools. We love them and it pains us to have to leave them off these systems.
We really need your feedback - so keep them coming (to submit feedback see section below: Where to get Support).
Right now, here are a few things we know are not working:
- emacs is on the head node, but missing on the other nodes (oops) - forcing the humiliation of using vim or nano for now (highest priority to fix in my book)
- renew before your last day (free to do so - see the section below: Extend your access) - because on your last day for an allocation you can still log in but things like SLURM will stop working
- ulimits are forcing jobs to end within an hour, instead of running a full 4 hours as we intend
- getpwuid() is broken on nodes - you may see error messages or warnings like "username unknown" - mostly harmless, other than a few apps which will refuse to run
- many additional conda packages would be nice to have preinstalled (we will add more)
- wanted to be installed by default: whole HPC toolkit (including Fortran)
- node 01 has a nasty habit of losing track of its PVC cards - we are investigating
- unzip needs installing - but gunzip is there as a capable alternative
- ssh directory is owned by root to force use of web GUI to install ssh keys, but the web GUI is broken; clever users are working around it, others need to wait
This is subject to change - here is where we are now:
- The system should auto-extend your account if you are using it during the last week of your allocated time. If it is idle in the days before it expires, your account will disappear and we cannot restore it (asking would be futile).
- In the final week, you can visit https://scheduler.cloud.intel.com and request an extension. Ideally, a button will appear next ot the instance when you schedule the "View Instances" tab. Click it and fill in the form and submit. If nothing appears on the "View Instances" tab - then you need to go to the "Launch Instance" tab, check the box in frotn of "Scheduled access..." and click "Launch Instance." You should now see a button to request an extension, click it and follow instructions to submit a request. If all else fails, please request support - see instructions.
We have a small team ramping to respond quickly: please go to the support address shown when you login and click "Submit Service Request."
We really hope you will contact us with feedback and requests by clicking "Submit Service Request" (goes to a small team) at intel.com/content/www/us/en/support/contact-intel.html#support-intel-products_67709:59441:231482.
You may ALSO send an email to ReadmeIDC@intel.com until July 27 for anything. WARNING: James and Ben may be much slower to respond than the ticket system, but we may be able to help with tough questions quicker. Feel free to send it both ways until July 27. After July 27, please escalate to ReadmeIDC@intel.com if you do not get a response to your ticket within a day, or a plan for resolution within 3 days. Please reference your ticket numbers.
We are enhancing, extending, and refining daily! Please check back at https://tinyurl.com/ReadmeIDC often for new tips, and inevitable changes as we get better together!