ttt50966/CUDA_Course

The Usage of twcp10 cluster:

1. System configuration:
- Hardware: Nvidia GeForce GTX 970

- OS: Debian GNU/Linux version 7.11

- Software:
- CUDA 9.1: /usr/local/nvidia
- Intel Compilers: /opt/intel

2. All jobs should be submitted and run through the condor system.

3. In the condor system, the available queues and list of machines can be
queried by the "nodelist" command. Here is an example output:

Name MTYPE Config State NP JobID Activity LoadAv
-------------------------------------------------------------------------------
vm1.1@twqcd7 TEST 2GPU-24G Claimed 1 486601 Busy 1.000
vm1.2@twqcd7 TEST 2GPU-24G Claimed 1 486602 Busy 1.000
vm1@twqcd58 TEST 2GPU-24G Unclaimed 2 n/a Idle 0.030
vm1@twqcd87 TEST 2GPU-24G Unclaimed 2 n/a Idle 0.010

Number of TEST: total=06, busy=02, free=04

where:

- Name: machine hostname, with job running slot ID in that machine.
- MTYPE: the queue name.
- Config: the hardware configuration summary (number of GPUs and the
size of host memory) of that machine.
- State: the current state of that machine: "Claimed" means occupied
by a job, and "Unclaimed" means unoccupied.
- NP: number of GPUs in that machine.
- JobID: the ID of the job running in that machine.
- Activity: the machine activity, Busy or Idle.
- LoadAv: the machine load average.

Finally, the "Number of <queue_name>" counts the total number of GPUs
belong to the queue.

6. To run jobs in the cluster, one should follow the guidelines:

- Create a working directory under /work/<account>/ for your job.

- Put all the necessary data files and input files in the working
directory.

- Prepare a job description file to tell condor the requirements of
your job, such as which queue to run, how many GPUs are needed, ...
etc. The example job description file named "cmd" are available in
the ~/example directory. It is self-described. Please use it as
an template and modify it to fulfill your job requirements.

- To submit your job, please run (suppose that your job description
filename is "cmd"):

condor_submit cmd

- After that, you can use the command "jview" to check the job status.
Here is an example output:

JobID User RunTime NGPU ST Host Queue Config Wdir
---------------------------------------------------------------------------------------------
100 twchiu 48.4h 2 R twqcd7 TEST 2GPU-32G /work/twchiu/jobs/testrun

where:

- JobID: The ID number of this job.
- User: The owner of this job.
- RunTime: The running time of this job.
- NGPU: The number of GPUs used in this job.
- ST: The job state. R: Running, H: Holding (waiting for available
computing resources), I: Ready to start.
- Host: The node ID running this job, where 802 means the node
"twqcd802".
- Queue: The queue name which runs this job.
- Config: The short configuration description of the computing node.
- Wdir: The working directory of this job.

- If you want to kill a job, please use the command:

condor_rm <JobID>

7. About GPU_ID

Each computing node has two GPU cards. The job can either use one GPU
(while another GPU can run another single GPU job), or use two GPUs together.
In order to prevent conflicts to the existing jobs, each job should query
the free GPU ID and use it before running.

In this cluster, the job start-up script "/opt/bin/runjob" provides the
available GPU IDs for each job.

Suppose that your code or script is named "jexec".

Arguments = ./jexec inp.txt out.txt
(this is equivalent to: ./jexec < inp.txt >> out.txt):
- jexec: the code or script to run your computation.
- inp.txt: the input file for the STDIN to your code "jexec".
- out.txt: the output file for the STDOUT of your code "jexec".

In your input file, there exists a parameter with the keyword "GPU_ID".
When the job starts, this keyword of the input file will be replaced
by the actual GPU ID and passed into your code "jexec".

8. To perform a test run in ~/example

- Edit the parameter "Initialdir" in the "cmd" file.

- Use the command "condor_submit cmd" to submit the job.

About

Languages

Language:Cuda 87.2%Language:Python 8.8%Language:Makefile 4.0%