ttt50966 / CUDA_Course

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

The Usage of twcp10 cluster:

1. System configuration:
   - Hardware:  Nvidia GeForce GTX 970

   - OS:        Debian GNU/Linux version 7.11           

   - Software:
     - CUDA 9.1:        /usr/local/nvidia
     - Intel Compilers: /opt/intel

2. All jobs should be submitted and run through the condor system.

3. In the condor system, the available queues and list of machines can be
   queried by the "nodelist" command. Here is an example output:

  Name          MTYPE       Config     State     NP     JobID  Activity  LoadAv
-------------------------------------------------------------------------------
vm1.1@twqcd7    TEST        2GPU-24G   Claimed    1    486601    Busy     1.000
vm1.2@twqcd7    TEST        2GPU-24G   Claimed    1    486602    Busy     1.000
  vm1@twqcd58   TEST        2GPU-24G   Unclaimed  2       n/a    Idle     0.030
  vm1@twqcd87   TEST        2GPU-24G   Unclaimed  2       n/a    Idle     0.010

Number of TEST:       total=06, busy=02, free=04

   where:

   - Name:     machine hostname, with job running slot ID in that machine.
   - MTYPE:    the queue name.
   - Config:   the hardware configuration summary (number of GPUs and the
               size of host memory) of that machine.
   - State:    the current state of that machine: "Claimed" means occupied
               by a job, and "Unclaimed" means unoccupied.
   - NP:       number of GPUs in that machine.
   - JobID:    the ID of the job running in that machine.
   - Activity: the machine activity, Busy or Idle.
   - LoadAv:   the machine load average.

   Finally, the "Number of <queue_name>" counts the total number of GPUs
   belong to the queue.


6. To run jobs in the cluster, one should follow the guidelines:

   - Create a working directory under /work/<account>/ for your job.

   - Put all the necessary data files and input files in the working
     directory.

   - Prepare a job description file to tell condor the requirements of
     your job, such as which queue to run, how many GPUs are needed, ...
     etc. The example job description file named "cmd" are available in
     the ~/example directory. It is self-described. Please use it as
     an template and modify it to fulfill your job requirements.

   - To submit your job, please run (suppose that your job description
     filename is "cmd"):

     condor_submit cmd

   - After that, you can use the command "jview" to check the job status.
     Here is an example output:

 JobID  User     RunTime NGPU  ST    Host     Queue   Config    Wdir
---------------------------------------------------------------------------------------------
   100  twchiu     48.4h    2   R    twqcd7    TEST   2GPU-32G  /work/twchiu/jobs/testrun

     where:

     - JobID:   The ID number of this job.
     - User:    The owner of this job.
     - RunTime: The running time of this job.
     - NGPU:    The number of GPUs used in this job.
     - ST:      The job state. R: Running, H: Holding (waiting for available
                computing resources), I: Ready to start.
     - Host:    The node ID running this job, where 802 means the node
                "twqcd802".
     - Queue:   The queue name which runs this job.
     - Config:  The short configuration description of the computing node.
     - Wdir:    The working directory of this job.

    - If you want to kill a job, please use the command:

     condor_rm <JobID>

7. About GPU_ID

   Each computing node has two GPU cards. The job can either use one GPU
   (while another GPU can run another single GPU job), or use two GPUs together.
   In order to prevent conflicts to the existing jobs, each job should query
   the free GPU ID and use it before running.

   In this cluster, the job start-up script "/opt/bin/runjob" provides the
   available GPU IDs for each job. 

   Suppose that your code or script is named "jexec". 

   Arguments = ./jexec inp.txt out.txt
   (this is equivalent to:  ./jexec < inp.txt >> out.txt):
      - jexec:   the code or script to run your computation.
      - inp.txt: the input file for the STDIN to your code "jexec".
      - out.txt: the output file for the STDOUT of your code "jexec".

   In your input file, there exists a parameter with the keyword "GPU_ID". 
   When the job starts, this keyword of the input file will be replaced 
   by the actual GPU ID and passed into your code "jexec".

8. To perform a test run in ~/example

   - Edit the parameter "Initialdir" in the "cmd" file.

   - Use the command "condor_submit cmd" to submit the job.

About


Languages

Language:Cuda 87.2%Language:Python 8.8%Language:Makefile 4.0%