Changes between Initial Version and Version 1 of Workshops/IntroToHpc2015/using


Ignore:
Timestamp:
10/12/15 15:56:10 (9 years ago)
Author:
pdejesus
Comment:

Legend:

Unmodified
Added
Removed
Modified
  • Workshops/IntroToHpc2015/using

    v1 v1  
     1[[PageOutline]]
     2
     3= Submitting Jobs on Cypress =
     4
     5In this section we will examine how to submit jobs on Cypress using the SLURM resource manager. We’ll begin with the basics and proceed to examples of jobs which employ MPI, OpenMP, and hybrid parallelization schemes.
     6
     7
     8== Quick Start for PBS users ==
     9
     10Cypress uses SLURM to schedule jobs and manage resources resources. Full documentation and tutorials for SLURM can be found on the SLURM website at:
     11
     12http://slurm.schedmd.com/documentation.html
     13
     14Additionally, those who are familiar with the Torque-PBS manager used on Aries and Sphynx may find the "SLURM Rosetta Stone" particularly useful:
     15
     16http://slurm.schedmd.com/rosetta.html
     17
     18Lastly, resource limits on Cypress divided into separate Quality Of Services (QOSs). These are analogous to the queues on Sphynx. You may choose a QOS by using the appropriate script directive in your submission script, e.g.
     19
     20{{{#!bash
     21#SBATCH --qos=long
     22}}}
     23
     24The default QOS is normal. For a list of which QOS are available and the associated limits please see the [https://wiki.hpc.tulane.edu/trac/wiki/cypress/about about] section of this wiki.
     25
     26== Using SLURM on Cypress ==
     27=== Introduction to Managed Cluster Computing  ===
     28
     29For those who are new to cluster computing and resource management, let's begin with an explanation of what a resource manager is and why it is necessary. Suppose you have a piece of C code that you would like to compile and execute, for example a helloworld program.
     30
     31{{{#!c
     32#include<stdio.h>
     33
     34int main(){
     35             printf("Hello World\n");
     36             return 0;
     37}
     38}}}
     39
     40On your desktop you would open a terminal, compile the code using your favorite c compiler and execute the code. You can do this without worry as you are the only person using your computer and you know what demands are being made on your CPU and memory at the time you run your code. On a cluster, many users must share the available resources equitably and simultaneously. It's the job of the resource manager to choreograph this sharing of resources by accepting a description of your program and the resources it requires, searching the available hardware for resources that meet your requirements, and making sure that no one else is given those resources while you are using them.
     41
     42Occasionally the manager will be unable to find the resources you need due to usage by other user. In those instances your job will be "queued", that is the manager will wait until the needed resources become available before running your job. This will also occur if the total resources you request for all your jobs exceed the limits set by the cluster administrator. This ensures that all users have equal access to the cluster.
     43
     44The take home point here is this: in a cluster environment a user submits jobs to a resource manager, which in turn runs an executable(s) for the user. So how do you submit a job request to the resource manager? Job requests take the form of scripts, called job scripts. These scripts contain script directives, which tell the resource manager what resources the executable requires. The user then submits the job script to the scheduler.
     45
     46The syntax of these script directives is manager specific. For the SLURM resource manager, all script directives begin with "#SBATCH". Let's look at a basic SLURM script requesting one node and one core on which to run our helloworld program.
     47
     48{{{#!bash
     49#!/bin/bash
     50#SBATCH --job-name=HiWorld    ### Job Name
     51#SBATCH --output=Hi.out       ### File in which to store job output
     52#SBATCH --error=Hi.err        ### File in which to store job error messages
     53#SBATCH --qos=workshop        ### Quality of Service (like a queue in PBS)
     54#SBATCH --partition=workshop  ### Partition to run on (not needed with normal and long queues)
     55#SBATCH --time=0-00:01:00     ### Wall clock time limit in Days-HH:MM:SS
     56#SBATCH --nodes=1             ### Node count required for the job
     57#SBATCH --ntasks-per-node=1   ### Nuber of tasks to be launched per Node
     58
     59./helloworld
     60}}}
     61
     62Notice that the SLURM script begins with #!/bin/bash. This tells the Linux shell what flavor shell interpreter to run. In this example we use BASh (Bourne Again Shell). The choice of interpreter (and subsequent syntax) is up to the user, but every SLURM script should begin this way. This is followed by a collection of #SBATCH script directives telling the manager about the resources needed by our code and where to put the codes output. Lastly, we have the executable we wish the manager to run (note: this script assumes it is located in the same directory as the executable).
     63
     64With our SLURM script complete, we’re ready to run our program on the cluster. To submit our script to SLURM, we invoke the '''sbatch''' command. Suppose we saved our script in the file helloworld.srun (the extension is not important). Then our submission would look like:
     65
     66{{{#!comment
     67[[Image(sbatch.png, 50%, center)]]
     68}}}
     69
     70{{{
     71[tulaneID@cypress1 ~]$ sbatch helloworld.srun
     72Submitted batch job 6041
     73[tulaneID@cypress1 ~]$
     74}}}
     75
     76Our job was successfully submitted and was assigned the job number 6041. We can check the output of our job by examining the contents of our output and error files. Referring back to the helloworld.srun SLURM script, notice the lines
     77
     78{{{#!bash
     79#SBATCH --output=Hi.out       ### File in which to store job output
     80#SBATCH --error=Hi.err        ### File in which to store job error messages
     81}}}
     82
     83These specify files in which to store the output written to standard out and standard error, respectively. If our code ran without issue, then the Hi.err file should be empty and the Hi.out file should contain our greeting.
     84
     85{{{#!comment
     86[[Image(Hi_output.png, 50%, center)]]
     87}}}
     88{{{
     89[tulaneID@cypress1 ~]$ cat Hi.err
     90[tulaneID@cypress1 ~]$ cat Hi.out
     91Hello World
     92[tulaneID@cypress1 ~]$
     93}}}
     94
     95There are two more commands we should familiarize ourselves with before we begin. The first is the “squeue” command. This shows us the list of jobs that have been submitted to SLURM that are either currently running or are in the queue waiting to run. The last is the “scancel” command. This allows us to terminate a job that is currently in the queue. To see these commands in action, let's simulate a one hour job by using the sleep command at the end of a new submission script.
     96{{{#!bash
     97#!/bin/bash
     98#SBATCH --job-name=OneHourJob ### Job Name
     99#SBATCH --time=0-00:01:00     ### Wall clock time limit in Days-HH:MM:SS
     100#SBATCH --nodes=1             ### Node count required for the job
     101#SBATCH --ntasks-per-node=1   ### Nuber of tasks to be launched per Node
     102
     103sleep 3600
     104}}}
     105
     106Notice that we've omitted some of the script directives from our hello world submission script. We will still run on the normal QOS as that's the default on Cypress. However, when no output directives are given SLURM will redirect the output of our executable (including any error messages) to a file labeled with our jobs ID number. This number is assigned upon submission. Let's suppose that the above is stored in a file named oneHourJob.srun and we submit our job using the '''sbatch''' command. Then we can check on the progress of our job using squeue and we can cancel the job by executing scancel on the assigned job ID.
     107
     108[[Image(squeue_scancel2.png, 50%, center)]]
     109
     110Notice that when we run the squeue command, our job status is marked R for running and has been running for 7 seconds. The squeue command also tells us what node our job is being run on, in this case node 123. When running squeue in a research environment you will usually see a long list of users running multiple jobs. To single out your own job you can use the "-u" option flag to specify your user name.
     111
     112Congratulations, you are ready to begin running jobs on Cypress!
     113
     114=== MPI Jobs ===
     115
     116Now let’s look at how to run an MPI based job across multiple nodes. SLURM does a nice job of interfacing with the mpirun command to minimize the amount of information the user needs to provide. For instance, SLURM will automatically provide a hostlist and the number of processes based on the script directives provided by the user.
     117
     118Let’s say that we would like to run an MPI based executable named myMPIexecutable. Let’s further suppose that we wished to run it using a total of 80 MPI processes. Recall that each node of Cypress is equipped with two Intel Xeon 10 core processors. Then a natural way of breaking up our problem would be to run it on four nodes using 20 processes per core. Here we run into the semantics of SLURM. We would ask SLURM for four nodes and 20 “tasks” per node.
     119{{{#!bash
     120#!/bin/bash
     121#SBATCH --qos=normal
     122#SBATCH --job-name=MPI_JOB
     123#SBATCH --time=0-01:00:00
     124#SBATCH --output=MPIoutput.out
     125#SBATCH --error=MPIerror.err
     126#SBATCH --nodes=4
     127#SBATCH --ntasks-per-node=20
     128
     129module load intel-psxe/2015-update1
     130
     131############ THE JOB ITSELF #############################
     132echo Start Job
     133
     134echo nodes: $SLURM_JOB_NODELIST
     135echo job id: $SLURM_JOB_ID
     136echo Number of tasks: $SLURM_NTASKS
     137
     138mpirun myMPIexecutable
     139
     140echo End Job
     141}}}
     142
     143Again, notice that we did not need to feed any of the usual information to mpirun regarding the number of processes, hostfiles, etc. as this is handled automatically by SLURM. Another thing to note is the loading the intel-psxe (parallel studio) module. This loads the Intel instantiation of MPI including mpirun. If you would like to use OpenMPI then you should load the openmpi/gcc/64/1.8.2-mlnx-ofed2 module or one of the other OpenMPI versions currently available on Cypress. We also take advantage of a couple of SLURMS output environment variables to automate our record keeping.  Now, a record of what nodes we ran on, our job ID, and the number of tasks used will be written to the MPIoutput.out file.  While this is certainly not necessary, it often pays dividends when errors arise.
     144
     145
     146
     147
     148
     149=== OpenMP Jobs ===
     150
     151When running OpenMP (OMP) jobs on Cypress, it’s necessary to set your environment variables to reflect the resources you’ve requested.  Specifically, you must export the variable OMP_NUM_THREAS so that its value matches the number of cores you have requested from SLURM. This can be accomplished through the use of SLURMS built in export environment variables.
     152
     153{{{#!bash
     154#!/bin/bash
     155#SBATCH --qos=normal
     156#SBATCH --job-name=OMP_JOB
     157#SBATCH --time=1-00:00:00
     158#SBATCH --nodes=1
     159#SBATCH --ntasks-per-node=1
     160#SBATCH --cpus-per-task=20
     161
     162export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
     163
     164./myOMPexecutable
     165}}}
     166
     167
     168In the script above we request 20 cores on one node of Cypress (which is all the cores available on any node). As SLURM regards tasks as being analogous to MPI processes, it’s better to use the cpus-per-task directive when employing OpenMP parallelism. Additionally, the SLURM export variable $SLURM_CPUS_PER_TASK stores whatever value we assign to cpus-per-task, and is therefore our candidate for passing to OMP_NUM_THREADS.
     169
     170=== Hybrid Jobs ===
     171
     172When running MPI/OpenMP hybrid jobs on Cypress, for example,
     173
     174{{{#!bash
     175#!/bin/bash
     176#SBATCH --qos=normal            # Quality of Service
     177#SBATCH --job-name=hybridTest   # Job Name
     178#SBATCH --time=00:10:00         # WallTime
     179#SBATCH --nodes=2               # Number of Nodes
     180#SBATCH --ntasks-per-node=2     # Number of tasks (MPI processes)
     181#SBATCH --cpus-per-task=10      # Number of threads per task (OMP threads)
     182
     183export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
     184
     185mpirun ./myHybridExecutable
     186}}}
     187
     188In the script above we request 2 tasks per node and 10 cpus per task, which means 20 cores per node and all the cores available on one node.
     189We request 2 nodes so that we can use 4 MPI processes. Each process can use 10 OpenMP threads.
     190
     191
     192
     193
     194
     195
     196=== MIC Native Jobs ===
     197
     198There are two ''Intel Xeon Phi'' co-processors (MIC) on each node of Cypress. To run your code natively on MIC, make sure you have to compile the code with "-mmic" option.
     199The executable for MIC code cannot run on host CPU. To launch the MIC native executable from host, use "micnativeloadex" command.
     200SLURM jobscript script is, for example,
     201
     202{{{#!bash
     203#!/bin/bash
     204#SBATCH --qos=normal            # Quality of Service
     205#SBATCH --job-name=nativeTest   # Job Name
     206#SBATCH --time=00:10:00         # WallTime
     207#SBATCH --nodes=1               # Number of Nodes
     208#SBATCH --ntasks-per-node=1     # Number of tasks (MPI presseces)
     209#SBATCH --cpus-per-task=1       # Number of processors per task OpenMP threads()
     210#SBATCH --gres=mic:1            # Number of Co-Processors
     211
     212micnativeloadex ./myNativeExecutable -e "OMP_NUM_THREADS=100" -d 0 -v
     213}}}
     214
     215
     216
     217In the script above we request one MIC device that will be device number 0.
     218"micnativeloadex" command launches MIC native executable. "-e "OMP_NUM_THREADS=100"" option to set the number of threads on the MIC device to 100.
     219For more options, see below.
     220
     221{{{#!bash
     222[fuji@cypress01-090 nativeTest]$ micnativeloadex -h
     223
     224Usage:
     225micnativeloadex [ -h | -V ] AppName -l -t timeout -p -v -d coprocessor -a "args" -e "environment"
     226  -a "args" An optional string of command line arguments to pass to
     227            the remote app.
     228  -d The (zero based) index of the Intel(R) Xeon Phi(TM) coprocessor to run the app on.
     229  -e "environment" An optional environment string to pass to the remote app.
     230      Multiple environment variable may be specified using spaces as separators:
     231        -e "LD_LIBRARY_PATH=/lib64/ DEBUG=1"
     232  -h Print this help message
     233  -l Do not execute the binary on the coprocessor. Instead, list the shared library
     234     dependency information.
     235  -p Disable console proxy.
     236  -t Time to wait for the remote app to finish (in seconds). After the timeout
     237     is reached the remote app will be terminated.
     238  -v Enable verbose mode. Note that verbose output will be displayed
     239     if the remote app terminates abnormally.
     240  -V Show version and build information
     241}}}
     242
     243
     244
     245
     246
     247
     248== Submitting Interactive Jobs ==
     249
     250For those who develop their own codes, we provide the app, `idev` to make interactive access to a set of compute nodes, in order to quickly compile, run and validate MPI or other applications multiple times in rapid succession.
     251
     252=== The app `idev`, (Interactive DEVelopment) ===
     253
     254The `idev` application creates an interactive development environment from the user's login window. In the `idev` window the user is connected directly to a compute node from which the user can launch executables directly.
     255The `idev` command submits a batch job that creates a copy of the batch environment and then goes to sleep. After the job begins, `idev` acquires a copy of the batch environment, SSH's to the master node, and then re-creates the batch environment.
     256
     257==== How to use `idev` ====
     258
     259On Cypress login nodes (cypress1 or cypress2),
     260{{{#!bash
     261[user@cypress1 ~]$ idev
     262}}}
     263
     264In default, `idev` submit a job requesting '''one node for one hour'''. It also requests two ''Intel Phi'' (MIC) co-processors.
     265If there is an available node,  your job will become active immediately and `idev` app initiates a ssh session to the computing node. For example:
     266
     267{{{#!bash
     268[fuji@cypress1 ~]$ idev
     269Requesting 1 node(s)  task(s) to normal queue of defq partition
     2701 task(s)/node, 20 cpu(s)/task, 2 MIC device(s)/node
     271Time: 0 (hr) 60 (min).
     272Submitted batch job 8981
     273JOBID=8981 begin on cypress01-100
     274--> Creating interactive terminal session (login) on node cypress01-100.
     275--> You have 0 (hr) 60 (min).
     276Last login: Mon Apr 27 14:45:38 2015 from cypress1.cm.cluster
     277[fuji@cypress01-100 ~]$
     278}}}
     279
     280Note the prompt, "cypress01-100", in the above session. It is your interactive compute-node prompt. You can load modules, compile codes and test codes.
     281`idev` transfers the environmental variables to computing node. Therefore, if you have loaded some modules on the login node, you don't have to load the same module again.
     282
     283'''FOR WORKSHOP'''
     284{{{#!bash
     285export MY_PARTITION=workshop
     286export MY_QUEUE=workshop
     287idev -c 4 --gres=mic:0
     288}}}
     289
     290==== Options ====
     291
     292By default only a single node is requested for 60 minutes. However, you can change the limits with command line options, using syntax similar to the request specifications used in a job script.
     293The syntax is conveniently described in the `idev` help display:
     294
     295{{{#!bash
     296[fuji@cypress1 ~]$ idev --help
     297-c|--cpus-per-task=     : Cpus per Task
     298-N|--nodes=             : Number of Nodes
     299-n|--ntasks-per-node=   : Number of Tasks per Node
     300--gres=                 : Number of MIC per Node
     301-t|--time=              : Wall Time
     302}}}
     303
     304For example, if you want to use 4 nodes for 4 hours,
     305
     306{{{#!bash
     307[fuji@cypress1 ~]$ idev -N 4 -t 4:00:00
     308Requesting 4 node(s)  task(s) to normal queue of defq partition
     3091 task(s)/node, 20 cpu(s)/task, 2 MIC device(s)/node
     310Time: 04 (hr) 00 (min).
     311Submitted batch job 8983
     312JOBID=8983 begin on cypress01-100
     313--> Creating interactive terminal session (login) on node cypress01-100.
     314--> You have 04 (hr) 00 (min).
     315Last login: Mon Apr 27 14:48:45 2015 from cypress1.cm.cluster
     316[fuji@cypress01-100 ~]$
     317}}}
     318
     319==== MIC native run ====
     320
     321You can login to MIC device and run native codes. There are two MIC devices on each node, mic0 and mic1.
     322
     323{{{#!bash
     324[fuji@cypress01-100 nativeTest]$ ssh mic0
     325fuji@cypress01-100-mic0:~$
     326}}}
     327
     328The prompt, "cypress01-100-mic0" in the above session is your interactive MIC device prompt. Note that you cannot run CPU code there. Therefore, you cannot compile code on device, even compiling for native code.
     329The environmental variables are not set and also ''module'' command doesn't work. To run your native executable that uses shared libraries, you have to set environmental variables manually, like
     330{{{#!bash
     331export LD_LIBRARY_PATH=/share/apps/intel_parallel_studio_xe/2015_update1/lib/mic:$LD_LIBRARY_PATH
     332}}}
     333
     334==== Questions or Suggestions ====
     335
     336If you have ideas for enhancing `idev` with new features or any questions, please send email to hpcadmin@tulane.edu.