Changes between Initial Version and Version 1 of Workshops/cypress/SlurmPractice


Ignore:
Timestamp:
Aug 22, 2018 12:46:46 PM (3 years ago)
Author:
fuji
Comment:

Legend:

Unmodified
Added
Removed
Modified
  • Workshops/cypress/SlurmPractice

    v1 v1  
     1= Work with SLURM on Cypress =
     2If you haven't done yet, download Samples by:
     3
     4{{{svn co file:///home/fuji/repos/workshop ./workshop}}}
     5
     6Checkout Sample files onto local machine, (linux shell)
     7
     8{{{svn co svn+ssh://USERID@cypress1.tulane.edu/home/fuji/repos/workshop ./workshop}}}
     9
     10
     11== Introduction to Managed Cluster Computing ==
     12On your desktop you would open a terminal, compile the code using your favorite c compiler and execute the code. You can do this without worry as you are the only person using your computer and you know what demands are being made on your CPU and memory at the time you run your code. On a cluster, many users must share the available resources equitably and simultaneously. It's the job of the resource manager to choreograph this sharing of resources by accepting a description of your program and the resources it requires, searching the available hardware for resources that meet your requirements, and making sure that no one else is given those resources while you are using them.
     13
     14Occasionally the manager will be unable to find the resources you need due to usage by other user. In those instances your job will be "queued", that is the manager will wait until the needed resources become available before running your job. This will also occur if the total resources you request for all your jobs exceed the limits set by the cluster administrator. This ensures that all users have equal access to the cluster.
     15
     16[[Image(https://docs.google.com/drawings/d/e/2PACX-1vQL7pibkwB5EK2z6d2I9wIu28baQt8Mu3U4FCfwOttWncEwurGa8r-sP2wQxNA1no0j_ik3bVV5s0X8/pub?w=480&h=360)]]
     17
     18== Serial Job Submission ==
     19Under 'workshop' directory,
     20{{{
     21[fuji@cypress1 ~]$ cd workshop
     22[fuji@cypress1 workshop]$ ls
     23BlasLapack  Eigen3        HeatMass    JobArray1  JobDependencies  MPI     PETSc  precision  Python  ScaLapack  SimpleExample  TestCodes  uBLAS
     24CUDA        FlowInCavity  hybridTest  JobArray2  Matlab           OpenMP  PI     PSE        R       SerialJob  SLU40          TextFiles  VTK
     25}}}
     26
     27Under 'SerialJob' directory,
     28{{{
     29[fuji@cypress1 workshop]$ cd SerialJob
     30[fuji@cypress1 SerialJob]$ ls
     31hello.py  slurmscript1  slurmscript2
     32}}}
     33
     34When your code runs on a single core only, your job-script should request a single core.  The python code 'hello.py' runs on a single core that is,
     35{{{
     36# HELLO PYTHON
     37import datetime
     38import socket
     39
     40now = datetime.datetime.now()
     41print 'Hello, world!'
     42print now.isoformat()
     43print socket.gethostname()
     44}}}
     45
     46Since this runs for a short time, you can try running it on the login node.
     47{{{
     48[fuji@cypress1 SerialJob]$ python ./hello.py
     49Hello, world!
     502018-08-22T11:46:05.394952
     51cypress1
     52}}}
     53This code print a message, time, and the host name.
     54
     55Look at 'slurmscript1'
     56{{{
     57[fuji@cypress1 SerialJob]$ more slurmscript1
     58#!/bin/bash
     59#SBATCH --qos=workshop            # Quality of Service
     60#SBATCH --partition=workshop      # partition
     61#SBATCH --job-name=python       # Job Name
     62#SBATCH --time=00:01:00         # WallTime
     63#SBATCH --nodes=1               # Number of Nodes
     64#SBATCH --ntasks-per-node=1     # Number of tasks (MPI processes)
     65#SBATCH --cpus-per-task=1       # Number of threads per task (OMP threads)
     66
     67module load anaconda
     68python hello.py
     69}}}
     70
     71Notice that the SLURM script begins with '''#!/bin/bash'''. This tells the Linux shell what flavor shell interpreter to run. In this example we use BASh (Bourne Again Shell).
     72The choice of interpreter (and subsequent syntax) is up to the user, but every SLURM script should begin this way.
     73
     74For Bash and Shell Script, see
     75[https://en.wikibooks.org/wiki/Bash_Shell_Scripting]
     76
     77In Bash Shell Script, '''#''' and the strings after it are comments.
     78So all '''#SBATCH''' things in the script above are comments for Bash,
     79but those are directives for '''SLURM''' job scheduler.
     80
     81=== qos, partition ===
     82Those two lines determine the quality of service and the partition.
     83{{{
     84#SBATCH --qos=workshop            # Quality of Service
     85#SBATCH --partition=workshop      # partition
     86}}}
     87The default partition is '''defq'''. In '''defq''', you can chose either '''normal''' or '''long''' for '''qos'''.
     88||||||||= '''QOS limits''' =||
     89|| '''QOS name''' || '''maximum job size (node-hours)''' || '''maximum walltime per job''' || '''maximum nodes per user''' ||
     90|| normal      || N/A ||24 hours || 18 ||
     91|| long        || 168 ||168 hours ||  8 ||
     92
     93The differences between '''normal''' and '''long''' are the number of nodes you can request and duration you can run your code.
     94The details will be explained in Parallel Jobs below.
     95
     96If you are using a workshop account, you can use only '''workshop''' qos and partition.
     97
     98=== job-name ===
     99{{{
     100#SBATCH --job-name=python       # Job Name
     101}}}
     102This is the job name that you can specify as you like.
     103
     104=== time ===
     105{{{
     106#SBATCH --time=00:01:00         # WallTime
     107}}}
     108The maximum walltime is specified by #SBATCH --time=T, where T has format h:m:s. 
     109Normally, a job is expected to finish before the specified maximum walltime. 
     110After the walltime reaches the maximum, the job terminates regardless whether the job processes are still running or not.
     111
     112=== Resource Rwquest ===
     113{{{
     114#SBATCH --nodes=1               # Number of Nodes
     115#SBATCH --ntasks-per-node=1     # Number of tasks (MPI processes)
     116#SBATCH --cpus-per-task=1       # Number of threads per task (OMP threads)
     117}}}
     118
     119The resource request '''#SBATCH --nodes=N''' determines how many compute nodes a job are allocated by the scheduler; only 1 node is allocated for this job. 
     120
     121'''#SBATCH --ntasks-per-node=n''' determines the number of tasks for MPI jobs. The details will be explained in Parallel Jobs below.
     122
     123'''#SBATCH --cpus-per-task=c'''  determines the number of cores/threads for a task. The details will be explained in Parallel Jobs below.
     124
     125
     126
     127
     128
     129
     130This script requests one core on one node.
     131
     132[[Image(https://docs.google.com/drawings/d/e/2PACX-1vSlffILDUxxzh_QpD4M7P5-bY_tCkYNjA9xIYWuUUqz_HBBczQ18o5AWA9OZ5_w5Q0bwQJbdgmUCuMJ/pub?w=594&h=209)]]
     133
     134There are 124 nodes on Cypress system. Each node has 20 cores.
     135
     136[[Image(https://docs.google.com/drawings/d/e/2PACX-1vQR7ztCNSIQhIjyW28FyYaQn92XC4Zq_vZzoPwALkywmXoyRl8qC2MEpT1t68zMopZv2yHNt2unMf-i/pub?w=155&h=134)]]