Changes between Version 3 and Version 4 of Matlab-Slurm


Ignore:
Timestamp:
Jul 13, 2021 4:29:01 PM (5 months ago)
Author:
fuji
Comment:

Legend:

Unmodified
Added
Removed
Modified
  • Matlab-Slurm

    v3 v4  
    7171=== Optional ===
    7272{{{
    73 >> % Specify an account to use for MATLAB jobs
    74 >> c.AdditionalProperties.AccountName = 'account-name';
    75 }}}
    76 
    77 {{{
    7873>> % Specify e-mail address to receive notifications about your job
    7974>> c.AdditionalProperties.EmailAddress = 'user-id@tulane.edu';
     
    9186
    9287{{{
    93 >> % Specify processors per node
    94 >> c.AdditionalProperties.ProcsPerNode = '2';
     88>> % Specify processors per node (maximum 20 on Cypress)
     89>> c.AdditionalProperties.ProcsPerNode = 20;
    9590}}}
    9691
     
    9994>> c.AdditionalProperties.QoS = 'qos-value';
    10095}}}
    101 
    102 {{{
    103 >> % Specify a queue to use for MATLAB jobs                             
    104 >> c.AdditionalProperties.QueueName = 'queue-name';
    105 }}}
     96The default is ''normal''. See [https://wiki.hpc.tulane.edu/trac/wiki/cypress/about#SLURMresourcemanager here] for other options.
    10697
    10798{{{
     
    109100>> c.AdditionalProperties.WallTime = '05:00:00';
    110101}}}
     102See [https://wiki.hpc.tulane.edu/trac/wiki/cypress/about#SLURMresourcemanager here] for the maximum walltime.
    111103
    112104Save changes after modifying !AdditionalProperties for the above changes to persist between MATLAB sessions.
     
    137129}}}
    138130{{{
    139 >> % Open a pool of 64 workers on the cluster
    140 >> p = c.parpool(64);
     131>> % Open a pool of 4 workers on the cluster (works up to 12?)
     132>> p = c.parpool(4);
    141133}}}
    142134Rather than running local on the local machine, the pool can now run across multiple nodes on the cluster.
     
    144136>> % Run a parfor over 1000 iterations
    145137>> parfor idx = 1:1000
    146       a(idx) =
     138      a(idx) = idx
    147139   end
    148140}}}
     
    154146
    155147== INDEPENDENT BATCH JOB ==
    156 Use the batch command to submit asynchronous jobs to the cluster.  The batch command will return a job object which is used to access the output of the submitted job.  See the MATLAB documentation for more help on batch.
    157 {{{
    158 >> % Get a handle to the cluster
    159 >> c = parcluster;
    160 }}}
    161 {{{
    162 >> % Submit job to query where MATLAB is running on the cluster
    163 >> j = c.batch(@pwd, 1, {}, …
    164        'CurrentFolder','.', 'AutoAddClientPath',false);
    165 }}}
    166 {{{
    167 >> % Query job for state
    168 >> j.State
    169 }}}
    170 {{{
    171 >> % If state is finished, fetch the results
    172 >> j.fetchOutputs{:}
    173 }}}
    174 {{{
    175 >> % Delete the job after results are no longer needed
    176 >> j.delete
    177 }}}
    178 To retrieve a list of currently running or completed jobs, call parcluster to retrieve the cluster object.  The cluster object stores an array of jobs that were run, are running, or are queued to run.  This allows us to fetch the results of completed jobs.  Retrieve and view the list of jobs as shown below.
    179 {{{
    180 >> c = parcluster;
    181 >> jobs = c.Jobs;
    182 }}}
    183 Once we’ve identified the job we want, we can retrieve the results as we’ve done previously.
    184 fetchOutputs is used to retrieve function output arguments; if calling batch with a script, use load instead.   Data that has been written to files on the cluster needs be retrieved directly from the file system (e.g. via ftp).
    185 To view results of a previously completed job:
    186 {{{
    187 >> % Get a handle to the job with ID 2
    188 >> j2 = c.Jobs(2);
    189 }}}
    190 NOTE: You can view a list of your jobs, as well as their IDs, using the above c.Jobs command. 
    191 {{{
    192 >> % Fetch results for job with ID 2
    193 >> j2.fetchOutputs{:}
    194 }}}
    195 
    196 == PARALLEL BATCH JOB ==
    197 Users can also submit parallel workflows with the batch command.  Let’s use the following example for a parallel job, which is saved as {{{parallel_example.m}}}.   
    198 
    199 {{{
    200 function t = parallel_example(iter)
    201 
    202 if nargin==0, iter = 8; end
    203 
     148Use the {{{batch}}} command to submit asynchronous jobs to the cluster.  The batch command will return a job object which is used to access the output of the submitted job.  See the MATLAB documentation for more help on [https://www.mathworks.com/help/parallel-computing/batch.html batch].
     149
     150=== Running Serial Job ===
     151Let’s use the following example for a serial job, which is saved as {{{serial_example.m}}}.   
     152{{{
     153% Serial Example
    204154disp('Start sim')
    205155
    206156t0 = tic;
    207 parfor idx = 1:iter
     157for idx = 1:8
    208158     A(idx) = idx;
    209159     pause(2)
     
    214164}}}
    215165
    216 This time when we use the batch command, to run a parallel job, we’ll also specify a MATLAB Pool.   
     166
     167{{{
     168>> % Get a handle to the cluster
     169>> c = parcluster;
     170}}}
     171
     172{{{
     173>> % Below, submit a batch job that calls the 'mywave.m' script.
     174>> % Also set the parameter AutoAddClientPath to false so that Matlab won't complain when paths on
     175>> % your desktop don't exist on the cluster compute nodes (this is expected and can be ignored).
     176
     177>> myjob = batch(c,'serial_example','AutoAddClientPath',false)
     178}}}
     179
     180{{{
     181>> % Wait for the job to finish. 
     182>> wait(myjob)
     183}}}
     184
     185{{{
     186>> % display the job diary (This is the Matlab standard output text, if any)
     187>> diary(myjob)
     188--- Start Diary ---
     189Start sim
     190Sim Completed
     191
     192--- End Diary ---
     193}}}
     194
     195{{{
     196>> % load the 'A' array (computed in serial_example) from the results of job 'myjob':
     197>> load(myjob,'A');
     198>> A
     199
     200A =
     201
     202     1     2     3     4     5     6     7     8
     203}}}
     204
     205
     206{{{
     207>> % Query job for state
     208>> myjob.State
     209}}}
     210
     211{{{
     212>> % If state is finished, fetch the results
     213>> mayjob.fetchOutputs{:}
     214ans =
     215
     216  struct with fields:
     217
     218      A: [1 2 3 4 5 6 7 8]
     219    ans: [1×1 struct]
     220    idx: 8
     221      t: 16.0127
     222     t0: 1626209674755060
     223}}}
     224{{{
     225>> % Delete the job after results are no longer needed
     226>> mayjob.delete
     227}}}
     228To retrieve a list of currently running or completed jobs, call parcluster to retrieve the cluster object.  The cluster object stores an array of jobs that were run, are running, or are queued to run.  This allows us to fetch the results of completed jobs.  Retrieve and view the list of jobs as shown below.
     229
     230{{{
     231>> c = parcluster;
     232>> jobs = c.Jobs;
     233}}}
     234
     235Once we’ve identified the job we want, we can retrieve the results as we’ve done previously.
     236fetchOutputs is used to retrieve function output arguments; if calling batch with a script, use load instead.   Data that has been written to files on the cluster needs be retrieved directly from the file system (e.g. via ftp).
     237To view results of a previously completed job:
     238
     239{{{
     240>> % Get a handle to the job with ID 2
     241>> j2 = c.Jobs(2);
     242}}}
     243
     244NOTE: You can view a list of your jobs, as well as their IDs, using the above c.Jobs command. 
     245
     246{{{
     247>> % Fetch results for job with ID 2
     248>> j2.fetchOutputs{:}
     249}}}
     250
     251=== Running Parallel Job ==
     252Users can also submit parallel workflows with the batch command.  Let’s use the following example for a parallel job, which is saved as {{{parallel_example.m}}} that uses the '''parfor''' statement to parallelize the '''for''' loop
     253
     254{{{
     255disp('Start sim')
     256
     257t0 = tic;
     258parfor idx = 1:8
     259     A(idx) = idx;
     260     pause(2)
     261end
     262t = toc(t0);
     263
     264disp('Sim Completed')
     265}}}
     266
     267In the next example, we will run a parallel job using 8 processors on a single node.
     268This time when we use the batch command, to run a parallel job, we’ll also specify a MATLAB Pool. 
     269 
    217270{{{
    218271>> % Get a handle to the cluster
    219272>> c = parcluster;
    220273}}}
    221 {{{
    222 >> % Submit a batch pool job using 4 workers for 16 simulations
    223 >> j = c.batch(@parallel_example, 1, {16}, 'Pool',4, …
    224        'CurrentFolder','.', 'AutoAddClientPath',false);
    225 }}}
     274
     275{{{
     276>> % Submit a batch pool job using 8 workers for 8 iterations
     277>> myjob = batch(c,'parallel_example','pool', 8, 'AutoAddClientPath',false)
     278}}}
     279
    226280{{{
    227281>> % View current job status
    228 >> j.State
    229 }}}
     282>> myjob.State
     283}}}
     284
    230285{{{
    231286>> % Fetch the results after a finished state is retrieved
    232 >> j.fetchOutputs{:}
    233 ans =
    234         8.8872
    235 }}}
    236 The job ran in 8.89 seconds using four workers.  Note that these jobs will always request N+1 CPU cores, since one worker is required to manage the batch job and pool of workers.   For example, a job that needs eight workers will consume nine CPU cores.   
    237 We’ll run the same simulation but increase the Pool size.  This time, to retrieve the results later, we’ll keep track of the job ID.
    238 NOTE: For some applications, there will be a diminishing return when allocating too many workers, as the overhead may exceed computation time.
    239 {{{   
    240 >> % Get a handle to the cluster
    241 >> c = parcluster;
    242 }}}
    243 {{{
    244 >> % Submit a batch pool job using 8 workers for 16 simulations
    245 >> j = c.batch(@parallel_example, 1, {16}, 'Pool', 8, …
    246        'CurrentFolder','.', 'AutoAddClientPath',false);
    247 }}}
     287>> myjob.fetchOutputs{:}
     288ans =
     289
     290  struct with fields:
     291
     292     A: [1 2 3 4 5 6 7 8]
     293     t: 2.3438
     294    t0: 1626210921701584
     295}}}
     296
     297The job ran in 2.3438 seconds using eight workers.  '''Note that these jobs will always request N+1 CPU cores''', since one worker is required to manage the batch job and pool of workers.   For example, a job that needs eight workers will consume nine CPU cores.         
     298
     299
     300==== retrieve the results later ====
    248301{{{
    249302>> % Get the job ID
     
    252305        4
    253306}}}
    254 {{{
    255 >> % Clear j from workspace (as though we quit MATLAB)
    256 >> clear j
     307
     308{{{
     309>> % Clear myjob from workspace (as though we quit MATLAB)
     310>> clear myjob
    257311}}}
    258312
    259313Once we have a handle to the cluster, we’ll call the findJob method to search for the job with the specified job ID.   
    260 {{{
    261 >> % Get a handle to the cluster
    262 >> c = parcluster;
    263 }}}
     314
     315{{{
     316>> % Get a handle to the cluster
     317>> c = parcluster;
     318}}}
     319
    264320{{{
    265321>> % Find the old job
    266 >> j = c.findJob('ID', 4);
    267 }}}
     322>> myjob = c.findJob('ID', 4);
     323}}}
     324
    268325{{{
    269326>> % Retrieve the state of the job
    270 >> j.State
     327>> myjob.State
    271328ans =
    272329finished
    273330}}}
     331
    274332{{{
    275333>> % Fetch the results
    276 >> j.fetchOutputs{:};
    277 ans =
    278 4.7270
    279 }}}
    280 The job now runs in 4.73 seconds using eight workers.  Run code with different number of workers to determine the ideal number to use.
     334>> myjob.fetchOutputs{:};
     335
     336ans =
     337
     338  struct with fields:
     339
     340     A: [1 2 3 4 5 6 7 8]
     341     t: 2.3438
     342    t0: 1626210921701584
     343}}}
     344
    281345Alternatively, to retrieve job results via a graphical user interface, use the Job Monitor (Parallel > Monitor Jobs).
    282346