wiki:cypress/R

Version 29 (modified by cbaribault, 42 hours ago) ( diff )

Added: modify slurm scripts

Running R on Cypress

About R

"R is ‘GNU S’, a freely available language and environment for statistical computing and graphics which provides a wide variety of statistical and graphical techniques: linear and nonlinear modelling, statistical tests, time series analysis, classification, clustering, etc." (See https://cran.r-project.org/.)

R Modules

You can list the versions of R available on Cypress as modules with the module avail command as in the following example, including the "/" character to show only R modules.

[tulaneID@cypress1 ~]$module avail R/

------------------------------------------------- /share/apps/modulefiles --------------------------------------------------
R/3.1.2(default) R/3.2.5-intel    R/3.4.1-intel    R/3.6.1-intel    R/4.1.1-intel
R/3.2.4          R/3.3.1-intel    R/3.5.2-intel    R/4.1.0-intel

--------------------------------------------- /share/apps/centos7/modulefiles ----------------------------------------------
R/4.3.2 R/4.4.1

Using Intel's Math Kernel Library (MKL)

Observe in the output above that some R modules have names ending with the string 'intel'. These modules have been constructed with links to Intel's Math Kernel Library (MKL) for performing certain computations using the Xeon Phi coprocessors. See cypress/XeonPhi.

Using CentOS 7 Operating System

Also, the later versions of R are available only on compute nodes using the later version, CentOS 7, of the operating system available in a separate SLURM partition. For more information, see Requesting partition centos7 (batch) and also Requesting partition centos7 (interactive).

Running R Interactively

Start an interactive session using idev

In the following we'll use the latest version of R available, which runs only on compute nodes in turn running with version CentOS 7 operating system.

For Workshop

If your account is in the group workshop, then in order to use only 2 cpu's per node and thus allow for sharing the few available nodes in the workshop7 partition among many users, do this.

[tulaneID@cypress1 ~]$export MY_PARTITION=workshop7
[tulaneID@cypress1 ~]$export MY_QUEUE=workshop
[tulaneID@cypress1 ~]$idev -c 2
Requesting 1 node(s)  task(s) to workshop queue of workshop7 partition
1 task(s)/node, 2 cpu(s)/task, 0 MIC device(s)/node
Time: 0 (hr) 60 (min).
0d 0h 60m
Submitted batch job 2706829
JOBID=2706829 begin on cypress01-009
--> Creating interactive terminal session (login) on node cypress01-009.
--> You have 0 (hr) 60 (min).
--> Assigned Host List : /tmp/idev_nodes_file_tulaneID
Last login: Tue Aug 19 10:30:58 2025 from cypress2.cm.cluster
[tulaneID@cypress01-009 ~ at 12:12:10]$

Non-workshop

export MY_PARTITION=centos7
[tulaneID@cypress1 ~]$ idev 
Requesting 1 node(s)  task(s) to workshop queue of workshop partition
1 task(s)/node, 20 cpu(s)/task, 2 MIC device(s)/node
Time: 0 (hr) 60 (min).
Submitted batch job 1164332
JOBID=1164332 begin on cypress01-121
--> Creating interactive terminal session (login) on node cypress01-121.
--> You have 0 (hr) 60 (min).
--> Assigned Host List : /tmp/idev_nodes_file_tulaneID
Last login: Wed Aug 21 15:56:37 2019 from cypress1.cm.cluster
[tulaneID@cypress01-121 ~]$ 

Load the R module

[tulaneID@cypress01-121 ~]$ module load R/4.4.1
[tulaneID@cypress01-121 ~]$ module list
Currently Loaded Modulefiles:
  1) slurm/14.03.0           6) mpc/1.2.1              11) pcre2/10.38            16) libtiff/4.6.0
  2) idev                    7) gcc/9.5.0              12) tcl/8.6.11             17) tre/0.8.0
  3) bbcp/amd64_rhel60       8) zlib/1.2.8             13) tk/8.6.11              18) binutils/2.37
  4) gmp/6.2.1               9) bzip2/1.0.6            14) libpng/1.6.37          19) java-openjdk/17.0.7+7
  5) mpfr/4.1.0             10) xz/5.2.2               15) libjpeg-turbo/3.0.1    20) R/4.4.1

Run R in the command line window

[tulaneID@cypress01-121 ~]$R

R version 4.4.1 (2024-06-14) -- "Race for Your Life"
Copyright (C) 2024 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

>

R Package Dependency

In order to run any of the SLURM scripts below, we would submit the SLURM script via sbatch command, but in general we would not expect the code to run successfully the very first time with out additional setup. We can try running the script without any additional setup, but we can expect to get the error such as in the following R session.

> library(doParallel)
Error in library(doParallel) : there is no package calleddoParallel

To resolve the above error, we should first ensure that the required R package, in this case the R package doParallel, is available and installed in your environment. For a range of options for installing R packages - depending on the desired level of reproducibility, see the section Installing R Packages on Cypress.

For Workshop : If your account is in the group workshop, use Alternative 1 - responding to the R prompts as needed - for installing R packages such as in the following.

Thus we need to install the R package doParallel.

> install.packages("doParallel")
...
(respond to prompts as needed)
...
> library(doParallel)
Loading required package: foreach
Loading required package: iterators
Loading required package: parallel
> q()
Save workspace image? [y/n/c]: n
[tulaneID@cypress01-121 ~]$exit
[tulaneID@cypress1 ~]$

Now that we have resolved our package dependency, we can expect future jobs requiring doParallel to run without errors.

Also, notice in the above that we have exited the interactive session since we no longer need it to submit batch jobs.

Download Sample Scripts

If you have not done yet, download Sample files by:

[tulaneID@cypress1 ~]$ git clone https://hidekiCCS:@bitbucket.org/hidekiCCS/hpc-workshop.git

Then use the cp command to copy the batch scripst and R scripts to your current directory.

cp hpc-workshop/R/* .

Running a R script in Batch mode

Besides running R interactively in an idev session, you can also submit your R job to the batch nodes (compute nodes) on Cypress. Inside your SLURM script, include a command to load the desired R module. Then invoke the Rscript command on your R script.

#!/bin/bash
#SBATCH --partition=centos7     # Partition
#SBATCH --qos=normal            # Quality of Service
#SBATCH --job-name=R            # Job Name
#SBATCH --time=00:01:00         # WallTime
#SBATCH --nodes=1               # Number of Nodes
#SBATCH --ntasks-per-node=1     # Number of tasks (MPI processes)
#SBATCH --cpus-per-task=1       # Number of threads per task (OMP threads)

module load R/4.4.1
Rscript myRscript.R

For Workshop : If your account is in the group workshop, modify the SLURM script like:

#!/bin/bash
#SBATCH --partition=workshop7   # Partition
#SBATCH --qos=workshop          # Quality of Service
##SBATCH --qos=normal           ### Quality of Service (like a queue in PBS)
#SBATCH --job-name=R            # Job Name
#SBATCH --time=00:01:00         # WallTime
#SBATCH --nodes=1               # Number of Nodes
#SBATCH --ntasks-per-node=1     # Number of tasks (MPI processes)
#SBATCH --cpus-per-task=1       # Number of threads per task (OMP threads)

module load R/4.4.1
Rscript myRscript.R

Running a Parallel R Job

Starting with version 2.14.0, R has offered direct support for parallel computation through the "parallel" package. We will present two examples of running a parallel job of BATCH mode. They differ in the ways in which they communicate the number of cores reserved by SLURM to R. Both are based on code found in "Getting Started with doParallel and foreach" by Steve Weston and Rich Calaway and modified by The University of Chicago Resource Computing Center.

Passing (SLURM) Environment Variables

In the first example, we will use the built in R function Sys.getenv( ) to get the SLURM environment variable from the operating system.

Let's look at the downloaded sample file bootstrap.R containing the following code.

#Based on code from the UCRCC website

library(doParallel)

# use the environment variable SLURM_CPUS_PER_TASK to set the number of cores
registerDoParallel(cores=(Sys.getenv("SLURM_CPUS_PER_TASK")))

# Bootstrapping iteration example
x <- iris[which(iris[,5] != "setosa"), c(1,5)]
iterations <- 10000# Number of iterations to run

# Parallel version of code 
# Note the '%dopar%' instruction
part <- system.time({
  r <- foreach(icount(iterations), .combine=cbind) %dopar% {
    ind <- sample(100, 100, replace=TRUE)
    result1 <- glm(x[ind,2]~x[ind,1], family=binomial(logit))
    coefficients(result1)
  }
})[3]

# Shows the number of Parallel Workers to be used
getDoParWorkers()
# Executes the functions
part

The above script will obtain the number of tasks per node via the SLURM environment variable SLURM_CPUS_PER_TASK set in our SLURM script and will pass that value to the registerDoParallel( ) function. To implement this we need only set the correct parameters in our SLURM script. Suppose we wanted to use 16 cores. Then the correct SLURM script would be as follows.

#!/bin/bash
#SBATCH --partition=centos7     # Partition
#SBATCH --qos=normal            # Quality of Service
#SBATCH --job-name=R            # Job Name
#SBATCH --time=00:01:00         # WallTime
#SBATCH --nodes=1               # Number of Nodes
#SBATCH --ntasks-per-node=1     # Number of Tasks per Node
#SBATCH --cpus-per-task=16      # Number of threads per task (OMP threads)

module load R/4.4.1

Rscript bootstrap.R

For Workshop : If your account is in the group workshop, modify the SLURM script like:

#!/bin/bash
#SBATCH --partition=workshop7   # Partition
#SBATCH --qos=workshop          # Quality of Service
##SBATCH --qos=normal          ### Quality of Service (like a queue in PBS)
#SBATCH --job-name=R            # Job Name
#SBATCH --time=00:01:00         # WallTime
#SBATCH --nodes=1               # Number of Nodes
#SBATCH --ntasks-per-node=1    # Number of Tasks per Node
#SBATCH --cpus-per-task=16      # Number of threads per task (OMP threads)

module load R/4.4.1

Rscript bootstrap.R

For now we'll need to modify the downloaded sample file bootstrap.sh to contain the above SLURM script code, and we can submit as shown below.

Also, note that since we did not specify an output file in the SLURM script, the output will be written to slurm-<JobNumber>.out. For example:

[tulaneID@cypress2 ~]$ sbatch bootstrap.sh
Submitted batch job 774081
[tulaneID@cypress2 ~]$ cat slurm-774081.out
Loading required package: foreach
Loading required package: iterators
Loading required package: parallel
[1] "16"
elapsed
  2.954
[tulaneID@cypress2 ~]$

Passing Parameters

The disadvantage of the above approach is that it is system specific. If we move our code to a machine that uses PBS-Torque as it's manager (LONI QB2 for example) we have to change our source code. An alternative method that results in a more portable code base uses command line arguments to pass the value of our environment variables into the script.

Let's look at the downloaded sample file bootstrapWargs.R containing the following code.

#Based on code from the UCRCC website

library(doParallel)
# Enable command line arguments
args<-commandArgs(TRUE)

# use the first command line argument to set the number of cores
registerDoParallel(cores=(as.integer(args[1])))

# Bootstrapping iteration example
x <- iris[which(iris[,5] != "setosa"), c(1,5)]
iterations <- 10000# Number of iterations to run

# Parallel version of code 
# Note the '%dopar%' instruction
part <- system.time({
  r <- foreach(icount(iterations), .combine=cbind) %dopar% {
    ind <- sample(100, 100, replace=TRUE)
    result1 <- glm(x[ind,2]~x[ind,1], family=binomial(logit))
    coefficients(result1)
  }
})[3]

# Shows the number of Parallel Workers to be used
getDoParWorkers()
# Executes the functions
part

Note the use of args←commandArgs(TRUE) and of as.integer(args[1]). This allows us to pass in a value from the command line when we call the script and the number of cores will be set to that value. Using the same basic submission script as last time, we need only pass the value of the correct SLRUM environment variable to the script at runtime.

#!/bin/bash
#SBATCH --partition=centos7     # Partition
#SBATCH --qos=normal            # Quality of Service
#SBATCH --job-name=R       # Job Name
#SBATCH --time=00:01:00         # WallTime
#SBATCH --nodes=1               # Number of Nodes
#SBATCH --ntasks-per-node=1    # Number of Tasks per Node
#SBATCH --cpus-per-task=16      # Number of threads per task (OMP threads)

module load R/4.4.1

Rscript bootstrapWargs.R $SLURM_CPUS_PER_TASK 

For Workshop : If your account is in the group workshop, modify the SLURM script like:

#!/bin/bash
#SBATCH --partition=workshop7   # Partition
#SBATCH --qos=workshop          # Quality of Service
##SBATCH --qos=normal          ### Quality of Service (like a queue in PBS)
#SBATCH --job-name=R       # Job Name
#SBATCH --time=00:01:00         # WallTime
#SBATCH --nodes=1               # Number of Nodes
#SBATCH --ntasks-per-node=1    # Number of Tasks per Node
#SBATCH --cpus-per-task=16      # Number of threads per task (OMP threads)

module load R/4.4.1

Rscript bootstrapWargs.R $SLURM_CPUS_PER_TASK 

For now we'll need to modify the downloaded sample file bootstrapWargs.sh to contain the above SLURM script code.

Now submit as in the following.

[tulaneID@cypress1 ~]$ sbatch bootstrapWargs.sh 
Submitted batch job 52481
[tulaneID@cypress1 ~]$ cat slurm-52481.out 
Loading required package: foreach
Loading required package: iterators
Loading required package: parallel
[1] "16"
elapsed 
  3.282 
[tulaneID@cypress1 ~]$ 

Installing R Packages on Cypress

See here.

Next Section: Running Python on Cypress

Note: See TracWiki for help on using the wiki.