# The R Statistical Computing Environment

## What is R?

R is an open source programming language designed for statistical computing and graphics. It is the current standard for the development of new statistical methodologies and enjoys a large user base.

Please consult the R web site for more information related to the R project.

Once on a compute node, R can be loaded using the appropriate module (link). The standard version of R on Peregrine is built using the GNU tool chain. To avoid possible conflicts, remove any Intel compiler modules before loading R. One way to do this is via the following:

$ module purge

$ module load R

To access the R interactive console, simply type R at the command line. You will be prompted with the familiar R console in your terminal window.

$ R

R version 3.1.0 (2014-04-10) -- "Spring Dance"

Copyright (C) 2014 The R Foundation for Statistical Computing

Platform: x86_64-unknown-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.

You are welcome to redistribute it under certain conditions.

Type 'license()' or 'licence()' for distribution details.

Natural language support but running in an English locale

R is a collaborative project with many contributors.

Type 'contributors()' for more information and

'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or

'help.start()' for an HTML browser interface to help.

Type 'q()' to quit R.

R will attempt to install into ~/R/library by default (if it exists).

If it doesn't exist, please make this directory and restart R. Send

an email to Ryan Elmore if you have questions. Use R (or Rscript)

--no-site-file to suppress this message.

>

### Running R Scripts

Since running R programs line by line in the interactive console can be a little tedious, it is often better to combine R commands into a single script and have R execute them all at once. R scripts are just text files containing R commands with file extension .R.

`[user@foo cur.dir]$ cat hello_world.R`

#hello_world.R

message = "Hi there!"

nums = sample(1:100, 5)

cat(message, "\n")

cat("Here are some random numbers: ", paste(nums, sep = ", "),"\n")

There are several options for running R scripts:

**source()**

The source() function will execute R scripts from inside the interactive console.

`> source("hello_world.R")`

Hi there!

Here are some random numbers: 100 41 14 82 63

**Rscript**

The** **Rscript command can be used to run R scripts from the command line. Output is piped to the stdout.

[user@foo cur.dir]$ Rscript hello_world.R

R will attempt to install into ~/R/library by default (if it exists).

If it doesn't exist, please make this directory and restart R. Send

an email to Ryan Elmore if you have questions. Use R (or Rscript)

--no-site-file to suppress this message.

Hi there!

Here are some random numbers: 71 37 50 24 90

**R CMD BATCH**

R CMD BATCH is an older function that behaves similar to Rscript. All output is piped to a corresponding .Rout file.

[user@foo cur.dir]$ R CMD BATCH hello_world.R --no-site-file

[user@foo cur.dir]$ cat hello_world.Rout

> #hello_world.R

>

> message = "Hi there!"

> nums = sample(1:100, 5)

> cat(message, "\n")

Hi there!

> cat("Here are some random numbers: ", paste(nums, sep = ", "),"\n")

Here are some random numbers: 41 51 61 70 43

>

> proc.time()

user system elapsed

0.188 0.024 0.277

## Submitting Jobs to Peregrine

Another option for using R on Peregrine is to submit them as part of job scripts to be run on non-interactive nodes (link).

An example job script for running the hello_world.R example is below.

`#! /bin/bash`

#PBS -l walltime=3600 # WALLTIME LIMIT

#PBS -N helloworld # Name of job

cd $PBS_O_WORKDIR

module purge

module load R

Rscript hello_world.R

## Versions and Packages

R is a popular open source language with an active development community. New versions of R are frequently released. The most current versions of R is 3.2.2 and can be loaded using modules the R/3.2.2 or R/3.2.2-base. These modules are listed in the candidate directory. The following code makes them available:

[user@foo cur.dir]$ module use /nopt/nrel/apps/modules/candidate/modulefiles/R

[user@foo cur.dir]$ module avail R

---------------- /nopt/nrel/apps/modules/candidate/modulefiles/ ----------------

R/3.2.1 R/3.2.2 R/3.2.2-base

----------------- /nopt/nrel/apps/modules/default/modulefiles ------------------

R/2.15.3 R/3.1.0(default)

The real strength of R is the ability to install and load packages pertaining to a vast assortment of problems. The R/3.2.2 module contains all packages found in the base version along with many additional common packages (listed below). The R/3.2.2-base contains packages only loaded from the base install of R.

` [1] "assertthat" "BH" "brew" "chron" `

[5] "colorspace" "crayon" "curl" "data.table"

[9] "DBI" "devtools" "dichromat" "digest"

[13] "dplyr" "evaluate" "foreach" "formatR"

[17] "ggplot2" "git2r" "gridExtra" "gtable"

[21] "highr" "htmltools" "httpuv" "httr"

[25] "iterators" "jsonlite" "KernSmooth" "knitr"

[29] "labeling" "lazyeval" "lme4" "magrittr"

[33] "markdown" "MASS" "memoise" "mime"

[37] "minqa" "munsell" "mvtnorm" "nloptr"

[41] "pbdBASE" "pbdDEMO" "pbdDMAT" "pbdMPI"

[45] "pbdPROF" "pbdSLAP" "plyr" "praise"

[49] "proto" "R6" "RColorBrewer" "Rcpp"

[53] "RcppEigen" "reshape2" "rjson" "rlecuyer"

[57] "Rmpi" "roxygen2" "rstudioapi" "rversions"

[61] "scales" "shiny" "snow" "stringi"

[65] "stringr" "testthat" "tidyr" "whisker"

[69] "xml2" "xtable" "yaml" "zoo"

[73] "base" "BiocInstaller" "boot" "class"

[77] "cluster" "codetools" "compiler" "datasets"

[81] "foreign" "graphics" "grDevices" "grid"

[85] "KernSmooth" "lattice" "MASS" "Matrix"

[89] "methods" "mgcv" "nlme" "nnet"

[93] "parallel" "rhdf5" "rpart" "spatial"

[97] "splines" "stats" "stats4" "survival"

[101] "tcltk" "tools" "utils" "zlibbioc"

**Installing new packages**

The install.packages() command in R will download new packages from the CRAN source directory and install them for your account. These will not be visible to other users.

**Checking installed packages**

The command installed.packages() in R list details about all packages that are loaded and visible to current R session.

**Loading packages**

Packages are loaded into the current R environment through the library() function.

## Graphics

R is commonly used to produce high-quality graphics based on data. This capability is built-in and can be extended through the use of packages such as ggplot2. To produce graphics on Peregrine, the easiest method is to output graphical displays to an appropriate filetype (pdf, jpeg, etc.). Then this file can be moved to your local machine using command line tools such as scp or rsync.

#Example R script for graphics output

library(ggplot2)

set.seed(8675309)

numbers = rnorm(200, sd = 2)

more.numbers = rnorm(100, mean = 10, sd = 2)

df = data.frame(values = c(numbers, more.numbers))

p = ggplot(df, aes(x = values, y = ..density..)) +

geom_histogram(fill = "dodgerblue",

colour = "black",

alpha = .5,

binwidth = .5) +

geom_density(size = 1.5) +

labs(y = "Density", x = "Value",

title = "Histogram Example")

png(file = "histogram_example.png")

print(p)

dev.off()

## Parallel Programming in R

Programming in R on Peregrine has two distinct advantages. First, running jobs on a remote system means you do not have to tie up your local machine. This can be particularly useful for jobs that take considerable time and resources to run. Secondly, the increased computational capabilities of Peregrine’s provide an opportunity to improve performance through parallel processing. R code, like many programming languages, is typically written and executed serially. This means that the added benefits of having multiple processing cores available are typically lost.

A major goal of the R community in recent years has been the development of specialized libraries and programming paradigms to better leverage modern HPC systems. The CRAN task view for High Performance Computing and Parallel Programming contains a detailed list of packages that address various aspects of these problems (link). Notable examples are:

- Parallel
- Foreach
- Multicore
- Snow
- pbdR
- Rmpi

Each package includes in-depth documentation and examples for how to implement parallel processing in R code. Learning these packages does require a moderate amount of time but for many large problems the improvements in computational efficiency dramatically out weighs the initial investment.

### Using the pbdR Project

The pbdR project "enables high-level distributed data parallelism in R, so that it can easily utilize large HPC platforms with thousands of cores, making the R language scale to unparalleled heights." There are several packages within this project, pbdMPI for easy MPI work, pbdDMAT for distributed data matrices and associated functions, and pbdDEMO for a tutorial/vignette describing most of the project's details.

The following script is a simple helloworld.R example using the pbdMPI package.

library(pbdMPI, quiet = TRUE)

init()

.comm.size <- comm.size()

.comm.rank <- comm.rank()

.hostname <- Sys.info()["nodename"]

msg <- sprintf("I am %d of %d on %s.\n", .comm.rank, .comm.size, .hostname)

comm.cat(msg, all.rank = TRUE, quiet = TRUE)

comm.cat(msg, rank.print = sample(0:.comm.size, size = 1))

comm.cat(msg, rank.print = sample(0:.comm.size, size = 1), quiet = TRUE)

finalize()

You could run this interactively from a compute node or by submitting it to the job scheduling using a shell script similar to the one given below. For example, you would submit using qsub helloworld.sh from a login node provided you name the script appropriately.

#!/bin/bash

#PBS -A XXXXXX

#PBS -l walltime=00:00:05:00 # WALLTIME limit

#PBS -l nodes=2:ppn=24 # Number of nodes, use 16 processes on each

# Specify :ppn=x in previous line if you want

# to use a "x" processors on each node

# if there are core/memory concerns

#PBS -M first.last@nrel.gov

#PBS -m be # (b) begin, (e) end, (a) abort

cd $PBS_O_WORKDIR # Runs the job in the current working directory

set -x

module purge

module load openmpi-gcc

module load R # Load the module for the current R version

INPUT_BASENAME=helloworld # JOB NAME - USER INPUT PARAMETER

JOB_FILE=$INPUT_BASENAME.R

OUT_FILE=$INPUT_BASENAME.Rout

mpirun -np 48 Rscript --no-site-file $JOB_FILE > $OUT_FILE

In either case (interactive or queue submission), the output produced from the simple helloworld.R script should look like this.

I am 0 of 48 on n1094.

I am 1 of 48 on n1094.

I am 2 of 48 on n1094.

I am 3 of 48 on n1094.

I am 4 of 48 on n1094.

I am 5 of 48 on n1094.

I am 6 of 48 on n1094.

I am 7 of 48 on n1094.

I am 8 of 48 on n1094.

I am 9 of 48 on n1094.

...

A more complicated example (dd-regression.R) involves computing the regression coefficients for a linear regression using distributed matrices. You can vary p to increase the dimension of the design matrix.

library(pbdDMAT, quiet = TRUE)

init.grid()

### Generate balanced fake data.

comm.set.seed(1234, diff = TRUE)

N <- 100

p <- 2

dx <- ddmatrix(rnorm(N * p), ncol = p)

dbeta <- ddmatrix(1:p, ncol = 1)

depsilon <- ddmatrix(rnorm(N), ncol = 1)

dy <- dx %*% dbeta + depsilon

dols <- solve(t(dx) %*% dx) %*% t(dx) %*% dy

ols <- as.matrix(dols, proc.dest = 0)

comm.cat("Straight matrix multiplication:\n", quiet = TRUE)

comm.print(ols, queit = TRUE)

## alternatively

dres <- lm.fit(dx, dy)

res <- as.matrix(dres$coef, proc.dest = 0)

comm.cat("\nUsing lm.fit:\n", quiet = TRUE)

comm.print(res, quiet = TRUE)

## Undistribute and compute (on a single core)

x <- as.matrix(dx, proc.dest = 0)

y <- as.matrix(dy, proc.dest = 0)

if(comm.rank() == 0){

serial.coef <- lm(y ~ x - 1)$coef

}

comm.cat("\nSerial fit:\n", quiet = TRUE)

comm.print(serial.coef, quiet = TRUE)

finalize()

The output of running the dd-regression.R script:

Using 8x6 for the default grid size

Straight matrix multiplication:

COMM.RANK = 0

[,1]

[1,] 0.9422302

[2,] 2.0726872

Using lm.fit:

[,1]

[1,] 0.9422302

[2,] 2.0726872

Serial fit:

x1 x2

0.9422302 2.0726872

### A Simple Scaling Study for PCA Using the pbdR packages.

As part of the NREL LDRD project " A Framework for Comparison of Spatiotemporal and Time Series Datasets" (Dan Getman, PI), we investigated several scaling factors involved in running a principal components analysis (PCA) on Peregrine. Specifically, we recorded the execution time of PCA for datasets of varying size (7.6 megabytes [MB], 76 MB, 763 MB, and 7.45 gigabytes [GB]), differing numbers of cores (96, 144, 192, 240, and 288), and numbers of cores per node (16 and 24). The results given in the figures below are based on the average timing over ten independent trials.

Be careful when requesting the number of compute nodes for a given job; more nodes do not always yield better outcomes! For small datasets (< 100 MB), requesting fewer nodes yields better results. However, a substantial speedup is attained when using 288 cores on the 7+ GB datasets relative to just using 96 cores.

The skeleton R script for performing this PCA experiment is given below.

library(pbdDMAT, quiet = TRUE)

init.grid()

## Prelims

comm.set.seed(19838, diff = TRUE)

kRow <- 10e3

kCols <- 10^c(2:5)

# Normal parms

kMean <- 0

kSD <- 1

# blocking

kBlock <- 4

# replications

kReps <- 10

for (i in 1:4){

comm.cat(sprintf("\n%d rows and %d columns: %s\n", kRow, kCols[i],

Sys.time()), quiet=T)

# benchmark

datatimes <- system.time(

dx <- ddmatrix("rnorm",

nrow = kRow,

ncol = kCols[i],

bldim = kBlock,

mean = kMean,

sd = kSD,

ICTXT=0))[3]

datatimes <- allreduce(datatimes, op='max')

size <- kCols[i]*kRow*8/1024

unit <- "kb"

if (log10(size) > 3){

size <- size/1024

unit <- "mb"

}

if (log10(size) > 3){

size <- size/1024

unit <- "gb"

}

comm.cat(sprintf("\n%.2f %s of data generated in %.3f seconds\n\n",

size, unit, datatimes), quiet=T)

times <- sapply(1:kReps, function(.) system.time(prcomp(dx))[3])

total <- allreduce(sum(times), op='max')

avg <- total/kReps

bench <- data.frame(operation="prcomp(dx)", mean.runtime=avg, total.runtime=total)

row.names(bench) <- ""

comm.print(bench, quiet=T)

}

finalize()

The shell script used to the PCA comparison is based on the following script. Note that the openmpi-gcc module is necessary for these examples and is loaded in this script.

#!/bin/bash

#PBS -A XXXXXX

#PBS -l walltime=00:06:00:00 # WALLTIME limit

#PBS -l nodes=6:ppn=24 # Number of nodes, use 16 processes on each

# Specify :ppn=x in previous line if you want

# to use a "x" processors on each node

# if there are core/memory concerns

#PBS -M first.last@nrel.gov

#PBS -m be # (b) begin, (e) end, (a) abort

cd $PBS_O_WORKDIR

module purge

module load openmpi-gcc

module load R

INPUT_BASENAME=pca # JOB NAME - USER INPUT PARAMETER

JOB_FILE=$INPUT_BASENAME.R

OUT_FILE=$INPUT_BASENAME-$PBS_NP.Rout

mpirun -np $PBS_NP Rscript $JOB_FILE > $OUT_FILE

There are a host of benchmarking examples that come along with the pbdDEMO and can be found here /nopt/nrel/apps/R/3.1.0-gcc/lib64/R/library/pbdDEMO/Benchmarks on Peregrine.

### Running R in parallel with Rmpi

One of several ways of using R in parallel computers is using the Rmpi library.

http://cran.r-project.org/web/packages/Rmpi/index.html

http://www.stats.uwo.ca/faculty/yu/Rmpi/

To test a parallel R in the interactive queue, start an interactive session with multiple nodes:

qsub -A ABC000 -q inter -I -l nodes=2:ppn=8,walltime=30:00

Once the resource is allocated, you can do below to start R.

module load R R -q

This starts an interactive R session. Once you are in R, load the Rmpi package

> library("Rmpi") > mpi.spawn.Rslaves(nslaves=16)Sample output:

16 slaves are spawned successfully. 0 failed.

master (rank 0 , comm 1) of size 17 is running on: rr14

slave1 (rank 1 , comm 1) of size 17 is running on: rr14

slave2 (rank 2 , comm 1) of size 17 is running on: rr14

slave3 (rank 3 , comm 1) of size 17 is running on: rr14

... ... ...

slave15 (rank 15, comm 1) of size 17 is running on: rr15

slave16 (rank 16, comm 1) of size 17 is running on: rr15> mpi.remote.exec(paste("I am",mpi.comm.rank(),"of",mpi.comm.size()))Sample output:

$slave1

[1] "I am 1 of 17"

$slave2

[1] "I am 2 of 17"

... ... ...

$slave16

[1] "I am 16 of 17"> mpi.close.Rslaves() > mpi.quit()

The mpi.close.Rslaves and mpi.quit commands stop the MPI processes and exit R.

### Example: Parallel Hello World program:

R --no-save -q < Hello.R > Hello.out

Hello.R

# Taken form this site # http://math.acadiau.ca/ACMMaC/Rmpi/sample.html # Load the R MPI package if it is not already loaded. if (!is.loaded("mpi_initialize")) { library("Rmpi") } # Spawn as many slaves as possible # mpi.spwan.Rslaves() # Above option does not work for now, so specify the number # of slaves as below. mpi.spawn.Rslaves(nslaves=16) # # In case R exits unexpectedly, have it automatically clean up # resources taken up by Rmpi (slaves, memory, etc...) .Last <- function(){ if (is.loaded("mpi_initialize")){ if (mpi.comm.size(1) > 0){ print("Please use mpi.close.Rslaves() to close slaves.") mpi.close.Rslaves() } print("Please use mpi.quit() to quit R") .Call("mpi_finalize") } # Tell all slaves to return a message identifying themselves mpi.remote.exec(paste("I am",mpi.comm.rank(),"of",mpi.comm.size())) # Tell all slaves to close down, and exit the program mpi.close.Rslaves() mpi.quit()

### Parallel R in batch mode

Create a slurm script file, myscript.slurm as below.

#!/bin/bash #PBS -l walltime=00:20:00 # WALLTIME #PBS -l nodes=2:ppn=16 # NUMBER OF NODES #PBS -A <allocation> # ALLOCATION ID #PBS -N your_simulation # NAME OF JOB #PBS -j oe # JOIN stdout AND stderr TO SINGLE PIPE

#PBS -q short # USE short QUEUE

module load R # Load the module for the current R version INPUT_BASENAME=Hello # JOB NAME - USER INPUT PARAMETER JOB_FILE=$INPUT_BASENAME.R # INPUT FILE NAME - USER INPUT PARAMETER OUT_FILE=$INPUT_BASENAME.Rout # Run Job echo "------- Running R ------" date R --no-save -q < $JOB_FILE > $OUT_FILE date echo "------- Finished R job ------"

Submit it to the queue:

qsub myscript.pbs

The status of the submitted job can be checked by doing

qstat -u $USER

## Additional Information

For general R or statistics questions, contact Bruce.Bugbee@nrel.gov. Additionally, NREL has an internal R Users Group that meets periodically to highlight interesting packages, problems, and share experiences related to R programming. Contact Daniel.Inman@nrel.gov for more details.

## References

http://cran.r-project.org/web/packages/Rmpi/index.html

http://www.stats.uwo.ca/faculty/yu/Rmpi/

http://math.acadiau.ca/ACMMaC/Rmpi/Rmpi.pdf

"State of the Art in Parallel Computing with R," M. Schmidberger, M. Morgan, D. Eddelbuettel, H. Yu, L. Tierney, U. Mansmann, *Journal of Statistical Software*, v**31**, issue 1 (2009) http://www.jstatsoft.org/v31/i01