Version 13 (modified by cmaggio, 6 years ago) (diff)

Programming for the Xeon Phi Coprocessor on Cypress

Workshop Reminder

To take advantage of the workshop QOS:

export MY_PARTITION=workshop
export MY_QUEUE=workshop
idev -c 4 --gres=mic:0

The Xeon Phi coprocessor is an accelerator used to provide many cores to parallel applications. While it has fewer threads available than a typical GPU accelerator, the processors are much "smarter" and the programming paradigm is similar to what you experience coding for CPUs.

Xeon Phi Coprocessor Hardware

Each compute node of Cypress is equipped with two (2) Xeon Phi 7120P coprocessors

Xeon Phi 7120p coprocessor

The 7120p is equipped with

  • 61 physical x86 cores running at 1.238 GHz
  • Four (4) Hardware threads on each core
  • 16GB GDDR5 memory
  • Uniquely wide SIMD capabilities via 512-bit wide vectors (16 doubles!)
  • Unique IMCI instruction set
  • Connected via PCIe Bus
  • Fully coherent L1 and L2 cache

All this adds up to about 2TFLOP/s (1TFLOG/s double precission) of potential computing power.

Each Xeon Phi can be regarded as it's own small machine (cluster really) running a stripped down version of linux. We can ssh onto them, we can run code on them, and we can treat them as another asset to be recruited into our MPI executions.

What Do I Call It?

The 7120p is referred to by many names, all of them correct

  • The Phi
  • The coprocessor
  • The Xeon Phi
  • The MIC (pronounced both Mic as in Jagger and Mike) which stands for Many Integrated Cores
  • Knights Landing (current gen)
  • Knights Hill (next gen)

You'll typically hear us call the 7120p either the MIC or the Phi. This is to help distinguish it from the Xeon E5 processors which we'll refer to as the host.

Xeon Phi Usage Models

The intel suite provides parallel instantiations and compilers that support three distinct programming models:

  • Automatic Offloading (AO) - the intel MKL library sends certain calculations to the Phi without any user input.
  • Native Programming - Code is compiled to run on the Xeon Phi Coprocessor and ONLY on the Xeon Phi Coprocessor.
  • Offloading - Certain Parallel sections of your source code are identified for offloading to the coprocessor. This provides the greatest amount of control and allows for the CPUs and coprocessors to work in tandem.

Automatic Offloading


As we saw yesterday during our Matlab tutorial, any program/code that makes use of the Intel MKL library may take advantage of Automatic Offloading (AO) to the MIC. However, not every MKL routine will automatically offload. The Routines that are eligible for AO are:

  • BLAS:
    • BLAS level-3 subroutines - ?SYMM,?TRMM, ?TRSM, ?GEMM
    • LU (?GETRF), Cholesky ((S/D)POTRF), and QR (?GEQRF) factorization functions

However, AO will only kick in if MKL deems the problem to be of sufficient size (i.e. the increase in parallelism will outweigh the increase in overhead). For instance, SGEMM will use AO only if the matrix size exceeds 2048x2048. For more information on which routines are eligible for AO see the white paper MKL Automatic Offload enabled functions for Intel Xeon Phi coprocessors

Enabling Offloading

To enable AO on Cypress you must

  • Load the Intel Parallel Studio XE module
  • Turn on MKL AO by setting the environment variable MKL_MIC_ENABLE to 1 (0 or nothing will turn off MKL AO)
  • (OPTIONAL) Turn on offload reporting to track your use of the MIC by setting OFFLOAD_REPORT to either 1 or 2. Setting OFFLOAD_REPORT to 2 adds more detail than 1 and will give you information on data transfers.
    [tulaneID@cypress1]$ module load intel-psxe
    [tulaneID@cypress1]$ export MKL_MIC_ENABLE=1
    [tulaneID@cypress1]$ export OFFLOAD_REPORT=2

Example using SGEMM

Let's do a small example using SGEMM to test the behavior of MLK AO

[tuhpc002@cypress01-089 Day2]$ cat sgemm_example.c 
/* System headers */
#include <stdio.h>
#include <stdlib.h>
#include <malloc.h>
#include <stdint.h>

#include "mkl.h"

// dtime
// returns the current wall clock time
double dtime()
    double tseconds = 0.0;
    struct timeval mytime;
    gettimeofday(&mytime,(struct timezone*)0);
    tseconds = (double)(mytime.tv_sec +
    return( tseconds );

int main(int argc, char **argv)
        float *A, *B, *C; /* Matrices */
        double workdivision;
        double tstart, tstop, ttime;

        MKL_INT N = 2560; /* Matrix dimensions */
        MKL_INT LD = N; /* Leading dimension */
        int matrix_bytes; /* Matrix size in bytes */
        int matrix_elements; /* Matrix size in elements */

        float alpha = 1.0, beta = 1.0; /* Scaling factors */
        char transa = 'N', transb = 'N'; /* Transposition options */

        int i, j; /* Counters */

        matrix_elements = N * N;
        matrix_bytes = sizeof(float) * matrix_elements;

        /* Allocate the matrices */
        A = malloc(matrix_bytes);
        B = malloc(matrix_bytes);
        C = malloc(matrix_bytes);

        /* Initialize the matrices */
        for (i = 0; i < matrix_elements; i++) {
                A[i] = 1.0; B[i] = 2.0; C[i] = 0.0;
        tstart = dtime();
        sgemm(&transa, &transb, &N, &N, &N, &alpha, A, &N, B, &N,
                        &beta, C, &N);
        tstop = dtime();
                /* Free the matrix memory */
        free(A); free(B); free(C);

        // elasped time
        ttime = tstop - tstart;
        // Print the results
        if ((ttime) > 0.0)
                printf("Time spent on SGEMM = %10.3lf\n",ttime);
    return 0;

To test MKL AO

  • Get onto a compute node using idev
    [tuhpc002@cypress1 Day2]$ export MY_PARTITION=workshop
    [tuhpc002@cypress1 Day2]$ export MY_QUEUE=workshop
    [tuhpc002@cypress1 Day2]$ idev -c 4 --gres=mic:0
    Requesting 1 node(s)  task(s) to workshop queue of workshop partition
    1 task(s)/node, 4 cpu(s)/task, 2 MIC device(s)/node
    Time: 0 (hr) 60 (min).
    Submitted batch job 54982
    JOBID=54982 begin on cypress01-089
    --> Creating interactive terminal session (login) on node cypress01-089.
    --> You have 0 (hr) 60 (min).
    Last login: Fri Aug 21 07:16:58 2015 from
    [tuhpc002@cypress01-089 Day2]$ 

Note: We will be sharing MICs so expect some resource conflicts

  • Load the Intel module containing MKL and set your environment variables
    [tuhpc002@cypress01-089 Day2]$ module load intel-psxe
    [tuhpc002@cypress01-089 Day2]$ export MKL_MIC_ENABLE=0
    [tuhpc002@cypress01-089 Day2]$ export OFFLOAD_REPORT=2

Notice that automatic offloading is turned OFF. This will set our baseline.

  • Compile the example code being sure to link to the MKL library
  • Run the executable
  • Turn on MKL AO and run it again
    [tuhpc002@cypress01-089 Day2]$ icc -O3 -mkl -openmp sgemm_example.c -o AOtest
    [tuhpc002@cypress01-089 Day2]$ ./AOtest 
    Time spent on SGEMM =      0.835
    [tuhpc002@cypress01-089 Day2]$ export MKL_MIC_ENABLE=1
    [tuhpc002@cypress01-089 Day2]$ ./AOtest 
    [MKL] [MIC --] [AO Function]	SGEMM
    [MKL] [MIC --] [AO SGEMM Workdivision]	0.60 0.20 0.20
    [MKL] [MIC 00] [AO SGEMM CPU Time]	2.858848 seconds
    [MKL] [MIC 00] [AO SGEMM MIC Time]	0.104307 seconds
    [MKL] [MIC 00] [AO SGEMM CPU->MIC Data]	31457280 bytes
    [MKL] [MIC 00] [AO SGEMM MIC->CPU Data]	5242880 bytes
    [MKL] [MIC 01] [AO SGEMM CPU Time]	2.858848 seconds
    [MKL] [MIC 01] [AO SGEMM MIC Time]	0.113478 seconds
    [MKL] [MIC 01] [AO SGEMM CPU->MIC Data]	31457280 bytes
    [MKL] [MIC 01] [AO SGEMM MIC->CPU Data]	5242880 bytes
    Time spent on SGEMM =      3.436
    [tuhpc002@cypress01-089 Day2]$ 

The Point: This example gets at some of the challenges of coding for the Xeon Phi. Utilization is simple, but optimization can be a real challenge. Let's look at a few more options we can manipulate through environment variables:

  • The work division among the Host and MICs can also be tuned by hand using MKL_MIC_<0,1>_WORKDIVISION
    [tuhpc002@cypress01-089 Day2]$ export MKL_MIC_0_WORKDIVISION=1.0
    [tuhpc002@cypress01-089 Day2]$ ./AOtest 
    [MKL] [MIC --] [AO Function]	SGEMM
    [MKL] [MIC --] [AO SGEMM Workdivision]	0.00 1.00 0.00
    [MKL] [MIC 00] [AO SGEMM CPU Time]	2.831957 seconds
    [MKL] [MIC 00] [AO SGEMM MIC Time]	0.141694 seconds
    [MKL] [MIC 00] [AO SGEMM CPU->MIC Data]	52428800 bytes
    [MKL] [MIC 00] [AO SGEMM MIC->CPU Data]	26214400 bytes
    [MKL] [MIC 01] [AO SGEMM CPU Time]	2.831957 seconds
    [MKL] [MIC 01] [AO SGEMM MIC Time]	0.000000 seconds
    [MKL] [MIC 01] [AO SGEMM CPU->MIC Data]	0 bytes
    [MKL] [MIC 01] [AO SGEMM MIC->CPU Data]	0 bytes
    Time spent on SGEMM =      3.394
  • The number of threads used on each MIC can be controlled using MIC_OMP_NUMTHREADS
    [tuhpc002@cypress01-089 Day2]$ export MIC_OMP_NUMTHREADS=122
    [tuhpc002@cypress01-089 Day2]$ ./AOtest 
    [MKL] [MIC --] [AO Function]	SGEMM
    [MKL] [MIC --] [AO SGEMM Workdivision]	0.60 0.20 0.20
    [MKL] [MIC 00] [AO SGEMM CPU Time]	1.625511 seconds
    [MKL] [MIC 00] [AO SGEMM MIC Time]	0.102266 seconds
    [MKL] [MIC 00] [AO SGEMM CPU->MIC Data]	31457280 bytes
    [MKL] [MIC 00] [AO SGEMM MIC->CPU Data]	5242880 bytes
    [MKL] [MIC 01] [AO SGEMM CPU Time]	1.625511 seconds
    [MKL] [MIC 01] [AO SGEMM MIC Time]	0.089364 seconds
    [MKL] [MIC 01] [AO SGEMM CPU->MIC Data]	31457280 bytes
    [MKL] [MIC 01] [AO SGEMM MIC->CPU Data]	5242880 bytes
    Time spent on SGEMM =      2.288
    [tuhpc002@cypress01-089 Day2]$
  • We can control the distribution of threads using MIC_KMP_AFFINITY
    [tuhpc002@cypress01-089 Day2]$ export MIC_KMP_AFFINITY=scatter
    [tuhpc002@cypress01-089 Day2]$ ./AOtest 
    [MKL] [MIC --] [AO Function]	SGEMM
    [MKL] [MIC --] [AO SGEMM Workdivision]	0.60 0.20 0.20
    [MKL] [MIC 00] [AO SGEMM CPU Time]	1.631954 seconds
    [MKL] [MIC 00] [AO SGEMM MIC Time]	0.101270 seconds
    [MKL] [MIC 00] [AO SGEMM CPU->MIC Data]	31457280 bytes
    [MKL] [MIC 00] [AO SGEMM MIC->CPU Data]	5242880 bytes
    [MKL] [MIC 01] [AO SGEMM CPU Time]	1.631954 seconds
    [MKL] [MIC 01] [AO SGEMM MIC Time]	0.105702 seconds
    [MKL] [MIC 01] [AO SGEMM CPU->MIC Data]	31457280 bytes
    [MKL] [MIC 01] [AO SGEMM MIC->CPU Data]	5242880 bytes
    Time spent on SGEMM =      2.028
    [tuhpc002@cypress01-089 Day2]$

Native Programming

The native model centers around the notion that each MIC is its own machine with it's own architecture. The first challenge is to compile code to run specifically on the hardware of the MIC.

SLURM jobscript script is, for example,

#SBATCH --qos=normal            # Quality of Service
#SBATCH --job-name=nativeTest   # Job Name
#SBATCH --time=00:10:00         # WallTime
#SBATCH --nodes=1               # Number of Nodes
#SBATCH --ntasks-per-node=1     # Number of tasks (MPI presseces)
#SBATCH --cpus-per-task=1       # Number of processors per task OpenMP threads()
#SBATCH --gres=mic:1            # Number of Co-Processors

micnativeloadex ./myNativeExecutable -e "OMP_NUM_THREADS=100" -d 0 -v

In the script above we request one MIC device that will be device number 0. "micnativeloadex" command launches MIC native executable. "-e "OMP_NUM_THREADS=100"" option to set the number of threads on the MIC device to 100. For more options, see below.

[fuji@cypress01-090 nativeTest]$ micnativeloadex -h

micnativeloadex [ -h | -V ] AppName -l -t timeout -p -v -d coprocessor -a "args" -e "environment"
  -a "args" An optional string of command line arguments to pass to
            the remote app.
  -d The (zero based) index of the Intel(R) Xeon Phi(TM) coprocessor to run the app on.
  -e "environment" An optional environment string to pass to the remote app.
      Multiple environment variable may be specified using spaces as separators:
        -e "LD_LIBRARY_PATH=/lib64/ DEBUG=1"
  -h Print this help message
  -l Do not execute the binary on the coprocessor. Instead, list the shared library
     dependency information.
  -p Disable console proxy.
  -t Time to wait for the remote app to finish (in seconds). After the timeout
     is reached the remote app will be terminated.
  -v Enable verbose mode. Note that verbose output will be displayed
     if the remote app terminates abnormally.
  -V Show version and build information


Programming Considerations

The number one thing to keep in mind is that all data traffic to and from the coprocessors must travel over PCIE. This is a relatively slow connection when compared to memory and the more you can minimize this communication, the faster you code will run.

Future Training

We've only scratched the surface on the potential of the Xeon Phi coprocessor. If you are interested in learning more, Colfax International will be giving two days of instruction on coding for the Xeon Phi at Tulane at the end of September. Interested parties can register at

CDT 101:

CDT 102:

No image "ColfaxInvite.png" attached to cypress/XeonPhi

Attachments (1)

Download all attachments as: .zip