Changes between Initial Version and Version 1 of Workshops/IntroToHpc2015/XeonPhi


Ignore:
Timestamp:
10/12/15 16:09:19 (9 years ago)
Author:
pdejesus
Comment:

Legend:

Unmodified
Added
Removed
Modified
  • Workshops/IntroToHpc2015/XeonPhi

    v1 v1  
     1= Programming for the Xeon Phi Coprocessor on Cypress =
     2
     3==== Workshop Reminder ====
     4To take advantage of the workshop QOS:
     5{{{#!bash
     6export MY_PARTITION=workshop
     7export MY_QUEUE=workshop
     8idev -c 4 --gres=mic:0
     9}}}
     10
     11To follow along, you can copy the source code and submission scripts
     12{{{
     13cp -r /lustre/project/workshop/PhiExamples/ .
     14}}}
     15
     16The Xeon Phi coprocessor is an accelerator used to provide many cores to parallel applications. While it has fewer threads available than a typical GPU accelerator, the processors are much "smarter" and the programming paradigm is similar to what you experience coding for CPUs.
     17
     18== Xeon Phi Coprocessor Hardware ==
     19
     20Each compute node of Cypress is equipped with two (2) Xeon Phi 7120P coprocessors
     21
     22[[Image(xeonPhi.jpg, center)]]
     23
     24The 7120p is equipped with
     25* 61 physical x86 cores running at 1.238 GHz
     26* Four (4) Hardware threads on each core
     27* 16GB GDDR5 memory
     28* Uniquely wide SIMD capabilities via 512-bit wide vectors (16 doubles!)
     29* Unique IMCI instruction set
     30* Connected via PCIe Bus
     31* Fully coherent L1 and L2 cache
     32
     33All this adds up to about 2TFLOP/s (1TFLOG/s double precission) of potential computing power.
     34
     35Each Xeon Phi can be regarded as it's own small machine (cluster really) running a stripped down version of linux. We can ssh onto them, we can run code on them, and we can treat them as another asset to be recruited into our MPI executions.
     36
     37== What Do I Call It? ==
     38The 7120p is referred to by many names, all of them correct
     39* The Phi
     40* The coprocessor
     41* The Xeon Phi
     42* The MIC (pronounced both Mic as in Jagger and Mike) which stands for Many Integrated Cores
     43* Knights Landing (current gen)
     44* Knights Hill (next gen)
     45
     46You'll typically hear us call the 7120p either the ''MIC'' or the ''Phi''. This is to help distinguish it from the Xeon E5 processors which we'll refer to as the ''host''.
     47
     48== Xeon Phi Usage Models ==
     49The intel suite provides parallel instantiations and compilers that support three distinct programming models:
     50
     51* Automatic Offloading (AO)  - the intel MKL library sends certain calculations to the Phi without any user input.
     52* Native Programming - Code is compiled to run on the Xeon Phi Coprocessor and ONLY on the Xeon Phi Coprocessor.
     53* Offloading - Certain Parallel sections of your source code are identified for offloading to the coprocessor. This provides the greatest amount of control and allows for the CPUs and coprocessors to work in tandem.
     54
     55=== Automatic Offloading ===
     56==== Eligibility ====
     57As we saw yesterday during our Matlab tutorial, any program/code that makes use of the Intel MKL library may take advantage of Automatic Offloading (AO) to the MIC. However, not every MKL routine will automatically offload. The Routines that are eligible for AO are:
     58* BLAS:
     59 * BLAS level-3 subroutines - ?SYMM,?TRMM, ?TRSM, ?GEMM
     60* LAPACK:
     61 * LU (?GETRF), Cholesky ((S/D)POTRF), and QR (?GEQRF) factorization functions
     62
     63However, AO will only kick in if MKL deems the problem to be of sufficient size (i.e. the increase in parallelism will outweigh the increase in overhead). For instance, SGEMM will use AO only if the matrix size exceeds 2048x2048. For more information on which routines are eligible for AO see the white paper [https://software.intel.com/en-us/articles/intel-mkl-automatic-offload-enabled-functions-for-intel-xeon-phi-coprocessors|Intel® MKL Automatic Offload enabled functions for Intel Xeon Phi coprocessors]
     64
     65==== Enabling Offloading ====
     66To enable AO on Cypress you must
     67* Load the Intel Parallel Studio XE module
     68* Turn on MKL AO by setting the environment variable MKL_MIC_ENABLE to 1 (0 or nothing will turn off MKL AO)
     69* (OPTIONAL) Turn on offload reporting to track your use of the MIC by setting OFFLOAD_REPORT to either 1 or 2. Setting OFFLOAD_REPORT to 2 adds more detail than 1 and will give you information on data transfers.
     70{{{
     71[tulaneID@cypress1]$ module load intel-psxe
     72[tulaneID@cypress1]$ export MKL_MIC_ENABLE=1
     73[tulaneID@cypress1]$ export OFFLOAD_REPORT=2
     74}}}
     75
     76==== Example using SGEMM ====
     77Let's do a small example using SGEMM to test the behavior of MLK AO
     78
     79{{{#!c
     80/* System headers */
     81#include <stdio.h>
     82#include <stdlib.h>
     83#include <malloc.h>
     84#include <stdint.h>
     85
     86#include "mkl.h"
     87
     88// dtime
     89//
     90// returns the current wall clock time
     91//
     92double dtime()
     93{
     94    double tseconds = 0.0;
     95    struct timeval mytime;
     96    gettimeofday(&mytime,(struct timezone*)0);
     97    tseconds = (double)(mytime.tv_sec +
     98                mytime.tv_usec*1.0e-6);
     99    return( tseconds );
     100}
     101
     102int main(int argc, char **argv)
     103{
     104        float *A, *B, *C; /* Matrices */
     105        double workdivision;
     106        double tstart, tstop, ttime;
     107
     108        MKL_INT N = 2560; /* Matrix dimensions */
     109        MKL_INT LD = N; /* Leading dimension */
     110        int matrix_bytes; /* Matrix size in bytes */
     111        int matrix_elements; /* Matrix size in elements */
     112
     113        float alpha = 1.0, beta = 1.0; /* Scaling factors */
     114        char transa = 'N', transb = 'N'; /* Transposition options */
     115
     116        int i, j; /* Counters */
     117
     118        matrix_elements = N * N;
     119        matrix_bytes = sizeof(float) * matrix_elements;
     120
     121        /* Allocate the matrices */
     122        A = malloc(matrix_bytes);
     123        B = malloc(matrix_bytes);
     124        C = malloc(matrix_bytes);
     125
     126        /* Initialize the matrices */
     127        for (i = 0; i < matrix_elements; i++) {
     128                A[i] = 1.0; B[i] = 2.0; C[i] = 0.0;
     129        }
     130       
     131        tstart = dtime();
     132        sgemm(&transa, &transb, &N, &N, &N, &alpha, A, &N, B, &N,
     133                        &beta, C, &N);
     134        tstop = dtime();
     135                /* Free the matrix memory */
     136        free(A); free(B); free(C);
     137
     138        // elasped time
     139        ttime = tstop - tstart;
     140        //
     141        // Print the results
     142        //
     143        if ((ttime) > 0.0)
     144        {
     145                printf("Time spent on SGEMM = %10.3lf\n",ttime);
     146        }
     147        printf("Done\n");
     148       
     149    return 0;
     150}
     151}}}
     152
     153To test MKL AO
     154* Get onto a compute node using idev
     155{{{
     156[tuhpc002@cypress1 Day2]$ export MY_PARTITION=workshop
     157[tuhpc002@cypress1 Day2]$ export MY_QUEUE=workshop
     158[tuhpc002@cypress1 Day2]$ idev -c 4 --gres=mic:0
     159Requesting 1 node(s)  task(s) to workshop queue of workshop partition
     1601 task(s)/node, 4 cpu(s)/task, 2 MIC device(s)/node
     161Time: 0 (hr) 60 (min).
     162Submitted batch job 54982
     163JOBID=54982 begin on cypress01-089
     164--> Creating interactive terminal session (login) on node cypress01-089.
     165--> You have 0 (hr) 60 (min).
     166Last login: Fri Aug 21 07:16:58 2015 from cypress1.cm.cluster
     167[tuhpc002@cypress01-089 Day2]$
     168}}}
     169
     170Note: We will be sharing MICs so expect some resource conflicts
     171
     172* Load the Intel module containing MKL and set your environment variables
     173{{{
     174[tuhpc002@cypress01-089 Day2]$ module load intel-psxe
     175[tuhpc002@cypress01-089 Day2]$ export MKL_MIC_ENABLE=0
     176[tuhpc002@cypress01-089 Day2]$ export OFFLOAD_REPORT=2
     177}}}
     178
     179Notice that automatic offloading is turned OFF. This will set our baseline.
     180* Compile the example code being sure to link to the MKL library
     181* Run the executable
     182* Turn on MKL AO and run it again
     183{{{
     184[tuhpc002@cypress01-089 Day2]$ icc -O3 -mkl -openmp sgemm_example.c -o AOtest
     185[tuhpc002@cypress01-089 Day2]$ ./AOtest
     186Time spent on SGEMM =      0.835
     187Done
     188[tuhpc002@cypress01-089 Day2]$ export MKL_MIC_ENABLE=1
     189[tuhpc002@cypress01-089 Day2]$ ./AOtest
     190[MKL] [MIC --] [AO Function]    SGEMM
     191[MKL] [MIC --] [AO SGEMM Workdivision]  0.60 0.20 0.20
     192[MKL] [MIC 00] [AO SGEMM CPU Time]      2.858848 seconds
     193[MKL] [MIC 00] [AO SGEMM MIC Time]      0.104307 seconds
     194[MKL] [MIC 00] [AO SGEMM CPU->MIC Data] 31457280 bytes
     195[MKL] [MIC 00] [AO SGEMM MIC->CPU Data] 5242880 bytes
     196[MKL] [MIC 01] [AO SGEMM CPU Time]      2.858848 seconds
     197[MKL] [MIC 01] [AO SGEMM MIC Time]      0.113478 seconds
     198[MKL] [MIC 01] [AO SGEMM CPU->MIC Data] 31457280 bytes
     199[MKL] [MIC 01] [AO SGEMM MIC->CPU Data] 5242880 bytes
     200Time spent on SGEMM =      3.436
     201Done
     202[tuhpc002@cypress01-089 Day2]$
     203}}}
     204
     205The Point: This example gets at some of the challenges of coding for the Xeon Phi. Utilization is simple, but optimization can be a real challenge. Let's look at a few more options we can manipulate through environment variables:
     206
     207* The work division among the Host and MICs can also be tuned by hand using MKL_MIC_<0,1>_WORKDIVISION
     208{{{
     209[tuhpc002@cypress01-089 Day2]$ export MKL_MIC_0_WORKDIVISION=1.0
     210[tuhpc002@cypress01-089 Day2]$ ./AOtest
     211[MKL] [MIC --] [AO Function]    SGEMM
     212[MKL] [MIC --] [AO SGEMM Workdivision]  0.00 1.00 0.00
     213[MKL] [MIC 00] [AO SGEMM CPU Time]      2.831957 seconds
     214[MKL] [MIC 00] [AO SGEMM MIC Time]      0.141694 seconds
     215[MKL] [MIC 00] [AO SGEMM CPU->MIC Data] 52428800 bytes
     216[MKL] [MIC 00] [AO SGEMM MIC->CPU Data] 26214400 bytes
     217[MKL] [MIC 01] [AO SGEMM CPU Time]      2.831957 seconds
     218[MKL] [MIC 01] [AO SGEMM MIC Time]      0.000000 seconds
     219[MKL] [MIC 01] [AO SGEMM CPU->MIC Data] 0 bytes
     220[MKL] [MIC 01] [AO SGEMM MIC->CPU Data] 0 bytes
     221Time spent on SGEMM =      3.394
     222}}}
     223
     224* The number of threads used on each MIC can be controlled using MIC_OMP_NUMTHREADS
     225{{{
     226[tuhpc002@cypress01-089 Day2]$ export MIC_OMP_NUMTHREADS=122
     227[tuhpc002@cypress01-089 Day2]$ ./AOtest
     228[MKL] [MIC --] [AO Function]    SGEMM
     229[MKL] [MIC --] [AO SGEMM Workdivision]  0.60 0.20 0.20
     230[MKL] [MIC 00] [AO SGEMM CPU Time]      1.625511 seconds
     231[MKL] [MIC 00] [AO SGEMM MIC Time]      0.102266 seconds
     232[MKL] [MIC 00] [AO SGEMM CPU->MIC Data] 31457280 bytes
     233[MKL] [MIC 00] [AO SGEMM MIC->CPU Data] 5242880 bytes
     234[MKL] [MIC 01] [AO SGEMM CPU Time]      1.625511 seconds
     235[MKL] [MIC 01] [AO SGEMM MIC Time]      0.089364 seconds
     236[MKL] [MIC 01] [AO SGEMM CPU->MIC Data] 31457280 bytes
     237[MKL] [MIC 01] [AO SGEMM MIC->CPU Data] 5242880 bytes
     238Time spent on SGEMM =      2.288
     239Done
     240[tuhpc002@cypress01-089 Day2]$
     241}}}
     242
     243* We can control the distribution of threads using MIC_KMP_AFFINITY
     244{{{
     245[tuhpc002@cypress01-089 Day2]$ export MIC_KMP_AFFINITY=scatter
     246[tuhpc002@cypress01-089 Day2]$ ./AOtest
     247[MKL] [MIC --] [AO Function]    SGEMM
     248[MKL] [MIC --] [AO SGEMM Workdivision]  0.60 0.20 0.20
     249[MKL] [MIC 00] [AO SGEMM CPU Time]      1.631954 seconds
     250[MKL] [MIC 00] [AO SGEMM MIC Time]      0.101270 seconds
     251[MKL] [MIC 00] [AO SGEMM CPU->MIC Data] 31457280 bytes
     252[MKL] [MIC 00] [AO SGEMM MIC->CPU Data] 5242880 bytes
     253[MKL] [MIC 01] [AO SGEMM CPU Time]      1.631954 seconds
     254[MKL] [MIC 01] [AO SGEMM MIC Time]      0.105702 seconds
     255[MKL] [MIC 01] [AO SGEMM CPU->MIC Data] 31457280 bytes
     256[MKL] [MIC 01] [AO SGEMM MIC->CPU Data] 5242880 bytes
     257Time spent on SGEMM =      2.028
     258Done
     259[tuhpc002@cypress01-089 Day2]$
     260}}}
     261
     262
     263
     264
     265
     266
     267=== Native Programming ===
     268
     269The native model centers around the notion that each MIC is its own machine with it's own architecture. The first challenge is to compile code to run specifically on the hardware of the MIC.
     270
     271Let's start out in interactive mode and then we will look at job submission:
     272{{{#!bash
     273export MY_PARTITION=workshop
     274export MY_QUEUE=workshop
     275idev -c 4 --gres=mic:0
     276}}}
     277
     278
     279
     280* Load the intel modules
     281 * intel-psxe
     282 * intel/mic/sdk/3.3
     283 * intel/mic/runtime/3.3
     284
     285And let's examine the code helloflops3.c from Jim Jeffers and James Rainers seminal text
     286[[http://lotsofcores.com/|Intels Xeon Phi Coprocessor High - Performance Programming]]
     287
     288{{{#!c
     289//
     290//
     291// helloflops2
     292//
     293// A simple example that gets lots of Flops (Floating Point Operations) on
     294// Intel(r) Xeon Phi(tm) co-processors using openmp to scale
     295//
     296
     297#include <stdio.h>
     298#include <stdlib.h>
     299#include <string.h>
     300#include <omp.h>
     301#include <sys/time.h>
     302
     303// dtime
     304//
     305// returns the current wall clock time
     306//
     307double dtime()
     308{
     309    double tseconds = 0.0;
     310    struct timeval mytime;
     311    gettimeofday(&mytime,(struct timezone*)0);
     312    tseconds = (double)(mytime.tv_sec + mytime.tv_usec*1.0e-6);
     313    return( tseconds );
     314}
     315
     316#define FLOPS_ARRAY_SIZE (1024*1024)
     317#define MAXFLOPS_ITERS 100000000
     318#define LOOP_COUNT 128
     319
     320// number of float pt ops per calculation
     321#define FLOPSPERCALC 2     
     322// define some arrays -
     323// make sure they are 64 byte aligned
     324// for best cache access
     325float fa[FLOPS_ARRAY_SIZE] __attribute__((aligned(64)));
     326float fb[FLOPS_ARRAY_SIZE] __attribute__((aligned(64)));
     327//
     328// Main program - pedal to the metal...calculate using tons o'flops!
     329//
     330int main(int argc, char *argv[] )
     331{
     332    int i,j,k;
     333    int numthreads;
     334    double tstart, tstop, ttime;
     335    double gflops = 0.0;
     336    float a=1.1;
     337
     338    //
     339    // initialize the compute arrays
     340    //
     341    //
     342
     343#pragma omp parallel
     344#pragma omp master
     345    numthreads = omp_get_num_threads();
     346
     347    printf("Initializing\r\n");
     348#pragma omp parallel for
     349    for(i=0; i<FLOPS_ARRAY_SIZE; i++)
     350    {
     351        fa[i] = (float)i + 0.1;
     352        fb[i] = (float)i + 0.2;
     353    }   
     354    printf("Starting Compute on %d threads\r\n",numthreads);
     355
     356    tstart = dtime();
     357       
     358    // scale the calculation across threads requested
     359    // need to set environment variables OMP_NUM_THREADS and KMP_AFFINITY
     360
     361#pragma omp parallel for private(j,k)
     362    for (i=0; i<numthreads; i++)
     363    {
     364        // each thread will work it's own array section
     365        // calc offset into the right section
     366        int offset = i*LOOP_COUNT;
     367
     368        // loop many times to get lots of calculations
     369        for(j=0; j<MAXFLOPS_ITERS; j++) 
     370        {
     371            // scale 1st array and add in the 2nd array
     372            for(k=0; k<LOOP_COUNT; k++) 
     373            {
     374                fa[k+offset] = a * fa[k+offset] + fb[k+offset];
     375            }
     376        }
     377    }
     378    tstop = dtime();
     379    // # of gigaflops we just calculated 
     380    gflops = (double)( 1.0e-9*numthreads*LOOP_COUNT*
     381                        MAXFLOPS_ITERS*FLOPSPERCALC);   
     382
     383    //elasped time
     384    ttime = tstop - tstart;
     385    //
     386    // Print the results
     387    //
     388    if ((ttime) > 0.0)
     389    {
     390        printf("GFlops = %10.3lf, Secs = %10.3lf, GFlops per sec = %10.3lf\r\n",                   gflops, ttime, gflops/ttime);
     391    }
     392    return( 0 );
     393}
     394}}}
     395
     396
     397Lets begin by compiling the code for the host and running it on the host processor
     398{{{
     399[tuhpc002@cypress01-089 Day2]$ icc -O3 -openmp helloflops3.c -o helloflops3_host
     400[tuhpc002@cypress01-089 Day2]$ ./helloflops3_host
     401Initializing
     402Starting Compute on 20 threads
     403GFlops =    512.000, Secs =      6.349, GFlops per sec =     80.645
     404[tuhpc002@cypress01-089 Day2]$
     405}}}
     406
     407Not bad. Now lets compile a native version by adding the architecture flag '''-mmic'''.
     408{{{
     409[tuhpc002@cypress01-089 Day2]$ icc -O3 -openmp -mmic helloflops3.c -o helloflops3_mic
     410[tuhpc002@cypress01-089 Day2]$ ./helloflops3_mic
     411-bash: ./helloflops3_mic: cannot execute binary file
     412[tuhpc002@cypress01-089 Day2]$
     413}}}
     414
     415We could not execute our binary because it we tried to run it on the host architecture. We could ssh onto one of the MICs and manually set our library environment variables to run the code, but a cleaner method (and the recommended method) is to use the intel program '''micnativeloadex'''.
     416
     417{{{
     418[tuhpc002@cypress01-089 Day2]$ micnativeloadex ./helloflops3_mic
     419Initializing
     420Starting Compute on 240 threads
     421GFlops =   6144.000, Secs =      2.630, GFlops per sec =   2335.925
     422
     423[tuhpc002@cypress01-089 Day2]$
     424}}}
     425
     426'''micnativeloadex''' has a number of options which can be seen using the '''-h''' or help flag.
     427
     428{{{
     429[tulaneID@cypress01 $ micnativeloadex -h
     430
     431Usage:
     432micnativeloadex [ -h | -V ] AppName -l -t timeout -p -v -d coprocessor -a "args" -e "environment"
     433  -a "args" An optional string of command line arguments to pass to
     434            the remote app.
     435  -d The (zero based) index of the Intel(R) Xeon Phi(TM) coprocessor to run the app on.
     436  -e "environment" An optional environment string to pass to the remote app.
     437      Multiple environment variable may be specified using spaces as separators:
     438        -e "LD_LIBRARY_PATH=/lib64/ DEBUG=1"
     439  -h Print this help message
     440  -l Do not execute the binary on the coprocessor. Instead, list the shared library
     441     dependency information.
     442  -p Disable console proxy.
     443  -t Time to wait for the remote app to finish (in seconds). After the timeout
     444     is reached the remote app will be terminated.
     445  -v Enable verbose mode. Note that verbose output will be displayed
     446     if the remote app terminates abnormally.
     447  -V Show version and build information
     448}}}
     449
     450Notice that we can use the '''-d''' flag to select which MIC we want to run on and the '''-e''' flag to set environment variables on the MIC (separated by whitespace). For example, we can choose to run on MIC0 and set the number of threads and their affinity with
     451
     452{{{
     453[tuhpc002@cypress01-089 Day2]$ micnativeloadex ./helloflops3_mic -e "OMP_NUM_THREADS=120 KMP_AFFINITY=scatter" -d 0
     454Initializing
     455Starting Compute on 120 threads
     456GFlops =   3072.000, Secs =      1.500, GFlops per sec =   2048.143
     457
     458[tuhpc002@cypress01-089 Day2]$
     459}}}
     460
     461
     462We've been using idev as an instructional tool, but we won't normally be running our MIC native jobs interactively. Rather, we'll be submitting jobs that we want to run in native mode on a compute node. An example SLURM jobscript script for our code would look like
     463
     464{{{#!bash
     465#!/bin/bash
     466#SBATCH --qos=workshop          # Quality of Service
     467#SBATCH --partition=workshop    #Partition
     468#SBATCH --job-name=nativeTest   # Job Name
     469#SBATCH --time=00:10:00         # WallTime
     470#SBATCH --nodes=1               # Number of Nodes
     471#SBATCH --ntasks-per-node=1     # Number of tasks (MPI presseces)
     472#SBATCH --cpus-per-task=1       # Number of processors per task OpenMP threads()
     473#SBATCH --gres=mic:1            # Number of Co-Processors
     474
     475module load intel-psxe/2015-update1
     476module load intel/mic/sdk/3.3
     477module load intel/mic/runtime/3.3
     478
     479micnativeloadex ./helloflops3_mic -e "OMP_NUM_THREADS=120 KMP_AFFINITY=scatter" -d 0
     480}}}
     481
     482
     483
     484
     485
     486
     487
     488
     489
     490
     491=== Offloading ===
     492
     493Offloading allows us to designate specific sections of our code that we wish to have executed on the MIC. Unlike the first two methods, this requires (minimal) alteration of the source code. Lets take our helloflop3.c example and modify it so that it offloads the area of heavy computation
     494
     495{{{#!c
     496//
     497//
     498// helloflops3offl
     499//
     500// A simple example that gets lots of Flops (Floating Point Operations) on
     501// Intel(r) Xeon Phi(tm) co-processors using offload plus  openmp to scale
     502//
     503
     504#include <stdio.h>
     505#include <stdlib.h>
     506#include <string.h>
     507#include <omp.h>
     508#include <sys/time.h>
     509
     510// dtime
     511//
     512// returns the current wall clock time
     513//
     514double dtime()
     515{
     516    double tseconds = 0.0;
     517    struct timeval mytime;
     518    gettimeofday(&mytime,(struct timezone*)0);
     519    tseconds = (double)(mytime.tv_sec + mytime.tv_usec*1.0e-6);
     520    return( tseconds );
     521}
     522
     523#define FLOPS_ARRAY_SIZE (1024*512)
     524#define MAXFLOPS_ITERS 100000000
     525#define LOOP_COUNT 128
     526
     527// number of float pt ops per calculation
     528#define FLOPSPERCALC 2     
     529// define some arrays -
     530// make sure they are 64 byte aligned
     531// for best cache access
     532__declspec ( target (mic)) float fa[FLOPS_ARRAY_SIZE] __attribute__((aligned(64)));
     533__declspec ( target (mic)) float fb[FLOPS_ARRAY_SIZE] __attribute__((aligned(64)));
     534//
     535// Main program - pedal to the metal...calculate using tons o'flops!
     536//
     537int main(int argc, char *argv[] )
     538{
     539    int i,j,k;
     540    int numthreads;
     541    double tstart, tstop, ttime;
     542    double gflops = 0.0;
     543    float a=1.1;
     544
     545    //
     546    // initialize the compute arrays
     547    //
     548    //
     549
     550#pragma offload target (mic)
     551#pragma omp parallel
     552#pragma omp master
     553    numthreads = omp_get_num_threads();
     554
     555    printf("Initializing\r\n");
     556
     557#pragma omp parallel for
     558    for(i=0; i<FLOPS_ARRAY_SIZE; i++)
     559    {
     560        fa[i] = (float)i + 0.1;
     561        fb[i] = (float)i + 0.2;
     562    }   
     563    printf("Starting Compute on %d threads\r\n",numthreads);
     564
     565    tstart = dtime();
     566       
     567    // scale the calculation across threads requested
     568    // need to set environment variables OMP_NUM_THREADS and KMP_AFFINITY
     569
     570#pragma offload target (mic)
     571#pragma omp parallel for private(j,k)
     572    for (i=0; i<numthreads; i++)
     573    {
     574        // each thread will work it's own array section
     575        // calc offset into the right section
     576        int offset = i*LOOP_COUNT;
     577
     578        // loop many times to get lots of calculations
     579        for(j=0; j<MAXFLOPS_ITERS; j++) 
     580        {
     581            // scale 1st array and add in the 2nd array
     582            #pragma vector aligned
     583            for(k=0; k<LOOP_COUNT; k++) 
     584            {
     585                fa[k+offset] = a * fa[k+offset] + fb[k+offset];
     586            }
     587        }
     588    }
     589    tstop = dtime();
     590    // # of gigaflops we just calculated 
     591    gflops = (double)( 1.0e-9*numthreads*LOOP_COUNT*
     592                        MAXFLOPS_ITERS*FLOPSPERCALC);   
     593
     594    //elasped time
     595    ttime = tstop - tstart;
     596    //
     597    // Print the results
     598    //
     599    if ((ttime) > 0.0)
     600    {
     601        printf("GFlops = %10.3lf, Secs = %10.3lf, GFlops per sec = %10.3lf\r\n",                   gflops, ttime, gflops/ttime);
     602    }
     603    return( 0 );
     604}
     605}}}
     606
     607Changes to take note of:
     608* The addition of the directive before the section of code that we wish to run on the MIC
     609 {{{#!c
     610#pragma offload target (mic)
     611}}}
     612* The alteration of our array declarations indicating they would be part of offload use, eg
     613{{{
     614__declspec ( target (mic)) float fa[FLOPS_ARRAY_SIZE] __attribute__((aligned(64)));
     615}}}
     616
     617Lets compile our new code
     618
     619{{{
     620icc -openmp -O3 helloflops3offload.c -o helloflops3offload
     621}}}
     622
     623And let's take a look at a submission script for our offloading example
     624
     625{{{#!bash
     626#!/bin/bash
     627#SBATCH --qos=workshop          # Quality of Service
     628#SBATCH --partition=workshop    #Partition
     629#SBATCH --job-name=offloadTest   # Job Name
     630#SBATCH --time=00:10:00         # WallTime
     631#SBATCH --nodes=1               # Number of Nodes
     632#SBATCH --ntasks-per-node=1     # Number of tasks (MPI presseces)
     633#SBATCH --cpus-per-task=20      # Number of processors per task OpenMP threads()
     634#SBATCH --gres=mic:1            # Number of Co-Processors
     635
     636module load intel-psxe/2015-update1
     637module load intel/mic/sdk/3.3
     638module load intel/mic/runtime/3.3
     639
     640export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
     641export MIC_ENV_PREFIX=MIC
     642export MIC_OMP_NUM_THREADS=120
     643export MIC_KMP_AFFINITY=scatter
     644
     645./helloflops3offload
     646}}}
     647
     648== Programming Considerations ==
     649
     650As we've seen, getting code to execute on the Xeon Phi Coprocessor can be accomplished in a manner of minutes or hours. However, getting production code to run optimally on the MICs is often an effort spanning weeks or months. A few of the key considerations that can speed up your code are:
     651* Cache alignment
     652* Vectorization
     653* Blocking
     654* Minimal data transfer
     655
     656
     657== Future Training ==
     658We've only scratched the surface on the potential of the Xeon Phi coprocessor. If you are interested in learning more, Colfax International will be giving two days of instruction on coding for the Xeon Phi at Tulane at the end of September. Interested parties can register at
     659
     660CDT 101:  http://events.r20.constantcontact.com/register/event?oeidk=a07eayq4gvha16a1237&llr=kpiwi7pab
     661
     662CDT 102:  http://events.r20.constantcontact.com/register/event?oeidk=a07eayqb5mwf5397895&llr=kpiwi7pab
     663
     664[[Image(ColfaxInvite.png, center, 80%)]]