Explicit Offloading with OpenMP
Note that "host" is the CPU, and "device" is MIC/GPU.
This is a simple OpenMP code:
#include <iostream>
#include <omp.h>
int main( void ) {
int totalProcs;
totalProcs = omp_get_num_procs();
std::cout << "Number of Threads = " << totalProcs << std::endl;
return 0;
}
If run on Cypress computing node, the "Number of Threads" will be 20.
Add a one-line directive #pragma that offloads to the device a line of executable code.
#include <iostream>
#include <omp.h>
int main( void ) {
int totalProcs;
#pragma omp target device(0)
totalProcs = omp_get_num_procs();
std::cout << "Number of Threads = " << totalProcs << std::endl;
return 0;
}
codes now return "240" Note that the host pauses until the device code is finished. This code offloads only one line of
totalProcs = omp_get_num_procs();
to the device. Use { } to offload a block of codes. What happens to 'totalProcs'?
Primitive variables are automatically transferred to/from the device.
Parallel Loop
OpenMP region is defined by an omp directive. This for-loop runs on device.
int main( void ) {
double a[500000];
// static arrays are allocated on the stack; literal here is important
int i;
#pragma omp target device(0)
#pragma omp parallel for
for ( i=0; i<500000; i++ ) {
a[i] = (double)i;
}
...
What happens to “a”?
- Detect a device
- Allocate 'a' on the device memory.
- The static array “a” is transferred to the device memory.
- Execute the device-side code
- Values in “a” in the device memory are transferred back to the host memory.
Controlling the Offload
Get the number of devices
const int num_dev = omp_get_num_devices(); std::cout << "number of devices : " << num_dev << std::endl;
Control data transfer
Transfer data from the device at the end of the offload section
int main( void ) {
double a[500000];
// static arrays are allocated on the stack; literal here is important
int i;
#pragma omp target device(0) map(from:a)
#pragma omp parallel for
for ( i=0; i<500000; i++ ) {
a[i] = (double)i;
}
}
Transfer data to the device at the beginning of the offload section
#pragma omp target device(0) map(to:a)
If not specified, do both.
Transfer dynamic arrays
You have to specify the range in the array.
#pragma omp target device(0) map(from:phi[0:num * num])
Keeping Data on Device Memory
This will allocate a space for the array a on Device memory.
#pragma omp target if (dev != num_dev) device(dev) map(to:a) map(from:a[dev:dev+1])
The memory block for a on Device will be freed when the offload section ends.
To keep data on Device memory, we have to allocate array on Device memory explicitly.
void *data;
#pragma omp target device(0) map(from:data)
{
double *vdata = new double[100];
#pragma omp parallel
for (int i = 0 ; i < 100 ; i++) vdata[i]= i;
data = (void *)vdata;
}
#pragma omp target device(0) map(to:data)
{
double *vdata = (double *)data;
for (int i = 0 ; i < 100 ; i++){
std::cout << vdata[i] << std::endl;
}
}
Use void * pointer variable to store the address of array on Device memory.
Controlling data transfer
#pragma omp target data map(to:aArray[0:num], bArray[0:num]) map(alloc:cArray[0:num])
{ // aArray, bArray, cArray are allocated on Device memory, and the elements of aArray & bArray are transferred from CPU to Device
#pragma omp target // Use aArray,bArray,cArray on Device memory
#pragma omp parallel for // Runs on Device
for (int i = 0 ; i < num ; i++){
double sum = 0.0;
for (int j = 0 ; j < num ; j++){
sum += aArray[i] * bArray[j];
}
cArray[i] = sum;
}
//Compute ||C|| . Host gets the results.
double cNorm = 0.0;
#pragma omp target // Use aArray,bArray,cArray on Device memory
#pragma omp parallel for reduction(+:cNorm) // Runs on Device
for (int i = 0 ; i < num ; i++){
cNorm += cArray[i] * cArray[i];
}
cNorm = std::sqrt(cNorm); // Runs on CPU
std::cout << "||C||=" << cNorm << std::endl;// Runs on CPU
// do the same on CPU
cNorm = 0.0;
#pragma omp target update from(cArray[0:num]) // Transfer cArray from Device to CPU
#pragma omp parallel for reduction(+:cNorm) // Runs on CPU
for (int i = 0 ; i < num ; i++){
cNorm += cArray[i] * cArray[i];
}
cNorm = std::sqrt(cNorm);
std::cout << "||C||=" << cNorm << std::endl;
}// aArray, bArray, cArray on Device memory are freed
Host-Device Parallelism
#include <iostream>
#include <cmath>
#include <omp.h>
int main(const int argc, const char** argv) {
omp_set_nested(1);
int num_dev = omp_get_num_devices();
std::cout << "number of devices " << num_dev << std::endl;
int a[10] = { 0 };
#pragma omp parallel firstprivate(num_dev) num_threads(num_dev + 1)
#pragma omp single
{
for (int dev = 0; dev < num_dev + 1; dev++) {
#pragma omp task firstprivate(dev)
{
#pragma omp target if (dev != num_dev) device(dev) map(to:a) map(from:a[dev:dev+1])
{
#pragma omp parallel
{
#pragma omp master
a[dev] = omp_get_num_threads();
}
}
}
}
}
for (int i = 0; i < num_dev + 1; i++) {
std::cout << a[i] << std::endl;
}
return 0;
}
'if' in pragma directive
#pragma omp target if (dev != num_dev) device(dev) map(to:a) map(from:a[dev:dev+1])
In this case, when dev is equal to num_dev, this directive is ignored. So next scope of code will run on Host (CPU).
