OpenACC enables rapid transition of serial C/C++/Fortran into GPU-enabled parallel code. However, due to high-level nature, OpenACC does not offer access to GPU-specific features useful for debugging, optimization and other purposes. In this article we demonstrate how to call CUDA device functions from within OpenACC kernels by two examples: GPU compute grid retrieval and printf.
In OpenACC source file make forward declarations of our CUDA device functions:
Now, we can call these functions from the OpenACC parallel loop:
CUDA device functions are to be defined in separate compute_grid.cu and print.cu files:
The last building block is a Makefile to compile and link this all together. Note the "rdc" flag needed to create linkable CUDA device code:
Now, our test program is capable of showing compute grid config and printing from withing the OpenACC code, which is not directly supported by OpenACC itself.
Small problem (n = 128) takes only one block:
Larger problem (n = 256) takes 2 blocks (OpenACC obviously uses (128, 1, 1) blocks):