Over 60 trainings all over Europe for universities and industryOn-site trainings on the whole range of GPU computing technologiesEach lecture accompanied with a practical session on remote GPU clusterBest recipes of GPU code optimization, based on our 5-year development experienceWe have multiple training programs and even books! Check out our catalogue here.

New 4-day Training Course on GPU-enabled Neural Networks

Category: CUDA
Published: Thursday, 17 November 2016


Deep Learning on GPUs is currently boosting AI's massive leap into real-life applications: autopilots, intelligent automated assistants, real-time translation, image recognition, data sequencing and clustering. With the unprecedented computing power of NVIDIA GPUs, many automotive, robotics and big data companies are creating products and services based on a new class of intelligent machines.

Applied Parallel Computing LLC offers a specialized 4-day course on GPU-enabled Neural Networks. The course is intended for developers willing to rapidly get NVIDIA-based AI technology into new and existing software solutions. In 4 days we will walk from the necessary theory of Recurrent and Convolutional Neural Networks to practical recipes of using Tensorflow and Caffe frameworks. Finally, course attendees will be guided to build up an AI software stack for a real robot.

Training Course Program


Remote profiling with NVIDIA Visual Profiler on a SLURM-based cluster

Category: CUDA
Published: Wednesday, 16 November 2016

GPU-equipped clusters are often managed by SLURM job control system. Essentially, developer logs into the frontend node by SSH, builds the application and then queries SLURM for compute node(s) allocation. Once compute nodes are granted, the application is executed on them. In order to debug or profile an application, developer is allowed to SSH from the frontend node to individual compute nodes granted for execution. Now, how this pipeline can play together with NVIDIA Visual Profiler?

Read more: Remote profiling with NVIDIA Visual Profiler on a SLURM-based cluster

CUDA-like runtime interface for Xeon Phi

Category: CUDA
Published: Tuesday, 12 April 2016

The performance power of GPUs could be exposed to applications using two principal kinds of programming interfaces: with manual parallel programming (CUDA or OpenCL), or with directive-based extensions relying on compiler's capabilities of semi-automatic parallelization (OpenACC and OpenMP4). Unlike for GPUs, Intel has never offered an explicit CUDA-like interface for their Xeon Phi accelerators to general public, leaving OpenMP offloading directives as the only programming option.

Based on liboffloadmic, we have prototyped "micrt" - a programming interface to execute memory transfers and kernels, similarly to CUDA runtime. Find the code example and building instructions here.



Category: CUDA
Published: Wednesday, 14 October 2015

Multiple presentations about OpenMP 4.0 support on NVIDIA GPUs date back to 2012. There is however still very limited OpenMP 4.0 production-ready tools availability for NVIDIA devices: Intel's compilers are Xeon Phi only, PGI and Cray offer only OpenACC, GCC support is only in plans. Fortunately, a usable compiler could be built on top of LLVM and Clang. In this blog post we provide complete instructions to build OpenMP 4.0 compiler with NVIDIA support on Linux. Note these instructions could be especially useful for building OpenMP 4.0 compiler on platforms where OpenACC is not available, e.g. Jetson TK1.

Read more: OpenMP 4.0 on NVIDIA CUDA GPUs

Use CUDA 7.0 NVRTC with Thrust

Category: CUDA
Published: Wednesday, 29 April 2015

Rintime Compilation (NVRTC) introduced in CUDA 7.0 allows to dynamically compile CUDA kernels during program execution (see example). This functionality allows to perform additional GPU code optimization/specialization, using runtime context, e.g. to substitute constant loop bounds and unroll loops, or to eliminate divergent branches known not to be visited. However, NVRTC is not fully equivalent to offline nvcc: it only compiles CUDA device code into PTX assembly. Thus, NVRTC is not directly usable with GPU-enabled frameworks that combine both host and device code, e.g. Thrust. In this demo we show how to use NVRTC to replace a certain device function in Thrust code.

Read more: Use CUDA 7.0 NVRTC with Thrust

Get extra 8% perf in bilinear interpolation on GPU using __restrict__ keyword

Category: CUDA
Published: Thursday, 26 March 2015

Starting from GK110 (Tesla Kepler), "const __restrict__" annotation on kernel argument has an extra GPU-specific meaning: accesses to that argument should go through the texture cache. As an example, we use GPU bilinear interpolation, which is a compute-bound problem. Replacing of "RGBApixel* pixels" by "const RGBApixel* __restrict__ pixels" in __global__ and __device__ functions instructs the compiler to emit LDG instructions for pixels loading instead of generic LD:

$ cuobjdump -sass no_restrict | grep LD
        /*01d8*/                   LD.E R13, [R8+0x4];
        /*0210*/                   LD.E R14, [R8];
        /*0248*/                   LD.E R8, [R4+0x4];
        /*0290*/                   LD.E R11, [R4];
$ cuobjdump -sass restrict | grep LD
        /*01a0*/                   LDG.E R7, [R16];
        /*01b0*/                   LDG.E R9, [R8];
        /*01e0*/                   LDG.E R8, [R16];
        /*01f0*/                   LDG.E R5, [R14];

If __restrict__ pointer does not produce LDG, then the keyword is not applicable or is used incorrectly.

Code version with "const __restrict__" demonstrates extra speedup of 8%:

$ ./no_restrict hst_lagoon_detail.bmp
Image: BMP 4778 x 4856
GPU kernel time = 0.013719 sec
$ ./restrict hst_lagoon_detail.bmp
Image: BMP 4778 x 4856
GPU kernel time = 0.012686 sec

Download our demo code here.

Thrust/CUDA tip: reuse temporary buffer across multiple transforms

Category: CUDA
Published: Thursday, 09 October 2014

Thrust is a very handy STL-like template library for rapid data processing on GPUs.

In real applications it is often needed to perform the same data processing (in Thrust's terminology - transform) multiple times on different datasets. Transforms containing reduction (reduction, sorting, transform-reduce, etc.) require temporary arrays allocation. By default these allocations are performed for each individual transform, adding cudaMalloc/cudaFree operations, that could be quite expensive:

Read more: Thrust/CUDA tip: reuse temporary buffer across multiple transforms

On-the-fly modification of LLVM IR code of CUDA sources

Category: CUDA
Published: Tuesday, 23 September 2014

Largely thanks to LLVM, in recent years we've seen a significant increase of interest to domain-specific compilation tools research & development. With the release of PTX backends by NVIDIA (opensource NVPTX and proprietary libNVVM), construction of custom LLVM-driven compilers for generating GPU binaries also becomes possible. However, two questions are still remaining:

  1. How to customize the CUDA source compilation?
  2. What is the NVIDIA's best set of GPU-specific LLVM optimizations and how to continue modifying IR after applying them?

In order to answer these two questions, we have created a special dynamic library. Being attached to NVIDIA CUDA compiler, this library exposes unoptimized and optimized LLVM IR code to the user and allows its on-the-fly modification. As result, domain-specific compiler developer receives flexibility e.g. to re-target CUDA-generated LLVM IR to different architectures, or to make additional modifications to IR after executing NVIDIA's optimizations.

Source code, sample modification and description are available on our GitHub page. Tested with CUDA 6.0.


NVIDIA Visual Profiler allows to connect 64-bit Linux server from 32-bit Windows

Category: CUDA
Published: Sunday, 13 July 2014

In CUDA 6.0 release an extremely handy feature has been added to Visual Profiler: support for remote profiling. This means that you can run the profiler GUI from your local machine - laptop or tablet (even without GPU!), while profiling is performed on remote machine, e.g. company server or university cluster.

Moreover, it does not matter which operating system local machine is running. On our screenshot remote profiling on 64-bit Linux server is performed from Visual Profiler running on 32-bit Windows 8.


Jetson K1: bandwidthTest

Category: CUDA
Published: Sunday, 15 June 2014


Chart on the left shows the bandwidths of memory transfers on Jetson K1 (click to enlarge). For the baseline we also added GTX680M's host-device and device-host (peak device-device is ~90K -- too large for this chart).

Summary of Jetson K1's bandwidthTest results, two of which are quite unusual:

  • peak device-device is ~7.5x smaller than GTX680M's
  • peak pinned host-device is ~12x smaller than GTX680M's
  • peak pinned device-host is only ~2x smaller than GTX680M's (!)
  • peak pageable host-device is ~1.5x higher than pinned host-device (!)


Download raw data table in ODS format.