Workshop: Scaling CUDA C++ Applications to Multiple Nodes

Europe/Ljubljana
MS Teams

MS Teams

Domen Verber, Jani Dugonik
Description

Description: Present-day high-performance computing (HPC) and deep learning applications benefit from, and even require, cluster-scale GPU compute power. Writing CUDA® applications that can correctly and efficiently utilize GPUs across a cluster requires a distinct set of skills. In this workshop, you will learn the tools and techniques needed to write CUDA C++ applications that can scale efficiently to clusters of NVIDIA GPUs.

You’ll do this by working on code from several CUDA C++ applications in an interactive cloud environment backed by several NVIDIA GPUs. You’ll gain exposure to a handful of multi-GPU programming methods, including CUDA-aware Message Passing Interface (MPI), before proceeding to the main focus of this course, NVSHMEM™.

NVSHMEM is a parallel programming interface based on OpenSHMEM that provides efficient and scalable communication for NVIDIA GPU clusters. NVSHMEM creates a global address space for data that spans the memory of multiple GPUs and can be accessed with fine-grained GPU-initiated operations, CPU-initiated operations, and operations on CUDA streams. NVSHMEM's asynchronous, GPU-initiated data transfers eliminate synchronization overheads between the CPU and the GPU. They also enable long-running kernels that include both communication and computation, reducing overheads that can limit an application’s performance when strong scaling. 

At the end of the workshop, participants can obtain an official certificate from Deep Learning Institute from NVIDIA.


Workflow: The workshop takes place remotely via a browser on the AWS cloud infrastructure.

Difficulty: Basic 

Language: English

Target audience: HPC developers using CUDA in the network or cloud.

Prerequisite knowledge: Intermediate experience writing CUDA C/C++ applications.

Skills to be gained: 

By participating in this workshop, you’ll learn how to:


– Use concurrent CUDA Streams to overlap memory transfers with GPU computation. 
– Utilize all available GPUs on a single node to scale workloads across all available GPUs. 
– Combine the use of copy/compute overlap with multiple GPUs. 
– Rely on the NVIDIA ® Nsight TM Systems Visual Profiler timeline to observe improvement opportunities and the impact of the techniques covered in the workshop. 
 

Maximum number of participants: 30

Virtual location: MS Teams

Organizer:

 

 

 


Lecturers:

Name:Domen Verber
 

Domen Verber is an assistant professor at the Faculty of Electrical Engineering and Computer Science of the University of Maribor (UM FERI) and ambassador of the NVIDIA Deep Learning Institute for the University of Maribor and their HPC specialist. He has been dealing with HPC and artificial intelligence issues for more than 25 years.

 domen.verber@um.si, deep.learning@um.si

 

Name:Jani Dugonik
 

Jani Dugonik is an academic researcher at the Faculty of Electrical Engineering, Computer Science and Informatics of the University of Maribor (UM FERI). He has been working in the field of natural language processing and evolutionary algorithms for more than 10 years.

 jani.dugonik@um.si

Registration
Registration
    • 9:00 AM 9:30 AM
      Introduction (- Meet the instructor. - Get familiar with your GPU-accelerated interactive JupyterLab environment.)
      Conveners: Domen Verber, Jani Dugonik
    • 9:30 AM 11:30 AM
      Multi-GPU Programming Paradigms: (- Survey multiple techniques for programming CUDA C++ applications for multiple GPUs using a Monte Carlo approximation of a CUDA C++ program: – Use CUDA to utilize multiple GPUs. – Learn how to enable and use direct peer-to-peer memory communication. – Write an SPMD version with CUDA-aware MPI.)
    • 11:30 AM 12:30 PM
      Lunch break
    • 12:30 PM 2:30 PM
      Introduction to NVSHMEM (Learn how to write code with NVSHMEM and understand its symmetric memory model: – Use NVSHMEM to write SPMD code for multiple GPUs. – Utilize symmetric memory to let all GPUs access data on other GPUs. – Make GPU-initiated memory transfers.)
    • 2:30 PM 2:45 PM
      Coffee break
    • 2:45 PM 4:30 PM
      Halo Exchanges with NVSHMEM: (Practice common coding motifs like halo exchanges and domain decomposition using NVSHMEM, and work on the assessment: – Write an NVSHMEM implementation of a Laplace equation Jacobi solver. – Refactor a single GPU 1D wave equation solver with NVSHMEM.)
    • 4:30 PM 5:00 PM
      Final Review (– Complete the assessment and earn a certificate. – Review key learnings and wrap up questions. – Learn about application tradeoffs on GPU clusters. – Take the workshop survey.) Conveners: Domen Verber, Jani Dugonik