6th Programming and Abstractions for Data Locality Workshop

5 September 2023, Tuesday (DAY 2)

 

09:00 – 10:20 | Session 3 – Programming Models and Runtime Systems

09:00 – 09:20 | ‘Towards an Automated Task-Size Adapting Runtime System’ by Wu Feng (Virginia Tech)

As evidenced in part by the 5th Programming and Abstractions for Data Locality Workshop in 2022, the high-performance computing (HPC) community, heterogeneity continues to increase at all levels of computing into everything from laptops to supercomputers. In turn, this heterogeneity leads to two levels of machine imbalance: (1) inter-heteroprocessor imbalance, e.g., CPU idle while GPU executes, and (2) intra-heteroprocessor imbalance, e.g., irregular workload executes on GPU. As a consequence, efficiently managing such machine imbalance has become increasingly complex. We propose to reduce this complexity by adaptively worksharing across compute resources at runtime without requiring any transformation of the code. The experimental results of our adaptive runtime system on regular workloads show performance improvements of up to 3x over a current state-of-the-art heterogeneous task scheduler as well as linear performance scaling from a single GPU to four GPUs. Next steps include testing on irregular workloads (with and without significant data movement between heterogeneous processors) and porting of adaptive runtime system from OpenMP/OpenACC and loop-based regions to SYCL.

09:20 – 09:40 | ‘Locality-Aware Task Scheduling and Global Address Space in the Itoyori Runtime System’ by Shumpei Shiina (The University of Tokyo)

Itoyori is a global-view task-parallel runtime system for distributed memory. It supports global fork-join task parallelism, in which tasks are dynamically migrated across computing nodes, and it implements the partitioned global address space (PGAS) model. However, naively combining these aspects can lead to poor performance due to problems related to data locality. The first data-locality problem arises because most PGAS systems cannot exploit the spatial and temporal locality of global memory access from multiple tasks scheduled on the same process. The second data-locality problem is that the commonly used work-stealing scheduler does not take the memory hierarchy into account, resulting in suboptimal data locality on deep memory hierarchies. Itoyori addresses both of these data-locality problems, thereby achieving high scalability and efficiency. We demonstrate its high productivity and performance by porting an existing shared-memory task-parallel implementation of the fast multipole method (FMM) to distributed memory.

09:40 – 10:00 | ‘Leveraging Ray Casting for Task Splitting over Processing Elements’ by Mohamed Wahib (RIKEN Center for Computational Science)

Task splitting based on task dependency analysis is a critical aspect of task-based runtime systems, as it significantly impacts performance. An effective task splitting algorithm should allocate tasks to processing elements (PEs) with improved data locality and minimizing the overhead caused by data communication. Traditionally, this analysis is performed using a task dependency graph, which is a sparse matrix with complex algorithms, making it difficult to accelerate. However, we propose a novel approach to enhance performance by modeling the task dependency analysis problem as a visibility problem and employing ray casting to extract the dependencies and split the tasks.

In this presentation, we delve into the concept of using ray casting for task dependency analysis and splitting, explaining why it offers a promising alternative to conventional methods and the potential advantages over existing techniques.

10:00 – 10:20 | ‘Give us cache, we give you bandwidth!’ by Hatem Ltaief (KAUST)

This talk presents a tour de force of recent numerical algorithms’ impacts on solving scientific problems using complex and imbalanced hardware architectures when it comes to their flops/words ratios. Based on algebraic compression, we illustrate the algorithmic approach on seismic imaging, computational astronomy, and climate modeling using x86 and accelerator-based systems. We also assess the impact on energy efficiency and identify the need for cross-disciplinary expertise to further address one of the most urgent challenges faced by the scientific community.

10:20 – 11:00 | Break

11:00 – 11:20 | ‘CPU-Free Execution Model to Program Multi-GPUs’ by Didem Unat (Koç University)

In typical multi-GPU setups, the host manages execution, kernel launches, communication, and synchronization. However, this orchestration leads to unnecessary overhead. We propose a CPU-free model that delegates control to devices, enhancing communication-heavy applications. By employing techniques like persistent kernels, specialized thread blocks, and device-initiated communication, we create autonomous multi-GPU code with significantly reduced communication overhead. Demonstrated on popular solvers—2D/3D Jacobian stencil and Conjugate Gradient (CG)—our CPU-free model improves 3D stencil communication latency by 58.8% and achieves 1.63x CG speedup on 8 Nvidia A100 GPUs. Code is available at: https://github.com/ParCoreLab/CPU-Free-model.

11:20 – 11:40 | ‘Overcoming the Gap Between Compute and Memory Bandwidth in Modern GPUs’ by Lingqi Zhang (Tokyo Institute of Technology)

The imbalance between compute and memory bandwidth has been a long-standing issue. Despite efforts to address it, the gap between the two has continued to widen. This has led to the categorization of many applications as memory-bound kernels.

We seek to exploit the latest GPU features to optimize memory-bound kernels. Specifically, we introduce strategies to extend the lifetime of kernels across time steps to take advantage of the large volume of on-chip resources. Additionally, we propose to seize the minimal level of parallelism to maximize the available on-chip resources.

This talk will cover our proposals built on top of such strategies that have shown outstanding performance in the latest GPU architectures (i.e., Tesla V100 and A100 GPUs).

11:40 – 12:30 | PANEL 3 (Moderator: John Shalf)