6th Programming and Abstractions for Data Locality Workshop

4 September 2023, Monday (DAY 1)

13:30 – 15:15 | Session 1 – Hardware Perspective

13:30 – 13:50 | ‘Performance Portability in the Age of Extreme Heterogeneity’ by John Shalf (Lawrence Berkeley National Laboratory)

Moore’s Law is a techno-economic model that has enabled the IT industry to double the performance and functionality of digital electronics roughly every 2 years within a fixed cost, power and area. This expectation has led to a relatively stable ecosystem (e.g. electronic design automation tools, compilers, simulators and emulators) built around general-purpose processor technologies, such as the x86, ARM and Power instruction set architectures. However, the historical improvements in performance offered by successive generations of lithography are waning while costs for new chip generations are growing rapidly. In the near term, the most practical path to continued performance growth will be architectural specialization in the form of many different kinds of accelerators. New software implementations, and in many cases new mathematical models and algorithmic approaches, are necessary to advance the science that can be done with these specialized architecture. This trend will not only continue but also intensify as the transition from multi-core systems to hybrid systems has already caused many teams to re-factor and redesign their implementations. But the next step to systems that exploit not just one type of accelerator but a full range of heterogeneous architectures will require more fundamental and disruptive changes in algorithm and software approaches. This applies to the broad range of algorithms used in simulation, data analysis and learning. New programming models or low-level software constructs that hide the details of the architecture from the implementation can make future programming less time-consuming, but they will not eliminate nor in many cases even mitigate the need to redesign algorithms. Future software development will not be tractable if a completely different code base is required for each different variant of a specialized system.

The aspirational desire for “minimizing the number of lines of code that must be changed to migrate to different systems with different arrangements of specialization” is encapsulated in the loaded phrase “Performance Portability.” However, performance portability is likely not an achievable goal if we attempt to do it using imperative languages like Fortran and C/C++. There is simply not enough flexibility built in to the specification of the algorithm for a compiler to do anything other than what the algorithm designer explicitly stated in their code. To make this future of diverse accelerators usable and accessible in the former case will require the co-design of new compiler technology and domain- specific languages (DSLs) designed around the requirements of the target computational motifs. The higher levels of abstraction and declarative semantics offered by DSLs enable more degrees of freedom to optimally map the algorithms onto diverse hardware than traditional imperative languages that over-prescribe the solution. Because this will drastically increase the complexity of the mapping problem, new mathematics for optimization will be developed, along with better performance introspection (both hardware and software mechanisms for online performance introspection) through extensions to the roofline model. Use of ML/AI technologies will be essential to enable analysis and automation of dynamic optimizations.

13:50 – 14:10 | ‘Modular Supercomputing: balancing applications on disaggregated heterogeneous resources’ by Estela Suarez (Forschungszentrum Juelich GmbH)

The Modular Supercomputing Architecture (MSA) integrates diverse hardware elements, including CPUs, GPUs, many-core accelerators, and disruptive technologies, within compute modules tailored to optimize performance for specific application classes and user demands. These modules are interconnected via a high-speed network and share a common software stack, creating a unified machine that enables users to customize their applications’ hardware resources by dynamically choosing the number of nodes in each module.

The presentation will delve into the core features of MSA, highlighting its advanced scheduler and dynamic resource manager, which maximize system utilization by intelligently allocating resources to jobs. Participants will gain insights into the global system-software and programming environment that facilitates the execution of multi-physics or multi-scale simulations across compute modules. By distributing application workflows to the most suitable hardware, the MSA ensures a balanced workload across the system, leveraging the intrinsic concurrency of each application part.

Real-world examples of running MSA systems will be presented to showcase the architecture’s hardware and software elements in action. Moreover, the talk will offer an outlook on the MSA’s potential in the Exascale computing era, promising new possibilities for high-performance computing.

14:10 – 14:30 | ‘Locality abstractions from GPUs to data centers’ by CJ Newburn (NVIDIA)

Programming abstractions are needed to accommodate users’ limited capacity for dealing with underlying complexity and to create freedom for tuning by implementation experts. One source of complexity in modern GPU-based data centers is that parallel work and data need to get mapped onto one or more of a dozen hierarchical layers that uniquely leverage locality among computing resources with higher-bandwidth, lower-latency access to shared data structures and lower-latency control coordination. This is far too many to tailor for, so architects seek to create a small number of programming abstractions that each effectively spans a range of underlying parallelism support. Problems arise when there’s a mismatch between the range of scale needed by applications using any given abstraction and the range of scale that’s effectively supported for that abstraction by the underlying hardware. Those problems can get worse when the ranges of scale needed by applications and the ranges of scale supported by underlying hardware structures shift over time independently of one another, leading to brittle performance characteristics. Architecting to maximize the portable longevity of code without hampering hardware evolution and innovation presents a considerable challenge to co-design!

In this talk, we’ll ground a discussion of programming abstractions for locality in the various structures and properties that shape hierarchy, from subdivisions of a GPU all the way up to an entire data center. We’ll offer a definition of self-similarity among layers and highlight where it matters to programming abstractions. We’ll highlight trade-offs between abstraction and locality control over specific resources for different roles. We’ll also show how many of the principles that are relevant within a GPU also apply at the data center level, as we examine the relative roles of workload managers, dataset services, and data orchestrators and discuss trade-offs.

14:30 – 15:15 | PANEL 1 (Moderator: Anshu Dubey)