Deep Learning models are compute-and-memory-intensive. Due to their widespread, training them efficiently while obeying the memory constraints of the used processing elements has invaluable benefits. While meeting the memory constraints permits exploring new architectures; efficient training enables conducting faster, cost-and-energy-effective research. This project focuses on applying generic, and system-level optimizations to achieve these two goals.
Collaborators: Wahib Attia (AIST, Japan)
In single-node multi-GPU systems, communication is a critical programming component and performance contributor. To handle the communication between multiple GPUs, CUDA API offers various data transfer options to the programmer under the hood of Unified Virtual Addressing (UVA), Zero-copy Memory and Unified Memory paradigms. This project focuses on monitoring, identifying, and quantify different types of communication among GPU devices.
The tool is available at https://github.com/ParCoreLab/ComScribe
In a multicore environment, inter-thread communication can provide valuable insights about application performance. Literature detecting inter-thread communication either employs hardware simulators or binary instrumentation. Those techniques bring both space and time overhead, which makes them impractical to use on real-life applications. Instead, we take a completely different approach that leverages hardware performance counters and debug registers to detect communication volume between threads. This work is nominated for the best paper and the best student paper at SC19. Tool is available at github.
Collaborators: Milind Chabbi (Uber)
The parallel implementation of SpTRSV is a challenging task due to the sequential nature of the steps involved. Many approaches for efficient SpTRSV on CPU and GPU systems have been proposed in the literature. However, no single implementation or platform (CPU or GPU) gives the fastest solution for all input sparse matrices. In this work, we propose a machine learning-based framework to predict the SpTRSV implementation giving the fastest execution time for a given sparse matrix based on its structural features. The framework is tested with six SpTRSV implementations on a state-of-the-art CPU-GPU machine (Intel Xeon Gold CPU, NVIDIA V100 GPU). Predicted SpTRSV implementations achieve average speedups (harmonic mean) in the range of 1.4–2.7× against the six SpTRSV implementations used in the evaluation.
Heterogeneous memory systems are equipped with two or more types of memories, which work in tandem to complement the capabilities of each other. We study various data placement scheme to assist the programmer in making decisions about program object allocations on heterogeneous memory systems
TiDA is a programming abstraction that centralizes tiling information within array data types with minimal changes to the source code. The metadata about the data layout can be used by the compiler and runtime to automatically manage parallelism and optimize data locality. TiDA targets NUMA and coherence domain issues on the massively parallel multicore chips.
Collaborators: Tan Nguyen and John Shalf at Berkeley Lab
Perilla is a data-driven task graph-based runtime system that exploits the meta-data information extended from the AMRex AMR framework and TiDA tiling library. Perilla utilizes meta-data of AMRex to enable various optimizations at the communication layer facilitating programmers to achieve significant performance improvements with only a modest amount of programming effort.
Collaborators: Tan Nguyen and John Shalf at Berkeley Lab
EmbedSanitizer is a tool for detecting concurrency data races in 32-bit ARM-based multithreaded C/C++ applications. We motivate the idea of detecting data races in embedded systems software natively; without virtualization or emulation. This provides more precise results and increased throughput and hence enhanced developer productivity.
More information: https://github.com/hassansalehe/embedsanitizer
Contributors: Hassan Salehe Matar, Didem Unat, Serdar Tasiran
Front tracking is an Eulerian-Lagrangian method for simulation of multiphase flows. The method is known for its accurate calculation of interfacial physics and conservation of mass. Parallelization of front tracking method is challenging because two types (structured and unstructured) of grids need to be handled at the same time. Scalable 3d front tracking method is implemented to optimize different types of communication that arises with parallel implementation of the method.
Collaborators: Metin Muradoğlu and Daulet Izbassarov at Koç University
ExaSAT is a comprehensive modeling framework to qualitatively assess the sensitivity of exascale applications to different hardware resources. It can statically analyze an application and gather key characteristics about the computation, communication, data access patterns and data locality. The framewore explores design trade-offs, and extrapolate application requirements to potential hardware realizations in the exascale timeframe (2020). Finally, ExaSAT forms the groundwork for more detailed studies involving architectural simulations of different system design points. Project webiste: http://www.codexhpc.org/?p=98
Collaborators: Cy Chan, John Shalf and John Bell at Berkeley Lab
Mint is a domain-specific programming model and translator that generates highly optimized CUDA C from annotated C source. Mint includes an optimizer that targets 3D stencil methods. The translator generates both host and device code and handles the memory management details including host-device and shared memory optimizations. For more info, visit Mint website, read our paper and thesis.