A Prediction Framework for Fast Sparse Triangular Solves

Sparse triangular solve (SpTRSV) is an important linear algebra kernel, finding extensive uses in numerical and scientific computing. The parallel implementation of SpTRSV is a challenging task due to the sequential nature of the steps involved. This makes it, in many cases, one of the most time-consuming operations in an application. Many approaches for efficient SpTRSV on CPU and GPU systems have been proposed in the literature. However, no single implementation or platform (CPU or GPU) gives the fastest solution for all input sparse matrices. In this work, we propose a machine learning-based framework to predict the SpTRSV implementation giving the fastest execution time for a given sparse matrix based on its structural features. The framework is tested with six SpTRSV implementations on a state-of-the-art CPU-GPU machine (Intel Xeon Gold CPU, NVIDIA V100 GPU). Experimental results, with 998 matrices taken from the SuiteSparse Matrix Collection, show the classifier prediction accuracy of 87% for the fastest SpTRSV algorithm for a given input matrix. Predicted SpTRSV implementations achieve average speedups (harmonic mean) in the range of 1.4–2.7× against the six SpTRSV implementations used in the evaluation.

Our paper received the best artifact award at the prestigious Europar 2020 conference, virtually held at Warsaw, Poland (August 24-28). GitHub Paper Presentation YouTube

Performance Optimizations For Machine Learning Applications

Machine learning algorithms successfully address different types of problems in various fields. Since machine learning algorithms consist of complex data structures processed in an iterative fashion, any performance optimizations play a crucial role to reduce their execution time. We develop performance optimizations and performance models for machine learning applications.

Collaborators: Wahib Attia

Data Placement On Heterogeneous Memory Systems

Heterogeneous memory systems are equipped with two or more types of memories, which work in tandem to complement the capabilities of each other. We study various data placement scheme to assist the programmer in making decisions about program object allocations on heterogeneous memory systems

TiDA And TiDA-Acc: Tiling Abstraction For Data Arrays For CPU And GPU

TiDA is a programming abstraction that centralizes tiling information within array data types with minimal changes to the source code. The metadata about the data layout can be used by the compiler and runtime to automatically manage parallelism and optimize data locality. TiDA targets NUMA and coherence domain issues on the massively parallel multicore chips.

Collaborators: Tan Nguyen and John Shalf at Berkeley Lab

Asynchronous Runtime System For AMR

Perilla is a data-driven task graph-based runtime system that exploits the meta-data information extended from the AMRex AMR framework and TiDA tiling library. Perilla utilizes meta-data of AMRex to enable various optimizations at the communication layer facilitating programmers to achieve significant performance improvements with only a modest amount of programming effort.

Collaborators: Tan Nguyen and John Shalf at Berkeley Lab

EmbedSanitizer: Runtime Race Detection For 32-Bit Embedded ARM

EmbedSanitizer is a tool for detecting concurrency data races in 32-bit ARM-based multithreaded C/C++ applications. We motivate the idea of detecting data races in embedded systems software natively; without virtualization or emulation or use of alternative architecture. This provides more precise results and increased throughput and hence enhanced developer productivity.

More information: https://github.com/hassansalehe/embedsanitizer

Contributors: Hassan Salehe Matar, Didem Unat, Serdar Tasiran

Scalable 3D Front Tracking Method

Front tracking is an Eulerian-Lagrangian method for simulation of multiphase flows. The method is known for its accurate calculation of interfacial physics and conservation of mass. Parallelization of front tracking method is challenging because two types (structured and unstructured) of grids need to be handled at the same time. Scalable 3d front tracking method is implemented to optimize different types of communication that arises with parallel implementation of the method.

Collaborators: Metin Muradoğlu and Daulet Izbassarov at Koç University

ExaSAT: A Performance Modeling Framework For ExaScale Co-Design

ExaSAT is a comprehensive modeling framework to qualitatively assess the sensitivity of exascale applications to different hardware resources. It can statically analyze an application and gather key characteristics about the computation, communication, data access patterns and data locality. The framewore explores design trade-offs, and extrapolate application requirements to potential hardware realizations in the exascale timeframe (2020). Finally, ExaSAT forms the groundwork for more detailed studies involving architectural simulations of different system design points. Project webiste: http://www.codexhpc.org/?p=98

Collaborators: Cy Chan, John Shalf and John Bell at Berkeley Lab

Mint Programming Model For GPUs

Mint is a domain-specific programming model and translator that generates highly optimized CUDA C from annotated C source. Mint includes an optimizer that targets 3D stencil methods. The translator generates both host and device code and handles the memory management details including host-device and shared memory optimizations. Mint parallelizes loop-nests in appropriately annotated C source, performing domain-specific optimizations important in three-dimensional problems. For more info, visit Mint website, read our paper and thesis.

Collaborators: Scott Baden at Univ. of California, San Diego and Xing Cai at Simula Research Lab