Communication Detection For Inter-Thread Communication

In a multicore environment, inter-thread communication can provide valuable insights about application performance. Literature detecting inter-thread communication either employs hardware simulators or binary instrumentation. Those techniques bring both space and time overhead, which makes them impractical to use on real-life applications. Instead, we take a completely different approach that leverages hardware performance counters and debug registers to detect communication volume between threads. The information generated by our tool can be utilized in several places to guide optimizations, understand performance behavior, and compare architectural features. We presented the design details of our tool along with experimental results on small to very large applications on the paper published at SC'19. This work is nominated for the best paper and the best student paper at SC19. Tool is available at github.

Collaborators: Milind Chabbi


Performance Optimizations for Machine Learning Applications

Machine learning algorithms successfully address different types of problems in various fields. Since machine learning algorithms consist of complex data structures processed in an iterative fashion, any performance optimizations play a crucial role to reduce their execution time. We develop performance optimizations and performance models for machine learning applications. 


Collaborators: Wahib Attia

Prior Projects

Data Placement on Heterogeneous Memory Systems

Heterogeneous memory systems are equipped with two or more types of memories, which work in tandem to complement the capabilities of each other. We study various data placement scheme to assist the programmer in making decisions about program object allocations on heterogeneous memory systems.


TiDA and TiDA-acc: Tiling Abstraction for Data Arrays for CPU and GPU

TiDA is a programming abstraction that centralizes tiling information within array data types with minimal changes to the source code. The metadata about the data layout can be used by the compiler and runtime to automatically manage parallelism and optimize data locality. TiDA targets NUMA and coherence domain issues on the massively parallel multicore chips.

Collaborators: Tan Nguyen and John Shalf at Berkeley Lab


Asynchronous Runtime System for AMR


Perilla is a data-driven task graph-based runtime system that exploits the meta-data information extended from the AMRex AMR framework and TiDA tiling library. Perilla utilizes meta-data of AMRex to enable various optimizations at the communication layer facilitating programmers to achieve significant performance improvements with only a modest amount of programming effort.

Collaborators: Tan Nguyen and John Shalf at Berkeley Lab



EmbedSanitizer: Runtime Race Detection for 32-bit Embedded ARM


EmbedSanitizer is a tool for detecting concurrency data races in 32-bit ARM-based multithreaded C/C++ applications. We motivate the idea of detecting data races in embedded systems software natively; without virtualization or emulation or use of alternative architecture. This provides more precise results and increased throughput and hence enhanced developer productivity.

More information:

Contributors: Hassan Salehe Matar, Didem Unat, Serdar Tasiran

Scalable 3D Front Tracking Method


Front tracking is an Eulerian-Lagrangian method for simulation of multiphase flows. The method is known for its accurate calculation of interfacial physics and conservation of mass. Parallelization of front tracking method is challenging because two types (structured and unstructured) of grids need to be handled at the same time. Scalable 3d front tracking method is implemented to optimize different types of communication that arises with parallel implementation of the method.

Collaborators: Metin Muradoğlu and Daulet Izbassarov at Koç University



ExaSAT: A Performance Modeling Framework for ExaScale Co-design


ExaSAT is a comprehensive modeling framework to qualitatively assess the sensitivity of exascale applications to different hardware resources. It can statically analyze an application and gather key characteristics about the computation, communication, data access patterns and data locality. The framewore explores design trade-offs, and extrapolate application requirements to potential hardware realizations in the exascale timeframe (2020). Finally, ExaSAT forms the groundwork for more detailed studies involving architectural simulations of different system design points. Project webiste:

Collaborators: Cy Chan, John Shalf and John Bell at Berkeley Lab

Mint Programming Model for GPUs

Mint is a domain-specific programming model and translator that generates highly optimized CUDA C from annotated C source. Mint includes an optimizer that targets 3D stencil methods. The translator generates both host and device code and handles the memory management details including host-device and shared memory optimizations. Mint parallelizes loop-nests in appropriately annotated C source, performing domain-specific optimizations important in three-dimensional problems. For more info, visit Mint website, read our paper and thesis.

Collaborators: Scott Baden at Univ. of California, San Diego and Xing Cai at Simula Research Lab