In a multicore environment, inter-thread communication can provide valuable insights about application performance. Literature detecting inter-thread communication either employs hardware simulators or binary instrumentation. Those techniques bring both space and time overhead, which makes them impractical to use on real-life applications. Instead, we take a completely different approach that leverages hardware performance counters and debug registers to detect communication volume between threads. The information generated by our tool can be utilized in several places to guide optimizations, understand performance behavior, and compare architectural features. We presented the design details of our tool along with experimental results on small to very large applications on the paper published at SC’19. This work is nominated for the best paper and the best student paper at SC19. Tool is available at github.
Collaborators: Milind Chabbi
Deep Learning models are compute-and-memory-intensive. Due to their widespread, training them efficiently while obeying the memory constraints of the used processing elements has invaluable benefits. While meeting the memory constraints permits exploring new architectures; efficient training enables conducting faster, cost-and-energy-effective research. This project focuses on applying generic, and system-level optimizations to achieve these two goals.
In single-node multi-GPU systems, communication is a critical programming component and performance contributor. To handle the communication between multiple GPUs, CUDA API offers various data transfer options to the programmer under the hood of Unified Virtual Addressing (UVA), Zero-copy Memory and Unified Memory paradigms. This project focuses on monitoring, identifying, and quantify different types of communication among GPU devices.
The project attempts to increase the efficiency of solving the task assignment problem by enhancing the classical population-based metaheuristic approach using recently introduced quantum annealing devices. The stochastic nature of the quantum annealing process provides an extra source of diversification essential for a thorough exploration of the search space. Additionally, it rapidly produces a large number of candidate solutions. On the other hand, the classical component of the algorithm is capable to guide the search, which allows ensuring the validity of the solution and to scale the efficiency of assignment at the cost of CPU time and/or the number of quantum annealing device queries.
Collaborators: Anastasiia Butko (Berkeley Lab)