Dr. Arjun Chandra
Graphcore, Norway

28 April 2022

Dr. Arjun Chandra is part of the Applications team at Graphcore (https://www.graphcore.ai), and is based in Oslo. His work focuses on engineering state-of-the-art machine intelligence models and methods that leverage, examine, and inform the development of the ever-expanding capabilities of Graphcore’s Poplar SDK and hardware systems. Arjun received a doctorate in Computer Science from the University of Birmingham, the UK in 2011, and has academic research, industry R&D, and entrepreneurial experience across a broad range of themes within machine intelligence.

Deep learning models are growing in size and complexity. One of the ways to train them efficiently at scale is to use low precision arithmetic and number formats. This talk will cover some of the key techniques engineered at Graphcore to provide numerically stable training of neural networks in reduced precision whilst maintaining the target FP32 accuracy. These techniques are available for use with Graphcore IPUs via our Poplar SDK.


Topic covered during the seminar:

  • Numerically stable deep learning in reduced precision

Dr. Anshu S. Ananda
Indian Institute of Information Technology, Allahabad

31 March 2022

One of the key challenges for the exascale systems is to optimize data movements, which are much more expensive than computation. The performance and energy implications are even more significant in the case of heterogeneous HPC systems. By improving locality of reference through data reuse, both memory access time and the memory bandwidth requirement between processors/nodes can be improved, simultaneously. However, in the absence of suitable abstractions for managing data locality, the onus is on the programmer to manually manage data locality using low-level techniques.

In this talk, I will first show how Powerlist, a data structure that enables us to specify parallel algorithms concisely, can be used as an abstraction for parallelism by describing a method to schedule computations (eg. Matrix Multiplication) across a cluster of GPUs. This is realized by implementing Powerlist as a library that facilitates automatic partitioning of the matrices and utilizing the cuBLAS API for efficient matrix multiplication of the sub-matrices at the individual GPUs. In the second part, I will discuss the prospects of Powerlist as a Locality abstraction.


Topic covered during the seminar:

  • Powerlist as a high-level parallel programming model

Alexander Geiß, M.Sc.
Technical University of Darmstadt

3 March 2022

Alexander Geiß holds an M.Sc. degree in computer science from the Technical University of Darmstadt and works there as a research associate at the Laboratory for Parallel Programming. He is also the leader of the work package on “Measuring, Modelling, Mapping and Monitoring” in the DEEP-SEA project. His research area is performance modeling and application mapping with a focus on heterogeneous systems and modular supercomputing.

The topic of this talk is Extra-P, an automatic performance-modeling tool that supports the user in the identification of scalability bottlenecks. This talk will give an overview of performance modeling with Extra-P. We start with a brief motivation for performance models and the need for assistance in creating them; followed by an explanation of the most important parts of the underlying methods and a discussion of the limitations of the method. Finally, we will discuss the recommended workflow for performance modeling with Extra-P based on a small demonstration.

Research papers covered during the seminar:

  • Using Automated Performance Modeling to Find Scalability Bugs in Complex Codes. SC ’13. PDF
  • Fast Multi-Parameter Performance Modeling. CLUSTER 2016. PDF
  • Learning Cost-Effective Sampling Strategies for Empirical Performance Modeling. IPDPS 2020. PDF

Prof. Paul H.J. Kelly
Imperial College London

13 December 2021

Paul Kelly leads the Software Performance Optimisation group at Imperial College London. His research focus is domain-specific program optimisation, leading to close engagement with colleagues in computational science, robotics, and computer vision. This talk covers joint work with many such collaborators.

The topic of this talk: Domain-specific languages enable us to automate the generation of high-performance code from a high-level abstraction. This talk will show, through a couple of example projects (Firedrake and Devito) that DSLs can deliver productivity, performance, and performance-portability. The key to success is compiler architecture – designing intermediate representations that make optimisations easy and analysis trivial. But the DSL software ecosystem is dysfunctional: DSL compilers (including ours) are typically standalone projects, reliant on support from a narrow developer base. Few, if any, components are shared between DSLs. The talk will conclude with a manifesto for fixing this – building on MLIR to establish community support for code generation tools that underpin multiple front-end DSLs. I will argue that this is in fact the only way we can tackle the complexity involved in achieving high performance for complex applications on diverse hardware.


Research paper covered during the seminar:

Architecture and Performance of Devito, a System for Automated Stencil Computation, ACM TOMS April 2020

Dr. Nehir Sönmez
Barcelona Supercomputing Center

8 November 2021

Dr. Nehir Sonmez holds a PhD in Computer Engineering (2012) from the Technical University of Catalonia (UPC), Spain, an MS degree from Bogazici University, Turkey (2006) and a BS degree from Syracuse University USA (2003). He is currently a senior researcher in the High Performance Domain-Specific Architectures research group at BSC, where he is the coordinator of the EuroHPC-JU eProcessor project. He is also active in the architecture design and verification efforts in EPI and DRAC projects, on RISC-V processors and vector accelerators. His other research interests include reconfigurable computing, computer architecture and multicores, transactional memory, disaggregated computing, and database acceleration.


Research paper covered during the seminar:

A RISC-V Simulator and Benchmark Suite for Designing and Evaluating Vector Architectures, TACO ’17

Dr. Luc Jaulmes
Barcelona Supercomputing Center

4 October 2021

Luc Jaulmes received the double MSc degree in engineering in computer science from Ecole Polytechnique, Paris and the Royal Institute of Technology (KTH), Stockholm, specializing in computer architecture. He received the PhD degree in the Barcelona Supercomputing Center (BSC) in 2019. His thesis focused on the resilience of high performance computing (HPC) applications and memories on extreme scale machines, using novel programming models, runtime systems and runtime-aware architectures.


Research paper covered during the seminar:

Dr. Aleksandar Ilic
INESC-ID

4 October 2021

Aleksandar Ilic (PhD’14) is an Assistant Professor at the Instituto Superior Técnico (IST), Universidade de Lisboa, and a Researcher of INESC-ID, Lisbon, Portugal. He has contributed to more than 50 international journal and conference publications, and received several Excellence in Teaching awards. Besides his teaching experience, he has organized and participated in more than 20 roofline-related tutorials, invited talks and seminars held at different scientific events, such as SC, ISC, Intel oneAPI Dev Summits, PACT etc. The integration of his scientific contribution (Cache-aware Roofline Model) in industry software tools (Intel Advisor) received the HiPEAC Tech Transfer award. His research interests include high-performance and energy-efficient computing and modeling of parallel heterogeneous systems.


Research papers covered during the seminar:

Dr. Milind Chabbi
Uber Technologies, Inc.

13 July 2021

Milind Chabbi conducts research in the areas of high-performance parallel computing, shared- memory synchronization algorithms, performance analysis tools, and compiler optimizations. He is currently employed as a senior researcher at Uber Technologies in Palo Alto, USA and is also the president of his independent research company Scalable Machine Research. Previously Milind worked at Baidu Research, Hewlett Packard Labs, and Microsoft. Milind Chabbi obtained his doctoral degree in computer science from Rice University working in the areas of software tools and algorithms for high-performance parallel computing. Milind has published over 30 conference and journal publications, received numerous best paper awards, and owns eight USPTO patents.


Research papers covered during the seminar:

Dr. Tan Nguyen
Lawrence Berkeley National Laboratory

21 June 2021

Tan Nguyen is a research scientist at Lawrence Berkeley National Laboratory. His recent research focuses on performance analysis and code optimizations for various processor architectures, including multi- and many-core CPUs, GPUs, FPGAs, and CGRAs. He is also interested in compiler analysis and code generation, programming models and runtime systems for scientific applications. Nguyen received his Ph.D. degree in Computer Science from University of California, San Diego in 2014.


Research papers covered during the seminar:

Wahib Avatar

Dr. Mohamed Wahib
AIST/TokyoTech Open Innovation Laboratory

25 May 2021

Mohamed Wahib is a senior scientist at AIST/TokyoTech Open Innovation Laboratory, Tokyo, Japan. Prior to that he worked as a researcher in RIKEN Center for Computational Science (RIKEN-CCS). He received his Ph.D. in Computer Science in 2012 from Hokkaido University, Japan. Prior to his graduate studies, he worked as a researcher at Texas Instruments (TI) R&D labs in Dallas, TX for four years. His research interests revolve around the central topic of “Performance-centric Software Development”, in the context of HPC. He is actively working on several projects including high-level frameworks for programming traditional scientific applications, as well as high-performance AI and data analytics.


Research papers covered during the seminar:

  • ParDNN: An Oracle for Characterizing and Guiding Large-Scale Training of Deep Neural Networks, HPDC’21
  • Scaling Distributed Deep Learning Workloads beyond the Memory Capacity with KARMA. SC’20
  • A Study of Single and Multi-device Synchronization Methods in Nvidia GPUs. IPDPS’20