Watch out Prof. Hwu and El Hajj YouTube Channel!
Keynote: LLM Training and Inference on GPU & HPC Systems
Training and inference of large transformer models is one of the most important computational challenges of modern AI. Systems for training these models must be highly scalable and run at extreme efficiency, because the amount of work necessary to converge a model can be extraordinarily large. Inference needs to be fast and accommodate different query sizes. In this talk, I'll discuss the work we have been doing at NVIDIA to optimize systems for Large Language Model training and inference on GPUs. I will present different parallelism techniques we are using in our LLM framework Megatron-LM and will discuss how parallelism techniques can be combined to maximize the training throughput of large models while retaining strict optimizer semantics. I will discuss optimizations techniques for inference and methods to accelerate inference and reduce memory fragmentation.
Dr. Mohammad Shoeybi is the Director of Applied Research at NVIDIA. His team focuses on building large foundational models and improving them to downstream applications. His team has build Megatron-LM, a framework for efficiently training LLMs and used it to train several large scale models such as Nemotron-4 with 340 billions of parameters. He received his PhD. from Stanford University in 2010. Prior to NVIDIA, he worked at DeepMind and Baidu USA leading efforts on bringing deep learning and reinforcement learning to applications.
Instructors
Wen-mei W. Hwu received the PhD degree in computer science from the University of California, Berkeley, 1987. He is the Walter J. (“Jerry”) Sanders III-Advanced Micro Devices endowed chair of electrical and computer engineering at the University of Illinois at Urbana-Champaign. His research interests include the areas of architecture, implementation, software for high-performance computer systems, and parallel processing. He is a principal investigator (PI) for the petascale Blue Waters system, a codirector of the Intel and Microsoft funded Universal Parallel Computing Research Center (UPCRC), and PI for the world’s first NVIDIA CUDA Center of Excellence. He is the chief scientist of the Illinois Parallel Computing Institute and the director of the IMPACT lab.
For his contributions to the areas of compiler optimization and computer architecture, he received the 1993 Eta Kappa Nu Outstanding Young Electrical Engineer Award, the 1994 University Scholar Award of the University of Illinois, the 1997 Eta Kappa Nu Holmes MacDonald Outstanding Teaching Award, the 1998 ACM SigArch Maurice Wilkes Award, the 1999 ACM Grace Murray Hopper Award, the 2001 Tau Beta Pi Daniel C. Drucker Eminent Faculty Award, the 2006 most influential ISCA paper award, and the University of California, Berkeley distinguished alumni in computer science award. From 1997 to 1999, he was the chairman of the Computer Engineering Program at the University of Illinois. In 2007, he introduced a new engineering course in massively parallel processing with David Kirk of NVIDIA. He is a fellow of IEEE and of the ACM.
Juan Gómez Luna is a senior researcher and lecturer at SAFARI Research Group @ ETH Zürich. He received the BS and MS degrees in Telecommunication Engineering from the University of Sevilla, Spain, in 2001, and the PhD degree in Computer Science from the University of Córdoba, Spain, in 2012. Between 2005 and 2017, he was a faculty member of the University of Córdoba. His research interests focus on GPU and heterogeneous computing, processing-in-memory, memory systems, and hardware and software acceleration of medical imaging and bioinformatics. He is the lead author of PrIM (https://github.com/CMU-SAFARI/prim-benchmarks), the first publicly-available benchmark suite for a real-world processing-in-memory architecture, and Chai (https://github.com/chai-benchmarks/chai), a benchmark suite for heterogeneous systems with CPU/GPU/FPGA.
Izzat El Hajj is an Assistant Professor in the Department of Computer Science at the American University of Beirut. His research interests are in application acceleration and programming support for parallel processors and memory technologies, with a particular interest in GPUs and processing-in-memory. He is a co-author of the textbook Programming Massively Parallel Processors: A Hands-on Approach, 4th edition. He received his M.S. and Ph.D. in Electrical and Computer Engineering at the University of Illinois at Urbana-Champaign, where he worked with the IMPACT Research Group led by Prof. Wen-mei Hwu and received the Dan Vivoli Endowed Fellowship. Prior to that, he received his B.E. in Electrical and Computer Engineering at the American University of Beirut, where he graduated with high distinction and received the Distinguished Graduate Award.
Antonio J. Peña holds a BS + MS degree in Computer Engineering (2006), and MS and PhD degrees in Advanced Computer Systems (2010, 2013), from Jaume I University of Castellón, Spain. He is currently a Leading Researcher at Barcelona Supercomputing Center (BSC), Computer Sciences Department, where he leads the “Accelerators and Communications for HPC” Group. Antonio is a Ramón y Cajal Fellow, former Marie Sklodowska-Curie Fellow, and former Juan de la Cierva Fellow, and a recipient of the 2017 IEEE TCHPC Award for Excellence for Early Career Researchers in High Performance Computing. He is also an ERC Consolidator Laureate and Sr. IEEE/ACM member. Antonio is also Teaching and Research Staff at Universitat Politècnica de Catalunya (UPC). His research interests in the area of runtime systems and programming models for high performance computing include resource heterogeneity and communications.
Leonidas Kosmidis is a Leading Researcher at the Barcelona Supercomputing Center (BSC) and the Universitat Politècnica de Catalunya (UPC). He holds a PhD and a MSc degree in Computer Architecture from UPC and a BSc in Computer Science from the University of Crete, Greece. He is leading the research on embedded GPUs for safety critical systems, both at hardware and system software level within the CAOS (Computer Architecture/Operating Systems) group. He is the PI of several projects funded by the European Space Agency (ESA) such as the GPU4S (GPU for Space) and the Horizon Europe METASAT project, as well as projects funded by industry such as the Airbus Defence and Space which focus on the adoption of GPUs in space and avionics systems. He is also participating in several standardisation efforts regarding GPU programming in safety critical systems. Dr. Kosmidis is the recipient of the RISC-V Educator of the Year Award in 2019 from the RISC-V Foundation and an Honourable Mention for the EuroSyS Roger Needham PhD Award in 2018, which is awarded to the best PhD thesis in Europe.
Xavier Martorell received the M.S. and Ph.D. degrees in Computer Science from the Universitat Politecnica de Catalunya (UPC) in 1991 and 1999, respectively. Since 1992 he has lectured on operating systems, parallel runtime systems, OS administration, and systems for data science. He has been an associate professor in the Computer Architecture Department at UPC since 2001. His research interests cover the areas of operating systems, runtime systems, compilers and applications for high-performance
multiprocessor systems. In 2003 he joined the IBM TJ Watson Research Center as a Visiting Scientist, and participated in the development of system software for the IBM BlueGene/L Supercomputer, which ranked top 1st in the Top500 list during 2004 and 2005. Since 2005 he is the Manager of the Parallel Programming Models team at the Barcelona Supercomputing Center. He has participated on several EU projects related to the use of FPGAs for HPC: AXIOM, EuroEXA, and LEGaTO, and the use of FPGAs for RISC-V emulation: Textarossa and MEEP. He is now participating in the Zettascale project, leading the porting of the OS and drivers to the Lagarto RISC-V developed by BSC. He has coauthored more than 80 publications in international journals and conferences. He has co-advised eight Ph.D. theses and he is currently co-advising three PhD students.
Xavier Teruel received the Technical Engineering and the Engineering degree in Computer Science at Universitat Politecnica de Catalunya (UPC) in 2003 and 2006, respectively. Since 2006, Xavier is working as a researcher within the group of Parallel Programming Models in the Computer Sciences department at the Barcelona Supercomputing Center (BSC).
His research interests include the areas of operating systems, programming languages, compilers, runtime systems and applications for high-performance architectures and multiprocessor systems, mostly focused in shared memory environments.
Marc Jordà received his M.S. in Computer Architecture, Networks and Systems in 2012 from the Universitat Politècnica de Catalunya, Barcelona. Since then, he has been a research engineer at the Barcelona Supercomputing Center – Centro Nacional de Supercomputación, working in several topics from the field of high-performance computing, including application acceleration with GPUs, GPU hardware simulation, and performance analysis.