Project Details
Scheduling adaptive HPC jobs to navigate through the energy doldrums
Applicant
Professor Dr. Felix Wolf
Subject Area
Computer Architecture, Embedded and Massively Parallel Systems
Term
since 2024
Project identifier
Deutsche Forschungsgemeinschaft (DFG) - Project number 545608190
High-performance computing (HPC) is crucial for advancements in science and engineering, and the demand for it has surged with the rise of artificial intelligence (AI). However, challenges such as escalating electricity costs and the need to reduce carbon emissions pose constraints on computing resources globally. The intermittent nature of renewable energy sources further complicates matters, introducing variability on the scale of hours. Consequently, HPC providers may need to dynamically adjust system capacity based on cost, emission and demand tradeoffs or temporarily restrict resources to meet energy constraints. This adaptability introduces a new dimension to HPC systems, making them malleable – a characteristic previously associated with specific job classes. Malleability allows jobs to dynamically adjust their resource usage in response to scheduler requests, even under fixed-size capacity. While traditional HPC workloads have been slow to adopt malleability due to limited support, AI training workloads, for which malleability can be easily achieved, provide an opportunity for widespread exploitation of this feature. In addition, AI training jobs may alter their resource requirements as training progresses, which classifies them as evolving. For instance, in computer vision tasks, the ideal batch size may grow as the learning progresses, suggesting the redistribution of resources in favor of jobs beyond their early training stages. The concept of adaptive jobs, encompassing both malleable and evolving jobs, is seen as conducive to the efficient operation of systems with dynamic capacity. This project focuses on developing scheduling algorithms for adaptive workloads on systems with variable capacity. The initial step involves a comprehensive formalization of the problem, including system modeling and the definition of objective functions. A multi-criteria approach will be pursued, combining system-oriented metrics such as energy efficiency with user-oriented metrics such as quality of service. The algorithm design will be grounded in theoretical analysis, encompassing complexity analysis, approximation or inapproximability results, and lower or upper bounds. Empirical evaluation will be conducted through simulation using ElastiSim, a simulator designed for adaptive workloads, to be extended to account for systems with variable capacity. To exploit our findings in a realistic scenario, the project will implement a simple overlay resource manager for distributed deep learning. This manager will operate atop standard resource managers, orchestrating single-node jobs and allocating resources to multi-node learning tasks on demand. The primary use case of this project involves developing a scheduling algorithm optimizing the speed and efficiency of distributed deep learning on systems with variable capacity by adjusting the resource sets of individual learning tasks.
DFG Programme
Research Grants
International Connection
France
Partner Organisation
Agence Nationale de la Recherche / The French National Research Agency
Cooperation Partners
Professorin Dr. Anne Benoit; Frédéric Vivien, Ph.D.
Co-Investigator
Dr.-Ing. Arya Mazaheri