Scheduling adaptive HPC jobs to navigate through the energy doldrums

Applicant Professor Dr. Felix Wolf

Subject Area Computer Architecture, Embedded and Massively Parallel Systems

Term since 2024

Project identifier Deutsche Forschungsgemeinschaft (DFG) - Project number 545608190

Project Description

High-performance computing (HPC) is crucial for advancements in science and engineering, and the demand for it has surged with the rise of artificial intelligence (AI). However, challenges such as escalating electricity costs and the need to reduce carbon emissions pose constraints on computing resources globally. The intermittent nature of renewable energy sources further complicates matters, introducing variability on the scale of hours. Consequently, HPC providers may need to dynamically adjust system capacity based on cost, emission and demand tradeoffs or temporarily restrict resources to meet energy constraints. This adaptability introduces a new dimension to HPC systems, making them malleable – a characteristic previously associated with specific job classes. Malleability allows jobs to dynamically adjust their resource usage in response to scheduler requests, even under fixed-size capacity. While traditional HPC workloads have been slow to adopt malleability due to limited support, AI training workloads, for which malleability can be easily achieved, provide an opportunity for widespread exploitation of this feature. In addition, AI training jobs may alter their resource requirements as training progresses, which classifies them as evolving. For instance, in computer vision tasks, the ideal batch size may grow as the learning progresses, suggesting the redistribution of resources in favor of jobs beyond their early training stages. The concept of adaptive jobs, encompassing both malleable and evolving jobs, is seen as conducive to the efficient operation of systems with dynamic capacity. This project focuses on developing scheduling algorithms for adaptive workloads on systems with variable capacity. The initial step involves a comprehensive formalization of the problem, including system modeling and the definition of objective functions. A multi-criteria approach will be pursued, combining system-oriented metrics such as energy efficiency with user-oriented metrics such as quality of service. The algorithm design will be grounded in theoretical analysis, encompassing complexity analysis, approximation or inapproximability results, and lower or upper bounds. Empirical evaluation will be conducted through simulation using ElastiSim, a simulator designed for adaptive workloads, to be extended to account for systems with variable capacity. To exploit our findings in a realistic scenario, the project will implement a simple overlay resource manager for distributed deep learning. This manager will operate atop standard resource managers, orchestrating single-node jobs and allocating resources to multi-node learning tasks on demand. The primary use case of this project involves developing a scheduling algorithm optimizing the speed and efficiency of distributed deep learning on systems with variable capacity by adjusting the resource sets of individual learning tasks.

DFG Programme Research Grants

International Connection France

Partner Organisation Agence Nationale de la Recherche / The French National Research Agency

Cooperation Partners Professorin Dr. Anne Benoit; Frédéric Vivien, Ph.D.

Co-Investigator Dr.-Ing. Arya Mazaheri

Servicenavigation

Hauptnavigation

Scheduling adaptive HPC jobs to navigate through the energy doldrums

Additional Information

Servicenavigation

Hauptnavigation

Scheduling adaptive HPC jobs to navigate through the energy doldrums

Additional Information

Textvergrößerung und Kontrastanpassung