Project Details
Projekt Print View

Scheduling adaptive HPC jobs to navigate through the energy doldrums

Subject Area Computer Architecture, Embedded and Massively Parallel Systems
Term since 2024
Project identifier Deutsche Forschungsgemeinschaft (DFG) - Project number 545608190
 
High-performance computing (HPC) is crucial for advancements in science and engineering, and the demand for it has surged with the rise of artificial intelligence (AI). However, challenges such as escalating electricity costs and the need to reduce carbon emissions pose constraints on computing resources globally. The intermittent nature of renewable energy sources further complicates matters, introducing variability on the scale of hours. Consequently, HPC providers may need to dynamically adjust system capacity based on cost, emission and demand tradeoffs or temporarily restrict resources to meet energy constraints. This adaptability introduces a new dimension to HPC systems, making them malleable – a characteristic previously associated with specific job classes. Malleability allows jobs to dynamically adjust their resource usage in response to scheduler requests, even under fixed-size capacity. While traditional HPC workloads have been slow to adopt malleability due to limited support, AI training workloads, for which malleability can be easily achieved, provide an opportunity for widespread exploitation of this feature. In addition, AI training jobs may alter their resource requirements as training progresses, which classifies them as evolving. For instance, in computer vision tasks, the ideal batch size may grow as the learning progresses, suggesting the redistribution of resources in favor of jobs beyond their early training stages. The concept of adaptive jobs, encompassing both malleable and evolving jobs, is seen as conducive to the efficient operation of systems with dynamic capacity. This project focuses on developing scheduling algorithms for adaptive workloads on systems with variable capacity. The initial step involves a comprehensive formalization of the problem, including system modeling and the definition of objective functions. A multi-criteria approach will be pursued, combining system-oriented metrics such as energy efficiency with user-oriented metrics such as quality of service. The algorithm design will be grounded in theoretical analysis, encompassing complexity analysis, approximation or inapproximability results, and lower or upper bounds. Empirical evaluation will be conducted through simulation using ElastiSim, a simulator designed for adaptive workloads, to be extended to account for systems with variable capacity. To exploit our findings in a realistic scenario, the project will implement a simple overlay resource manager for distributed deep learning. This manager will operate atop standard resource managers, orchestrating single-node jobs and allocating resources to multi-node learning tasks on demand. The primary use case of this project involves developing a scheduling algorithm optimizing the speed and efficiency of distributed deep learning on systems with variable capacity by adjusting the resource sets of individual learning tasks.
DFG Programme Research Grants
International Connection France
Co-Investigator Dr.-Ing. Arya Mazaheri
 
 

Additional Information

Textvergrößerung und Kontrastanpassung