Project Details
Projekt Print View

Validated Models of MapReduce Scaling

Subject Area Security and Dependability, Operating-, Communication- and Distributed Systems
Term since 2017
Project identifier Deutsche Forschungsgemeinschaft (DFG) - Project number 389207087
 
The goal of the VaMoS project is to bridge the gap between systems-oriented research and queueing theoretic works on parallel systems to create models which reflect the performance of real systems and their scaling behavior. This document reports on the first phase of this project, and proposes an extension to the project with a work program that builds on the successes and developments in the field over the last few years.During the first phase of the VaMoS project we performed wide-ranging, experimentally inspired work on parallel systems. We investigated the effects of job locality, analyzed traces from real clusters, investigated the performance benefits and trade-offs of finer task granularity both theoretically and experimentally, and conducted experiments and developed models for parallel systems with barriers, as are often needed when parallelizing machine learning workloads. This work involved implementation or extension of several software packages which we have publicly released.Our proposed project extension focuses mainly on parallel systems with barriers. Typically this means that jobs are divided into tasks which will be serviced in parallel by a cluster of workers, but that the tasks are constrained to start, and possibly complete, simultaneously. There may also be intermediate synchronization points. This type of constraint is common in machine learning workloads, and support for barrier execution mode has very recently been added to some map-reduce engines in order to support these types of workloads. These barrier constraints have major performance implications, because they can force workers to sit idle, waiting for a single long-running task to finish. In the first phase of the project we developed analytical performance bounds for basic configurations of these types of systems. To connect our results with real systems we need to answer a number of questions about how parallel systems scale under multi-barrier workloads, how they handle a heterogeneous stream of barrier-containing jobs, and how multi-barrier workloads can be modeled. We also need to address certain implementation problems involving dynamic manipulation of a job's degree of parallelism, and whether scheduling optimizations required to support extremely fine task granularity can be useful in parallel processing of streaming data.
DFG Programme Research Grants
Co-Investigator Brenton Walker, Ph.D.
 
 

Additional Information

Textvergrößerung und Kontrastanpassung