IEEE ITW 2020, Riva del Garda, Italy
Distributed computing enables parallel execution of tasks that make up a large computing job. In large scale systems, even small random fluctuations in service times (inherent to computing environments) often cause a non-negligible number of straggling tasks with long completion times. Redundancy, in the form of simple task replication and erasure coding, has emerged as a potentially powerful way to curtail the variability in service time, as it provides diversity that allows a job to be completed when only a subset of redundant tasks gets executed. Thus both redundancy and parallelism reduce the execution time, but compete for resources of the system. In situations of constrained resources (e.g., fixed number of parallel servers), increasing redundancy reduces the available level of parallelism. This talk will present the diversity vs. parallelism trade off for some common models of task size dependent execution times, and show that different models operate optimally at different levels of redundancy, and thus require very different code rates.