|
Søren ASMUSSENLimit theorems for failure recovery in computing and data transmissionA task such as the execution of a computer program or the transmission of a file has an ideal execution time T with distribution F. However, failures may cause the actual execution time X to be different. Various schemes for failure recovery have been considered, among with the most notable are REPLACE, RESUME, RESTART and checkpointing. The two first are fairly easy to study, while the RESTART has long resisted detailed analysis. Here the task has to start over if a failure occurs before completion, say the failure time is U with distribution G, and multiple failures may occur. Based upon Cramér-Lundberg theory, we show that the RESTART total time X has an exponential tail if F has finite support. Otherwise, X is always heavy-tailed, and the tail behaviour is quantified under various assumptions on F and G. Two related settings are also studied. In parallel computing, the total task time becomes the maximum of independent copies of X. Classical extreme value theory combined with the RESTART results immediately give the order of the total task time in a simple i.i.d. setting, but we also look into some non-classical triangular array extreme value problems. In checkpointing, the task is split into subtasks, each of which behave as in the RESTART setting. The tail of the total task time is given for a variety of specific models for the checkpointing. |