Journées MAS de la SMAI
Modélisation et Statistiques des Réseaux
Rennes, 27-29 août 2008

Informations générales
Accueil
Présentation
Comité local
Programme scientifique
Comité scientifique
Conférences plénières
Sessions parallèles
Les exposés en pdf
Informations pratiques
Dates
Planning
Inscription
Venir à Rennes
 Transportation and Hotels
Contact

Søren ASMUSSEN


Limit theorems for failure recovery in computing and data transmission


A task such as the execution of a computer program or the transmission of a file has an ideal execution time T with distribution F. However, failures may cause the actual execution time X to be different.

Various schemes for failure recovery have been considered, among with the most notable are REPLACE, RESUME, RESTART and checkpointing. The two first are fairly easy to study, while the RESTART has long resisted detailed analysis. Here the task has to start over if a failure occurs before completion, say the failure time is U with distribution G, and multiple failures may occur.

Based upon Cramér-Lundberg theory, we show that the RESTART total time X has an exponential tail if F has finite support. Otherwise, X is always heavy-tailed, and the tail behaviour is quantified under various assumptions on F and G.

Two related settings are also studied. In parallel computing, the total task time becomes the maximum of independent copies of X. Classical extreme value theory combined with the RESTART results immediately give the order of the total task time in a simple i.i.d. setting, but we also look into some non-classical triangular array extreme value problems. In checkpointing, the task is split into subtasks, each of which behave as in the RESTART setting. The tail of the total task time is given for a variety of specific models for the checkpointing.
region bretagne rennes rennes CNRS inria irmar rennes