Brandon Barker is a Computational Scientist working at the Cornell Center for Advanced Computing with research interests in safety-critical programming, parallel computing and linear modelling of metabolic systems.
It is not uncommon to have computational analyses running for many days or weeks. Software, hardware, and power failures all present the possibility that a significant amount of work could be lost. In some cases, the programmer can incrementally save data at specific intervals, but this is an error-prone process, and it is time-consuming to implement for each application. If a failure occurs, all data not saved will need to be regenerated again, and the researcher can only hope that another failure won't occur.
There is general solution known as Checkpoint/Restart (C/R) that can work for serial and a large variety of parallel programming applications. The primary advantage of all C/R implementations is the automation of saving program state at specified increments and allowing the program to be resumed. There are numerous C/R frameworks and implementations, each offering various advantages and disadvantages; despite some implementations being very mature, C/R remains an area of open and active research as no single solution covers every application type. By knowing the capabilities and drawbacks of each C/R solution, as well as the requirements and specifications of your application, it should be straightforward to choose a C/R framework that is right for you.
In this talk, we discuss in more detail what checkpointing is, and several scenarios where one would and would not want to use it. Next we discuss several popular checkpointing solutions, giving examples when pertinent of software that would not work well with each solution. Finally, we give some simple examples of how checkpointing can be implemented on your own system (Linux currently required).
Date: Dec 8, 2014
Time: 11:00 AM
Location: Weill hall, Room 121
Webex (webcast)
Slides