Time Machines and Black Box Recorders for Embedded Systems Software
by Henrik Thane
Black-box recorders and virtual Time Machines make it possible to deterministically re-execute embedded real-time software. It is possible to jump back and forth in time while debugging the system offline.
The concepts of time machines and time travel have enticed people's fantasies for hundreds of years. Since the publication of H.G. Wells' 'The Time Machine' in 1895, hundreds of books and movies have followed. Even in theoretical physics, research on the subject has been taken seriously. Esoteric ideas like making use of black holes for time travel (Kip Thorne at Caltech) or consequences of Einstein's theory on relativity have been considered. However, time travel has never been proven to work in practice or theory. It has been speculated that time travel might even be prohibited by the laws of the universe. Nonetheless, if we consider man-made constructs such as computers and computer programs, it is possible to make 'time machines' that allow us to achieve time travel for a specific purpose: debugging.
The cost for verification and debugging of embedded software typically exceeds half the development budgets. Debugging of embedded systems software is difficult, and for multi-tasking real-time software especially so. Embedded systems have few interfaces for diagnostic observation, precisely because they are embedded. What makes matters worse is the fact that the actual act of observation may change the behaviour of the system, especially if the observation is performed using some software other than the application code (causing a probe-effect). Another large problem inherent in the concurrency of multitasking real-time software is that it is very difficult to reproduce executions and observations.
In solving these problems, our research (at Malardalen Real-Time Research Centre) has led us to solutions based on black-box recorders (similar to those in aeroplanes) and 'time machines'. By recording significant events online like task-switches and interrupt hits as well as data from the external process and internal state, we can deterministically re-execute the embedded system software offline, as dictated by the recording. From a user's point of view, this deterministic replay will behave exactly like a regular sequential program, mimicking the exact execution of the recorded multitasking real-time application. We can single-step, insert any number of breakpoints and inspect data without introducing the probe-effect. We can even jump back and forth in time using the debugger (therefore named the Time Machine). Since we have eliminated the dependency on real time and replaced the temporal and functional context of the application with the recording, we can replay the system history repeatedly.
|Figure 1: A robotic assemblyline with black-box recorders. Retrieval of black -box contents allow remote deterministic post mortem replay debugging using virtual Time Machines.
|Figure 2: A commercial IDE with an instruction level simulator debugger, into which we have integrated our Time Machine technology (the lower left window). The time line illustrates the recorded control-flow for six tasks; task priorities on the vertical axis. Selecting any instance of a task re-executes the system from an idle point (the red lowest priority task) up to the selection (it is possible to jump back and forth in time). The debugger window shows the current state. From here it is possible to single-step, watch variables, and set new additional breakpoints.
We have applied our method to a number of systems, but the most recent and most complex is an industrial robot control system from the largest industrial robot manufacturer in the world, ABB Robotics. Their system consists of several computing control systems, signal processing systems and I/O units. We applied our Time Machine to the motion-control part of the system, which consists of approximately 2.5 million lines of C code and is run on the VxWorks real-time operating systems. The motion-control part is a hard real-time system, with about 70 tasks running (the most frequent task is activated every 4ms) and multiple interrupts driving an assortment of device drivers.
The control flow of the system (the task-switches) was captured by a task-switch hook and recorded in a cyclic buffer of programmable length (the black-box). We also transparently instrumented the system calls that could change the system control flow by making use of an existing operating system abstraction layer. The only manual instrumentation that had to be inserted into the source code were calls to data-flow monitors after blocking system calls, in order to capture messages and the state of the task (represented by specified local and global variables). Worth noting is that we needed only to record the start conditions, since we re-executed the code offline. In total, our black-box recorder introduced an overhead of less than 2% of the processor utilisation and a few hundred kB of data in order to capture the last few hundred events before major failures, which could subsequently be replayed in the Time Machine.
Malardalen University, Sweden