# Susceptibility of Commodity Systems and Software to Memory Soft Errors

Alan Messer, *Member*, *IEEE*, Philippe Bernadat, Guangrui Fu, *Member*, *IEEE*, Deqing Chen, *Member*, *IEEE*, Zoran Dimitrijevic, *Member*, *IEEE*, David Lie, *Member*, *IEEE*, Durga Devi Mannaru, Alma Riska, *Member*, *IEEE*, and Dejan Milojicic, *Member*, *IEEE* 

Abstract—It is widely understood that most system downtime is acounted for by programming errors and administration time. However, a growing body of work has indicated an increasing cause of downtime may stem from transient errors in computer system hardware due to external factors, such as cosmic rays. This work indicates that moving to denser semiconductor technologies at lower voltages has the potential to increase these transient errors. In this paper, we investigate the susceptibility of commodity operating systems and applications on commodity PC processors to these soft-errors and we introduce ideas regarding the improved recovery from these transient errors in software. Our results indicate that, for the Linux kernel and a Java virtual machine running sample workloads, many errors are not activated, mostly due to overwriting. In addition, given current and upcoming microprocessor support, our results indicate that those errors activated, which would normally lead to system reboot, need not be fatal to the system if software knowledge is used for simple software recovery. Together, they indicate the benefits of simple memory soft error recovery handling in commodity processors and software.

Index Terms—Soft errors, memory errors, commodity, operating systems, Java, recovery.

# **1** INTRODUCTION

COMMODITY systems such as PC systems based on the Intel IA-32 architecture running the Windows and Linux operating systems account for the bulk of computer system sales. As computers become more ubiquitous, demand for better performance and higher availability increases in cost-effective commodity systems. However because of price pressures, current commodity systems have focused on price/performance issues, giving availability less attention. It is a common belief that software errors and administration time are, and will continue to be, the most probable cause of the loss of availability [13]. While such failures are clearly commonplace, especially in desktop

- A. Messer is with Samsung Electronics Corporate Technology Operation, 75 W. Plumeria Dr., San Jose, CA 95134.
  E-mail: alan\_messer@yahoo.com.
- P. Bernadat is with Hewlett Packard, 5 av. Raymond Chanas, 38320 Eybens, France. E-mail: philippe\_bernadat@hp.com.
- G. Fu is with the Mobile Internet Laboratory, DoCoMo Communications Lab. USA Inc., 181 Metro Dr. Suite 300, San Jose, CA 95110.
  E-mail: fu@dcl.docomo-usa.com.
- D. Chen is with AskJeeves, 1551 S. Washington Ave., Suite 400, Piscataway, NJ 08854. E-mail: dchen@askjeeves.com.
- Z. Dimitrijevic is with Google at the University of California at Santa Barbara, Computer Science Department, Santa Barbara, CA 93106. E-mail: xoran@cs.ucsb.edu.
- D. Lie is with the University of Toronto, 10 King's College Rd., Toronto, ON M5S 3G4, Canada. E-mail: lie@eecg.toronto.edu.
- D.D. Mannaru is with IBM, Research Triangle Park, NC 27709.
- E-mail: durgavellanki@yahoo.com.
- A. Riska is with Seagate Research, 1251 Waterfront Place, Pittsburgh, PA 15222. E-mail: alma.riska@seagate.com.
- D. Milojicic is with Hewlett Packard Laboratories, 1501 Page Mill Rd., MS 1183, Palo Alto, CA 94304. E-mail: dejan\_milojicic@hp.com.

Manuscript received 13 Sept. 2001; revised 28 May 2003; accepted 27 May 2004.

For information on obtaining reprints of this article, please send e-mail to: tc@computer.org, and reference IEEECS Log Number 114973.

environments, research has shown that certain transient hardware errors, particularly in memories, are also becoming increasingly probable as technology improves [5], [32]. Since such transient errors require system reboots that can take several tens of minutes or more on large systems, these errors can affect availability considerably.

Hardware errors can be classified as hard errors (faults) or transient (soft) errors. Hard errors are those that require replacement (or otherwise relinquished use) of a component. These typically happen as a consequence of physical damage to a component, e.g., by damage to connectors. Transient (soft) errors are those that result in an invalid state in the hardware that is correctable. For example, data stored at a memory location may become corrupt, but overwriting it will remove the invalid state. Such errors may lie dormant for a significant time since they are only detected by the system when the processor directly uses the erroneous hardware or corrupt memory location. When an error is touched by the hardware, it is referred to as Error Activation, while activated errors which go undetected by the hardware are called Silent Data Corruption. Such hardware errors have been considered by mainframe technology for years using expensive proprietary hardware and software for detection and recovery [2], [26]. However, in the field of commodity systems, it has not been costeffective to provide full hardware detection and redundancy for recovery support to mask all errors.

Results over the last 20 years have shown that soft errors due to cosmic rays and substrate alpha particles can cause semiconductor transient errors in memory hardware [32]. For example, it has been reported that a 1GB memory system based on today's 64Mbit DRAMs still has a potential combined error rate of 3435 FIT (Failures In Time—failures in  $10^9$  hours) when using Single Error Correct-Double Error Detect (SEC-DED) ECC [10]. Of these errors, soft-errors account for 30 times (96.7 percent) the errors when compared to hard errors. This is equivalent to around 300 reboots resulting from soft errors on 10,000 machines in one year, if all the errors are activated and cause reboots. Based on Moore's law, both cache and DRAM sizes will grow significantly over the next five years (to around 2Gbits per DRAM), indicating the possibility for a large increase in error rates due to shrinking cell sizes and reduced supply voltage.

The increasing prevalence of soft errors and their recovery has received some attention in current and next-generation commodity processor architectures. For example, Intel's IA-32 architecture has various levels of support across processor implementations for error detection and correction on certain buses and caches. In addition, the new Intel/HP IA-64 architecture contains increased support for the detection, correction, and reporting of software recovery of soft errors at the processor level.

At current error rates and memory sizes, processor-only recovery support may be sufficient. However, given the potential for a high soft error rate, we would like to understand the effect of those errors that are not masked by hardware support. In doing so, this paper aims to determine how frequently soft errors are activated by commodity PC software using commodity operating systems on commodity processors. In addition, given improved error support, we aim to determine what influence soft errors will have on commodity software workloads and what the possibilities for recovery from those soft errors are.

The rest of the paper is organized as follows: In Section 2, we present our approach to investigating error susceptibility and recovery on commodity processors. We then describe our investigation into the influence of these errors on a commodity operating system (Section 3) and a sample application platform (Section 4). In Section 5, based on our understanding, we analyze the probable effect of soft errors on commodity systems with and without simple, improved error handling. Section 6 presents work related to this paper. In Section 7, we present the lessons learned from this work and, in Section 8, we conclude the paper and propose potential future work.

# 2 APPROACH

In a commodity system based on the IA-32 or IA-64 architectures, memory soft errors predominantly occur in both the cache and main memories of a system. Depending on memory size, technology sensitivity to soft errors, and price pressures, PC systems usually support at least parity detection on main memory and ECC for larger caches. Single bit errors can be effectively masked with error correction support, however, as technology feature sizes shrink and voltage levels drop the probability of multiple bit errors increases. Depending on the protection used, price, and technology types, the number of errors masked by this protection will vary. But, ultimately, a number of soft errors are not masked and may be activated by the software, resulting in either detected errors or silent data corruption.

Determining the effect of soft-errors on a commodity system is a difficult task due to their relative infrequency and limited postmortem information. Past work has reported the effect on availability of these errors from data acquired from in the field failures; however, this information typically comes from mainframe machines where some level of postmortem information is available. Our approach to this problem is to investigate the activation rate of emulated soft errors (rather than total availability) for offthe-shelf commodity systems and software. Based on this approach, we can gain an understanding of how soft errors are activated and, with some minor kernel modifications, we can classify those errors by their usage and likely effect on those commodity softwares and processors.

Based on this information, we would like to understand the processor and system status at which the memory error is activated. This is important because the error's severity on the software and, thus, the ability of the system software to recover from the error is directly affected by what software activates the error. We believe that the information derived from these experiments will increase the understanding of whether and not to make software execution platforms more robust to soft errors.

Soft-errors based on sources such as cosmic rays occur uniformly in memory, although the size of the error may vary due to multibit impacts. To mimic this occurence, our approach is to insert emulated memory soft-errors uniformly distributed throughout memory and then determine their activation rate. This approach does not stress-test particular memory regions or system processing for errors to fully understand the consequences of the error on availability; other work covers this approach well. Instead, we are able to determine the effect of the uniformly inserted errors and their effect on the *total* system software (OS, application software, etc.).

In this paper, we evaluate two common PC software platforms; the operating system and Java virtual machine platforms. Operating systems have the most scope for performing recovery without affecting application code because they are first to receive an error from the hardware and are in total control of the hardware. However, some errors cannot be handled by the operating system. In these cases, the Java virtual machine's abstraction may enable additional application recovery.

# 2.1 Existing Commodity Error Handling

The effect of a soft error on software execution depends predominantly on the processor's support for error masking, handling, and reporting. In commodity processors such as IA-32 and IA-64 architecture processors, a detected memory soft error causes the system to raise a Machine Check Architecture (MCA) exception to notify the operating system of a serious error. However, because the hardware has seen an uncorrectable error and because of commodity price/performance pressures, this exception usually leaves the processor in an undefined state requiring a system reboot due to loss of containment of the error's effects at the hardware level.

A complete overview of the IA-32 or IA-64 MCA is beyond the scope of this paper. In general, for IA-32 processors, while the exception leaves the processor in an undefined state, the status of the processor concerning the error is reported in a set of processor registers [17]. The IA-64 architecture extends support for soft errors in two ways [16], [28]. First, additional hardware detection is supported for processor implementation, such as providing parity or ECC protection to the system bus and the three on-chip caches. These provide good coverage of most common errors while limiting cost. Second, the recoverability of machine-check exception handling has been improved by providing several types of well-defined error scenarios. This provides more information for potential software containment of the error.

These processors report many types of errors as part of the MCA. For memory errors, three pieces of information are of interest: where the error activation occurred (both the current instruction point and the erroneous memory location), what action caused the error (e.g., read or write access), and whether it occurred in main memory or in cache. Based on these error cases, several opportunities exist for simple error recovery. For example, a write access to a soft error in main memory need not cause a fatal exception since the error is being overwritten rather than read. However, many architectures cause a cache read on a write in order to read the data into cache for update. In doing so, an error is consumed and signaled needlessly. Alternatively, an error may occur in user memory during application processing. Since the error is outside the kernel, its integrity has not been affected, allowing an enhanced kernel to simply kill the user process and continue.

Based on these possibilities, our approach is to understand where errors occur and what operation was being performed so that knowledge can be used to determine whether some form of error recovery processing could contain the error. This may lead to enhanced error recovery in the kernel or an application platform to contain the data corruption due to the error.

# 3 INFLUENCE OF SOFT ERRORS ON A COMMODITY OS

To understand how activated errors affect the operating system on commodity processors, we need to measure and characterize soft errors with the following information:

- Was the processor reading or writing memory? Depending on the processor implementation, errors while writing versus reading memory may be ignored since the content is overwritten.
- Was the processor executing kernel or user code? If the error activation occurs while the processor is in user mode, there may be an opportunity for terminating the current thread or task and switching to another one, thus avoiding bringing down the system.
- Is the affected memory in user or kernel space? If the processor is executing kernel code, but the accessed memory is in user space, such as when reading user data in a write system call or when reading system call arguments, it may be possible to modify the operating system to send a signal to the corresponding user task and interrupt the system call. Most operating systems are already prepared to handle an invalid memory access for such transfers.

• What is the memory object type and what is the OS state if the processor is executing kernel code? Depending of the type of kernel memory and how it is accessed (read/write), the recoverability of the error can be determined from the state saved from the error insertion.

### 3.1 Kernel Instrumentation and Methodology

Simply injecting errors at random memory locations is an easy task. Determining if the error is activated and what the effects are is more difficult. The error may be activated without any visible effect, even though its future consequences can be severe. For example, an activated error may cause a reboot, requiring a file system integrity check taking many minutes or hours for large systems. In the extreme case where the kernel panics or halts, analyzing error casualty "a posteriori" is complex. Restarting the system during this analysis is often a long process that may require some human intervention.

For soft-error investigation where a number of samples are required, we feel the kernel panic analysis process is too slow and difficult to gather enough samples. Instead, we chose to adopt a nonintrusive approach of error activations that would give us enough information to categorize the memories usage and use this with human analysis to determine the general affect on the underlying software. To enable this hand analysis, we must modify the kernel to capture the relevant state at memory error activation time in a nondestructive and nonintrusive fashion. For each activated error, our instrumentation records the activation delay and some error context, including affected memory type at injection time, the affected memory type at activation time (since memory may be reallocated), the access mode (read or write), the execution mode (kernel or user), the interrupted task's ID, and the program counter. This information is used under human analysis to understand how potentially fatal each error would have been to the OS.

Note, because we are using a software injection approach, we are only attempting to determine the software affect of activated errors on the system, rather than measure total availability of the systemware. In addition, using this approach we are unable to measure activation due to no process memory activity such as DMA transfers or virtual memory lookups.

#### 3.2 Error Injections

We performed our investigations on an IA-32 platform using watch points to simulate memory errors. Similar to break points for instructions, watch points are a means to detect any type of memory access to a given virtual memory location. A set of three debug registers in the processor allows the detection of data read/write accesses or instruction fetches at any given memory location. Given that the watch point mechanism is virtual address driven, one limitation is that physical memory that isn't accessed through a virtual mapping is ignored. In particular, simulated errors in page tables (PTE) cannot be detected during page translations nor can errors in I/O buffers during DMA operations. Most platforms use a TLB cache to minimize PTE lookups. I/O buffers usually contain user data and errors activated in such a memory area are not



Fig. 1. Nonintrusive error injection to emulate soft errors and determine activation point/usage.

considered fatal. This limitation shouldn't significantly impact the OS susceptibility.

The fault injector is organized as two components, a user mode program and some newly written kernel code. The user mode program is executed concurrently with the workload. The user mode program randomly selects physical addresses where to inject an error. Then, it interfaces to our kernel injection component through a /proc virtual file system interface to setup a watch point, called a /proc/mfi. This interface is a convenient way to communicate with the kernel without adding new system calls. If the error is activated, the kernel component returns the error context through the same interface. It will delete the watch point once the error is activated or if some configurable time-out expires. Finally, it computes the various statistics required for our analysis. The watch point facility does not allow more than one virtual address to be monitored simultaneously. Therefore, we set up a time-out to detect that the error has not been activated and inject a new error at another random location.

The kernel component searches which virtual address (kernel or user) maps into the physical address provided by the user program. (We never detected multiple memory mappings while running our experiments.) This virtual address is process space dependent and must be searched for each distinct task. This reverse PTE lookup is fairly expensive and cannot be performed systematically for all tasks or at each context switch. Instead, it is performed when a task is first scheduled after the error was injected, then the matching virtual address is cached in a taskspecific data structure. Additionally, the kernel intercepts all virtual mapping requests in case the physical page where the error was injected is about to be mapped. Three watch points are initialized to detect read, write, and instruction fetch on this virtual address. If a watch point exception is raised, the kernel gathers the error context and returns it to the user program. See Fig. 1 for a overview of this nonintrusive soft-error emulation process.

To collect the largest sample set, a new error is injected as soon as the previous one is activated or when the time-out expires; therefore, injections are not strictly periodic. Overall, we injected one error every 50 seconds, with minimal impact on the workload applications.

# 3.3 Memory Objects Classification

To enable human analysis of the activation point and consequences of the recorded exceptions, we need a means of classifying the usage of that memory as well as determining where it is used. In order to classify memory usage, we chose to break down memory usage into types of memory, based on the point at which it is allocated. This allows us to determine whether the memory is used for file systems buffering, stack space, etc., and, thus, categorize them for further analysis. To obtain this memory type information, the Linux OS was instrumented such that every byte of main memory be classified. This is accomplished by modifying the memory allocators (the buddy and Slab memory systems [3]) so that they register the requestor's return PC within the memory object. Each distinct PC is mapped to a distinct memory type. Given any kernel virtual address, the operating system's memory type may be retrieved either from the page descriptor or the Slab header. The memory object type is determined both at injection and activation time since the physical memory may be reallocated in the meanwhile. This allows the program function allocating the memory (either the kernel or application) to be recorded as well as the function activating the memory and then stored for analysis on activation.

#### 3.4 Error Severity Classification

We categorize each error into one of three simple error severity classifications based on whether the error was in kernel or user memory and whether the access was a read or a write:

- **Overwritten**. The memory is accessed in write mode. On many platforms, write access to an erroneous location is not detected and can be ignored. However, on some platforms, a write access may be preceded by a read when the cache loads a line, causing the processor to detect the error before it can be overwritten.
- User Signalable. The memory is accessed in read mode, but it belongs to a user (as opposed to kernel) area. This applies whether or not the processor was running in kernel or user mode. In these cases, the state of a particular user program has become corrupt, but the processor may allow the kernel to continue operating. As a result, the kernel can signal the user task and proceed with another one or interrupt the system call. Depending on the processor, some memory error exceptions indicate that processor error containment has been lost; these are not considered to be user signalable.
- Kernel Fatal. The memory is accessed in read mode and the location belongs to the kernel space. In general, this is fatal because the kernel state is corrupt. There may be cases where the error could be ignored or surmounted, but this would require a more thorough kernel analysis.

#### 3.5 Experimental Setup

For our experimentation, we used a 500 MHz Pentium III PC with 192 MB of memory running the Linux kernel version 2.2. We used two workloads:

TABLE 1 OS Error Injection Experiment Sets

| Workload | Injection<br>Time-out | Elapsed       | Injections | Activations | Activation<br>Rate |
|----------|-----------------------|---------------|------------|-------------|--------------------|
| 1        | 10 sec.               | 100 sec.      | 12         | 2           | 17%                |
|          | 30 sec.               | 5 min.        | 15         | 5           | 33%                |
|          | 1 min.                | 10 min.       | 17         | 9           | 53%                |
|          | 2 min.                | 20 min.       | 18         | 10          | 56%                |
|          | 5 min.                | 50 min.       | 18         | 12          | 66%                |
|          | 10 min.               | 100 min.      | 28         | 20          | 71%                |
|          | 30 min.               | 5 hours       | 47         | 40          | 85%                |
| 1        | 10 sec.               | 4 hours       | 1690       | 464         | 27%                |
|          | 30 sec.               | 4 hours       | 670        | 278         | 41%                |
|          | 60 sec.               | 4 hours       | 382        | 197         | 52%                |
|          | 120 sec.              | 4 hours       | 228        | 132         | 57%                |
| 2        | 60 sec.               | 90 hours      | 5499       | 599         | 11%                |
|          | Total                 | ~114<br>hours | 8624       | 1768        | 20%                |

- Workload 1: The host runs an Apache Web server and repetitively recompiles the Linux kernel. A single client (600 MHz Pentium III Windows NT) connected over a 10 Mbit Ethernet link runs the WebStone benchmark against the Apache server, simulating 20 users. The network traffic is close to saturation. For this workload, the real memory was artificially reduced to 64 MB so that the memory working set can be slightly larger than real memory to induce some swap activity.
- Workload 2: The host runs mySQL server 4.0.12 and iteratively executes the associated CPU bounded benchmark suite. We didn't limit memory as we did for workload 1, leaving the Linux system with 10 percent of memory reported free. Another difference is that there is no network traffic in this workload. One characteristic of this version of MySQL is its extremely efficient memory cache, reducing the amount of I/O operations.

To collect enough performance data, we injected errors at a much higher rate than found in a real system. Unlike a real environment where errors persist in memory as long as it is not activated, our simulator cannot monitor more than one error at a time. Since the error may never be activated, we need to impose a time-out. So, the injection rate is not a fixed parameter, it is not uniform, and its value can only be measured a posteriori. The time-out value is the only configurable parameter.

We performed two sets of experiments. With the first one, performed with workload 1 (first seven rows of Table 1), we study how the activation and activation delay evolve as a function of the error injection time-out. In this experiment, the injection time-out varies from 10 seconds to 30 minutes. The second set of experiments, performed on both workloads (last five rows of Table 1), allows us to characterize the error severity and to observe the influence of the workload. Here, the errors were injected at higher rates and the experiments last longer.



Fig. 2. Fault activation rate versus injection time-out.

Overall, across 12 distinct experiments, 8,624 error injections were performed over a 110-hour period, resulting in 1,774 activations. In the next sections, we analyze these experiments with respect to: the activation rates, activations delays, the influence of the time-out value and the workload, and the error severity.

#### 3.6 Activation Rate and Activation Delay

Because of our need to use a time-out on memory injections, our first analysis was to determine how activation delay varies with the injection time-out. Absolute activation rates are of less interest, but logic indicates that the larger the time-out is, the greater the activation rate. With no time-out (e.g., an infinite time-out value), the activation rate should be close to 100 percent. Only close, since some memory areas may never be accessed, such as unused kernel text code (initialization, unused components). Fig. 2 shows that the activation rate reaches 55 percent for a 120 second timeout and 85 percent for a 30 minute time-out. This high activation rate is the result of both a memory intensive workload (as little as 4 percent free memory) and of minimal memory fragmentation resulting from the use of the slab allocator in the Linux kernel.

Fig. 3 reports the activation delay (elapsed time between the injection and the activation). With a two minute time-out value, 90 percent of the activated errors are activated within a minute. The average activation delay increases insignificantly for injection time-out values greater than five minutes.

#### 3.7 Memory Distribution and Activation Rate

Related to this analysis, we also would like to understand how activated errors are distributed by memory type to determine their recoverability. Fig. 4 depicts the average memory usage distribution while running workload 1. We have categorized memory into 200 distinct memory object types. Among the allocated memory (96 percent of total real



Fig. 3. Activation delay.



Fig. 4. Linux memory objects classification by size.

memory), the top nine types account for 93 percent. Seventy-five percent of the memory (48 MB) is dedicated to user processes. For this workload, a total of 280 processes are allocated. Excluding the user private objects (private as opposed to shared with other tasks), the mapped files, and the free memory, 21 percent of the memory belongs to kernel objects.

In Fig. 5, we show three distinct injection and activation distributions for four injection time-out value experiments. The first distribution is the percentage of overall memory used by each memory type. The second is the injection distribution over the various memory types and the third one is the distribution at the activation time. The figures show that error injections are distributed across memory object types according to their memory usage. This is no surprise since the injector uses a uniform random generator to compute the physical addresses. The activation distribution is not such a close fit; in particular, the user private memory hit rate is unexpectedly high and the mapped file hit rate is unexpectedly low. Two factors contribute to this:

• The task/thread creation rate for this workload is high and the private data page lifetime is short.

Every byte of a freshly allocated private page is cleared by the kernel whether or not it will be entirely used by a task.

• The text (here classified as a mapped file) locality is also high. The server tasks are repetitive. Only a small fraction of the text pages are referenced. This leads to low text hit rate.

Another important observation is that the injection timeout value has little influence over the distributions or activation rate per memory type at these time-out levels since, while the absolute activation rate varies with timeout, our activation rate per memory type is similar across timeout values. This validates our experimental nonintrusive time-out-based approach mimics the activitations a real kernel would see from memory soft errors.

### 3.8 Error Severity

Fig. 6 shows the overall result of our classification across the two workloads. Overall, only 10 percent of the activated errors are considered fatal to the system for our sample workload. Most of this reduction is caused by 74 percent of the error activations simply overwriting an existing error, leaving 16 percent of the errors which have potential for signaling the application before termination. This interesting result follows from the common operation of many software components such as stacks and virtual memory pages, both of which are generally written (an intermediate result or zeroed page, respectively) before they are read.

First, let us assume that write errors are silently ignored by the hardware or that, if signaled, the error can be continued and the OS may be restarted. Then, we could ignore 74 percent of the activated errors. Given this assumption, an unmodified Linux system would be affected by only 26 percent of the activated errors. Second, the kernel already has support for appropriately handling existing user data errors (e.g., segmentation violations) by signaling the relevant task. The same mechanism would allow us to





Fig. 6. Error severity classification.

signal applications when a read error activation occurs in user space. With this error handling support and restartable processor error exceptions, the system would only need to panic for 10 percent of the errors.

Table 2 provides the activation rate and error severity distribution for the two workloads and on a per time-out basis. Our first observation is that time-out value has little influence on the severity. The activation rate is higher for the first workload since we artificially reduced the real memory so that the entire memory range is used. Despite a significant variation for the overwritten and user signal-able errors, the fatal error proportions are close: 8 percent for workload 1 and 13 percent for workload 2.

#### 3.9 Potential OS Recovery and Containment

Our results show that up to 90 percent of memory errors can be considered as nonfatal to the operating system. This assumes that the operating system has been instrumented to capture relevant information at error activation time and is able to pinpoint the affected memory object type. This may allow it to discard write mode errors and signal user processes when errors occur in user memory space. While this does impose extra kernel development, we were able to apply this modification to the Linux kernel in about one manmonth. Given the Linux kernel size, this seems a reasonably small implementation cost for the potential benefit.

The remaining 10 percent is much more difficult to handle. Looking more closely at the error distribution in Fig. 5, we observe that, apart from the nonkernel object types (user private and mapped files), a number of kernel objects may be altered without affecting the overall kernel availability:

- User page table entries—Some may be rebuilt; at worst, the task can be signaled.
- Buffer cache—Nondirty blocks can be recovered from disk or an I/O error may be raised.
- Kernel stacks—If locks can be unwound, in some circumstances, the task may be destroyed.

- Network buffers—The data may be retransmitted, or an I/O error can be raised.
- Kernel text—May be reloaded if the page-in code path is not altered.

More generally, corruptions within logs or statistical counters should not bring the system down.

However, this decrease in fatality will come at a higher cost due to more complex modifications to the operating system core. Fig. 7 outlines the major functions associated with program counter when errors are activated in workload 1. Further to this information, Table 3 provides the list of kernel code routines affected by fatal read error activations with workload 1. The most frequently affected are:

- *ide\_output\_data*—Used while writing to disk. This is mostly a consequence of the compile tasks.
- statm\_pgd\_range—Collects the memory usage statistics available through the /prof virtual file system.
  We were running the top program simultaneously to observe the memory usage.
- *xirc2ps\_interrupt*—Processes network controller interrupts. The network is close to saturation with the Webstone benchmark.
- *filemap\_nopage*—When a user task maps a shared page as private, the page must be copied. This is usually the case for initialized data sections of a program. This memory is not considered as user private. One option would consist of signaling all tasks still mapping this page and discarding it for each. The page would then be reloaded from disk when the program is next scheduled.
- *do\_fork*—The kernel duplicates a significant amount of kernel data. Workload 1 induces a significant amount of task creations since it continuously compiles small files.

# 4 INFLUENCE OF SOFT ERRORS ON APPLICATION SOFTWARE

System recovery is a complex problem that involves participation from the hardware through to the application software. We have seen that the operating system could be extended with simple instrumentation to increase recoverability when it receives a memory error exception. However, if the operating system determines that the error occurred in application space, in order to avoid termination, the application must consider recovery as well. The operating system could be extended to signal the application that an error

TABLE 2 Influence of Workload on Error Activation and Severity

|                    |         | Wor     | Workload 2 | Overall  |         |      |
|--------------------|---------|---------|------------|----------|---------|------|
| Injection Time out | 10 sec. | 30 sec. | 60 sec.    | 120 sec. | 60 sec. |      |
| Injections         | 1690    | 670     | 382        | 228      | 5499    | 8469 |
| Activation rate    | 27%     | 41%     | 52%        | 58%      | 11%     | 20%  |
| Fatal              | 9%      | 6%      | 10%        | 8%       | 13%     | 10%  |
| Overwritten        | 80%     | 81%     | 78%        | 88%      | 60%     | 73%  |
| User signalable    | 11%     | 13%     | 13%        | 5%       | 27%     | 17%  |



Fig. 7. Function associated with the program counter location when errors were activated in kernal mode.

occurred, but recovery for the application is not necessarily straightforward. The data corruption that caused the exception may have affected an important data structure. In addition, on commodity processors, the execution activating the error is often not continuable after the exception. Therefore, the application will either need to consider recovery from such exceptions or the system will need to have mechanisms to preserve application state in order to provide recovery for the application.

In this section, we present initial investigations into application susceptibility to soft errors. At the application level, Java Virtual machines (JVM) and Java applications are of particular interest to us due to the large garbage collected heaps, the machine abstraction presented, and the integrated exception mechanism. By presenting an abstraction between the operating system and the applications, the virtual machine simplifies application-level recovery by using increased knowledge of the application's status and semantics, such as whether the error is in static or heap memory.

#### 4.1 Influence on a Java VM

To determine how the JVM and its Java applications can respond to soft errors and potentially detect silent data

TABLE 3 PC Locations for Read Access Memory Error Activations Occurring in Kernel Mode

| Kernel location              | %  | Kernel location            | %   |
|------------------------------|----|----------------------------|-----|
| ide_output_data (disk write) | 20 | csum_partial_copy_gene ric | 2   |
| statm_pgd_range (/proc FS)   | 9  | math_state_restore         | 1   |
| xirc2ps_interrupt (net I/O)  | 8  | tcp_send_skb               | 1   |
| filemap_nopage               | 8  | tcp_clear_xmit_timers      | 1   |
| do_fork                      | 7  | tcp_v4_rcv                 | 1   |
| cp_new_stat (lstat)          | 4  | do_select                  | 1   |
| ext2_update_inode            | 4  | find_buffer                | 1   |
| tcp_timewait_kill            | 3  | d_lookup                   | 1   |
| brw_page                     | 3  | generic_copy_to_user       | 1   |
| ext2_open_file               | 3  | zap_page_range             | 1   |
| check_tty_count              | 3  | get_statm                  | 1   |
| clear_page_tables            | 2  | vsprintf                   | 1   |
| get_unmapped_area            | 2  | si_meminfo                 | 1   |
| lookup_dentry                | 2  | si_swapinfo                | 1   |
| skb_clone                    | 2  | collect_sigign_sigcatch    | 1   |
| Total                        |    |                            | 100 |



Fig. 8. Error activation in the JVM's static data.

corruption, we performed several investigations instrumenting and adapting the open-source Kaffe VM. This allowed us to examine its memory usage, to instrument it for fault injection experiments, and to extend it to detect silent data corruption. It is also a mature system, it has reasonable performance, and it is widely used. For our experiments, we used an IA-32 RedHat Linux 6.2 platform, running Kaffe 1.0.5 in the "interpreter mode."

We instrumented the Kaffe virtual machine to inject memory errors into the data memory area and to record the memory status. In a manner similar to the OS fault injector described in Section 3.1, the interpreter loop is instrumented so that, after a certain number of byte codes have been executed, the loop calls our error injection procedure to inject a memory error.

In a Java VM, the data areas can be divided roughly into two partitions, those allocated statically for the VM and those allocated on the heap for Java objects. In each test set, errors are injected into one of these data areas. When the error is activated, we determine what data area the error has hit, what type of object it is in, and we also inspect the VM status to see whether it is activated by the garbage collector. Kaffe uses the mark and sweep algorithm, which makes this inspection fairly easy because, when the GC runs, all of the other user threads are stopped.

To investigate the effects on some sample applications on top of the JVM, we chose four benchmark applications extracted from the SPEC JVM98 benchmark suites using the medium data configuration—ten percent [29]. To represent a range of memory uses, we chose a Java expert system (SpecJVM98 name: \_202\_jess), a Java database (\_209\_db), a Java compiler (\_213\_javac), and a Java parser generator (\_228\_jack).

For our experiments, we injected 1,000 memory errors for the four benchmarks in both static and dynamic memory areas of the JVM. Figs. 8 and 9 show the results of our initial investigations for the static and dynamic memory areas, respectively. These results show that, for the static data region, around 5-6 percent of injected errors cause application errors (crashes or incorrect results) and around 2 percent of errors are activated but cause no adverse result. However, the Java object heap shows a much higher error activation rate between 16 percent and 63 percent when causing no error and between 7 percent and 13 percent when causing application errors.

The most interesting results show that there is a significant difference in the error susceptibility of the two data areas, especially that there is a large difference in the number of errors that are injected and not activated. As can be seen from



Error consumption in the heap

Fig. 9. Error activation in the JVM's heap region.

Fig. 9, this seems to be because the Garbage Collector (GC) activates a large number of those normally latent errors. This stems from Kaffe's mark and sweep garbage collector strategy that touches most objects periodically, causing latent errors to be uncovered. This may cvause an error on the GC thread. However, the GC is designed to be easily restarted to relieve load when memory is tight.

Interestingly, however, although most of the error activation takes place in the garbage collector, relatively few errors actually cause real problems (crashing the JVM, for example). We believe the main reason for this is that the garbage collector only uses certain data in the heap (e.g., object references) on its traversal, reducing its susceptibility to the number of actual errors. By comparison, 56 percent of static data error activation cause application errors, whereas only 7 percent of the error activation in the GC cause application errors.

#### 4.2 Potential JVM Recovery and Containment

These results indicate that, similarly to the operating system, a large number of errors are latent and never detected while executing. However, it seems that applications that exhibit behavior similar to a mark and sweep garbage collector are much more susceptible to uncovering those normally latent errors. For example, an in-memory database application transverses a large amount of memory in order to produce each query result. However, it is unclear whether the different search patterns of other garbage collectors are similarly affected.

Results also indicate that, with a little extra application knowledge, a large number of those detected errors need not be fatal. For example, the garbage collector could be modified to tolerate machine-check abort exceptions that may occur during a heap sweep. Or, in the case of silent data corruption, when errors are not detected by hardware, the garbage collector could check the validity of object references before use. In fact, most garbage collectors already check object reference validity before proceeding as part of their sweep to determine which data is an object reference. This probably accounts for some of the garbage collector's existing tolerance to errors. Also, given the level of abstraction offered by the JVM, there may be opportunities for other forms of error handling, such as improved exception handling and object checksums to detect silent data corruption. Initial investigations into these ideas show several interesting approaches [7].

#### 5 ANALYSIS FOR COMMODITY SYSTEMS

At the beginning of this paper, we noted that it has been reported that a 1GB memory system based on today's 64Mbit DRAMs still has a potential combined unmasked error rate of 3,435 FIT<sup>1</sup> when using ECC [10]. Given our investigation, it is interesting to consider: 1) Given our activation rate evidence how many failures in time would lead to a reboot? 2) Given our results for the recoverability of errors, how can this error rate be improved?

For simplicity, let's assume that all activated errors are detected, which is quite common for an ECC-based system. Sections 4.2 and 5.1 report similar worst-case error activation rates in the range of 11-37 percent, an average of 20 percent. Taking the worse case activation rate and our 1GB memory system, our experimental result would indicate that the software would only need to reboot on a  $(3, 435 \times 0.20) = 687 FIT$  activated error rate.

Our analysis of the activated error reports indicates that, given a small amount of modification to each piece of software, not all errors need be fatal to the OS (reboot) or application (restart). We would like to convert our understanding of the various software memory susceptibilities to errors into an approximate visible error rate. The approximate combined error rate for errors that are not masked by the hardware or modified OS can be determined using the following formula:

 $FatalErrorRate = (VHE \times AR) \times (KMS + AMS + OR)$ 

VHE = Visible Hardware Error Rate

AR = ActivationRate

KMS = KernelErrorSusceptibility imes KernelErrorFatality

 $AMS = Signaled Application Error Susceptibility \times$ 

#### Application Fatality

 $OR = OverwriteRate \times OverwriteFatality.$ 

In Section 3.9, we indicate that only 10 percent need be fatal to the operating system (KernelErrorSusceptibility), around 74 percent of errors are overwritten (Overwrite-Rate), and 16 percent of errors could be signaled to the application (SignaledApplicationErrorSusceptibility). Given our design goals to minimize the operating system modifications, let's assume KernelErrorFatality is 100 percent. For high-level IA-32 and IA-64 processors, most forms of data overwrite can be recovered from by the processor or exception handler, so let's assume OverwriteFatality is 0 percent. Our results from Section 4.2 show that 7-13 percent of JVM errors would be fatal to the JVM and its application when signaled (ApplicationFatality). This indicates that, with a little application knowledge, the reboot error rate could be reduced to

 $(3,435 \times 0.20) \times ((0.1 \times 1.0) + (0.16 \times 0.13) + (0.74 \times 0.0)),$ 

or 82.9 FIT, a considerably smaller rate. This drop comes predominantly from ignoring overwritten errors. However,

<sup>1.</sup> Note 97 percent of the referenced source error rate is accounted for by soft errors. For simplicity, the remaining 3 percent is not taken into account in these calculations.

if we assume an old IA-32 processor where overwrites are fatal, this error rate remains at 591 FIT. In these situations, when execution cannot be continued, our investigations indicate that this may be improved by focused kernel modifications. Since these numbers are error rates, we cannot directly calculate a machine's eventual availability without determining the downtime for each error.

# 6 RELATED WORK

In 1979, Ziegler et al. at IBM Research proposed that cosmic rays and alpha particles can cause semiconductor transient errors in memory hardware [32]. Since this seminal research, many others in the field of semiconductors have reported other experiments verifying the result. Some semiconductor research indicates that manufacturers are managing to limit increases in soft error rates through changes to their memory design and manufacturing processes [5], [33]. Other research indicates the need for consideration of soft errors more carefully in the longer term [1], [10]. However, the general consensus is that soft errors are likely to continue to play an important role in computer system availability.

Techniques such as parity bits, Error Correction Codes (ECC), and ChipKill [10] have been used in commodity main memories, storage media, and interconnects. These technologies allow different levels of error detection and correction on locations accessed by the processor. While ECC techniques reduce the number of errors, in this paper, we are interested in the effect on software when either only parity is used or errors are not masked by ECC. We believe that, as technology trends increase the probability of memory soft errors, software recovery technique may become more important. Since these errors are not masked by hardware support, they cause the severe Machine Check Architecture (MCA) exception which typically results in a processor reset. These errors are a prime candidate for increasing availability through software recovery techniques.

Much research has been undertaken into the influence of failures on computer systems, as well as techniques to improve hardware and software reliability. Probably closest to the work of this paper has been the work on fault injection, propagation, and error handling. The systems, such as FERRARI [18], React [9], and Fine [19], have greatly improved our understanding of hardware and software faults that are difficult to catch and repeat. Hsueh et al. give a good survey and comparison of different injection methods [15]. Some work has been undertaken into understanding errors in COTS systems, including Linux and the PowerPC CPU [11], [14], [21]. In the operating system work [11], [14], the focus is on inserting faults in focused areas to stress test the OS to evaluate potential corruption and effect on availability. This work is complementary to our work since it focuses on availability with faults rather than soft error activation rates, both of which are required to understand total system availability, while the COTS CPU work [21] focused on inserting errors through out the processor logic using a circuit fault injector, rather than focusing on the external memory system as our work does. Again, we feel this investigation is complementary to our soft error activation rate experiments.

One approach to the problem of soft errors is to use reliable hardware through the use of redundancy. Typically, this increased hardware reliability is only available in proprietary servers, with specialized redundantly configured hardware and critical software components, such as processor pairs [2]. Examples include the IBM S/390 Parallel Sysplex [26], the Tandem NonStop Himalaya [2], and the Stratus ftServer [22]. Cornell's Hypervisor-based fault tolerance system provides a software alternative using multiple virtual machines to provide an n-1 fault-tolerant system [4]. Another approach in multiple processor systems is fault containment and recovery at a "node" granularity, including cluster systems, and multicellular NUMA architectures, such as Hive [6].

Software reliability has been more difficult to achieve in commodity software, even with extensive testing and quality assurance [25]. Techniques such as recovery blocks [12], checkpoints [13], techniques for failure transparency [20], to name but a few, have greatly improved recovery. In addition, a lot of work has been conducted in the context of distributed systems providing tolerance with support such as fail-over and distributed transactions [2], [13] rather than increasing single system availability, which is the focus of our work. Rio [8] takes a novel software-based approach to fault containment for a fault-tolerant file cache by using memory protection operations to protect against wild writes to shared data structures.

However, commodity software fault recovery has not evolved very far. Most popular operating systems support some form of memory protection between units of execution to detect and prevent wild read/writes. But, most commodity operating systems have not taken up software reliability research in general and have not tackled problems of memory errors. Instead, they typically rely on failover solutions, such as Microsoft's Wolfpack [27].

Part of the solution to the problem of soft errors is the more widespread use of existing availability techniques to more effectively mask errors throughout the system. However, our approach is complimentary and attempts to improve the understanding of the susceptibility and recovery of commodity software to soft errors using simple fault injection and exception handling techniques. Our goal is to contribute to the existing work on the interaction of errors with software activation and to propose simple techniques that may help reduce the effect of soft errors.

# 7 LESSONS LEARNED

The following observations can be derived from our experimental data and analysis:

- The effect of soft errors on a modified operating system may be small. For our sample workload, we measured that 90 percent of memory errors need not be fatal to the operating system's execution.
- Large numbers of activations are overwritten. This stems from the write before read use of most memory locations. This is due to page (and object) clearing for security and semantic reasons.
- Kernel mode read accesses to user data only account for a small number of accesses most of which are write accesses. In addition, kernel access to kernel data only accounts for a small number of memory access. This indicates that recovery is still possible when execution cannot be continued after an MCA

exception if processors ignore overwritten errors since user processes may be signaled or terminated in all other situations.

- For the Kaffe JVM and sample Java applications running on it, the memory errors in the object heap have a higher error activation rate and susceptibility rate than those in the static data area.
- A large portion of heap error activation is caused by the garbage collector (up to 75 percent). But, this activation causes fewer application errors than other sources of activation (7 percent versus 56 percent).

Adding a small amount of knowledge about the operating system and application can reduce the need for reboots by a significant fraction (down to 10 percent for an operating system and down to 15 percent for Kaffe in our initial experiments). While these are only initial results, they do indicate that simple forms of error handling and software recovery can noticeably benefit system availability.

#### 8 **CONCLUSIONS AND FUTURE WORK**

In this paper, we have described how memory soft errors have become an increasing cause of failures in modern systems. However, commodity recovery support from these errors is limited because of price pressures on these systems. While semiconductor researchers try to limit the causes of soft errors on systems, the consensus is that these errors will continue to effect system availability.

Current and future commodity processor implementations are begining to have increased support for soft errors signalling. Assuming this improved support, we have investigated the effect of soft errors on commodity software. In doing so, we have gained an understanding of the correlation between soft errors and the reboots they can potentially cause.

Our investigation into the susceptibility of both the Linux kernel and Kaffe Java virtual machine indicate that many errors are not necessarily activated by commodity software. In addition, despite the potential data corruption that can occur, with simple instrumentation of the Linux kernel, we believe only 10 percent of memory errors actually need to be fatal for our sample workload. For the virtual machine, a large number of errors are activated by the heap garbage collector that need not cause a fatal error to the Java application. Together, these results indicate that, with improved processor support and a little application knowledge, few of the activated soft errors need to be fatal to the system, especially due to overwritten errors. Recently, similar observations to those made here have lead some high-end commodity chipsets to include memory scrubbing support to take advantage of the overwritting to minimize errors are masked by hardware.

Our results are only preliminary and the interaction between hardware soft errors and the software that they affect is a complex one. Therefore, future research on other commodity software and systems would greatly benefit our work. In addition, experiments run on real-world, possibly IA-64-based hardware, would help further validate our results and perhaps improve the possibility of running with real work example workloads.

#### ACKNOWLEDGMENTS

The authors are indebted to John Wilkes, Ira Greenberg, Duane Dutton, George Candea, and Armando Fox for reviewing early versions of this paper and Valentin Anders, Dan Osecky, Mike Traynor, Peter Markstein, Don Wiess, and Richard Adkisson for contributing to the project. Their contributions significantly improved the project and this paper's content and presentation.

#### REFERENCES

- [1] L. Anghel, M. Nicolaidis, and D. Alexandrescu, "Evaluation of Soft Error Tolerance Technique Based on Time and/or Space Redundancy," Proc. 13th Symp. Integrated Circuits and Systems Design, Sept. 2000.
- J. Bartlett, "A Nonstop Kernel," Proc. Eighth Symp. Operating [2] Systems Principles, pp. 22-29, Dec. 1981. J. Bonwick, "The Slab Allocator: An Object-Caching Kernel
- [3] Memory Allocator," Proc. USENIX Technical Conf., 1994.
- T. Bressoud and F. Schneider, "Hypervisor-Based Fault Toler-[4] ance," Proc. 15th ACM Symp. Operating Systems Principles, pp. 1-11, Dec. 1995.
- R. Baumann, "Soft Error Characterization and Modeling Meth-[5] odologies at TI: Past, Present and Future," Proc. Fourth Ann. Topical Research Conf. Reliability, Oct. 2000.
- J. Chapin et al., "Hive: Fault Containment for Shared-Memory Multiprocessors," Proc. 15th Symp. Operating Systems Principles, [6] pp. 12-25, Dec. 1995.
- D. Chen et al., "JVM Susceptibility to Memory Errors," Proc. [7] USENIX JVM Symp. '01, Apr. 2001.
- P.M. Chen et al., "The Rio File Cache: Surviving Operating System [8] Crashes," Proc. Seventh Conf. Architectural Support for Programming Languages and Operating Systems, pp. 74-83, Oct. 1996. J. Clark and D. Pradhan, "Fault Injection: A Method for Validating
- [9] Computer System Dependability," Computer, pp. 47-56, June 1995.
- [10] T.J. Dell, "A White Paper on the Benefits of Chipkill," IBM Microelectronics Division, Nov. 1997.
- [11] J.C. Fabre, F. Salles, M. Modriguez-Moreno, and J. Arlat, "Assessment of COTS Microkernels by Fault Injection," Proc. IFIP Dependable Computing for Critical Applications, 1999.
- J. Goodenough, "Exception Handling: Issues and a Proposed [12] Notation," Comm. ACM, vol. 18, pp. 683-696, 1975.
- [13] J. Gray and A. Reuter, Transaction Processing: Concepts and Techniques. Morgan Kaufmann, 1993.
- [14] W. Gu et al., "Characterization of Linux Kernel Behavior under
- Errors," Proc. Int'l Conf. Dependable Systems and Networks '03, 2004. M.-C. Hsueh et al., "Fault Injection Techniques and Tools," Computer, pp. 75-82, Apr. 1997.
- [16] Intel IA-64 Architecture Software Developer's Manual, Volume 2. Intel Corp., 1999.
- [17] Intel IA-32 Architecture Software Developer's Manual, Volume 3. Intel Corp., 2002.
- G. Kanawati et al., "FERRARI: A Flexible Software-Based Fault and Error Injection System," *IEEE Trans. Computers*, vol. 44, no. 2, [18] pp. 248-260, Feb. 1995.
- [19] W.-I. Kao et al., "FINE: A Fault Injection and Monitoring Environment for Tracing the UNIX System Behavior under Faults," IEEE Trans. Software Eng., vol. 19, no. 11, pp. 1105-1118, Nov. 1993.
- [20] D.E. Lowell and P. Chen, "Exploring Failure Transparency and the Limits of Generic Recovery," Proc. USENIX Operating System Design and Implementation, Oct. 2000.
- [21] H. Maderia et al., "Experimental Evaluation of a COTS System For Space Applications," Proc. Int'l Conf. Dependable Systems and Networks '02, 2002.
- B. McLaughlin, "Evaluating Alternatives for Windows® 2000 [22] Server Availability," white paper, Stratus, 2001. L. McVoy and C. Staelin, "Imbench: Portable Tools for Perfor-
- [23] mance Analysis," Proc. Usenix Technical Conf., 1996.
- [24] D. Milojicic et al., "Increasing Relevance of Memory Hardware Errors-A Case for Recoverable Programming Models," Proc. ACM SIGOPS European Workshop, Sept. 2000.
- [25] B. Murphy et al., "Windows 2000 Dependability," Proc. IEEE Int'l Conf. Dependable Systems and Networks, June 2000.

- [26] J.M. Nick et al., "S/390 Cluster Technology: Parallel Sysplex," IBM Systems J., vol. 36, no. 2, pp. 172-201, 1997.
- [27] G. Pfister, In Search of Clusters. Prentice Hall, 1998.
- [28] N. Quach, "High Availability and Reliability in the Itanium Processor," *IEEE Micro*, vol. 20, no. 5, pp. 61-69, 2000.
- [29] Standard Performance Evaluation Corp., "SPECjvm98 Specification," Aug. 998.
- [30] Tandem, Compaq Corp., "Data Integrity for Compaq NonStop Himalaya Servers,"white paper, 1999.
- [31] Y. Tosaka, "Soft Error Modeling and Simulation for SOI Circuits," Proc. Fourth Ann. Topical Research Conf. Reliability, Oct. 2000.
- [32] J.F. Ziegler et al., "IBM Experiments in Soft Fails in Computer Electronics (1978-1994)," IBM J. Research and Development, vol. 40, no. 1, pp. 3-18, Jan. 1996.
- [33] J.A. Zoutendyk et al., "Characterization of Multiple-Bit Errors from SingleIon Tracks in Integrated Circuits," IEEE Trans. Nuclear Science, vol. 36, no. 6, Dec. 1989.



Alan Messer received the PhD degree from City University and the Bachelor's degree in computer science from Imperial College of Science and Technology, London, England. He is a senior manager and principle engineer with Samsung Electronics' Corporate Techology Operations Division. He leads projects researching the practical application of pervasive computing and distributed systems to consumer electronics. He is also currently engaged in

representing Samsung Electronics on several standardization bodies, including the CEA and UPNP Forum. He has published in a variety of conferences/journals and participates on several technical program committees. Before joining Samsung, he worked at Hewlett Packard Laboratories on enterprise systems and pervasive computing, where this work was done. He is a member of the IEEE, IEEE Computer Society, ACM, and Usenix.



Philippe Bernadat received the PhD degree from the University Pierre et Marie Curie of Paris in 1983 while working in the field of relational databases at INRIA. He is a system architect within Hewlett Packard's high-performance technical computing division where he works on parallel global file systems. He has been a researcher at HP Labs Grenoble, France, for two years, where he contributed to two projects in the field of memory failure recovery and services

on demand for pervasive infrastructures. Prior to this, he was part of the OSF research institute in Grenoble and Cambridge, Massachusetts, where he led research in the field of microkernels, distributed computing, real-time systems, Java virtual machines, and active networks.



**Guangrui Fu** received the BS degree from Peking University, China, in 1997 and the MS degree from the University of Hawaii in 1999, both in computer science. She worked at Hewlett Packard Labs and is currently with DoCOMo USA Labs. Her research interests include high availability systems, network security, and network mobility management. She is a member of the IEEE.



**Deqing Chen** received the PhD degree in computer science from the University of Rochester and the Bachelor's and Master's degrees in computer engineering from Shanghai Jiaotong University. He is a research engineer at Ask-Jeeves, Inc. His research interests include distributed and parallel computing systems. He is a member of the IEEE and the IEEE Computer Society.



Zoran Dimitrijevic received the Dipl.Ing. degree in electrical engineering from the School of Electrical Engineering, University of Belgrade, Serbia, in 1999 and the PhD degree in computer science from the University of California, Santa Barbara, in 2004. During 1999-2000, he was a Dean's Fellow at the University of California, Santa Barbara. He joined Google in May 2004. His research interests include operating systems, real-time storage systems, large-scale parallel

computing, multimedia applications, and computer architecture. He is a member of the IEEE and the IEEE Computer Society.



**David Lie** graduated from Stanford University in 2004. Since 2003, he has been an assistant professor in the Electrical and Computer Engineering Department at the University of Toronto. While at Stanford, he led and founded the XOM (eXecute Only Memory) Processor Project, which supports the execution of tamper and copy-resistant software. Currently, he works on computer security, virtual machine monitors, and hypervisors. He is a member of the IEEE and the

IEEE Computer Society.



**Durga Devi Mannaru** received the Master's degree in computer science from the Georgia Institute of Technology. Her interests are in the areas of networking, distributed computing, and performance management of software applications. She is currently working in IBM, Research Triangle Park, North Carolina.



Alma Riska received the PhD degree in computer science from the College of William and Mary, Williamsburg, Virginia, in 2002 and the Diploma in computer science from the University of Tirana, Albania, in 1996. Currently, she is a research staff member of Seagate Research Center in Pittsburgh, Pennsylvania. Her research interests are in performance evaluation of computer systems, with emphasis on applying analytic techniques and detailed workload char-

acterization in better-performing and self-managinig storage systems. She is a member of the IEEE, IEEE Computer Society, and ACM.



Dejan Milojicic received the BSc and MSc degrees from the University of Belgrade and the PhD degree from the University of Kaiserslautern. He is a senior scientist and a project manager at Hewlett Packard Labs. He has worked in the area of operating systems and distributed systems for more than 20 years. He was the program chair of the IEEE Agent Systems and Applications Symposium (ASA/MA '99) and of the first ACM/IEEE/USENIX

Workshop on Industrial Experiences with System Software (WIESS 2000). He has published in many journals and at various events. He is currently on the editorial board of the *IEEE Distributed Systems Online*. He has been engaged in various standardization bodies and helped standardize OMG MASIF and, more recently, worked on standardizing SmartFrog configuration framework. He is a member of the ACM, IEEE, IEEE Computer Society, and USENIX.

▷ For more information on this or any other computing topic, please visit our Digital Library at www.computer.org/publications/dlib.