Application-Level Checks for Concurrent Error Detection

Computer Engineering Research Center
The University of Texas at Austin

This area deals with techniques for detecting errors in computations. Traditionally, hardware techniques (such as self-checking circuits, or duplicated CPUs) have been used for this purpose. Our research deals with software techniques for checking for correct control flow, or the results of computations to detect errors due to hardware faults.


K.-H. Huang and J. A. Abraham, "Algorithm-Based Fault Tolerance for Matrix Operations," IEEE Transactions on Computers (Special Issue on Reliable and Fault-Tolerant Computing), vol. C-33, June 1984, pp. 518--528.

Abstract

The rapid progress in VLSI technology has reduced the cost of hardware, allowing multiple copies of low-cost processors to provide a large amount of computational capability for a small cost. In addition to achieving high performance, high reliability is also important to ensure that the results of computations are valid. This paper proposes a novel system-level method of achieving high reliability, called algorithm-based faults tolerance. The technique encodes data at a high level, and algorithms are designed to operate on the encoded data and produce encoded output data. The computation tasks within an algorithm are appropriately distributed among multiple computation units for fault tolerance. The technique is applied to matrix computations which form the heart of many computation-intensive tasks. Algorithm-based fault tolerance schemes are proposed to detect and correct errors when matrix operations such as addition, multiplication, scalar product, LU-decomposition, and transposition are performed using multiple processor systems. The method proposed can detect and correct any failure within a single processor in a multiple processor system. The number of processors needed to just detect errors in matrix multiplication is also studied.


J. A. Abraham, E. S. Davidson and J. H. Patel, "Memory System Design for Tolerating Single Event Upsets," IEEE Transactions on Nuclear Systems, vol. NS-30, no. 6, December 1983, pp. 4339-4344,

Abstract

This paper presents a new memory system design which employs fault-tolerant design techniques for tolerating errors due to single event radiation upsets. the classical Triple Modular Redundancy (TMR) technique has extremely high insertion cost (a factor of 3 to 4); duplication techniqeus can detect, but not correct, errors and have a factor of 2 to 3 insertion cost. Data encoding techniques have low cost but can correct only stored data, not control, errors. What is needed therefore is an effective mixture of known and new techniques to achieve adequately high reliability at minimum cost. The proposed memory system design uses coding, control duplication, and scrubbing for tolerating soft errors from single event upsets. This memory permits the use of less costly conventional unhardened memory technology and has a very low insertion cost of approximately 25% for achieving fault tolerance. Furthermore, events which upset control logic are tolerated as well as those which affect stored data. this approach therefore offers a cost trade-off with respect to hardened technolory memory and tolerates combinational control logic upsets which some hardening techniques cannot tolerate.


J.-Y. Jou and J. A. Abraham, "Fault-Tolerant Matrix Arithmetic and Signal Processing on Highly Concurrent Computing Structures," Proceedings IEEE, vol. 74, no. 5, May 1986, pp. 732-741,

Abstract

Hardware for executing matrix arithmetic and signal processing algorithms at high speeds is in great demand in many real-time and scientific applications. with the advent of VLSI technology, large numbers of processing elements which cooperate with each other at high speed have become economically feasible. Since any fuctional error in a high-performance system may seriously jeopardize the operation of the system and its data integrity, some level of fault tolerance must be incorporated in order to ensure that the result of long computations are valid. Since the major computational requirements for many important real-time signal processing tasks can be reduced to a common set of basic matrix operations, tasks can be reduced to a common set of basic matrix operations. Earlier work proposed a low-cost checksum scheme for fault-tolerant matrix operations on multiple processor systems. However, the scheme can only correct errors in matrix multiplication; it can detct, but not correct, errors in matrix-vector multiplication, LU decomposition, matrix inversion, etc. In order to solve these problems with the checksum scheme, a very general matrix encoding scheme is proposed in this paper to achieve fault-tolerant matrix arithmetic and signal processing with linear arrays, which are believed to hold the most promise in VLSI computing structures for their flexibility, low cost, and applicability to most of the interesting algorithms. This proposed technique is, therfore, a very cost-effective encoding technique to ahcieve fault-tolerant matrix arithmetic and signal processing on highly concurrent VLSI computing structures.


V. S. S. Nair and Jacob A. Abraham, "Real-Number Codes for Fault-Tolerant Matrix Operations on Processor Arrays," IEEE Transactions on Computers, vol. 39, April 1990, pp. 426-435.

Abstract

Various checksum codes have been suggested for fault-tolerant matrix computations on processor arrays. Use of these codes is limited due to inflexibility of the encoding schemes and also due to potential numerical problems. Numerical errors may also be misconstrued as errors due to physical faults in the system. In this paper, we develop a generalization of the existing schemes as a possible solution to these shortcomings. We prove that linearity is a necessary and sufficient condition for codes used for fault-tolerant matrix operations such as matrix addition, multiplication, transposition, and LU decomposition. We also prove that for every linear code defined over a finite field, there exists a corresponding linear real-number code with similar error detecting and correcting capabilities. Encoding schemes are given for some of the example codes which fall under the general set of real-number codes. With the help of experiments, we derive a rule of thumb for the selection of a particular code for a given application. The performance overhead of fault tolerance schemes using the generalized encoding schemes is shown to be very low, and this was substantiated through simulation experiments. Since the overall error in the code will also depend on the method of implementation of the encoding scheme, we also suggest the use of specific algorithms and special hardware realizations for the check element compuation.


N. A. Kanawati, G. A. Kanawati and J. A. Abraham, "Control Flow Checking In Object-Based Distributed Systems," Dependable Computing for Critical Applications 3, C. E. Landwehr, B. Randell and L Simoncini, editors, Springer-Verlag, 1993, pp. 213-232.

Abstract

Object-based distributed systems are becoming increasingly popular since objects provide a secure and easy means of using the abstraction of shared memory. In this paper we develop a new object-based control flow checking technique called (COTM) to detect errors due to hardware faults in such systems. The proposed technique monitors objects and thread flow across objects in two stages. The first stage applies control flow checking for every object invocation. In the second stage, the legality of a terminating thread is examined. Results of fault injection experiments on several applications written in C++ and modified to incorporate the object-based checks show that the proposed technique achieves high fault coverage with low performance overhead.


G. A. Kanawati, N. Krishnamurthy, V. S. Nair and J. A. Abraham, "Evaluation of Integrated System-Level Checks for on-line error detection," Proceedings IEEE Int'l Performance and Dependability Symposium, September 1996.



Back to: