Paper Menu >>
Journal Menu >>
![]() J. Software Engineering & Applications, 2010, 3: 446-454 doi:10.4236/jsea.2010.35050 Published Online May 2010 (http://www.SciRP.org/journal/jsea) Copyright © 2010 SciRes. JSEA Experiences Analyzing Faults in a Hybrid Distributed System with Access Only to Sanitized Data* Ronald J. Leach Department of Systems & Computer Science Howard University, Washington DC, USA. Email: rjl@scs.howard.edu Received March 12th, 2010; revised March 26th, 2010; accepted March 28th, 2010. ABSTRACT In this paper we report on a work in progress assessing the faults observed and reported in a distributed, safety-critical, largely embedded system with both electrical and mechanical components. We illustrate why standard software testing techniques are not sufficient and indicate some of the technical and non-technical problems encountered in examining the faults and the initial results obtained. While the application domain is elevator operation, the techniques described here are general enough to apply to many other domains. Much of the data analyzed here would be considered imprecise in the software industry if it were used in software testing or to help increase fault tolerance. The paper includes a discussion of the use of multiple views of data, assessment of missing data, and analysis of informal information to produce its conclusions about fault avoidance and fault tolerance. Keywords: Distributed System, Safety-Critical Systems, Fault Tolerance, Remote Monitoring 1. Introduction It is difficult to obtain useful information about the nature and distribution of faults in an actual distributed system, especially one that is safety-critical. Most companies and government organizations do not allow such information to be made available to external entities, even in sanitized form. This lack of data poses a potentially enormous problem for researchers in fault-tolerance and distributed systems. It is very important to provide insights for researchers who might not have sufficient access to realistic data. With- out such access, it is difficult to verify the practicality of research hypotheses. Hopefully the process described here, with a discussion of the analyses done, can provide insight and advance the research in this important field. In this paper, we report on an evaluation of the root causes of faults in a safety-critical system and describe some of the partial solutions that were obtained. Our ex- perience illustrates the difficulty in obtaining useful, re- alistic fault data from an operational safety-critical system. The system studied included several elevators in a high-rise building, with both internal and external moni- toring and communications systems [1]. The situation examined in this paper is rather unusual as an example in the fault-tolerance community, because the fault and maintenance data analyzed was not reported in any sort of form that would ordinarily be used for a complete fault analysis, including analysis of either fault- tolerance or fault-avoidance [2,3]. We also observe that the reliability of electro- me- chanical systems such as elevators might exhibit some of the characteristics of a “bathtub curve” typical in me- chanical systems [4-6], or one more common in software [7]. The book [8] is devoted to systems with mechanical and electronic components, and the evolution of elevator control software systems is discussed in [9]. A 1996 version of a NASA standards document, Facil- ity System Safety Guidebook NASA-STD-8719.7 states the following about software faults in hybrid systems [10]. Software faults may take three forms: The so-called honest errors made by the program- mer in coding the software specification. These are sim- ple mistakes in the coding process that result in the soft- ware behaving in a manner other than that which the programmer intended. Faults due to incorrect software specifications or the *This research was partially funded by the National Science Foundation under grant number 0324818. ![]() Experiences Analyzing Faults in a Hybrid Distributed System with Access Only to Sanitized Data Copyright © 2010 SciRes. JSEA 447 programmer’s interpretation of these specifications. These errors may result from system designer’s lack of full understanding of system function or from the pro- grammer’s failure to fully comprehend the manner in which the software will be implemented or the instruc- tions executed. In this type of fault the software state- ments are written as intended by the programmer. Faults due to hardware failure. Hardware failures may change software coding. Thus such software faults are secondary in that they originate outside the software. All these types of faults, as well as a considerable amount of human error, are present in this system. We note that a new draft standard STD-8719.7A is currently under NASA review. Other relevant research on the reli- ability of fault-tolerant, safety-critical; systems can be found in [11,12]. As will be discussed later in this paper, an informal verbal description of a problem with an on-site building manager and a conversation with a service company rep- resentative helped identify a set of faults that could be removed easily, leaving the system with a greater degree of resilience when other faults were encountered. We note that some of the fault data was sanitized be- fore it was made available to the author for the analysis that is described in this paper. Even so, some conclusions can be drawn about the major causes of faults, even with incomplete data. We have removed all references to the particular companies that performed the initial installation and ser- vice of the set of elevators described here. The distrib- uted card and password security system that the elevator access controls must interface with are described only at the highest levels, also. We have also sanitized the nature of any company database design in order to protect pro- prietary information. Of course, simulation of elevator behavior in terms of picking up and letting off passengers is often used as a teaching tool. One of the earliest readily available such discussion is provided in Knuth [13]. A recent search on Google for the terms “elevator simulation” and “assign- ment” provided 517 matches. 2. The System Evaluated The system evaluated in this work is a set of user- oper- ated elevators that have multiple sets of controls, multi- ple alarms, and the capability to communicate with a remote monitoring device. All elevators are in the same high-rise building complex. The system is integrated with an access control system and electronic cards. The system currently complies with all existing safety codes in the geographical area. The elevator system is over twenty years old and has some problems of age, wear and tear, and unavailability of parts. Of course, it is not reasonable to expect that the pro- grammers who wrote the original code for the micro- processors and related subsystems will still be with the company. In fact, there is no reason to expect that the company that originally designed and installed the ele- vator system is responsible for its maintenance. This is, of course, a typical situation in the software maintenance industry. The entire system may be viewed as having several distinct features, most of which are illustrated below in Figure 1. The system contains a set of seven elevator cars that are positioned in three banks of two elevators each, with the remaining elevator essentially by itself, although an- other nearby elevator could be used in an emergency. The banks of elevators are several hundred feet apart. The alarm system in the elevators is audible to a lo- cal human monitoring system, with monitoring at all times of day and night. The on-site human monitor enters all problems into a log book and can call the elevator company’s service center. There are also phones inside each elevator to enable a stranded user to contact the proper service personnel, or the fire department. In the late evening, the elevators automatically re- vert to limiting access to being controlled by electronic access cards. These electronic control cards are integrated into a building-wide security system with monitoring by the aforementioned human monitors and with each access entered into a database system. Microprocessors in each of the seven elevator cars can interact with communications devices that are able to transmit problem information to an off-site remote monitoring system. The microprocessors use a custom design and should Remote monitor system Alarm system Hall buttons Door open/close controls Security System Elevator (one of several) Figure 1. An OV-1, high-level view of the interaction be- tween several of the elevator’s microprocessors and some of the other relevant computer-controlled systems ![]() Experiences Analyzing Faults in a Hybrid Distributed System with Access Only to Sanitized Data Copyright © 2010 SciRes. JSEA 448 be thought of as ASICs (Application-Specific Integrated Circuits). The lack of a standardized design makes the error rates of processors difficult to compare with other microprocessors of the same vintage. Hence it is impos- sible to use fault microprocessor data – even if it were made available – to determine if the reliability was typi- cal of long-lived systems with high degrees of reuse. It appears that the microprocessors are not readily available for replacement in all of the elevator com- pany’s installed locations. Every call for elevator service is entered into a ser- vice database at the elevator company’s central location. The elevator company’s service supervisors can see this database monitoring system. This system can be viewed, in certain circumstances, by non-company personnel. It is natural to ask why this system is an appropriate example to serve as the basis for a paper on software failures. Most modern elevators do not require a special operator and are operated by individuals who are, almost certainly, unaware of the safety, design, and control is- sues involves with their safe operation. Hence, there are multiple control and monitoring features, nearly all of which are computer-based for the system described in this paper. There are microprocessors in several subsystems of this set of elevators. The microprocessors are custom designed and cannot be replaced easily by off-the-shelf components. Each elevator has the following computer components or computer system interfaces: Each elevator contains a microprocessor that selects options, based on the buttons that have been pressed. The microprocessor controls the operation of the doors (open, closed), as well as floor selection, based on the buttons pressed. Since there are separate controls on each side of the elevator cab, each side must have its own microproces- sor. For six of the seven elevators, the buttons are ren- dered inoperable late at night by a security code set by a human operator at an in-building control center until a person uses their personal pre-assigned security code, which is entered using the in-car buttons on the keypad. Unless the code is entered correctly, the elevator car re- turns to the ground floor. For some of the higher floors, access also requires the swiping of an electronic security card. There are control units in sets of buttons, one for each floor, that allow the elevator to be called. Each of the control units contains a microprocessor for commu- nication. There are sensors in each set of door panels. There are both interior and exterior doors in each elevator. These sensors make the doors stop closing if they en- counter an obstacle, usually a human, but perhaps lug- gage or a grocery cart. These are controlled by micro- processors. Some doors have microprocessors to control smooth opening and closing of doors in the event of se- vere wind conditions affecting air flow within the eleva- tor shafts. The elevator shafts have external air access, due to elevator safety regulations. All programming of the microprocessors is done off-site and, after testing, the microprocessors are de- ployed. There is only a minimal amount of on- site pro- gramming performed. Figure 2. An OV-2 view of the system, showing need lines ![]() Experiences Analyzing Faults in a Hybrid Distributed System with Access Only to Sanitized Data Copyright © 2010 SciRes. JSEA 449 Each elevator contains a microprocessor and a communications path that sends a service code to the elevator company’s central service location in the event of a malfunction. The company’s central service location monitors all service calls, whether called in by an authorized human monitor or the electronic call system described above. There is a company proprietary database of service calls. In certain circumstances, the database may be made available for read-only access to selected customer rep- resentatives. 3. Modeling the System To help understand and model the system’s organization, we used the Department of Defense Architectural Fram- ework, DoDAF and created the models using the System Architect for DoDAF tool from Telelogic. Representa- tions of system operation were shown in what in DoDAF terminology is called “Operational Views.” There are several types of standardized operational views: OV-1 consists of an informal, graphical representa- tion of operations as well as explanatory text. It is infor- mal in the sense that information provided in it is not included in any database or CASE tool. An OV-1 dia- gram of the system is provided in Figure 1. OV-2 is intended to track the need to exchange in- formation from specific operational nodes that play a key role in the architecture to others. OV-2 does not depict the connectivity between the nodes. OV-3 (Operational Information Interchange Matrix) This view expresses the relationship between the three basic architecture data elements of an OV (operational activities, operational nodes, and information flow) in the form of an Excel spreadsheet, with a focus on the spe- cific aspects of the information flow and the information content. This view is not provided in this paper, since it is somewhat redundant to the information included in the OV-2 and OV-5 diagrams. OV-4 (Organizational Relationships Chart) This view clarifies the various relationships that can exist be- tween organizations and sub-organizations within the architecture and between internal and external organiza- tions. Relevant organizations are the elevator service company, the company that built and installed the opera- tor, the elevator inspector, the building management company, tenants, and, although informal, the organiza- tion of elevator users. This view is not provided in this paper, since it has been superceded by a new, somewhat confidential, contractual relationship that was developed as part of the analysis that was performed as a result of this study. OV-5 (Operational Activity Diagrams) The dia- grams provided in this view represents the various activi- ties that are performed by major components of the ele- vator management system. It is intended to do the fol- lowing: Clearly delineate the lines of responsibility for ac- tivities when coupled with OV-2 Uncover unnecessary operational control activity redundancy Make decisions about streamlining, combining, or omitting activities Define or flag issues, opportunities, or operational activities and their interactions (information flows among the activities) that need to be scrutinized further Provide a necessary foundation for depicting activ- ity sequencing and timing in OV-6a, OV-6b, and OV-6c In Telelogic’s implementation of System Architect for DoDAF, three distinct OV-5 diagrams are created: an “Operational Activity Model Node Tree,” a top-level “Node Activity Diagram,” and a child-level “Node Ac- tivity Diagram.” Each of these diagrams is discussed in detail. The methodology used in this diagram in System Architect is known as IDEF0, which is used to reflect data flows. The acronym IDEF stands for Integrated Computer-Aided Manufacturing (ICAM) DEFinition. The Operational Activity Model Node Tree Diagram indicates the major components of the elevator manage- ment system: human operation; elevator car operation; remote monitoring operation; security system operation, alarm system operation, and the phone system The tree structure indicates the major operational activity de- pendencies and their relation to the primary operational activity-management of the elevator’s operation. For simplicity, only a few of the child nodes are shown in Figure 3. For each of the nodes in an operational activity dia- gram, a set of operations is allowed. We show a few of these in Figure 4, where we have presented an ICOM diagram. The acronym ICOM stands for Input Control Output Mechanism. Arrows for a few of each of these four types of interactions are shown in clockwise order, beginning at the left hand side of the highest level opera- tional activity named “Manage elevator” in Figure 4. OV-6 (Operational Activity Sequence and Timing Descriptions) OV products discussed previously model the static structure of the architecture elements and their relationships. Many of the critical characteristics of a software architecture are only discovered when the dy- namic behavior of these elements is modeled to incorpo- rate sequencing and timing aspects of the architecture. Three standard types of sequence diagrams are in com- mon use: Operational Rules Model (OV-6a), Operational State Transition Description (OV-6b), and Operational Event-Trace Description (OV-6c). Since our analysis of the failure data indicated that timing considerations did not appear to be a problem, these views are not discussed in this paper. ![]() Experiences Analyzing Faults in a Hybrid Distributed System with Access Only to Sanitized Data Copyright © 2010 SciRes. JSEA 450 Figure 3. An OV-5 Operational Activity Diagram, showing parent and some of the child nodes 4. Relevant Non-Technical Issues Elevators such as the ones described here are complex, far more so than one that might be found in, say, an ex- pensive city townhouse. Therefore, the number of com- panies who can handle this type of installation is rela- tively limited to large companies with sufficiently large service staffs that can provide service at any time of the day or night. It is common practice, but not uniformly guaranteed, that the company that performed the initial installation may not be given the service maintenance contracts once an initial warrantee period has expired. In order to pro- tect confidentiality farther, we will always refer to two separate companies in this paper, although that may or may not be accurate in this particular situation, with the possibility that all service work was performed by a sin- gle company. Figure 4. An OV-5 diagram showing an operational activity with ICOM arrows To insure income streams, elevator service companies strongly prefer long-term service contracts. On the other hand, once the service contract is in hand, there is an incentive to not provide service beyond what is needed to maintain minimal operational service. Fortunately, safety is never ignored by any reputable elevator manufacturing or service company. Elevator safety systems are highly redundant; their designs resemble a multi-version pro- gramming scheme [2] with constant rollback states [5]. Of course, there are political issues about who pays for repairs beyond what is covered by these maintenance contracts, and who monitors the availability of the repairs of items not covered by these maintenance contracts. These issues suggest a somewhat adversarial relationship between customer and the elevator service company, especially if major repairs are anticipated. Independent analysis of faults by consultants is often of use. However, the dearth of companies with sufficient expertise to maintain elevator systems of this complexity encourages all parties to work together. There are several sources of information that extend beyond the database discussed later. Either the building’s manager or engineer, or both have been present during most of the elevator service calls during the period being examined. They have indicated verbally that some faults requiring service calls may have been caused by envi- ronmental conditions affecting microprocessors. It is conceivable that some other problems may have been caused by interference with control microprocessors in individual elevator cars or near the hall buttons by cell phones. The elevators are over twenty years old and the design of the original shielding may not have considered the potential for cell phone interference. There is one other non-technical issue that affects the analysis of the problem. It is conceivable that in certain alar m ![]() Experiences Analyzing Faults in a Hybrid Distributed System with Access Only to Sanitized Data Copyright © 2010 SciRes. JSEA 451 instances, data in the aforementioned company’s pro- prietary database of service calls may provide some con- fidential information about failures of certain compo- nents. That might give some competitors an unfair ad- vantage when bidding for maintenance or major upgrade contracts. This information must be kept within the secu- rity standards of the company. Hence, such data is sani- tized considerably before release to anyone not employed by the company. 5. Current System Status In Figure 5, we illustrate the availability of the individ- ual elevators for service during a period of one year. The period shown was ended before the analysis described in this paper was undertaken. Of course, these percentages, while high, are never high enough for the elevator user who might be stuck in an elevator. The low availability of the first elevator is clearly a cause for concern. The graph shows real data, but information on specific elevators has been deleted to preserve sensitive proprie- tary information. The diagrams are screen dumps taken directly from the elevator company’s website. While it is difficult to appreciate the differences be- tween the percentages indicated, simple arithmetic shows that an elevator with an availability of 98.49% causes difficulty for its users 5½ days per year on average. Even the elevator with the highest availability was out of commission over ¾ of a day per year, on average. Data for individual elevators was available for further analysis during the same reporting period. The results by month for the first elevator (the one most troublesome in Figure 5) are shown in Figure 6. Note that there was a wide range in availability of this particular elevator, which was the most troublesome of the elevators consid- ered. Also, some of the other elevators had the desired 100% availability for multiple months. Data for the other elevators has been omitted to save space. It is important to understand the meaning of the data illustrated in Figures 5 and 6. A lack of availability might mean that a unit could not stop on a particular floor, that a hall button might not call the elevator unless it was pushed several times, or that a security code needed to be entered from a central location in the build- ing. It did not mean that the elevator car was in any dan- ger of falling. This does not happen on modern fail-safe elevators. 6. Analysis In addition to the overall data on availability of the ele- vators during a one-year period illustrated in Figure 5 and the monthly report for the same year, illustrated in Figure 6, data on this complex system were collected by the elevator maintenance company over an approxi- mately nine-month period. There were a total of 74 ser- vice visits during that nine-month period. The results of each visit were entered into the company’s service data- base, which is in the form of a Microsoft Excel spread- sheet. Since a spreadsheet normally contains less infor- mation than a database, and is less easily queried, data analysis is somewhat limited. Initially, there was little concern about the discrepancy between the nine-month period of the service visits and the yearly data reported in Figures 5 and 6. This omis- sion slowed down the analysis considerably, because it could have pointed out one of the most serious problems Figure 5. Percentage of availability of operation of the elevators during a recent one-year period Percentage Up Time ![]() Experiences Analyzing Faults in a Hybrid Distributed System with Access Only to Sanitized Data Copyright © 2010 SciRes. JSEA 452 Figure 6. Percentage of availability of operation by month for the most troublesome elevator during a recent one-year period immediately, had it been fully understood. The entries in the database that, apparently, triggered the technician’s maintenance service call are not very illuminating from the perspective of providing insight into computer faults. The categories indicated are limited to the following: Door_performance Checked/adjusted elevator operation and phone Maintenance on controller/mr_equipment Ropes Motor_generator General_maintenance_procedure Brake_elevator Hall_buttons Door operation/car doors Maintnance_on_car_door/operator/car_top/emg_ light There were other views of this data that were some- what more informative. One was a listing of 43 of the 74 service calls on which specific items that needed to be repaired or replaced were identified in more detail. These specific items could be classified as follows in this list- ing: There were 28 issues that required mechanical re- pairs. There were 12 issues that required the replacement of one or more specific mechanical parts. There were 5 issues that required computer hard- ware repairs. There were 2 issues that required computer software repairs. In this listing, a few of the 43 service calls in which specific items that needed to be repaired or replaced were identified had multiple items, accounting for the 47 items described in the above list. It is now obvious that there are discrepancies between the entries in the database of actions (repairs, replace- ments, hardware-specific repairs, software-specific re- pairs), the number of service calls, and, to some degree, the periods of unavailability of the elevators. It is natural to ask why there are such discrepancies. One possibility that could be eliminated readily in the analysis of this data is the possibility of the elevator ser- vice company cutting corners. The elevators were under a long-term maintenance contract and, under the terms of the service contract, any unresolved issues would result in an additional service call to the elevator service com- pany. Since the service calls required transportation of service personnel, it was in the elevator service com- pany’s best interest to minimize unnecessary extra travel trips. Hence, this possibility was rejected. The elevator company’s central dispatch office as- signed technicians when faults were either detected or called in. Because of the redundancy in each of the ele- vator banks, service calls received lower priority in the dispatch office than locations with a single elevator. Oc- casionally, junior technicians were dispatched. For these reasons, it was felt that a statistical distribution of the time to fix problems would not produce more meaningful ![]() Experiences Analyzing Faults in a Hybrid Distributed System with Access Only to Sanitized Data Copyright © 2010 SciRes. JSEA 453 data than simply reporting aggregated outages times. It is clear that the entries in the technician’s database (door_performance, hall buttons, checked/adjusted ele- vator_operation and phone, ropes, etc.) were restricted to match certain pre-defined categories. Thus, it is reason- able to assume that they might not provide much infor- mation on specific failures, especially for hardware and software failures. When examining the discrepancies, it was noted that the time period were different. One set of data was for a nine-month period, while the other was for one year. It was important to know if the discrepancy was due to the way the elevator service company sanitized the data, or to the way data was collected. In particular, if the dis- crepancy was due to a problem data collection process, what caused this failure and did the result of this failure cause a cascade of related faults? The explanation for this discrepancy was quite simple. Both the company’s database and what we have called the secondary listing of which specific items that needed to be repaired or replaced were accurate, but did not show the failures at the times they were noted by human users and monitors. The data from the technician’s ser- vice call database was accurate and reflected what was actually done (even though the codes were not always very helpful). What happened is that the remote monitoring of what is called the “health and safety” of the elevators via the communications path between the elevator microproces- sors had not been activated during the entire period. Re- initializing this communication allows microprocessors to be reset automatically if there were failures, providing much higher tolerance of hardware and software faults, thereby increasing availability. How was it determined that the remote monitoring of elevator status was not working? (It was not clear from the documentation provided to the building – the cus- tomer – that there even was remote monitoring.) The information was obtained from the elevator company’s newly appointed service manager, who gracefully pro- vided access to the data. A follow-up interview with the building manager of the building complex indicated another potential expla- nation for what had seemed to be an overly large number of microprocessor errors that required either resets or hardware replacement. The cleaning fluid used to clean the surfaces of both the in-elevator control panels and the much simpler hall buttons in several cases had seeped behind the decorative plates and caused electrical shorts. A simple change in the cleaning procedures reduced the number of observed faults. The two actions–enabling the remote monitoring of microprocessor status and enacting new procedures for cleaning – caused a great reduction in faults, with almost no down time when failures did occur as a result of these remaining faults. 7. Conclusions and Suggestions for Future Work Obviously, this was an unusual situation when compared to what is typically studied in the fault tolerance research and community. However, it may be more relevant to the practitioners of fault tolerance who are faced with solv- ing a real-world problem. The following techniques were especially useful in helping to determine the root causes of faults that led to system failures: While nearly all the reports in the maintenance ser- vice databases used pre-defined categories that, at first glance had little useful information, more detailed analy- sis indicated certain commonalities of faults. Interviews with knowledgeable people, such as the building’s manager and the elevator service company’s service manager, led to information that resulted in new policies (for keeping cleaning fluids and gels away from the microprocessors) and the proper use of the remote monitoring system. Unwritten information was useful, such as the exis- tence of the remote monitoring database and the possibil- ity of viewing this database by persons who are not em- ployees of the elevator service company. Reasoning about missing things, such as the miss- ing months in two different views of the maintenance database, led to an understanding of a major lapse in the use of the remote monitoring system. It is likely that many of the lessons learned in this analysis can be useful to practitioners of fault tolerance who are faced with similar problems with the data avail- able to them. REFERENCES [1] Unnamed elevator company, Unnamed Service Database, 2008. [2] A. Avizienis and J. P. Kelly, “Fault Tolerance by Design Diversity: Concepts and Experiments,” IEEE Computer, Vol. 17, No. 8, August 1984, pp. 67-80. [3] B. Randell, “System Structure for Software Fault Tole- rance,” IEEE Transactions on Software Engineering, Vol. 11, No. 2, June 1975, pp. 220-232. [4] R. Amuthakkannan, S. M. Kannan, K. Vijayalakshmi and N. Ramaraj, “Reliability Analysis of Programmable Mechatronics System Using Bayesian Approach,” Intern- ational Journal of Industrial and Systems Engineering, Vol. 4, No. 3, 2009, pp. 303-325. [5] V. Dhudsia, “Guidelines for Equipment Reliability,” Technical Publication, Sematech, Inc, 1997. http://www. sematech.org/docubase/document/1014agen.pdf [6] G. K. Fourlas, “An Approach towards Fault Tolerant Hybrid Control Systems,” Control & Automation Mediter- ![]() Experiences Analyzing Faults in a Hybrid Distributed System with Access Only to Sanitized Data Copyright © 2010 SciRes. JSEA 454 ranean Conference on MED, Corsica, 27-29 June 2007, pp. 1-6. [7] J. D. Musa, A. Iannino and K. Okumoto, “Software Relia- bility: Measurement, Prediction, Application,” Mc-Graw- Hill, Inc. New York, 1987. [8] R. Isermann, “Mechatronic Systems Fundamentals,” Springer, London. 2003. [9] K. Lee, K. C. Kang, E. Koh, W. Chae, B. Kim and B. W. Choi, “Domain-Oriented Engineering of Elevator Control Software: A Product Line Practice,” Proceedings of the First Software Product Line Conference, Denver, August 2000, pp. 3-22. [10] “Facility System Safety Guidebook,” NASA-STD-8719.7, National Aeronautics and Space Administration, 1996. [11] “The use of Computers in Safety Critical Operations,” Final Report of the Study Group on the Safety of Ope- rational Computer Operations, Health and Safety Commission, UK. http://www.hse.gov.uk/nuclear/compu ters.pdf [12] N. Leveson, “Software Safety: Why, What, and How,” ACM Computing Surveys, Vol. 18, No. 2, June 1986, pp. 125-163. [13] D. E. Knuth, “Fundamental Algorithms,” The Art of Com- puter Programming, 3rd Edition, Addison-Wesley, Read- ing, Massachusetts, Vol. 1, 1973. |