November 1, 2009
Effective Risk Management and Quality Improvement by Application of FMEA and Complementary Techniques An in-depth discussion about Failure Modes and Effects Analysis (FMEA) as used in the aerospace and airline industries and why FMEA, extended with appropriate top-down, probabilistic, and feedback methods, is an excellent framework for risk management and
This paper provides my expert opinion of the use and effectiveness of Failure Modes and Effects Analysis (FMEA) for managing risks and improving quality in several industrial domains. I also consider and evaluate several other analytical techniques as complementary extensions of FMEA.
The opinions that I express in this paper are based on a thorough review that I conducted of industry standards and procedures for risk management, FMEA techniques, and FMEA applications in aviation and other industries. I also base these opinions on my 25 years of experience in transportation management and analysis, airline flight operations, safety investigation management, safety research, and airline accident investigation. I have ten years of experience on the staff of the U.S. National Transportation Safety Board (NTSB), concluding my service there as the Chief of the Major Investigations Division. In that position, I managed the overall investigative effort for U.S. air carrier accidents from the field investigation to the public board meeting and final accident report. I also managed the U.S. Government’s participation in foreign aviation accidents. My previous NTSB experience included management of flight operations, air traffic control, and meteorological aspects of air carrier accident investigations; on-scene and follow-up investigations of flight operations for several major accident investigations including the USAir flight 427 Boeing 737 accident near Pittsburgh and ValuJet flight 592 DC-9 accident in the Everglades; and management of research programs on flight crew human factors and regional air safety issues, both of which were adopted and published by the NTSB. I am a pilot for a major U.S. air carrier, qualified in the Boeing 737 and two other transport category aircraft types. I have consulted with the National Aeronautics and Space Administration (NASA), the World Bank, the European Bank for Reconstruction and Development, the U.S. President’s Aviation Safety Commission, and several airlines, financial institutions, airport authorities, and other private entities on safety and analytical matters. I received the A.B. degree summa cum laude in Economics from Harvard College and am a member of the Phi Beta Kappa Society.
FMEA—Summary and Definition
According to the Society of Automotive Engineers (SAE) International Aerospace Recommended Practice (ARP) 5580, Recommended Failure Modes and Effects (FMEA) Practices for Non-Automobile Applications, FMEA is “a formal and systematic approach to identifying potential system failure modes, their causes, and the effects of the failure mode occurrence on the system operation…FMEA provides a basis for identifying potential system failures and unacceptable failure effects that prevent achieving design requirements from postulated failure modes…FMEA is used in many system design analyses including assessing system safety, planning system maintenance activities, defining provisions for fault recovery, fault tolerance, and failure detection and isolation, and identifying design modifications and corrective actions needed to mitigate the effects of a failure on the system.”
The basic FMEA process involves examining each basic hardware, software, personnel, or functional element of a system, identifying all the ways in which that element can fail (failure modes), assessing the effects of each failure mode upon the function of other elements of the system and the entire system (failure effects), and then assessing the criticality of the failure effects. Integral to the FMEA process is the specification of corrective actions that will prevent critical failures or restore critical functions.
FMEA typically uses a worksheet for analyzing data and documenting the results. The worksheet proceeds, left to right, from the component identification, to the associated failure modes, to the failures’ effects at various levels of the system (including detectability of the failure modes/effects), to their risk, reliability, or quality consequences. The following is an example of an FMEA worksheet that was prepared by the SAE for analysis of a fictitious aerospace application:
Source:SAE ARP926B, p. 32.
The criticality or level of risk, from a failure is a combination of the severity of the effect and the probability of its occurrence. Under FMEA the severity is estimated qualitatively with each effect assigned to one of several categories ranging from none to catastrophic, and the probability is assessed either qualitatively or quantitatively (the latter if failure rate data are available from previous experience or from laboratory or field experimentation). The severity and probability assessments are combined into an overall assessment of the risk level of the failure effect as being acceptable or unacceptable, along the lines of the following graphic from Federal Aviation Administration (FAA) guidance material:
Source:FAA Advisory Circular 25.1309-1A, System Design and Analysis, p. 7
One aspect of the FMEA process that is often ignored in discussions of the methodology (perhaps because it is not represented on the FMEA worksheet) is the importance of documenting and retaining all assumptions, including rationales for failure rates and effects categorization that underlie the FMEA worksheet entries. This is specifically cited by the SAE in its recommended standard ARP4761, appendix G, section 3.2.1.
My review of FMEA utilization in aerospace and several other fields suggests that the most common applications of FMEA are in product design and manufacturing processes. FMEA has not typically been applied to the post-manufacturing environment (such as product distribution and field usage by providers, operators, maintainers, and customers); however, post-manufacturing applications are not specifically excluded in FMEA standards. In fact, in SAE ARP5580 section 6.1.1 (5), “failure conditions caused by the operational and maintenance environment” are specifically cited among the failure modes to be considered.
Cross-industry acceptance and use of FMEA
FMEA is firmly established as a risk analysis and risk management methodology. Originating in the U.S. military during the 1940s and supported by military specification beginning in 1949 (MIL-P-1649, Procedures for Performing a Failure Mode, Effects, and Criticality Analysis), FMEA methods and applications were officially accepted as a recommended practice for aerospace engineering by the SAE beginning in 1967 under ARP926, Fault/Failure Analysis Procedure. FMEA had become a standard part of the design process in the aerospace industry by the 1980s and has been in continuous use through the present. For example, the Boeing Commercial Airplane Group relied upon FMEA to substantiate the safety and reliability of design changes for two generations of the Boeing 737 commercial airliner: the 737-300/400/500 series, first produced in the mid-1980s, and the “next generation” 737-600/700/800/900 series, first produced in the late 1990s and early 2000s. I have personally examined numerous FMEA documents and FMEA-based safety analyses prepared by aircraft manufacturers for original and modified transport-category aircraft designs (these FMEA applications are proprietary to the manufacturers). In addition to these aviation applications of FMEA, the late 1980s saw the application of FMEA to design and manufacturing processes by a major U.S. automobile manufacturer, and these practices were recognized by the automotive industry under the auspices of the Automotive Industry Action Group (AIAG) and the SAE (Surface Vehicle Recommended Practice J-1739, first issued in 1994). Currently, FMEA is recognized by the SAE (ARP5580, Recommended Failure Modes and Effects Analysis (FMEA) Practices for non-Automobile Applications), the FAA (Advisory Circular 25.1309-1A, System Design and Analysis), and the National Aeronautics and Space Administration (NPA 8715.3, NASA Safety Manual, and NSTS 22206, Instructions for Preparation of FMEA and CIL). In a subsequent section of this paper, I will provide an example of a successful government-sponsored (and therefore non-proprietary) aviation industry application of FMEA that resulted in a significant improvement in commercial air carrier flight safety.
FMEA has also been applied successfully in a wide range of other domains. For example, FMEA is being used to analyze design and maintenance issues in building structures (Anker Nielson, Ph.D., “Use of FMEA, Failure Modes Effects Analysis on Moisture Problems in Buildings,” Building Physics 2002—6th Nordic Symposium). Also, engineers have applied FMEA to design and manufacturing processes in the semiconductor industry (Steven Martin and Bedwyr Humphreys, “FMEA Speeds Time to Market in Photonic IC Manufacturing”, Compound Semiconductor, November 2002). The authors concluded, “The FMEA technique has been successfully implemented at MetroPhotonics, aiding in the rapid development and the successful launch of the SurePath product suite…Time to market and development costs were greatly reduced through the selection of optimum system alternatives (through FMEA), resulting in a successful product launch within four months of concept” (Martin and Humphreys, p. 69).
FMEA has become established as a standard methodology for risk management in the healthcare industry. Under Joint Commission on Accreditation of Healthcare Organizations (JCAHO) Standard LD.5.2, adopted July 1, 2000, healthcare organizations are required to proactively identify and manage potential risks to patient safety, using FMEA and root cause analysis to analyze at least one high-risk process annually. The U.S. Veteran’s Administration has developed and begun implementation of an application of FMEA that the agency customized for healthcare delivery (Joseph DeRosier, Erik, Stalhandske, James P. Bagian, and Tina Nudell, “Using Health Care Failure Mode and Effect Analysis™: The VA National Center for Patient Safety’s Prospective Risk Analysis System,” The Joint Commission Journal on Quality Improvement, Vol 28. No 5, May 2002). Private health care organizations (for example, Kaiser Permanente) have begun to implement FMEA-based processes (Kaiser Permanente, Failure Modes and Effects Analysis Team Instruction Guide, March 2002). Although healthcare-related applications of FMEA have considered some aspects of pharmaceutical delivery (for example, Institute for Healthcare Improvement, “Sample FMEA: Comparison of Five Medication Dispensing Scenarios,” 2003), I am not aware that a comprehensive analysis of pharmaceutical distribution, delivery, and use, treating all post-manufacture activities as an integrated system, has been performed to date using FMEA or any alternative, formal risk-management methodology.
Advantages of FMEA
I suggest that FMEA has several general advantages for organizations seeking to improve quality and safety:
First, FMEA is a structured process that promotes disciplined elicitation of ideas about the kinds of failures that may occur, careful analysis of specific risk/hazard areas, proper documentation of sources and assumptions, and identification of interventions that manage risks to an acceptable level. Regarding the ultimate goal of risk management, in most applications the FMEA process requires intervention in each identified adverse outcome until the residual level of risk is acceptable.
Further, as a “bottom-up process” proceeding from the failure an individual component of a system to the effects on the entire system, FMEA helps organizations identify unforeseen, undesired outcomes. Its best applications are prospective, facilitating the control or mitigation of adverse outcomes before they occur.
Also, FMEA explicitly considers the detectability of failure modes, and thus it promotes consideration of failures that can remain latent; that is, failures that have no immediate effect and (if they remain undetected) are capable of resulting in adverse effects when combined with subsequent failure modes or events (however, as is discussed below, the basic FMEA methodology may need to be modified to fully address latent failures).
Limitations of FMEA
SAE ARP5580 provides the following “cautions” for the application of FMEA:
- First, a FMEA traditionally considers only non-simultaneous failure modes. Each failure mode is considered individually, assuming that all other system components are performing as designed. Hence, a typical FMEA provides limited insight into the following anomalous behaviors:
- the effects of multiple component failures on system functions, and
- latent manifestations of defects such as timing, sequencing, etc.
- Second, the prioritization of the failure modes for corrective actions is substantially subjective. Thus, care should be taken in decision making when using any quantitative aspects of the numbers presented in the analysis (SAE ARP5580, Section 3.3).
I concur that the basic approach of FMEA is to consider single failures and that a typical FMEA application handles multiple (simultaneous/sequential) failures with difficulty (later in this paper, I will suggest several extensions to FMEA that are capable of addressing these issues).
Further, I suggest that the following additional general limitations exist for FMEA:
First, as FMEA has typically been applied in aerospace engineering, designers are permitted to rely upon human performance (such as interventions by pilots and mechanics) to mitigate the adverse effects of hardware and software component or system failures. However, in doing so, no consideration is given to given to imperfect human performance. For example, FAA guidance for aircraft certification states, “If…a potential failure condition can be alleviated or overcome…without requiring exceptional pilot skill or strength, credit can be taken for correct and appropriate action” (FAA AC25.1309-1A, pararaph 11). The assessment of “exceptional” skill or strength is subjective, and once a specific human response to a failure mode is determined to require unexceptional skill or strength, FMEA typically assumes that the human will intervene reliably every time that the failure mode occurs. I believe that this is an unrealistic assumption for human performance, and as a common treatment of human performance in FMEAs it constitutes a limitation of the typical FMEA methodology.
Also, as FMEA typically has been applied in design/process applications, there is no inherent feedback to the FMEA process from the actual failure modes and outcomes experienced in field use. However, this feedback is not excluded by the FMEA process and the continuing refinement of an FMEA through feedback has been explicitly recognized as an important aspect of system safety analysis in some applications.
Keys to successful application of FMEA
I believe that several additional issues are important for obtaining satisfactory results from an FMEA.
First, while FMEA is a structured technique that provides a comprehensive analysis, it is difficult (or impossible) to prospectively identify all possible failure modes/adverse outcomes from a complex component or functional element of a system. Because even the best FMEA effort may leave some failure modes and effects undiscovered, after completing an FMEA it is essential to avoid concluding that all risks have been compensated for or controlled. This suggests that FMEA analysts need to maintain an open and creative attitude about identifying failure modes and assessing their effects and consequences, It also establishes the rationale for obtaining, analyzing, and reacting to feedback from field use and operations, and for treating the FMEA as a “living document” that will be revisited and revised on a continuing basis.
Further while planning and performing an FMEA, it is essential to understand the scope of the analysis and to choose a proper scope that will allow the evaluation of all critical risks that can result from failure modes. For example, many FMEAs are limited to design issues and do not necessarily consider manufacturing variations or errors. An aircraft part that includes several linkages may not consider the effects of cumulative (stack-up) of the manufacturing tolerances that are allowed for each individual linkage as a possible contributor to failure modes and effects. Even if the scope of the FMEA for this part is enlarged to include manufacturing processes and therefore considers tolerance stack-up, the analysis still may not consider the effects of failure modes that remain downstream from the processes that have been included within the analytical scope, such as improper maintenance or use. When considering all of a product’s failure modes and effects in all environments, a still broader scope of analysis might reveal additional factors that significantly affect safety and quality. For example, consider a pharmaceutical product with an adverse side effect that poses a risk to some users. One option for controlling the risks of these side effects would be for the Food and Drug Administration (FDA) to withdraw approval for the product. However, because the product also has therapeutic value, withdrawal of the product may actually result in a net reduction of patient health and safety, even considering the adverse consequences of the side effects. The net therapeutic benefit of the product relative to its side effects will not be identified by an FMEA of its design, manufacturing, and use—unless the withdrawal of the product is considered as a failure mode and the scope of analysis is broadened to consider the net consequences of non-use.
In addition to considering downstream effects in scoping the analysis, it is essential to recognize that the interventions selected in an FMEA to mitigate an identified risk can also introduce their own failure modes and effects having critical risks. Interventions should be designed to “first, do no harm;” that is, they should introduce no new uncorrected failure modes. This suggests that FMEA should be performed on each intervention, as well. In some cases controlling the hazard from one failure mode can increase the hazard from another, and this may require consideration of multiple simultaneous or sequential failures as an extension of FMEA.
Also, while interpreting the results of an FMEA, it is essential to understand the derivation and limitations of the probability analysis that is incorporated in the evaluation of the risks associated with failure effects. The probability that a failure mode will occur can be obtained from engineering, field, or registry data such as historic component failure rates; the probability that a functional element or complex component will fail can be estimated by combining the failure rates of sub-assemblies or sub-systems. Failure rates may be obtained from laboratory research if actual field data are unavailable. Lacking in both field and laboratory data, failure mode probabilities may be estimated. The FMEA analyst’s confidence in the results should depend on the derivation of these probabilities. An additional probabilistic element in some FMEA applications is the likelihood that an effect of stated severity will follow from a failure mode. This element needs to be estimated in a similar manner, with confidence in the results of the analysis once again depending on the source of the probability estimates. Another probabilistic element can enter FMEA when considering interventions to control or mitigate an identified risk; here, the probability that the intervention will successfully address the risk needs to be estimated.
Failure and reliability rates are particularly difficult to estimate when human performance is involved. The FAA states in its design guidance material that “quantitative assessments of the probabilities of crew error are not considered feasible” (FAA AC25.1309-1A, paragraph 11); as I have already discussed, the FAA then turns at times to the unrealistic assumption that humans perform with perfect reliability. In other domains, performance by trained professionals has been estimated as being satisfactory in 30-60 percent of exposures to a demanding task. Although the reliability level of human performance is highly variable depending on the nature of the task, environment, and individual, it is probably best to assume that human performance in systems often may be much less reliable than what is demanded of hardware and software systems, and accordingly to plan compensations when humans may be responsible for detecting primary failure modes or for intervening to mitigate failure effects.
Review of FMEA applications in various industries suggests that there is no standard definition for an acceptable level of risk. Based on the high volume of operations with consequent risk exposure and the public’s low tolerance for mishaps, commercial aviation design and manufacturing is held to a stringent reliability criterion: certification guidance requires that every failure having catastrophic consequences must be demonstrated to be extremely improbable; the FAA defines “extremely improbable failure conditions” as “those having a probability of on the order of 1 X 10E-9 or less” (AC251309-1A, paragraph 10). In contrast, FMEA applications in other industrial domains accept catastrophic outcomes with probabilities that may be orders of magnitude more likely. An interesting criterion for aviation design that incorporates both probability and severity factors establishes that “in general, a failure condition resulting from a single failure mode of a device cannot be accepted as being extremely improbable” (FAA AC 25.1309-1A, paragraph 2-g). Thus, every failure mode having catastrophic consequences, regardless of its estimated likelihood, must be mitigated by a redundant system or a means of reliably detecting the failure before it occurs (the FAA guidance does suggest that “…in very unusual cases, however, experienced engineering judgment may enable an assessment that such a failure mode is not a practical possibility.”).
When considering the effectiveness of interventions in mitigating the risks of failure effects, a significant implication of probability analysis is the assumption of independent events. Normally, the probability of two events both occurring is the probability of one event multiplied by the probability of the other event. For example, consider an aircraft component that FMEA determines to have an unacceptable failure rate. To control this risk, designers require the mechanic to check the component before each flight and also require the pilot to recheck the component during the taxi-out checklist. If there is a 10 percent chance of the mechanic forgetting to check the component and also a 10 percent chance of the pilot skipping the same item on the checklist, the probability of the check being omitted by both persons is only 1 in 100. In this manner, adequate reliability can be obtained from two somewhat unreliable human performances by imposing multiple, redundant interventions. However, this analysis assumes that the pilot and mechanic events are independent, while in reality these events may interact: a pilot who knows that the mechanic is supposed to be checking the component may grow to rely on the mechanic and become less likely to perform the re-check. As another example, consider a pharmaceutical product that requires patients to receive periodic lab tests to detect possible adverse side effects. Multiple, redundant interventions are designed to ensure that patients receive the lab tests: doctors and pharmacists are both instructed to track the due dates for the tests and notify patients. However, if doctors become aware that pharmacists are tracking the due dates, the doctors may become less likely to perform this effort as well; therefore, multiple intervention collapses to a single intervention and the redundancy is lost. Whenever the assumption of independent events is violated and the likelihood of one event becomes a function of another event, it is impossible to conclude that the desired reliability will result from multiple interventions. Therefore, interventions must be designed and implemented so as to provide and preserve the independence of the events.
Complementary analytical techniques
In its Safety Manual, NASA states that “risk assessment should use the simplest methods that adequately characterize the probability and severity of undesired events.” The NASA manual further states, “Qualitative methods that characterize hazards and failure modes and effects should be used first…quantitative methods are to be used when qualitative methods do not provide an adequate understanding of failures, consequences, and events” (NASA NPG 8715.3).
A variety of analytical methods are available to apply to risk management, in addition to FMEA. I will briefly define and discuss several of these methods and indicate how they can be used to complement FMEA and extend its applications into areas in which FMEA is otherwise inherently limited.
I have described the FMEA method as a “bottom-up” approach that attempts to identify failure effects (some of which may not yet have occurred in actual use of the product) by starting with individual component failures, imagining the ways the component can fail, and then proceeding up the chain of the system to subsequent failures and consequences. Further, I identified the bottom-up orientation of FMEA as advantageous for a prospective, accident-prevention program.
Some alternative analytical methods are “top-down” in that they begin with the ultimate system consequence or failure event and then proceed down into the system to identify why the failure occurred. These methods perform well as retrospective analyses; for example, investigations of accidents or incidents that have already occurred. However, top-down methods can also be useful in prospective analysis; for example, when concerned about a severe consequence, recognizing that the primary FMEA method may miss some failure effects, it may also be helpful to analyze beginning with the consequence itself and to search creatively for other sub-system functions or component failures might bring about the undesired result.
The SAE’s recommended standard for the general evaluation of aircraft safety (ARP4761, Guidelines and Methods for Conducting the Safety Assessment Process on Civil Airborne Systems and Equipment) describes an over-arching “System Safety Assessment” (SSA) process. SSA integrates FMEA and some of the following approaches, as required, to thoroughly evaluate all of the failure modes, failure effects, and risks of a system and show that the entire system (the aircraft) operates at the required level of safety/reliability despite all anticipated failure modes.
Functional Hazard Analysis (FHA) is a top-down approach that is most often performed at the beginning of a design effort, when the final specifications for a product have not yet been settled yet its basic functions are already established. Using engineering judgment and knowledge from similar efforts, analysts review the basic functions of a product or process and suggest system-level hazardous outcomes for further analysis. This method allows the safety/quality improvement process to begin early in product development, at least at a level of broad generality.
Methods similar to FHA also can be applied retrospectively, after a product is fielded. One successful application is Hazard Analysis of Critical Control Points, which is used in the food services industries to evaluate the entire chain of food production and distribution, identifying and controlling sources of food contamination. This application seems amenable to the simpler FHA methodology rather than a formal FMEA.
Fault Tree Analysis (FTA) is more formal top-down approach to identifying the causal links between functional breakdowns and their antecedents in events or failures of lower-level components. The FTA begins with the system-level failure or consequence that the analysts want to understand. Proceeding down through the system from the top-end level to the underlying processes and components, the analysis results in a graphical representation of the combinations of subsystem and component failures that can result in the system event. The fault tree (so-named because it resembles the root structure of a tree) uses standard notations of Boolean logic to denote precursor or lower-level events that must occur individually (“or-gate”) or in combination (“and-gate”) to bring about the higher level event. In this manner, FTA directly incorporates multiple causation (simultaneous/sequential) events. Further, when failure rates are added to each component of the tree diagram, the probabilities of each of the lower-level events can be added or multiplied to estimate the probability of the ultimate system-level event.
The following is an example of FTA provided by the SAE:
Source: SAE ARP926B, p. 46.
As a top-down approach, FTA may identify one or more underlying causes of the top-level event but omit others that might be identified in the bottom-up FMEA. Additional limitations of FTA are that the methodology (unlike FMEA) does not represent the severity of consequences; hence, it is difficult to assess the risks of failure and evaluate them with respect to the available countermeasures, without also undertaking an FMEA.
Because it handles multiple failures, various multiple causations as expressed through Boolean logic, and the associated probabilities rather naturally, FTA also complements FMEA where the latter is limited. I suggest that FTA notation and techniques should be applied selectively to explore multiple failures and associated probabilities once these factors have been identified in the basic FMEA. Another advantage of FTA when used in combination with FMEA is the top-down check of the bottom-up process that I have already described. FTA might be applied selectively, once again, to confirm that FMEA has not omitted catastrophic outcomes. I would consider selective application of FTA as a complementary extension to the basic FMEA methodology. This is explicitly recognized by the SAE in ARP926B.
Probabilistic Risk Assessment (PRA) has been adopted by NASA as formal methodology for analyzing “the probability (or frequency) of occurrence of a consequence of interest, and the magnitude of that consequence, including assessment and display of uncertainties.” (Michael A. Greenfield, “Risk Management Tools,” NASA Langley Research Center presentation, May 2, 2000). A key contribution of PRA is that it considers, tracks, and documents the current state of knowledge and certainty of the probabilities that are employed in basic FMEA and other analyses. One significant limitation of PRA, as defined by NASA, is that the methodology requires specific experience-based failure rate data for the components and functions that are being analyzed. As a result, I suggest that it may be difficult to apply formal PRA to “softer” areas such as human performance in FMEA interventions.
Markov Analysis (MA) is a specialized probabilistic analysis especially well suited to evaluating the failure effects and consequences of high-technology systems that include self-monitoring, self-repairing and self-reconfiguring functionalities. MA is capable of handling these complex relationships between failure mode, effect, and consequence by representing the relationship as a chain, each element in the chain in an operational or non-operational state, and the movement between states as a system of differential equations. I would suggest that MA is a good methodology to employ as a complement to basic FMEA and FTA when the nature of the components, environment, or operators require it; otherwise, in accordance with the principle of minimizing the complexity of risk analysis, MA does not appear warranted in most applications.
To summarize these alternative methodologies, it is quite possible to extend a basic FMEA into areas in which the FMEA method is limited, including multiply caused events, simultaneous or sequential events, and the estimation of probabilities of failure modes, effects, and consequences (and our confidence in the estimated probabilities), by applying selected aspects of FTA and PRA to the FMEA. I do not suggest that complete, formal FTA and PRA need to be undertaken in every FMEA application; rather, these methodologies should be drawn from as required.
Complementary field reporting and data analysis systems from aviation
In a previous section, I mentioned the importance of feeding information from the post-manufacturing user communities and processes back into the FMEA to ensure that the consequences of failure modes that arise only in product use (perhaps because they were rare events and did not occur during design and testing) are recognized and compensated for once they have been discovered. There are several fairly recent developments in aviation industry reporting and analysis systems, potentially useful for refining and refreshing an FMEA on a continuing basis, that may also have applications in other industries.
Aviation Safety Action Programs (ASAP) are cooperative reporting systems for persons active in commercial aviation operations, including pilots, mechanics, and aircraft dispatchers, to report the events that happen in daily line operations. ASAP reports are non-jeopardy; in fact, if a person reports an event to ASAP independently of enforcement action by the regulatory authority (FAA) then the FAA will typically waive sanctions for any regulatory violation related to the event. This waiver of sanctions motivates personnel to report the information. ASAP reflects the aviation system’s recognition that for human failings, obtaining the information is often more important than punishment the transgressions, most of which are inadvertent in any case. A key feature of the ASAP program is the Event Review Team, comprising representatives from the airline, the pilot’s association, and the FAA, which meets periodically to review all submitted ASAP reports and act on the information in the reports. ASAP is considered to be successful in revealing, disseminating, and promoting resolution of adverse events in daily flight operations that would otherwise remain unknown. ASAP applications are increasingly popular in commercial aviation. These programs are described in official FAA guidance (Advisory Circular 120-66B, Aviation Safety Action Program).
Whereas ASAP obtains information from the personnel in the aviation system, Flight Operations Quality Assurance (FOQA) programs tap into the volumes of parametric data generated during regular flight operations and recorded continuously by on-board solid state recording equipment (similar to, but usually distinct from the crash-hardened Digital Flight Data Recorders that are used in accident investigations). In FOQA, the greatest challenges are handling mass data and then interpreting the information. Initial applications of FOQA concentrated on identifying events in which normal flight parameters (such as airspeed limitations, g-loading, touchdown relative to target) were exceeded. The programs are beginning to delve beyond exceedance monitoring to the consideration of within-specification performance statistics, including both the means and the distributions about them, which can then define the norms of the industry. There is also a growing trend in FOQA programs to link the information obtained from FOQA with information derived from ASAP about the same events. This facilitates the combined analysis of “what” happened (from FOQA) and “why” it happened (ASAP, to the extent that the personnel involved in the event were aware of why they performed the way that they did). A long-term NASA research program, the Automated Performance Management System, is encouraging the establishment of FOQA programs at various U.S. airlines and enhancing data analysis along these lines. Most of the major U.S. air carriers are generating and collecting FOQA data on at least their more modern fleet types (these aircraft are equipped with the required data busses). FOQA programs are described in the Flight Safety Foundation’s Flight Safety Digest, July-September 1998, “Aviation Safety: U.S. Efforts to Implement Flight Operational Quality Assurance Programs.” Although analogous data may not be available in other applications, FOQA demonstrates the value of routine monitoring of the use of products in the field, including the identification of product misuse (exceedances in FOQA) and the characterization of norms for product use.
The Continuing Airworthiness Surveillance System (CASS) is an aviation reporting and analysis system that concentrates on tracking product failure modes, effects, and consequences in actual line maintenance operations. CASS is one of the oldest data-driven quality assurance programs, beginning in 1964 and tracing its history to industry concerns about several maintenance-related air carrier accidents during the 1950s. Air carriers are required to implement CASS by Federal aviation regulations (14 CFR Part 121.373); interestingly, CASS is the only safety management/quality assurance system that has been specifically mandated by the FAA. CASS is defined by the FAA as a “structured process to identify factors that could lead to an accident or incident through collection and evaluation of information that can be used as indicators of the degree of maintenance program effectiveness and performance…accomplished through a closed-loop, continuous cycle of surveillance, investigations, data collection and analysis, corrective action, corrective action monitoring, and back to surveillance.” (FAA AC 120-16D, Air Carrier Maintenance Programs, and AC 120-79, Developing and Implementing a Continuing Airworthiness Surveillance System).
Event reporting systems with many similarities to these aviation systems are being developed and used in other industries, including healthcare. I think that review of the characteristics and implementation of ASAP, FOQA, and CASS may enhance similar systems in alternative industries, particularly as these aviation systems are applied in combination to obtain information that only the personnel in the system can report, additional mass data about regular operations, and specific product and personnel failures in the post-manufacturing environment. Also, I suggest that information systems with these characteristics can be effective feedback mechanisms for the ongoing analysis of failure modes, effects, and consequences through FMEA.
The Boeing 737 Flight Controls Engineering Test and Evaluation Board: a successful application of extended FMEA
On September 8, 1994, USAir flight 427, a Boeing 737-300 airplane, crashed while maneuvering to land at Pittsburgh International Airport, Pittsburgh, Pennsylvania. All of the 132 persons aboard were killed, and the airplane was destroyed. The accident occurred in clear weather with light winds, during the hours of daylight. After a three-year investigation, the National Transportation Safety Board (NTSB) determined that the probable cause of this accident was “loss of control of the airplane resulting from the movement of the rudder surface to its blowdown limit…The rudder surface most likely deflected in a direction opposite to that commanded by the pilots as a result of a jam of the main rudder power control unit servo valve secondary slide to the servo valve housing offset from its neutral position and overtravel of the primary slide.” (National Transportation Safety Board, Uncontrolled Descent and Collision With Terrain, USAir Flight 427, Boeing 737-300, N513AU, Near Aliquippa, Pennsylvania, September 8, 1994, NTSB AAR-99/01, adopted on 3/24/99).
Before this accident the rudder system of the 737 had been evaluated by Boeing and the FAA, in full compliance with existing certification requirements, using failure analysis (a less rigorous version of FMEA) for the original design reviews performed during the 1960s and FMEA for new-model reviews performed during the 1980s and 90s. Because the rudder systems had not been completely redesigned in the new model 737s, the FAA required only a very limited scope for the FMEAs conducted in the 80s and 90s. Despite these analyses and consistent with their limited scope, the NTSB investigation determined that the airplane’s rudder system was subject to several previously unidentified single-point failures that could have catastrophic results. One or more of these failure modes was most likely involved in the rudder system jam and reversal, which led to the fatal accidents.
The NTSB issued numerous safety recommendations related to its findings regarding the Boeing 737 rudder system and unusual attitude recovery procedures for flight crews. In Safety Recommendation A-99-21, the NTSB recommended to the FAA:
Convene an engineering test and evaluation board to conduct a failure analysis to identify potential failure modes, a component and subsystem test to isolate particular failure modes found during the failure analysis, and a full-scale integrated systems test of the Boeing 737 rudder actuation and control system to identify potential latent failures and validate operation of the system without regard to minimum certification standards and requirements in 14 Code of Federal Regulations Part 25. Participants in the engineering test and evaluation board should include the Federal Aviation Administration (FAA); National Transportation Safety Board technical advisors; the Boeing Company; other appropriate manufacturers; and experts from other government agencies, the aviation industry, and academia. A test plan should be prepared that includes installation of original and redesigned Boeing 737 main rudder power control units and related equipment and exercises all potential factors that could initiate anomalous behavior (such as thermal effects, fluid contamination, maintenance errors, mechanical failure, system compliance, and structural flexure). The engineering board’s work should be completed by March 31, 2000 and published by the FAA.
In response to this recommendation, the Engineering Test and Evaluation Board (ETEB) was convened in May 1999 and completed its work in July 2000 with the issuance of a final report. (Federal Aviation Administration, 737 Flight Controls Engineering Test and Evaluation Board Final Report, July 20, 2000.) The staff of the ETEB was detailed from the FAA, Boeing (Commercial, Space, and Military Airplane divisions), Air Line Pilots Association, Ford Motor Company, Air Transport Association, Interstate Aviation Commission (Russia), NASA, and U.S. Navy.
According to the ETEB’s report, the group conducted:
- A failure analysis of the flight control system to identify potential failure modes;
- Component and subsystem tests to isolate particular failure modes found during the failure analysis; and
- Full-scale integrated systems tests, including ground and flight testing, of the … 737 rudder actuation and control system to identify potential latent failures and to validate the operation of the system (ETEB Final Report, p. 2-3).
The ETEB noted that normal certification procedures for aircraft and components require consideration of the probabilities of a failure mode or adverse effect. However, the ETEB chose to evaluate the severity of failure mode consequences without regard to their probability of occurrence. The ETEB’s rationale for this approach was that the Boeing 737 had experienced approximately four serous failures of its rudder system in 100 million flight hours, two of which had resulted in fatal accidents. Therefore, the failures under investigation were extremely rare but of extremely adverse outcome. Consequently, it was considered appropriate to treat any failure mode with the potential for catastrophic consequences as of the highest risk level, regardless of how unlikely the failure mode or effect. A related goal of this new analysis was to “focus…on rare failures that may not have been considered in the original certification requirements” (because the failures were considered extremely improbable, ETEB Final Report, p. 2-8). The ETEB described its analytical approach as follows:
The ETEB conducted a comprehensive and detailed failure modes and effects analysis (FMEA) for the complete rudder control system…Preliminary hazard classifications were assigned to each failure, based on the predicted severity and the ability of the flight crew to maintain control of the airplane and conduct a safe landing. For all failures classified as “catastrophic (Class I)” or “hazardous (Class II),” the ETEB conducted failure simulations using a detailed high-fidelity simulation of the rudder control system. In addition, the ETEB conducted pilot-in-the-loop failure simulations using a motion-base flight simulator. The purpose was to identify the impact of the failures on the operation of the airplane following flight crew actions. The hazard classifications of the failures were updated, based on the combined results from these two simulation activities (ETEB Final Report, p. 2-7).
These tests and simulations were used to verify and validate the hazard levels that had preliminarily been assigned to the failure modes. Because some failures and interventions had unexpected consequences in the testing, the feedback from these verifications was extremely important and influential in the final conclusions and recommendations of the ETEB. This demonstrates how an FMEA that is open to feedback and change, either from testing or field experience, can provide much better results than a one-time evaluation.
The ETEB illustrated the verification and feedback built into the FMEA in the following figure from its final report:
Source: ETEB Final Report, p. 2-6
The full range of hazard classifications followed standard FAA practice and was defined as follows by the ETEB:
Source: ETEB Final Report p. 3-3
The ETEB used a standard adaptation of the FMEA analysis form (see table). It is interesting to note how the form explicitly recognized the mitigating effects of flight crew actions in response to equipment malfunctions (columns 5, 7, and 8).
Source: ETEB Final Report, p.3-2
Although the possibility of imperfect flight crew performance (a realistic expectation for human intervention in a complex or stressful situation) was not explicitly modeled on the FMEA worksheet, the ETEB accomplished this important extension to the basic FMEA by validating and revising assumptions about the reliability of flight crew performance through its testing process. The ETEB found that flight crews were not able to reliably intervene and mitigate the consequences of rudder component failures in some operational circumstances, and these revised expectations were entered into the final versions of the FMEA worksheets.
The following figure provides an excerpt of an actual FMEA worksheet. This worksheet includes a finding of catastrophic severity for a failure effect that could not be mitigated:
Source: ETEB Final Report, appendix A, p. 95
Another useful extension that the ETEB added to the basic FMEA was the explicit consideration of latent (preexisting, undetected) failures combined with active failures. Although FMEA is not considered to be well-suited to the analysis of multiple failure modes, the ETEB was able to readily analyze these sequential failure combinations by treating the latent and active failures as a single combined failure mode for subsequent evaluation of the failure effects and consequences. This manual extension of the FMEA method was effective for linked pairs of errors; I think that it may have been very complicated to use this method to track and display triple or even more complicated failure combinations, but these failure combinations were not required.
The table that follows (from ETEB Final Report, p. 3-40) provides a sample of the new latent/active failure combinations that the ETEB was able to identify and analyze using FMEA:
The FMEA undertaken by the ETEB was successful in identifying a large number of previously unknown or unevaluated failure modes, several of which had the potential to result in catastrophic consequences. The following are excerpted from the results presented by the ETEB in its final report:
The [Boeing] 737 rudder control system is susceptible to a number of:
- Failures and jams that can cause uncommanded rudder motion;
- Failures and jams that affect the operation of both the rudder main and standby power control units (PCU), thereby defeating the independence of the two systems; and
- Latent failures.
These failure modes are single failures, single jams, or latent failures in combination with a detectable failure or jam.
The rudder control system of the Initial and Classic Model 737s with the modifications required by the applicable FAA [Airworthiness Directives]…have:
- 14 single failures and jams, and 12 latent failure combinations, that have Class I failure effects in the takeoff and landing regimes. These same failure modes have 4 Class I effects and 22 Class III (major) effects in the rest of the flight envelope.
- 8 single failures and jams, and 11 latent failure combinations, that have Class II failure effects. (ETEB Final Report p.. 1-3)
The ETEB drew strong conclusions about factors influencing the efficacy of human interventions to mitigate rudder system failures:
The ETEB conducted 40 hours of pilot-in-the-loop rudder failure simulations with10 pilot and co-pilot flight crews from four airlines.
- In general, the flight crews found the existing Jammed or Restricted Rudder Emergency Procedure difficult to use.
- The flight crews appeared to have received little training in the use of the Jammed or Restricted Rudder Emergency Procedure or the Uncommanded Yaw or Roll Emergency Procedure.
- The lack of a clear and unambiguous display of rudder position made it difficult for the crews to diagnose uncommanded rudder deflections and take prompt corrective actions.
- Uncommanded rudder hardover deflections during takeoff and landing resulted in Class I failure effects [i.e., human intervention was not reliably effective] (ETEB Final Report, p. 1-4).
The ETEB’s investigation of latent failure effects using extended FMEA methods resulted in a conclusion that “there are several latent failures that, when combined with one additional single failure or jam, result in Class I or Class II failure effects. There are insufficient inspections for these latent failures” (ETEB Final Report, p. 1-5).
As I have indicated throughout, no FMEA is can be considered complete unless it leads to the mitigation of the unacceptable risks that the analysis identifies. The ETEB’s application of FMEA resulted in the following recommendations for redesign of the rudder system:
Modify the Boeing Model 737 rudder control system to ensure that:
- No single failure or single jam of the rudder control system will cause uncommanded motion of the rudder surface that results in a Class I failure effect;
- No combination of failures or jams will result in a Class I failure effect, except for those combinations that are shown to be extremely improbable; and
- No probable single failure or jam will have an effect worse than Class IV.
In addition, The Boeing Company should consider providing a fail-safe rudder control system design that provides protection from latent failures that contribute to a Class I failure effect (ETEB Final Report, p. 1-6).
As a result of these recommendations (and the preceding accident investigation causal findings and recommendations of the NTSB), the Boeing 737 rudder system has been redesigned to provide reliable redundancy, and a major hardware retrofit program is underway for the entire fleet.
To mitigate risks pending completion of this fleet retrofit, the ETEB also provided the following recommendations to improve the risk mitigation value of human (pilot and mechanic) interventions following a rudder system failure:
- Revise and simplify the current “Jammed or Restricted Rudder” emergency procedure.
- Provide additional training to flight crews in the use of the “Jammed or Restricted Rudder” emergency procedure and the related “Uncommanded Yaw or Roll” emergency procedure.
- Display rudder position to the flight crew.
- Alert flight crews and maintenance crews to the signs of rudder malfunctions, such as uncommanded pedal motion (ETEB Final Report, p. 1-6).
These recommendations targeted at improving human performance have been partially implemented by the aircraft manufacturer and FAA, from 2000 to present. Despite the limitations that remain in human interventions, it is most significant, I believe, that the result of the FMEA performed by the ETEB was to render the designers’ expectations for human performance, and the design’s reliance on human intervention, much more consistent with realistic human capabilities and limitations. This was a strong contributor to the accuracy and applicability of the FMEA’s results and its ability to improve system safety.
In all, I believe that the ETEB process was a very successful example of the application of FMEA extended with (1) top-down analysis (the program began with foreknowledge that the end-level adverse event to eliminate or mitigate was flight control malfunction leading to loss of aircraft control), (2) consideration of multiple (latent) failures, and (3) realistic consideration of human performance during interventions, and (4) feedback from external data sources to FMEA revision. In the ETEB application, FMEA was not supplemented by data-driven analysis of conditional probabilities, this was an appropriate, conservative response to the extremely rare/extremely hazardous nature of the environment and threats.
The ETEB’s work shows how the basic FMEA combined with complementary extensions can form a comprehensive safety analysis that results in real safety improvement. The excellent results of the ETEB program are equally a testament, I think, to a strong effort to creatively re-think the failure modes and effects for a system that had been thought to be completely well-understood and thoroughly time-tested by 100 million hours of field use. This creativity and openness are necessary ingredients for any successful analysis.
Conclusions about FMEA
Based on the foregoing review, I conclude the following about the Failure Modes and Effects Analysis methodology:
- FMEA is a sound methodology for basic, structured risk management and quality improvement analysis.
- The ideal approach can be to use FMEA as the backbone for analysis that also includes the integration of complementary methods, as required; for example, it may be appropriate to apply elements of FTA or PRA to understand and explore the proper scope of analysis, the significance of failure effects, and the effectiveness of risk management interventions.
- Thoughtful application of FMEA can identify when these extensions are required and to integrate and document results of an extended analysis.
- The limited reliability of humans in complex systems argues for multiple, redundant, independent interventions when relying on humans to detect failure modes or actively intervene to mitigate failure effects.
- FMEA, as extended with appropriate top-down, probabilistic, and feedback methods, is an excellent framework for risk management and quality improvement in the post-design/post-manufacture (field distribution, application, or user) environment, including the human performance aspects of this environment.
 I acknowledge and thank ParagonRx, LCC for its support of my review of risk-management methodologies and the writing of this paper. All opinions expressed herein are my own and do not necessarily represent the opinions, policies, and products of ParagonRx, LLC.
To speak to a ParagonRx team member about our publications
call 888.459.8080 or email email@example.com