White Paper
Reliability Testing: Verifying Reliability for Embedded Systems
By: Mike Willey, Vice President
Abstract
Many companies guarantee their products, but few work the process to determine their real risk and estimated failing point. Some general procedures companies can apply to their own reliability testing analysis will be discussed, allowing companies to check their own risk analysis and improve their testing. Design verification testing (DVT) uses the engineering specification as its road map. The DVT establishes conformity of the completed design and the manufactured product with the original engineering specification. The testing covers the functions of the product and its interaction with other devices. Also tested is the ability of the product to work over the specified range of environmental conditions. And last, the product must conform to the appropriate safety standards and regulatory agencies. Lifetime testing is used to verify if the product meets the product life design objectives. Automated round the clock operation of the device permits a manufacturer to simulate accelerated usage with the goal of statistically estimating wear and tear over the lifetime of the product.
Scope
This paper will cover three elements of reliability testing. Design verification testing (DVT) ensures that the design of the product meets all requirements. Reliability analysis determines how physical failures (based on fatigue and wear) over time. System burn-in testing helps accelerate the occurrence of early life failures. Reliability tracking and analysis helps validate and refine the other elements of reliability testing.
These elements comprise the continuum of reliability analysis. We will describe each element and how they interrelate. We will also cover practical methods of implementing a reliability testing program for embedded systems development and production.
Embedded systems are typically comprised of mechanical, electronic, electromechanical, and firmware. Reliability testing relates to all of these areas, this paper will pay special attention to the electronic and electromechanical aspects of reliability testing, and discuss the other areas only briefly.
Introduction
When we talk about reliability we are discussing the most fundamental measure of quality. Every product is designed and built to perform some function. As long as it continues to perform that function the product is deemed to be reliable. Reliability always influences the market reception and perception of a product. For embedded systems reliability may also be a matter of life and death.
It has long been established that reliability can be computed mathematically for certain electronic devices. Often misunderstood, however, is the reason that we can compute reliability based only on the number of components, their complexity, and the operating environment. The United States Military and old Bell System recognized that standardized processes led to predictable reliability on purchased systems. Both have established rigorous standards for procurement and performance tracking on systems and devices. The results of this tracking are accumulated in the Military Handbook MIL-HDBK-217 and Bellcore TR-NWT-000332. These documents allow anyone building electronics products to learn from their experience.
The common thread behind the Military and Bellcore reliability analysis systems is documenting a repeatable, predictable processes in manufacturing. This is also the underlying tenet of ISO 9000. By documenting the processes used to produce a product and following the process, products will be produced with a consistent level of quality. The reliability of products produced by these processes can then be recorded and analyzed over time and used to predict the reliability of future products.
Any process to produce any product can be documented and serve as the basis for reliability prediction. Some of the processes that are involved in embedded systems include:
- system design
- electronics design
- PCB design
- mechanical design
- PCB fabrication
- PCB assembly
- mechanical fabrication
- mechanical assembly
- electronics design
- software design
This paper will discuss the computation of reliability for electronic products, setting up processes that can ensure consistent reliability, and tracking and analyzing reliability data.
When we think of high quality design we might call to mind the DC-3 an aircraft designed by Douglas in 1935. There are still DC-3s in active commercial service today. In fact, it is not uncommon for aircraft to remain in service for many decades. We might well ask, what makes aircraft so reliable? A substantial part of that reliability is based on solid design, accurate failure analysis, preventive maintenance, and continuing market viability. These elements begin before the product announcement and continue far beyond the last actual sale.
Typical Product Life Cycle
It will be helpful to understand a typical product life cycle, to determine when and how reliability issues must be addressed. Figure 1 shows a very high level view of Product Life Cycle for any new product, be it an airplane or an embedded system. The job functions that typically participate in the development and sale of a product are shown with the activities generally associated with those functions above. In many companies the lines between these job functions may be blurred, but the activities and responsibilities are universal.
The primary function of marketing is to determine or generate a market need for a product. The marketing job function is primarily to provide the product concept and to help validate the commercial specification.
The role of the sales job function is to take the final product to the market place identified by or created by marketing. Sales teams are generally in very close contact with the client base and can serve to validate the product concept and assist in the creation and validation of the commercial specification.
Engineering?s role in a product development is to translate the product concept into an actual product. Engineering can have a valuable role in developing the commercial specification and validating the technical assumptions made by the marketing and sales functions.
The role of product management is to represent the customer?s interest in the product and to be a champion for the product within the company. Product management provides important information concerning the validation of the design and feedback of information from the product users and potential users.
The role of production is, obviously to build and ship the product. Production also plays a valuable role in verifying the ability to economically produce the product and determining how to improve the quality of the product.
The diagram below shows the activities as flowing from one to the next without any backtracking. If this were used as an actual implementation model the resulting product is not likely to meet the customer needs, not likely to be reliable and will probably be short lived in the market place.

Figure 1: Simplified Product Life Cycle
The remainder of this paper will discuss a more detailed view of the activities shown in Figure 1 and how these activities can be structured to ensure high reliability products.
Design Verification Testing
Design verification testing is the process of ensuring that a product matches a commercial need, can be produced at a reasonable cost, is reliable in operation and is consistent with the strategic goals of the company. There is a lot in this statement that is unsaid, specifically, design verification testing is not the process of ensuring that a device is the most reliable or least expensive or perfect match for the consumer needs. These three areas must be tempered with the strategic goals of the company. For example, there is a place in the market for the most reliable automobile, but the most reliable product may not be the least expensive to produce.
The foundation of reliable and successful products is the strategic plan for the company. If a company or division of a larger company does not have a mission and a plan to accomplish that mission it is very unlikely that anything produced by the company will be a success. Developing missions and business plans are beyond the scope of this paper. It should be recognized, however, that before beginning any project everyone involved must be familiar with the company mission and their part in the plan to accomplish that mission.
The process of Design Verification Testing comprises the activities that take place in the boxes labeled "Product Concept", "Commercial Specification", and "Detailed Design" in Figure 1. A detailed look at these three activities and how they interrelate is shown in Figure 2 below.

Figure 2: Design Verification Testing Process
Developing a Reliable Specification
The making of a high quality product begins with a high quality design. A solid design does not begin with an engineer, it begins with the identification of a significant market need and a concept of a product to meet that need.
A product concept is usually developed by the marketing job function within a company. The form of a product will be a white paper or other document with the scope of describing the market need, market size, current solutions and competing products. We will refer to this as the product requirements document. Prior to moving on to the next stage of development, the Commercial Specification, the product requirements document must be validated.
There are a number of ways to validate the product requirements for a product. Some of these include focus groups, surveys, and interviews with potential customers. The important issue that the product requirements must be validated and that validation must not amount to the fox guarding the hen house. The group responsible for developing the requirements should include other objective parties in the evaluation process. Focus groups conducted by a consultant or different department can help bring some perspective to a requirements document and will find areas where the definition is incomplete. The negative consequences of skipping validation of the product requirement can be catastrophic. To understand how costly the omission of this step in product development is consider the following example.
The XYZ Company marketing department has a great new idea. They are going to design a standalone controller card for the chrome plated muffler bearing industry (CPMB). The marketing guys thought they knew everything about CPMBs and put together a commercial specification and gave it to the engineering department. The engineering department spent 30 man months getting the product designed and the prototypes built. When they went to install it on an actual muffler bearing they discovered that the latest technology in muffler bearings was to control them with an infrared serial port rather than the ± 30 volt parallel ports of the older technology. When trying to determine how this happened the marketing department said "We assumed that the engineers were on top of the latest technology, so we didn?t bother putting it in the specification." The engineering department said "We new about the new technology, but we thought that the old technology performed better, if the guys from marketing wanted us to use the other technology they should have told us."
The bottom line on this story is that 30 man months of engineering is now substantially down the drain. If this discrepancy were caught in the early phases of product development the cost to change direction for the engineers would have been minimal.
Once a commercial specification is ready, it should be reviewed by engineering and marketing together. The purpose of this review is to determine feasibility. In many companies, marketing will involve engineering from the beginning to help prevent too much blue sky thinking. Although this is certainly helpful, it does not mean that a review of the commercial specification can be skipped.
Pausing a moment at the completion of each major phase of a product development to ensure that phase is as complete and accurate as possible is always worth the investment. As you can see from the example above skipping the verification stage can allow major problems with a product to slip through the entire design and development cycle, and a cost many times greater than the cost of a design review.
Developing a Reliable Design
When engineering has a commercial specification that has been tested by marketing and identified as feasible by engineering, it is time to begin the design phase. There are two areas of design that we will address in this paper, software and electronic hardware design. One can describe an embedded system as a device that performs a dedicated function through the seamless integration of software and electronics. Since most engineers specialize in either hardware or software development this implies a large amount of communication will be needed to make the integration seamless.
What does it take to develop a reliable design?
- Clear, unambiguous commercial specifications
- Complete, well maintained, design documentation
- Testing the documentation against the commercial specifications
- Implementing the design from the design documentation
- Testing the implementation against the design documentation
The most common area where problems creep into the design process is the documentation of the design. Supervisors want projects completed yesterday, and it is often hard for them to understand that writing the documentation is progress towards the goal. Engineers don?t like to write, we like to get our hands dirty, and skip to the end result as soon as possible. Writing dry technical documents is not why we got into engineering, it?s not fun. So engineers and supervisors agree:
"We'll write the documentation after we get the project done. Besides, if we write it now, we?ll just have to change it when the project is over anyway."
"We'll just do a draft of the documentation, I know pretty well what I intend to do so I just need to provide an overview. Anything you don't understand or you think is missing just ask."
The sad truth is that both are making their lives more difficult. Recalling the XYZ company and their chrome plated muffler bearings, a little documentation up front would have saved them from disastrous schedule slips and over expenditures. The time saved by putting off design documentation is an illusion. When the documentation is put off till the end of the project, design requirements will be missed. Even though the project may be delivered on time, there will be a seemingly endless stream of engineering changes and bugs. In the end, even if the product gets to the market a little sooner, reliability problems may be able to accomplish what the competition can not. This is not much fun for the engineer, staying at the office all night long to fix a problem that might not have ever shown up it the design documentation were complete.
"It's really a very simple design. I don't need to write a theory of operation, because I can see it all in my head."
If it is that simple, it won?t take long to write the theory of operation. Documentation doesn?t have to be heavy, just complete. Below are listed some of the documents that might be included in a software or hardware documentation project. Although the lists may vary some and additional documentation may be needed, these lists provide a good start.
Short cutting the documentation or the verification of the documentation is the same as skipping the design. It is important for the engineering group to sit down at the beginning of the project and identify which set of documents will be needed provide a complete design description, then create and verify those documents. The results of this process can be seen in the DC-3, still flying after over 60 years of continuous service.
Testing a Design
Design reviews are the test processes for a design. Many engineers find design reviews to be intimidating. This is because the purpose of the design review and the roles of the participants are often misunderstood.
The participants in the design review should be the engineers who created the design and design documentation, the marketing group who created the commercial specification, representatives from manufacturing (if possible), and the project manager. The purpose of the design review is to uncover possible errors or omissions in the design before they are implemented. The participants are there to ensure that as many possible points of view are used to verify the design. Too often design reviews turn into bashing sessions when the participants feel that a flaw discovered in the review process reflects poorly on the engineers.
The most effective design reviews focus on asking questions. Many engineers feel that if the design review concludes with design questions which were considered prior to the design review there is a problem. The truth is that this is the purpose of a design review. The important result of a design review is the action item list, once that has been fully addressed the design can move on to implementation.
Here are some hints for conducting a successful design review:
- Provide copies of all materials to the participants with enough time for them to review them.
- Provide an outline or copies of presentation slides with the other materials.
- When preparing your outline make it detailed enough that you expect each item to require between 1 to 5 minutes of discussion.
- When answering questions, don?t rush off to get something off of your desk if the material you need is not in the room. Make it an action item for later.
- Keep an action item list.
- Make assignments for resolution of each action item.
- Don?t make commitments you can?t keep when scheduling action items, it?s O.K. to defer the schedule commitment for action items. Just make sure that a commitment is made as soon as possible after the review.
- Have frequent breaks. Design reviews can be very intense and it?s easy to loose your concentration.
- Follow the agenda. Defer discussions that are covered later on the agenda, move lengthy discussions of items not planned for to the end of the meeting.
Believing Your DVT Results
There are three possible results of a successful design review:
- A list of action items is generated. When completed, the implementation of the design can proceed.
- A list of action items is generated and some significant design issues are identified. Another design review will be required before implementation can proceed.
- Virtually insurmountable implementation issues are encountered. The design must be sent back to generate a new commercial specification before a design can be completed.
A fourth possible result is that no action items were generated. If this happens, one should be very skeptical of the results of the design review. No design can be made perfect, if no action items were identified, the odds are that the review was incomplete or the participants were not properly prepared.
Reliability Analysis
After the design and test procedures have been developed and the implementation of the design has been substantially completed there is some analysis that can be done to predict failures due to stress and normal wear and tear. This analysis is based on statistical methods developed by AT&T Bell Labs (now Bellcore) and the US military. These methods can be used to predict failures in electronics devices and are discussed below.
Bellcore vs. Milstd
There are two widely used standards for predicting reliability in electronics devices. The Bellcore TR-NWT-000332 Reliability Prediction Procedure for Electronic Equipment and US DoD MIL-HDBK-217 Reliability Prediction of Electronic Equipment. Both methods provide good prediction results for the environment they were intended for.
MIL-HDBK-217 is written around military equipment standards. This standard makes allowances for devices that will be used at all extremes of temperature, vibration, shock, etc. This procedure can predict the reliability of electronic equipment in almost any environment. It also requires a huge number of calculations and very specific information on the operating conditions of devices, such as operating junction temperatures in all semiconductor devices. The problem with this method of reliability prediction is that if your product is not operating in an extreme environment, it can take longer to predict the reliability than it takes to design the product.
TR-NWT-000332 takes a subset of the analysis for MIL-HDBK-217 and makes assumptions about the operating environment which are more applicable to commercial and automotive products. If the design is not for airborne, shipboard, or military field electronics, the Bellcore prediction will match the DoD predictions very closely and is usually preferred because it significantly less costly to implement.
Types of Failures
There are some things reliability predictions can deal with and some they can not. It is very important to understand the distinctions especially when the reliability prediction will be used to compute warrantee cost. All electronics components degrade in operating characteristics over their useful life. Generally failures caused by this degradation are related to operating temperature and other stresses applied to the device. Failures due to time, temperature and stress can be predicted readily. The other type of failure is sudden stress. This type of failure is caused by lightning strikes, dropped components, improper handling, etc. Mechanical and electronic countermeasures can be taken to reduce this type of failure, but the frequency of these types of stresses is very unpredictable and may vary from customer to customer. This makes the prediction of sudden stress failure virtually impossible, except in retrospect.
Parts Count Reliability Predication
The parts count method of reliability prediction is supported by both the DoD and Bellcore standards. This method, in essence, uses the predicted failure rate of each individual component and uses that information to predict the failure rate for the entire unit. Since the Bellcore standard is the one most commonly used, we will discuss the features of that standard. The primary difference between the Bellcore and DoD standards is that the DoD standard requires much more rigorous calculation of thermal operating conditions.
There are three cases for the parts count method. These are a simple black box method, black box method with burn in, and the general case. The black box method assumes that the components and the device being analyzed are being burned in for less than one hour. The black box method with burn in is similar, but allows for a correction based on burn in of the device. The general case allows for burn in of both the components and the device.
Failure rates are based on the idea that if a device is made up of many individual components the failure of the device can be predicted by adding the failure rates of the individual components.
Where: l D is the device failure rate, Ni is the quantity of component i, and l i is the failure rate for component i. Failure rates are usually expressed in units of Failures in Time (FITS). Bellcore uses failures in 109 hours and the DoD generally uses failures in 106 hours as standard units. Both sets of units were chosen to express electronic component failures in easy to manipulate numbers (this may give you an idea as to how much more rigorous the DoD standard is). For further discussion on the units of measure used in failure analysis, see the section titled Measuring Up at the end of this paper.
The addition of device and component burn in information changes the failure rates we use for predicting the total number of failures. The advantages of device and component burn in are discussed in the section titled Understanding MTBF and Infant Mortality.
Improving Prediction Using Laboratory Data
The parts count method of reliability prediction is useful, but is not as accurate as prediction using laboratory data as well as the parts count method. Since the sample size is usually small and time to market limits the time that can be devoted to laboratory testing, the results obtained should never be used by themselves to predict field reliability.
The Bellcore method for integrating laboratory data allows individual components or the assembled device to be tested for reliability. The method then specifies a way to compute a weighted average of the parts count method and the laboratory test results. If this method can be used properly it will provide more accurate results than the parts count method alone.
The laboratory tests may use high operating temperatures to accelerate the failures and shorten the required observation time, but each unit must be tested for a minimum of 500 hours of operation and the effective test time (after allowing for temperature acceleration) must be at least 3000 hours for each unit. When testing individual components, at least 500 must be used and when testing assembled devices at least 50 must be used. Also, the test must be long enough to ensure at least 2 failures for the results to be valid.
Improving Prediction Using Field Tracking
Using field data to supplement the parts count method will increase further the accuracy of reliability predictions. In particular, field data can account for the prediction sudden stress failures in normal use.
The devices used in the field study must all be in normal service for at least 3000 hours, and the number of devices tested must be large enough to allow for at least two failures using the parts count method. Because it takes so long to gather the data for field data this method of reliability prediction is usually used to refine the reliability predictions computed using the parts count or laboratory testing methods.
Understanding MTBF and Infant Mortality
The result of a reliability prediction is a predicted number of failures per unit time. This can be expressed in Mean Time Between Failures (MTBF). The failure rate or MTBF allows one to predict how many devices will fail during a given period of time. For electronics devices, the first year is the most critical, in fact, the first year failure rate for electronic components can be as much as 4 times the total failure rate averaged over many years.
Manufacturers often use burn in to help ensure that these early failures occur before the product leaves the factory. This concept is important enough to restate for clarity. Burning in a device does not improve it?s reliability. Burning in a device helps ensure that the device will fail in the factory where it is easy to fix and the customer?s perception of the product is not affected.
When computing failure rates a first year multiplier can show the effects on a product due to burn in. If a device is not burned in, the first year multiplier will be 4, implying that the failure rate for the device will be 4 times the steady state failure rate during the first year of use. All prediction methods make allowances for adjusting the first year multiplier based on the length and temperature of the burn in. The longer and more stressful the burn in, the smaller the first year multiplier.
If burn in is used to reduce first year field failures, it pays to remember that every failure that does not occur in the field will occur in burn in. This will have the effect of increasing production costs (more units that need to be repaired or scrapped) and reducing warrantee costs. Since the cost of repairing or scrapping a unit that is still on the factory floor is usually less than the cost of repairing or scrapping a unit after it has been shipped to the customer, burn in almost always pays off in the end.
Reliability Testing in Today?s Engineering Environment
Trends in Engineering Labor
The profession of engineering has been evolving with the times. Engineers seldom work for the same company from college to retirement. As companies down size and products come and go engineers move from one job to another. Every time an engineer moves to a new job a project is left behind that must be completed or maintained by someone who was not involved in the original design. The resulting gap in knowledge can contribute to products that decrease in reliability with each new change. The only way a company can address this issue is to ensure that hard won knowledge is kept within the company and within easy reach of new engineers.
Managing Reliability in a Volatile Labor Market
All of the documentation that was described in this paper as important to the production of a high quality design helps the next generation of engineers in a company understand existing products and their design. Having the benefit of a permanent documentation trail that is detailed and complete will allow companies to survive even though talented engineers tend to be a transient group. In today?s evolving work place companies can not afford to put documentation off until the product is released. Time and time again we see examples of companies that have products for which the documentation package consists of schematics, gerber files, and software listings (if that much). These companies are often surprised to learn that projects on which they spent many dollars and man hours have to be designed again because the design documentation was not captured in the rush to market.
Documentation is the Key
Managing reliability means managing documentation and the validation of that documentation. It is not enough to simply weigh the documents either. All documentation must be thoroughly reviewed by those who have a stake in the resulting product.
Companies that establish sound documentation and validation procedures have consistently better performance to schedule and budget. Resulting products are more easily maintained and more reliable in the consumer?s application. Companies are also given more flexibility to outsource if there are sound documentation practices. If the documentation for a project is complete and validated, than it is easier to make the choice to use a services company or independent contractor because all of the valuable intellectual property is left in the company in a useable form.
Conclusions
Reliability is a process not an end in itself. One does not set about to create reliable products and one day say "I have finished making this product reliable." Reliability is best achieved by documenting and validating at every step in the development cycle.
This paper diagrammed one method of breaking down the steps in a project and validating through documentation and review. There are many other procedures and groups of documents that can achieve the same purpose. The key to developing reliable products is to follow the following three steps through every phase of development and understand that the road is not always straight. The reason for validation is that we all make mistakes. When validation uncovers a problem we must be willing to go back to the drawing board and try again.
Research - Document - Validate
|