|
Back to White Papers
Reliability Testing
Verifying Reliability for Embedded Systems
By: Mike Willey
Vice President
Abstract
Many companies guarantee their products,
but few work the process to determine their real risk and estimated
failing point. Some general procedures companies can apply to
their own reliability testing analysis will be discussed, allowing
companies to check their own risk analysis and improve their
testing. Design verification testing (DVT) uses the engineering
specification as its road map. The DVT establishes conformity
of the completed design and the manufactured product with the
original engineering specification. The testing covers the functions
of the product and its interaction with other devices. Also tested
is the ability of the product to work over the specified range
of environmental conditions. And last, the product must conform
to the appropriate safety standards and regulatory agencies.
Lifetime testing is used to verify if the product meets the product
life design objectives. Automated round the clock operation of
the device permits a manufacturer to simulate accelerated usage
with the goal of statistically estimating wear and tear over
the lifetime of the product.
Scope
This paper will cover three elements of
reliability testing. Design verification testing (DVT) ensures
that the design of the product meets all requirements. Reliability
analysis determines how physical failures (based on fatigue and
wear) over time. System burn-in testing helps accelerate the
occurrence of early life failures. Reliability tracking and analysis
helps validate and refine the other elements of reliability testing.
These elements comprise the continuum of
reliability analysis. We will describe each element and how they
interrelate. We will also cover practical methods of implementing
a reliability testing program for embedded systems development
and production.
Embedded systems are typically comprised
of mechanical, electronic, electromechanical, and firmware. Reliability
testing relates to all of these areas, this paper will pay special
attention to the electronic and electromechanical aspects of
reliability testing, and discuss the other areas only briefly.
Introduction
When we talk about reliability we are discussing
the most fundamental measure of quality. Every product is designed
and built to perform some function. As long as it continues to
perform that function the product is deemed to be reliable. Reliability
always influences the market reception and perception of a product.
For embedded systems reliability may also be a matter of life
and death.
It has long been established that reliability
can be computed mathematically for certain electronic devices.
Often misunderstood, however, is the reason that we can compute
reliability based only on the number of components, their complexity,
and the operating environment. The United States Military and
old Bell System recognized that standardized processes led to
predictable reliability on purchased systems. Both have established
rigorous standards for procurement and performance tracking on
systems and devices. The results of this tracking are accumulated
in the Military Handbook MIL-HDBK-217 and Bellcore TR-NWT-000332.
These documents allow anyone building electronics products to
learn from their experience.
The common thread behind the Military and
Bellcore reliability analysis systems is documenting a repeatable,
predictable processes in manufacturing. This is also the underlying
tenet of ISO 9000. By documenting the processes used to produce
a product and following the process, products will be produced
with a consistent level of quality. The reliability of products
produced by these processes can then be recorded and analyzed
over time and used to predict the reliability of future products.
Any process to produce any product can
be documented and serve as the basis for reliability prediction.
Some of the processes that are involved in embedded systems include:
- system design
- electronics design
- PCB design
- mechanical design
- PCB fabrication
- PCB assembly
- mechanical fabrication
- mechanical assembly
- electronics design
- software design
This paper will discuss the computation
of reliability for electronic products, setting up processes
that can ensure consistent reliability, and tracking and analyzing
reliability data.
When we think of high quality design we
might call to mind the DC-3 an aircraft designed by Douglas in
1935. There are still DC-3s in active commercial service today.
In fact, it is not uncommon for aircraft to remain in service
for many decades. We might well ask, what makes aircraft so reliable?
A substantial part of that reliability is based on solid design,
accurate failure analysis, preventive maintenance, and continuing
market viability. These elements begin before the product announcement
and continue far beyond the last actual sale.
Typical Product Life Cycle
It will be helpful to understand a typical
product life cycle, to determine when and how reliability issues
must be addressed. Figure 1 shows a very high level view of Product
Life Cycle for any new product, be it an airplane or an embedded
system. The job functions that typically participate in the development
and sale of a product are shown with the activities generally
associated with those functions above. In many companies the
lines between these job functions may be blurred, but the activities
and responsibilities are universal.
The primary function of marketing is to
determine or generate a market need for a product. The marketing
job function is primarily to provide the product concept and
to help validate the commercial specification.
The role of the sales job function is to
take the final product to the market place identified by or created
by marketing. Sales teams are generally in very close contact
with the client base and can serve to validate the product concept
and assist in the creation and validation of the commercial specification.
Engineerings role in a product development
is to translate the product concept into an actual product. Engineering
can have a valuable role in developing the commercial specification
and validating the technical assumptions made by the marketing
and sales functions.
The role of product management is to represent
the customers interest in the product and to be a champion
for the product within the company. Product management provides
important information concerning the validation of the design
and feedback of information from the product users and potential
users.
The role of production is, obviously to
build and ship the product. Production also plays a valuable
role in verifying the ability to economically produce the product
and determining how to improve the quality of the product.
The diagram below shows the activities
as flowing from one to the next without any backtracking. If
this were used as an actual implementation model the resulting
product is not likely to meet the customer needs, not likely
to be reliable and will probably be short lived in the market
place.

Figure 1: Simplified Product
Life Cycle
The remainder of this paper will discuss
a more detailed view of the activities shown in Figure 1 and
how these activities can be structured to ensure high reliability
products.
Design Verification Testing
Design verification testing is the process
of ensuring that a product matches a commercial need, can be
produced at a reasonable cost, is reliable in operation and is
consistent with the strategic goals of the company. There is
a lot in this statement that is unsaid, specifically, design
verification testing is not the process of ensuring that
a device is the most reliable or least expensive
or perfect match for the consumer needs. These three areas
must be tempered with the strategic goals of the company. For
example, there is a place in the market for the most reliable
automobile, but the most reliable product may not be the least
expensive to produce.
The foundation of reliable and successful
products is the strategic plan for the company. If a company
or division of a larger company does not have a mission and a
plan to accomplish that mission it is very unlikely that anything
produced by the company will be a success. Developing missions
and business plans are beyond the scope of this paper. It should
be recognized, however, that before beginning any project everyone
involved must be familiar with the company mission and their
part in the plan to accomplish that mission.
The process of Design Verification Testing
comprises the activities that take place in the boxes labeled
"Product Concept", "Commercial Specification",
and "Detailed Design" in Figure 1. A detailed look
at these three activities and how they interrelate is shown in
Figure 2 below.
Figure 2: Design Verification Testing Process
Developing a Reliable Specification
The making of a high quality product begins
with a high quality design. A solid design does not begin with
an engineer, it begins with the identification of a significant
market need and a concept of a product to meet that need.
A product concept is usually developed
by the marketing job function within a company. The form of a
product will be a white paper or other document with the scope
of describing the market need, market size, current solutions
and competing products. We will refer to this as the product
requirements document. Prior to moving on to the next stage of
development, the Commercial Specification, the product requirements
document must be validated.
There are a number of ways to validate
the product requirements for a product. Some of these include
focus groups, surveys, and interviews with potential customers.
The important issue that the product requirements must be validated
and that validation must not amount to the fox guarding the hen
house. The group responsible for developing the requirements
should include other objective parties in the evaluation process.
Focus groups conducted by a consultant or different department
can help bring some perspective to a requirements document and
will find areas where the definition is incomplete. The negative
consequences of skipping validation of the product requirement
can be catastrophic. To understand how costly the omission of
this step in product development is consider the following example.
The XYZ Company marketing department has
a great new idea. They are going to design a standalone controller
card for the chrome plated muffler bearing industry (CPMB). The
marketing guys thought they knew everything about CPMBs and put
together a commercial specification and gave it to the engineering
department. The engineering department spent 30 man months getting
the product designed and the prototypes built. When they went
to install it on an actual muffler bearing they discovered that
the latest technology in muffler bearings was to control them
with an infrared serial port rather than the ± 30 volt
parallel ports of the older technology. When trying to determine
how this happened the marketing department said "We assumed
that the engineers were on top of the latest technology, so we
didnt bother putting it in the specification." The
engineering department said "We new about the new technology,
but we thought that the old technology performed better, if the
guys from marketing wanted us to use the other technology they
should have told us."
The bottom line on this story is that 30
man months of engineering is now substantially down the drain.
If this discrepancy were caught in the early phases of product
development the cost to change direction for the engineers would
have been minimal.
Once a commercial specification is ready,
it should be reviewed by engineering and marketing together.
The purpose of this review is to determine feasibility. In many
companies, marketing will involve engineering from the beginning
to help prevent too much blue sky thinking. Although this is
certainly helpful, it does not mean that a review of the commercial
specification can be skipped.
Pausing a moment at the completion of each
major phase of a product development to ensure that phase is
as complete and accurate as possible is always worth the investment.
As you can see from the example above skipping the verification
stage can allow major problems with a product to slip through
the entire design and development cycle, and a cost many times
greater than the cost of a design review.
Developing a Reliable Design
When engineering has a commercial specification
that has been tested by marketing and identified as feasible
by engineering, it is time to begin the design phase. There are
two areas of design that we will address in this paper, software
and electronic hardware design. One can describe an embedded
system as a device that performs a dedicated function through
the seamless integration of software and electronics. Since most
engineers specialize in either hardware or software development
this implies a large amount of communication will be needed to
make the integration seamless.
What does it take to develop a reliable
design?
- Clear, unambiguous commercial specifications
- Complete, well maintained, design documentation
- Testing the documentation against the
commercial specifications
- Implementing the design from the design
documentation
- Testing the implementation against the
design documentation
The most common area where problems creep
into the design process is the documentation of the design. Supervisors
want projects completed yesterday, and it is often hard for them
to understand that writing the documentation is progress towards
the goal. Engineers dont like to write, we like to get
our hands dirty, and skip to the end result as soon as possible.
Writing dry technical documents is not why we got into engineering,
its not fun. So engineers and supervisors agree:
"Well write the documentation
after we get the project done. Besides, if we write it now, well
just have to change it when the project is over anyway."
"Well just do a draft of the
documentation, I know pretty well what I intend to do so I just
need to provide an overview. Anything you dont understand
or you think is missing just ask."
The sad truth is that both are making their
lives more difficult. Recalling the XYZ company and their chrome
plated muffler bearings, a little documentation up front would
have saved them from disastrous schedule slips and over expenditures.
The time saved by putting off design documentation is an illusion.
When the documentation is put off till the end of the project,
design requirements will be missed. Even though the project may
be delivered on time, there will be a seemingly endless stream
of engineering changes and bugs. In the end, even if the product
gets to the market a little sooner, reliability problems may
be able to accomplish what the competition can not. This is not
much fun for the engineer, staying at the office all night long
to fix a problem that might not have ever shown up it the design
documentation were complete.
"Its really a very simple design.
I dont need to write a theory of operation, because I can
see it all in my head."
If it is that simple, it wont take
long to write the theory of operation. Documentation doesnt
have to be heavy, just complete. Below are listed some of the
documents that might be included in a software or hardware documentation
project. Although the lists may vary some and additional documentation
may be needed, these lists provide a good start.
Short cutting the documentation or the
verification of the documentation is the same as skipping the
design. It is important for the engineering group to sit down
at the beginning of the project and identify which set of documents
will be needed provide a complete design description, then create
and verify those documents. The results of this process can be
seen in the DC-3, still flying after over 60 years of continuous
service.
Testing a Design
Design reviews are the test processes for
a design. Many engineers find design reviews to be intimidating.
This is because the purpose of the design review and the roles
of the participants are often misunderstood.
The participants in the design review should
be the engineers who created the design and design documentation,
the marketing group who created the commercial specification,
representatives from manufacturing (if possible), and the project
manager. The purpose of the design review is to uncover possible
errors or omissions in the design before they are implemented.
The participants are there to ensure that as many possible points
of view are used to verify the design. Too often design reviews
turn into bashing sessions when the participants feel that a
flaw discovered in the review process reflects poorly on the
engineers.
The most effective design reviews focus
on asking questions. Many engineers feel that if the design review
concludes with design questions which were considered prior to
the design review there is a problem. The truth is that this
is the purpose of a design review. The important result of a
design review is the action item list, once that has been fully
addressed the design can move on to implementation.
Here are some hints for conducting a successful
design review:
- Provide copies of all materials to the
participants with enough time for them to review them.
- Provide an outline or copies of presentation
slides with the other materials.
- When preparing your outline make it detailed
enough that you expect each item to require between 1 to 5 minutes
of discussion.
- When answering questions, dont rush
off to get something off of your desk if the material you need
is not in the room. Make it an action item for later.
- Keep an action item list.
- Make assignments for resolution of each
action item.
- Dont make commitments you cant
keep when scheduling action items, its O.K. to defer the
schedule commitment for action items. Just make sure that a commitment
is made as soon as possible after the review.
- Have frequent breaks. Design reviews can
be very intense and its easy to loose your concentration.
- Follow the agenda. Defer discussions that
are covered later on the agenda, move lengthy discussions of
items not planned for to the end of the meeting.
Believing Your DVT Results
There are three possible results of a successful
design review:
- A list of action items is generated. When
completed, the implementation of the design can proceed.
- A list of action items is generated and
some significant design issues are identified. Another design
review will be required before implementation can proceed.
- Virtually insurmountable implementation
issues are encountered. The design must be sent back to generate
a new commercial specification before a design can be completed.
A fourth possible result is that no action
items were generated. If this happens, one should be very skeptical
of the results of the design review. No design can be made perfect,
if no action items were identified, the odds are that the review
was incomplete or the participants were not properly prepared.
Reliability Analysis
After the design and test procedures have
been developed and the implementation of the design has been
substantially completed there is some analysis that can be done
to predict failures due to stress and normal wear and tear. This
analysis is based on statistical methods developed by AT&T
Bell Labs (now Bellcore) and the US military. These methods can
be used to predict failures in electronics devices and are discussed
below.
Bellcore vs. Milstd
There are two widely used standards for
predicting reliability in electronics devices. The Bellcore TR-NWT-000332
Reliability Prediction Procedure for Electronic Equipment
and US DoD MIL-HDBK-217 Reliability Prediction of Electronic
Equipment. Both methods provide good prediction results for
the environment they were intended for.
MIL-HDBK-217 is written around military
equipment standards. This standard makes allowances for devices
that will be used at all extremes of temperature, vibration,
shock, etc. This procedure can predict the reliability of electronic
equipment in almost any environment. It also requires a huge
number of calculations and very specific information on the operating
conditions of devices, such as operating junction temperatures
in all semiconductor devices. The problem with this method of
reliability prediction is that if your product is not operating
in an extreme environment, it can take longer to predict the
reliability than it takes to design the product.
TR-NWT-000332 takes a subset of the analysis
for MIL-HDBK-217 and makes assumptions about the operating environment
which are more applicable to commercial and automotive products.
If the design is not for airborne, shipboard, or military field
electronics, the Bellcore prediction will match the DoD predictions
very closely and is usually preferred because it significantly
less costly to implement.
Types of Failures
There are some things reliability predictions
can deal with and some they can not. It is very important to
understand the distinctions especially when the reliability prediction
will be used to compute warrantee cost. All electronics components
degrade in operating characteristics over their useful life.
Generally failures caused by this degradation are related to
operating temperature and other stresses applied to the device.
Failures due to time, temperature and stress can be predicted
readily. The other type of failure is sudden stress. This type
of failure is caused by lightning strikes, dropped components,
improper handling, etc. Mechanical and electronic countermeasures
can be taken to reduce this type of failure, but the frequency
of these types of stresses is very unpredictable and may vary
from customer to customer. This makes the prediction of sudden
stress failure virtually impossible, except in retrospect.
Parts Count Reliability Predication
The parts count method of reliability prediction
is supported by both the DoD and Bellcore standards. This method,
in essence, uses the predicted failure rate of each individual
component and uses that information to predict the failure rate
for the entire unit. Since the Bellcore standard is the one most
commonly used, we will discuss the features of that standard.
The primary difference between the Bellcore and DoD standards
is that the DoD standard requires much more rigorous calculation
of thermal operating conditions.
There are three cases for the parts count
method. These are a simple black box method, black box method
with burn in, and the general case. The black box method assumes
that the components and the device being analyzed are being burned
in for less than one hour. The black box method with burn in
is similar, but allows for a correction based on burn in of the
device. The general case allows for burn in of both the components
and the device.
Failure rates are based on the idea that
if a device is made up of many individual components the failure
of the device can be predicted by adding the failure rates of
the individual components.
Where: l D is the device
failure rate, Ni is the quantity of component
i, and l i is the failure rate for component
i. Failure rates are usually expressed in units of Failures
in Time (FITS). Bellcore uses failures in 109
hours and the DoD generally uses failures in 106 hours
as standard units. Both sets of units were chosen to express
electronic component failures in easy to manipulate numbers (this
may give you an idea as to how much more rigorous the DoD standard
is). For further discussion on the units of measure used in failure
analysis, see the section titled Measuring Up at the end
of this paper.
The addition of device and component burn
in information changes the failure rates we use for predicting
the total number of failures. The advantages of device and component
burn in are discussed in the section titled Understanding
MTBF and Infant Mortality.
Improving Prediction Using Laboratory
Data
The parts count method of reliability prediction
is useful, but is not as accurate as prediction using laboratory
data as well as the parts count method. Since the sample size
is usually small and time to market limits the time that can
be devoted to laboratory testing, the results obtained should
never be used by themselves to predict field reliability.
The Bellcore method for integrating laboratory
data allows individual components or the assembled device to
be tested for reliability. The method then specifies a way to
compute a weighted average of the parts count method and the
laboratory test results. If this method can be used properly
it will provide more accurate results than the parts count method
alone.
The laboratory tests may use high operating
temperatures to accelerate the failures and shorten the required
observation time, but each unit must be tested for a minimum
of 500 hours of operation and the effective test time (after
allowing for temperature acceleration) must be at least 3000
hours for each unit. When testing individual components, at least
500 must be used and when testing assembled devices at least
50 must be used. Also, the test must be long enough to ensure
at least 2 failures for the results to be valid.
Improving Prediction Using Field Tracking
Using field data to supplement the parts
count method will increase further the accuracy of reliability
predictions. In particular, field data can account for the prediction
sudden stress failures in normal use.
The devices used in the field study must
all be in normal service for at least 3000 hours, and the number
of devices tested must be large enough to allow for at least
two failures using the parts count method. Because it takes so
long to gather the data for field data this method of reliability
prediction is usually used to refine the reliability predictions
computed using the parts count or laboratory testing methods.
Understanding MTBF and Infant Mortality
The result of a reliability prediction
is a predicted number of failures per unit time. This can be
expressed in Mean Time Between Failures (MTBF). The failure rate
or MTBF allows one to predict how many devices will fail during
a given period of time. For electronics devices, the first year
is the most critical, in fact, the first year failure rate for
electronic components can be as much as 4 times the total failure
rate averaged over many years.
Manufacturers often use burn in to help
ensure that these early failures occur before the product leaves
the factory. This concept is important enough to restate for
clarity. Burning in a device does not improve its reliability.
Burning in a device helps ensure that the device will fail
in the factory where it is easy to fix and the customers
perception of the product is not affected.
When computing failure rates a first year
multiplier can show the effects on a product due to burn in.
If a device is not burned in, the first year multiplier will
be 4, implying that the failure rate for the device will be 4
times the steady state failure rate during the first year of
use. All prediction methods make allowances for adjusting the
first year multiplier based on the length and temperature of
the burn in. The longer and more stressful the burn in, the smaller
the first year multiplier.
If burn in is used to reduce first year
field failures, it pays to remember that every failure that does
not occur in the field will occur in burn in. This will have
the effect of increasing production costs (more units that need
to be repaired or scrapped) and reducing warrantee costs. Since
the cost of repairing or scrapping a unit that is still on the
factory floor is usually less than the cost of repairing or scrapping
a unit after it has been shipped to the customer, burn in almost
always pays off in the end.
Reliability Testing in Todays
Engineering Environment
Trends in Engineering Labor
The profession of engineering has been
evolving with the times. Engineers seldom work for the same company
from college to retirement. As companies down size and products
come and go engineers move from one job to another. Every time
an engineer moves to a new job a project is left behind that
must be completed or maintained by someone who was not involved
in the original design. The resulting gap in knowledge can contribute
to products that decrease in reliability with each new change.
The only way a company can address this issue is to ensure that
hard won knowledge is kept within the company and within easy
reach of new engineers.
Managing Reliability in a Volatile Labor
Market
All of the documentation that was described
in this paper as important to the production of a high quality
design helps the next generation of engineers in a company understand
existing products and their design. Having the benefit of a permanent
documentation trail that is detailed and complete will allow
companies to survive even though talented engineers tend to be
a transient group. In todays evolving work place companies
can not afford to put documentation off until the product is
released. Time and time again we see examples of companies that
have products for which the documentation package consists of
schematics, gerber files, and software listings (if that much).
These companies are often surprised to learn that projects on
which they spent many dollars and man hours have to be designed
again because the design documentation was not captured in the
rush to market.
Documentation is the Key
Managing reliability means managing documentation
and the validation of that documentation. It is not enough to
simply weigh the documents either. All documentation must be
thoroughly reviewed by those who have a stake in the resulting
product.
Companies that establish sound documentation
and validation procedures have consistently better performance
to schedule and budget. Resulting products are more easily maintained
and more reliable in the consumers application. Companies
are also given more flexibility to outsource if there are sound
documentation practices. If the documentation for a project is
complete and validated, than it is easier to make the choice
to use a services company or independent contractor because all
of the valuable intellectual property is left in the company
in a useable form.
Conclusions
Reliability is a process not an end in
itself. One does not set about to create reliable products and
one day say "I have finished making this product reliable."
Reliability is best achieved by documenting and validating at
every step in the development cycle.
This paper diagrammed one method of breaking
down the steps in a project and validating through documentation
and review. There are many other procedures and groups of documents
that can achieve the same purpose. The key to developing reliable
products is to follow the following three steps through every
phase of development and understand that the road is not always
straight. The reason for validation is that we all make mistakes.
When validation uncovers a problem we must be willing to go back
to the drawing board and try again.
Research - Document - Validate
|