A Test/ Repair loop can become a bottleneck for the whole Value Stream when the test First Pass Yield is lower than planned.

A drop in test FPY is normally caused by problems upstream in the Value Stream. This Test/ Repair loop is often absent in Value Stream Maps in spite of the potential to become the bottleneck for the total process. 

You can download this example file:  TestRepair.xlsm

This Excel file simulates a Test/ Repair loop such as:

Ideal Situation

The ideal case is when FPY = 100%. In this case no repair is necessary and the required test capacity is just 100, which corresponds to the Value Stream throughput.

You must close all open Excel files before you open this one and you should enable Macros.

To simulate 1 hour operation just press F9. Keep pressing to see the evolution.

To reset (put all Work-In-Process to zero) press Ctrl and r keys.

You can only write in the yellow cells.

You need to work out the minimum required Test and Repair capacities in order to deliver 100 units each hour (balanced loop).

Test FPY Drop

Let's assume 1% of the units are failing test: Test FPY = 99%:

The immediate result is that the output of this loop drops to 99: It has become the bottleneck of the Value Stream.

Since we are not repairing the faulty units they are accumulating in front of the Repair station. 

How much repair capacity do we need in this case?

All repaired items need to be retested so they will add to the new items entering the loop. What test capacity do we need then?

We can calculate this without the need for simulation:

In this case when we test 100 units only 99 come out OK (we have multiplied 100 by FPY which is 0.99). 

If we need an output of 100 items OK how many do we need to test? 

Answer: We divide by the FPY:  100/ 0.99 = 101.1 units per hour

And the repair capacity will be: 101.1 - 100 = 1.1 units per hour

 Repair Time Variation

Test time typically doesn't have much variation, specially if it is automatic. Repair time, on the other hand, tends to have high variability. Some automatic testers may provide specific repair instructions but very often repair requires an investigation which might take a long time. 

Let us assume that in our example an average repair capacity of 1.1 items/ hour has a standard deviation of 0.5:

The frequency distribution of this repair capacity will be: 

And the result will be:

We see that occasionally output will drop and a queue will develop waiting for repair. This means we will need additional average repair capacity to compensate for this variation:

Now the "waiting for repair" queue has moved to "waiting for test" but output is maintained in 100. 


The Test/ Repair loops should normally be included in the Value Stream Map because they are critical steps in the process which could become the bottleneck of the Value Stream. See an example in:


A drop in Test FPY should trigger immediate actions to analyse and correct the source of the defects.

In the mean time we will need additional Test and Repair capacity to maintain the VSM throughput.

The Repair operation may require high skills in short supply so enough training should be provided to insure it doesn't become the bottleneck for the total process. 

By developing a robust Repair process we can reduce the average repair time and its standard deviation reducing this risk. 

The Standard Work Combination of machine and operator times tries to meet the throughput requirements of the line with the available machines and operators.

You can download this example file:     Standard-Work-Combination-Table.xlsx

In this example 4 autonomous machines are handled by one operator who is performing the following operations on each machine in sequence:

  1. Unload finished part from the machine
  2. Load the new part
  3. Start the machine
  4. Take the removed part to the next machine

The following table contains the operator times to perform these operations as well as the machine times involved:

The manual times include the loading and unloading of the parts in the machine therefore these times are included in the machine times. Walk times involve the operator but not the machines.

Each start time is calculated from the previous start time plus manual and walk times.

The Manual Gantt chart above is built from the start times and manual. You can see that the delay from a manual time and the next is the walk time.

The quality control operation is performed by the operator and no machine is involved. 

The Machine Gantt Chart is built from start times and machine times. 

If we want to maximize the throughput of the line with one single operator the cycle time would be 60. This is the time between two finished parts leaving the line. If we are talking about seconds a cycle of 60 seconds corresponds to a throughput of one unit per minute. 

This cycle was calculated with operator times but did not take into account machine times so we need to insure that no machine times are above the cycle time. 

Lead Time Calculation

How long will it take for one item from start to finish? This is what we call lead time:

You can see it takes 5 cycles for one item to go through the total process so lead time will be 300. 

Increase Throughput

If a cycle of 60 is what the market demands we can handle this line with one single operator. We realize that the machines are kept idle most of the time because it does not make sense to produce more than market demand. 

If we need to increase throughput (decrease cycle) we can add another operator to do this:

If we assign work in this way the cycle would be 25 for operator 1 and 35 for operator 2. This means we can have an overall cycle of 35 and all machine times are below this value. 

The corresponding Gantt charts are:

So with 2 operators we can almost double throughput: cycle is reduced from 60 to 35. 


Standard work combination charts allows planning with machine and operator times in order to achieve the required throughput with the existing equipment and using the minimum human resources. 

Statistical Process Control is used to detect when a significant change has taken place in a process. 

All processes have variation as observed in any of their metrics but not all variation is significant from the statistical point of view. 

When we want to control a process by adjusting some process parameter we can make two mistakes:

  • Over-react by adjusting when we shouldn't
  • Under-react by failing to adjust when we should

Control over-reaction

A case of over-reaction is illustrated by the following example:

Someone is shooting at a target and based on the deviation of the impacts he adjusts the gun site after each shot.

The end result will be an increase of the dispersion of the impacts, therefore the adjustments will make the process worse. 

The correct way is, of course, to fire 5 or 6 shots without adjustments and then decide if adjustment is required based on the center of the impacts.

Control under-reaction

This is an example of under-reaction:

If you drop a live frog in boiling water it will immediately jump out to save its life.

But if you put it in a pot of cold water and heat it the frog will eventually pass out without any reaction.

Many companies have fallen into this trap: 

  • A big disaster generates a quick and effective reaction and the company recovers
  • A slow degradation of their KPIs such as customer sat pass undetected until it is too late 

Statistical Process Control

To avoid these two mistakes we can use SPC to control a metric to find out if the observed changes are SIGNIFICANT from the statistical point of view.

We will illustrate this with the use of an example: control your own weight to check if your diet is leading to your weight target or not.

You can download this example Excel file and replace the data in columns A and B with your own data:


 Data collection

You can decide the frequency of your data collection. In this example I have collected it daily. 

It is important to collect it at the same moment (more or less) each day.

You can either have your Excel file in your smartphone or in the cloud (Google Drive or Microsoft's One Drive)

Results interpretation

I am assuming our target is to reduce weight, therefore a downward trend will put the numbers in green and an upward trend will put them in red.

SPC uses some rules, developed by the Western Electric company, to detect symptoms of SIGNIFICANT trends . They are shown below with a screen of Minitab SPC for individuals:

The center line is the average of all values and the standard  deviation is also estimated by all the values.

In our Excel file upward trend symptoms are in red and downward in green.

The Excel file applies rules 1, 2, 3, 5, and 6 of this list.

Weight data interpretation

On 2/4/2018 weight 78.8 is in red and we see the explanation with a number 2 in column N which means that two values (1/4/2018 and 2/4/2018) with 78.8 Kg are 2 σ above the average. This is passed data. What really interests us is the data we have just collected.

The numbers start getting green on 8/4 with 78 kg and the reason is that out of the 5 last numbers the last 4 are 1 σ below the average. This is interpreted as a SIGNIFICANT downward trend. This trend continues the following days.

 Statistically significant trend doesn't mean that this reduction is acceptable but at least we are moving in the right direction. 

Results analysis with Minitab

In our Excel file there is no graphic: we just want to alert of significant trends 

Looking at this same data in Minitab we can see an SPC chart:

The red dots indicate trends. looking at the graph we notice the first two are upward and the last three downward so the overall trend is downward.

Stable Process

There is a common misconception about the meaning of process stability

For instance some might call the process characterized by the following data unstable:

But if we look at our SPC:

It doesn't give us any alert of unstability

This is a STABLE process consisting simply of throwing dice. Since we have always thrown dice the same way SPC is telling us that this is a stable process: no SIGNIFICANT change has been detected.

Stability is not always a good thing. In the case of our weight control stability would mean that there is no improvement. 

We want stability when the process is OK. If we improve this will show with SPC alerts in the right direction.

 Use for other processes

This Excel file can be used to control a variable other than weight at work. For instance:


In this case we are doing an hourly control of our process yield. In this case Yield increase is good so Red is Good.

At 20:00 we have enough evidence of a Yield increase although the change seems to have taken place at 16:00. 


SPC enables the right detection of process change.

High process intrinsic variation can lead to over-reaction (adjust when you shouldn't) or under-reaction (not reacting when you should)

Part 2: Real-Time SPC Analysis

Real-Time SPC Analysis


Real Time Statistical Process Control enables the operators running the process to know when action is required, as described in 


In this second article we will analyse some of the Out-Of-Control situations we can detect and the corresponding corrective actions required.

You can download an Excel file with these examples:


Individuals and Moving Range SPC charts

In sheet A we have collected process yield data every hour during several shifts.

Since we collect a single value each hour we are using the pair:

  • chart for individuals
  • moving range chart

The individuals chart has the actual values and the MR chart the absolute values of the change between two consecutive values.

In both charts we automatically apply the Western Electric rules for Out-Of-Control symptoms and the alert will appear in the corresponding column: N - W for the individuals chart and Y - AH for the MR chart. 

When any of these alerts appear the corresponding point in the chart will be colored: 

  • In red: upward symptom
  • In green: downward symptom

 Process Change

Looking at the charts above, the individuals chart shows green dots from 10:00 to 16:00 and red dots between 1:00 and 3:00 during the night. We can notice a jump from 16:00 to 17:00 which might indicate the moment when the process has changed but this jump is not significant enough to declare it as a process change. 

It is not until 1:00 that we have enough evidence to declare that the process has changed.

In this case, since we are measuring yield, an increase in yield is an improvement. 

So we can say that process yield has improved sometime between 16:00 and 1:00. With this information we should now investigate what happened during this period of time that caused this improvement. 

How much have we improved?  If we compute the average between 6:00 and 16:00 we get an average yield of 93. And from 17:00 to 11:00 the next day 118. So we can estimate an average yield increase of 25.

Process Improvement Symptoms

Let's now analyse the alerts in our sheet A:

The -4 values in the individuals chart indicate 4 out of 5 consecutive values 1 sigma below the center line: values significantly below the average. The value -9 at 14:00 means that the last 9 values were below the center line: same interpretation.

The values 9 starting at 1:00 mean that from 17:00 to 3:00 all values have been above the center line: yield has significantly increased. 

Finally, value 1 in moving range indicates that a significant change has occurred in the yield at 5:00. Looking at the individuals chart we see it corresponds to an increase. 

Slow Trend

Let's interpret the charts in sheet B:

 We notice that there is an upward trend in yield as shown by the green dots up to 6:00 in day 11 followed by red dots starting at 12:00 until the end. 

The alert in the MR chart on 4:00 corresponds to a significant one time decrease in yield so we can interpret it as a false alarm. 

The interpretation of this situation is that we have a continuous process improvement starting around 11:00 on day 11 maybe due to some ongoing improvement actions. 

Let's now look at the alerts:

They confirm a significant yield increase starting around 12:00 of day 11.

Variation Increase

We now analyse sheet C:

An increase in yield variation is visible in the individuals chart but no alarms are visible. 

The MR chart, on the other hand, is meant to detect this variation, and so it does.

Variation is significantly below the center line, as shown by the green dots before 20:00 and a significant increase is shown by the red dots starting at 3:00 on day 11. 

If we only measure averages, as we often do, this problem would remain undetected.

Variation is always a bad thing for our process, so we can say our process has got worse starting around the end of day 10. 

Yield variation will eventually cause inefficiencies, late deliveries and accumulation of WIP so this problem needs to be addressed before these side effects start to appear.

 Day of the Week Effect

Let's now analyse the example in sheet D:

Here we are collecting daily production data and we notice that our average production is 400 and our charts are not detecting any significant changes in the process.

On the other hand we can see a repetitive cycle of 7 days which, of course, corresponds to a week. 

If we look in our calendar the days of lowest production we realize they correspond to Sundays and the next lowest to Saturdays.

This is telling us that maybe there is significant seasonal trend within each week. 

In order to check that we have added column AT to our spreadsheet where we calculate the day of the week corresponding to each date with this formula:


Where A3 holds the date and "2" is to start counting the week on Monday. 

We can now do an ANOVA of our Production Vs Day of the Week with Minitab:

We can now confirm that there are significant differences among the different days of the week: Wednesdays and Fridays we have the highest productions. On Thursdays, for some reason, it is significantly lower and, of course, it is lowest on Saturdays and Sundays.

 Shift Change Effect

We have been suspecting that we have a drop in production at the start of a shift, so we have collected data of hourly production in sheet E to check this effect. These are the resulting charts:


In this case, just like in the previous one, there is no indication of Out-Of-Control but we can see peaks in the MR chart which correspond to the start of each shift.

To see how significant are these differences we add a column with the hour of the shift: 6:00, 14:00 and 22:00 are hour 1 of the shift.

Now we do an Anova in Minitab of Production Vs Shift Time:

We notice that production in the last hour of the shift is significantly higher than the average and in the first hour it is significantly lower. 

We would need to analyse the root cause of this effect which represents an inefficiency in our process. 

One possible cause may be to try to ship as much product as possible at the end of the shift in order to meet the  target. And, to do that, operators are moved to the end of the line. The side effect is that we leave an empty line for the next shift so some operators will have no work until the line fills up.


Real Time SPC is a useful tool for operators in order to control their process by making the required adjustments as soon as needed but without over-reacting.

Control charts are useless if analysis and corrective actions are not done in real time.

Operators need to be trained to do this analysis and to know what corrective action needs to be done and when.

In spite of the high-tech of our shop floors you can still see pencil and paper data collection being used.

Data collection in the shop floor is a non-value-add operation which is still required to control the process.

This collection is often automatically done by the manufacturing equipment with no operator intervention.

Unfortunately, detecting aspect defects is often difficult to automate even with AOI (Automatic Optical Inspection).

When an operator is required to inspect we need a method to enter the results into the system.

Line incidents, change of critical process parameters, etc. also need to be reported.

The method still widely used in this 21st century is:

  1.  The operator hand writes the defect data on a form
  2.  Forms are collected at the end of the shift
  3.  A clerk types the forms contents into the system at a later stage (next day or later)
  4.  The engineer analyses the results and takes corrective actions

This reporting process creates a number of problems:

  •  It is time consuming for the operator
  •  It adds a non-value-add operation: transcribing the data
  • The operator’s handwriting is not necessarily good (this criteria is not used for operator selection)
  •  Reporting errors may be introduced both by the operator and the clerk doing the transcription
  •  Some of these errors may be impossible to recover when they are detected on transcription
  •  Real time process control is impossible: the engineer detects a problem at least 24h after it happened.
  •  Maybe they have been producing bad product all this time.

On-line reporting

The alternative to the pencil & paper data collection is, of course, directly reporting defects in a terminal the moment they happen.

You can provide terminals in the inspection and test stations to do that.

Sometimes it is difficult to provide individual terminals to all the operators who need to report.  Sharing terminals and having to walk to the nearest terminal is time consuming so it may not be acceptable.

The reporting operation should not be time consuming and error prone. If we ask operators to type 12 digit numbers or time stamps the chances of error are very high.

So we need a reporting system which is both simple and error free.

Large number typing can be replaced by a bar code reader whenever possible. We can also use the data validation alternatives provided by Excel:

 In this way the reporting errors may be detected the moment they are entered and corrected on the spot.

Pull down menus may be used for each field to avoid typing and insure all use the same wording.

The current time stamp can be automatically provided with a formula:  =now() 

Reporting on the cloud

Having a terminal for each operator who needs reporting may not be possible, in which case reporting on the cloud may be an alternative. 

The data base can be created in a Google Drive spreadsheet:


  Reporting may be done with any tablet or smart phone via WIFI


 BYOD (Bring Your Own Device) alternative

Operators may have difficulties with readable hand writing or reporting in a terminal but they are normally experts in typing into their own smart phones. This existing skill can prove to be very useful for real-time on-line reporting.   

 The database may be in Internet or the company Intranet.

The smart phones used may be the operator’s own device or one supplied by the company.

It, obviously, requires appropriate firewalls to avoid entering virus.

Even a smart phone with no telephone line may be used to access via WIFI (an old out-of-use smart phone might do the job).

To avoid corrupting the database each workplace may have a dedicated input worksheet in the cloud and then data is automatically transferred to a central worksheet. Operators need write access to their input worksheet and only read access to the central worksheet and reports.


Real-time feedback to the workplace

Real-time data collection enables real-time feedback to the operators in order to take corrective actions while it is still possible.

This Andon display provides real-time feedback:


  • PTH component insertion DPMO (red means above 1500 target)
  • PTH component solder DPMO (green: below 190 target)
  • Processed cards/hour.

 SPC (Statistical Process Control) Charts is another alternative to provide feedback to the operators in real-time.

 When this feedback is in real-time it is possible to relate changes in the output to what just happened in the line: change in process parameters, line incidents, etc.

This makes it possible to get to the root cause of problems and often solve them on the spot.

Pencil and paper data collection makes it impossible to provide this real-time feedback: all we have is after-the-fact compound data. In this case it will be difficult to find the root cause of the problem and therefore the problem will often remain unresolved. All we can do in this situation is guess what happened and try to find a justification for management.