The Most Useful Data Plot You’ve Never Used

Those of us working in industry with Excel are familiar with scatter plots, line graphs, bar charts, pie charts and maybe a couple of other graph types. Some of us have occasionally used the Analysis Pack to create histograms that don’t update when our data changes (though there is a way to make dynamic histograms in Excel; perhaps I’ll cover this in another blog post).

One of the most important steps in data analysis is to just look at the data. What does the data look like? When we have time-dependent data, we can lay it out as a time-series or, better still, as a control chart (a.k.a. “natural process behavior chart”). Sometimes we just want to see how the data looks as a group. Maybe we want to look at the product weight or the cycle time across production shifts.

Unless you have Minitab, R or another good data analysis tool at your disposal, you have probably never used—maybe never heard of—boxplots. That’s unfortunate, because boxplots should be one of the “go-to” tools in your data analysis tool belt. It’s a real oversight that Excel doesn’t provide a good way to create them.

For the purpose of demonstration, let’s start with creating some randomly generated data:

head(df)

##   variable   value
## 1   group1 -1.5609
## 2   group1 -0.3708
## 3   group1  1.4242
## 4   group1  1.3375
## 5   group1  0.3007
## 6   group1  1.9717

tail(df)

##     variable   value
## 395   group1  1.4591
## 396   group1 -1.5895
## 397   group1 -0.4692
## 398   group1  0.1450
## 399   group1 -0.3332
## 400   group1 -2.3644

If we don’t have much data, we can just plot the points:

library(ggplot2)

ggplot(data = df[1:10,]) +
  geom_point(aes(x = variable, y = value)) +
  coord_flip() +
  theme_bw()

But if we have lots of data, it becomes hard to see the distribution due to overplotting:

ggplot(data = df) +
  geom_point(aes(x = variable, y = value)) +
  coord_flip() +
  theme_bw()

We can try to fix this by changing some parameters, like adding semi-transparency (alpha blending) and using an open plot symbol, but for the most part this just makes the data points harder to see; the distribution is largely lost:

ggplot(data = df) +
  geom_point(aes(x = variable, y = value), alpha = 0.3, shape = 1) +
  coord_flip() +
  theme_bw()

The natural solution is to use histograms, another “go-to” data analysis tool that Excel doesn’t provide in a convenient way:

ggplot(data = df) +
  geom_histogram(aes(x = value), binwidth = 1) +
  theme_bw()

But histograms don’t scale well when you want to compare multiple groups; the histograms get too short (or too narrow) to really provide useful information. Here I’ve broken the data into eight groups:

head(df)

##   variable   value
## 1   group1 -1.5609
## 2   group1 -0.3708
## 3   group1  1.4242
## 4   group1  1.3375
## 5   group1  0.3007
## 6   group1  1.9717

tail(df)

##     variable   value
## 395   group8 -0.6384
## 396   group8 -3.0245
## 397   group8  1.5866
## 398   group8  1.9747
## 399   group8  0.2377
## 400   group8 -0.3468

ggplot(data = df) +
  geom_histogram(aes(x = value), binwidth = 1) +
  facet_grid(variable ~ .) +
  theme_bw()

Either the histograms need to be taller, making the stack too tall to fit on a page, or we need a better solution.

The solution is the box plot:

ggplot() +
  geom_boxplot(data = df, aes(y = value, x = variable)) +
  coord_flip() +
  theme_bw()

The boxplot provides a nice, compact representation of the distribution of a set of data, and makes it easy to compare across a large number of groups.

There’s a lot of information packed into that graph, so let’s unpack it:

Median

A measure of the central tendency of the data that is a little more robust than the mean (or arithmetic average). Half (50%) of the data falls below this mark. The other half falls above it.

First quartile (25th percentile) hinge

Twenty-five percent (25%) of the data falls below this mark.

Third quartile (75th percentile) hinge

Seventy-five percent (75%) of the data falls below this mark.

Inter-Quartile Range (IQR)

The middle half (50%) of the data falls within this band, drawn between the 25th percentile and 75th percentile hinges.

Lower whisker

The lower whisker connects the first quartile hinge to the lowest data point within 1.5 * IQR of the hinge.

Upper whisker

The upper whisker connects the third quartile hinge to the highest data point within 1.5 * IQR of the hinge.

Outliers

Any data points below 1.5 * IQR of the first quartile hinge, or above 1.5 * IQR of the third quartile hinge, are marked individually as outliers.

We can add additional values to these plots. For instance, it’s sometimes useful to add the mean (average) when the distributions are heavily skewed:

ggplot(data = df, aes(y = value, x = variable)) +
  geom_boxplot() +
  stat_summary(fun.y = mean, geom="point", shape = 10, size = 3, colour = "blue") +
  coord_flip() +
  theme_bw()

Graphs created in the R programming language using the ggplot2 and gridExtra packages.

References

R Core Team (2014). R: A language and environment for statistical computing. R Foundation for Statistical
Computing, Vienna, Austria. URL http://www.R-project.org/.
H. Wickham. ggplot2: elegant graphics for data analysis. Springer New York, 2009.
Baptiste Auguie (2012). gridExtra: functions in Grid graphics. R package version 0.9.1.
http://CRAN.R-project.org/package=gridExtra

Flowing Requirements from the VoC or VoP

In a previous post, I talked about the voice of the customer (VoC), voice of the process (VoP) and the necessity of combining the two when specifying a product. Here, I’d like to offer a general method for applying this in the real world, which can be implemented as a template in Excel.

Recap

I showed that there was a cost function associated with any specification that derived from both the VoC (expressed as tolerances or specification limits) and from the process capability. An example cost function for a two-sided tolerance is reproduced below.

Percent of target production costs given an average production weight and four different process capabilities.

I argued that, given this cost function, specifying a product requires specifying both the product specification limits (or tolerances) and the minimally acceptable process capability, Cpk. Ideally, both of these should flow down from a customer needs analysis to the finished product, and from the finished product to the components, and so on to materials.

Requirements flow down and up

To flow all requirements down like this, we would need to know the transfer functions, $Y = f(X)$ , for each requirement Y and each subcomponent characteristic X. There are methods for doing this, like Design for X or QFD, but they can be difficult to implement. In the real world, we don’t always know these transfer functions, and determining them can require non-trivial research projects that are best left to academia.

As an illustration, we will use the design of a battery (somewhat simplified), where we have to meet a minimum requirement that is the sum of component parts. The illustration below shows the component parts of a battery, or cell. It includes a container (or “cell wall”), positive and negative electrodes (or positive and negative “plates”), electrolyte and terminals that provide electrical connection to the outside world. Usually, we prefer lighter batteries to heavier ones, but for this example, we’ll suppose that a customer requires a minimum weight. This requirement naturally places limits on the weight of all components.

In the absence of transfer functions, we often make our best guess, build a few prototypes, and then adjust the design. This may take several iterations. A better approach is to estimate the weight specification limits and minimum Cpk by calculation before any cells are actually built.

General drawing of the structure of aircraft battery’s vented type NiCd cell. Ransu. Wikipedia, [http://en.wikipedia.org/wiki/ File:Aircraft_battery_cell.gif]. Accessed 2014-04-04.

Suppose the customer specifies a cell minimum weight of 100 kg. From similar designs, we know the components that contribute to the cell mass and have an idea of the percentage of total weight that each component contributes.

$m_{cell}=m_{container}+m_{terminals}+m_{electrolyte}+m_{poselect}+m_{negelect}$

Each individual component is therefore a fraction f_m of the total cell mass, e.g.

$m_{container}=f_{m,container}m_{cell}$

More generally, for a measurable characteristic c, component i has an expected mean or target value of $T_{i,c}=f_{i,c}\mu_{parent,c}$ or $T_{i,c}=f_{i,c}T_{parent,c}$ .

In our example, we may know from similar products or from design considerations that we want to target the following percents for each fraction f_m:

5% for container
19% for terminals
24% for electrolyte
26% for positive electrodes
26% for negative electrodes

Specification Limits

Upper Specification Limit (USL): The maximum allowed value of the characteristic. Also referred to as the upper tolerance.
Lower Specification Limit (LSL): The minimum allowed value of the characteristic. Also referred to as the lower tolerance.

Since the customer will always want to pay as little as possible, a specified lower weight of 100 kg is equivalent to saying that they are only willing to pay for 100 kg of material; any extra material is added cost that reduces our profit margin. If we tried to charge them for 150 kg of material, they would go buy from our competitors. The lower specification limit, or lower tolerance, of the cell weight is then 100 kg.

If the customer does not specify a maximum weight, or upper specification limit, then we determine the upper limit by the maximum extra material cost that we are willing to bear. In this example, we decide that we are willing to absorb up to 5% additional cost per part. Assuming that material and construction contributes 50% to the total cell cost, the USL is then 110 kg. To allow for some variation, we can set a target weight in the middle: 105 kg. From data on previous designs and the design goals, we can apportion the target weight to each component of the design, as shown in the table below.

We can apply the same fractions to the cell USL and LSL to obtain a USL and LSL of each component. As long as parts are built within these limits, the cell will be within specification. The resulting specification for cell and major subcomponents is illustrated in table [tblSpecification]. Further refinement of the allocation of USL and LSL to the components is possible and may be needed if the limits do not make sense from a production or cost perspective.

Part	Percent	Target	LSL	USL
		/kg	/kg	/kg
Cell	100%	105	100	110
Container	5%	5.2	5	5.5
Terminals	19%	19.9	19	20.9
Electrolyte	24%	25.2	24	26.4
Positive electrodes	26%	27.3	26	28.6
Negative electrodes	26%	27.3	26	28.6

Variance of components and Cpk

When a characteristic is due to the sum of the part’s components, as with cell mass, the part-to-part variation in the characteristic is likewise due to the variation in the components. However, where the characteristic adds as the sum of the components,

$m_{cell}=m_{container}+m_{terminals}+m_{electrolyte}+m_{poselect}+m_{negelect}$

the variance, $\sigma^{2}$ adds as the sum of squares

$\sigma_{cell}^{2}=\sigma_{container}^{2}+\sigma_{terminal}^{2}+\sigma_{electrolyte}^{2}+\sigma_{poselect}^{2}+\sigma_{negelect}^{2}$

The variance of any individual component is therefore a function of the total parent part variance

$\sigma_{container}^{2}=\sigma_{cell}^{2}-\sigma_{terminal}^{2}-\sigma_{electrolyte}^{2}-\sigma_{poselect}^{2}-\sigma_{negelect}^{2}$

$\displaystyle \sigma_{container,mass}^{2}=f_{\sigma,container}\sigma_{cell,mass}^{2}$

Since this is true for all components, the two fractions $f_{m}$ and $f_{\sigma}$ will be approximately equal. Therefore if we don’t know the fractions $f_{\sigma}$ , we can use the fraction $f_{m}$ , which usually easier to work out, to allocate the variance to each component:

$\displaystyle \sigma_{container,mass}^{2}=f_{m,container}\times\sigma_{cell,mass}^{2}$

More generally, for measurable characteristic $c$ of a subcomponent $i$ of a parent component,

$\displaystyle \sigma_{i,c}=\sqrt{f_{c,i}}\:\sigma_{c,parent}$

Since the given $\sigma$ is the maximum allowed for the parent to meet the desired Cpk, this means that $\sigma_{i}^{2}$ is an estimate for the maximum allowed component variance. Manufacturing can produce parts better than this specification, but any greater variance will drive the parent part out of specification.

Calculating Specification Limits

In general, there are two conflicting goals in setting specifications:

Make them as wide as possible to allow for manufacturing variation while still meeting the VoC.
Make them as narrow as possible to stay near the minimum of the cost function.

For this, Crystall Ball or iGrafx are very useful tools during development, as we can simulate a set of arts or processes, analyze the allowed variation in the product and easily flow that variation down to each component. In the absence of these tools, Minitab or Excel can be used to derive slightly less robust solutions.

Calculating from Customer Requirements

Identify any customer requirements and set specification limits (USL and LSL) accordingly. If the customer requirements are one-sided, determine the maximum additional cost we are willing to accept, and set the other specification limit accordingly. Some approximation of costs may be needed.
If no target is given, set the target specification for each requirement as the average of USL and LSL.
Set the minimum acceptable Cpk for each specification. Cpk = 1.67 is a good starting value. Use customer requirements for Cpk, where appropriate, and consider, also, whether the application requires a higher Cpk (weakest link in the chain….
Calculate the maximum allowed standard deviation to meet the Cpk requirement as $\sigma_{parent}=\left(USL-LSL\right)/\left(6\times Cpk\right)$ .
For each subcomponent (e.g. the cell has subcomponents of container, electrodes, electrolyte, and so on), apportion the target specification to each of the subcomponents based on engineering considerations and judgement. If the fractions $f$ are known, $T_{i}=f_{i}\times T_{parent}$ .
Calculate the fraction $f_{i}$ (or percent) of the parent total for each subcomponent if not already established in step (5).
Calculate the USL and LSL for each subcomponent by multiplying the parent USL and LSL by the component’s fraction of parent (from step 6). $USL_{i}=f_{i}\times USL_{parent}$ and $LSL_{i}=f_{i}\times LSL_{parent}$ .
Estimate the allowed standard deviation $\sigma_{i}$ for each subcomponent as
$\displaystyle \sigma_{i}=\mathtt{SQRT}\left(f_{i}\right)\times\sigma_{parent}.$
Calculate the minimum allowed Cpk for each subcomponent from the results of (5), (7) and (8), using the target, $T$ , for the mean, $\mu$ .
$\displaystyle Cpk_{i}=minimum\begin{cases}\frac{USL_{i}-T_{i}}{3\sigma_{i}}\\\frac{T_{i}-LSL_{i}}{3\sigma_{i}}\end{cases}$
Repeat steps (5) through (9) until all components have been specified.
For each component, report the specified USL, LSL, target T and maximum Cpk.

Calculating from Process Data

When there is no clear customer-driven requirement or clear requirement from parent parts (e.g. dimensional specifications that can be driven by the fit of parts), but specification limits are still reasonably needed, we can start from existing process data.

This is undesirable because any change to the process can force a change to the product specification, without any clear understanding of the impact on customer needs or requirements; the VoC is lost.

The calculation of USL and LSL from process data is also somewhat more complicated, as we have to use the population mean and standard deviation to determine where to set the USL and LSL, without really knowing what that mean and standard deviation are.

In the real world, we have to live with such constraints. To deal with these limitations, we will use as much data as is available and calculate the confidence intervals on both the mean and the standard deviation. The calculation for USL and LSL becomes

$\setlength\arraycolsep{2pt}\begin{array}{rl}\displaystyle USL &=\textrm{upper 95\% confidence on the mean}\smallskip\\ \displaystyle &\quad +k\times\textrm{upper 95\% confidence on the standard deviation}\end{array}$
$\setlength\arraycolsep{2pt}\begin{array}{rl}\displaystyle LSL &=\textrm{lower 95\% confidence on the mean}\smallskip\\ &\quad -k\times\textrm{upper 95\% confidence on the standard deviation}\end{array}$

where $k$ is the number of process Sigmas desired, based on the tolerance cost function. Most of the time, we will use $k=5$ , to achieve a Cpk of 1.67.

We always use the upper 95% confidence interval on the standard deviation. We don’t care about the lower confidence interval, since a small $\sigma$ will not help us in setting specification limits.

Calculate the mean ( $\mu_{parent}$ ) from recent production data. In Excel, use the AVERAGE() function on the data range.
Calculate the standard deviation () from recent production data. In Excel, you can use the STDEV() function on the data range.
1. If the order of production data is known, or SPC is in use, a better method is to use the range-based estimate from the control charts. This will be discussed in subsequent training on control charts.
Count the number of data points, n, that were used for the calculations (1) and (2). You can use the COUNT() function on the data range.
Calculate the 95% confidence level on the mean. In Excel, this is accomplished with
$CL=\mathtt{TINV}\left(\left(1-0.95\right);n-1\right)\times\sigma_{parent}/\mathtt{SQRT}\left(n\right)$

In Excel 2010 and later, TINV() should be replaced with T.INV.2T().
Calculate the 95% confidence interval on the mean as $CI_{upper}=\mu+CL$ and $CI_{lower}=\mu-CL$ .
Calculate the upper and lower 95% confidence limits on the standard deviation. In Excel, this is accomplished with
$\sigma_{upper}=\sigma_{parent}\times\mathtt{SQRT}\left(\left(n-1\right)/\mathtt{CHIINV}\left(\left(1-0.95\right)/2;n-1\right)\right)$

and

$\sigma_{lower}=\sigma_{parent}\times\mathtt{SQRT}\left(\left(n-1\right)/\mathtt{CHIINV}\left(1-\left(1-0.95\right)/2;n-1\right)\right)$

In Excel 2010 and later, CHIINV() can be replaced with CHISQ.INV.RT() for improved accuracy.
Calculate the LSL as $LSL_{parent}=CI_{lower}-k\sigma_{upper}$ . You might use a value other than 5 if the customer requirements or application require a higher process Sigma.
Calculate the USL as $USL_{parent}=CI_{upper}+k\sigma_{upper}$ .
For each subcomponent (e.g. the cell has subcomponents of positive electrode, negative electrode, electrolyte, and so on), apportion the parent part mean to each of the subcomponents based on engineering considerations and judgement. If the fractions $f$ are known, $T_{i}=f_{i}\times\mu_{parent}$ .
If the the fraction (or percent) $f_{i}$ of the parent total for each subcomponent is not known, calculate it using the results of step (9).
Calculate the USL and LSL for each subcomponent by multiplying the parent USL and LSL by the component’s fraction of parent (from step 6). $USL_{i}=f_{i}\times USL_{parent}$ and $LSL_{i}=f_{i}\times LSL_{parent}$ .
Estimate the allowed standard deviation $\sigma_{i}$ for each subcomponent as $\sigma_{i}=\mathtt{SQRT}\left(f_{i}\right)\times\sigma_{lower}$
Calculate the minimum allowed Cpk for each subcomponent from the results of (5), (7) and (8), using the target $T_{i}$ for the mean, $\mu_{i}$ .
$\displaystyle Cpk_{i}=minimum\begin{cases}\frac{USL_{i}-T_{i}}{3\sigma_{i}}\\\frac{T_{i}-LSL_{i}}{3\sigma_{i}}\end{cases}$
Repeat steps (9) through (13) until all components have been specified.
For each component, report the specified USL, LSL, target T and maximum Cpk.

Can We do Better than R-squared?

If you're anything like me, you've used Excel to plot data, then used the built-in “add fitted line” feature to overlay a fitted line to show the trend, and displayed the “goodness of fit,” the r-squared (R²) value, on the chart by checking the provided box in the chart dialog.

The R² calculated in Excel is often used as a measure of how well a model explains a response variable, so that “R² = 0.8” is interpreted as “80% of the variation in the 'y' variable is explained by my model.” I think that the ease with which the R² value can be calculated and added to a plot is one of the reasons for its popularity.

There's a hidden trap, though. R² will increase as you add terms to a model, even if those terms offer no real explanatory power. By using the R² that Excel so helpfully provides, we can fool ourselves into believing that a model is better than it is.

Below I'll demonstrate this and show an alternative that can be implemented easily in R.

Some data to work with

First, let's create a simple, random data set, with factors a, b, c and response variable y.

head(my.df)

##       y a       b      c
## 1 2.189 1 -1.2935 -0.126
## 2 3.912 2 -0.4662  1.623
## 3 4.886 3  0.1338  2.865
## 4 5.121 4  1.2945  4.692
## 5 4.917 5  0.1178  5.102
## 6 4.745 6  0.4045  5.936

Here is what this data looks like:

Calculating R-squared

What Excel does when it displays the R² is create a linear least-squares model, which in R looks something like:

my.lm <- lm(y ~ a + b + c, data = my.df)

Excel also does this when we call RSQ() in a worksheet. In fact, we can do this explicitly in Excel using the Regression analysis option in the Analysis Pack add-on, but I don't know many people who use this, and Excel isn't known for its reliability in producing good output from the Analysis Pack.

In R, we can obtain R² via the summary() function on a linear model.

summary(my.lm)

## 
## Call:
## lm(formula = y ~ a + b + c, data = my.df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.2790 -0.6006  0.0473  0.5177  1.5299 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(&gt;|t|)  
## (Intercept)    2.080      0.763    2.72    0.034 *
## a             -0.337      0.776   -0.43    0.679  
## b             -0.489      0.707   -0.69    0.515  
## c              1.038      0.817    1.27    0.250  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.1 on 6 degrees of freedom
## Multiple R-squared:  0.833,  Adjusted R-squared:  0.75 
## F-statistic:   10 on 3 and 6 DF,  p-value: 0.00948

Since summary() produces a list object as output, we can grab just the R² value.

summary(my.lm)$r.squared

## [1] 0.8333

Normally, we would (somewhat loosely) interpret this as telling us that about 83% of the variation in the response y is explained by the model.

Notice that there is also an "adjusted r-squared” value given by summary(). This tells us that only 75% of the variation is explained by the model. Which is right?

The problem with R-squared

Models that have many terms will always give higher R² values, just because more terms will slightly improve the model fit to the given data. The unadjusted R² is wrong. The calculation for adjusted R² is intended to partially compensate for that “overfit,” so it's better.

It's nice that R shows us both values, and a pity that Excel won't show the adjusted value. The only way to get an adjusted R² in Excel is to run the Regression analysis; otherwise, we have to calculate adjusted R² manually.

Both R² and adjusted R² are measures of how well the model explains the given data. However, in industry we usually want to know something a little different. We don't build regression models to explain only the data we have; we build them to think about future results. We want R² to tell us how well the model predicts the future. That is, we want a predictive R². Minitab has added the ability to calculate predictive R² in Minitab 17, and has a nice blog post explaining this statistic.

Calcuting predictive R-squared

Neither R nor Excel provide a means of calculating the predictive R² within the default functions. While some free R add-on packages provide this ability (DAAG, at least), we can easily do it ourselves. We'll need a linear model, created with lm(), for the residuals so we can calculate the “PRESS” statistic, and then we need the sum of squares of the terms so we can calculate a predictive R².

Since the predictive R² depends entirely on the PRESS statistic, we could skip the added work of calculating predictive R² and just use PRESS, as some authors advocate. The lower the PRESS, the better the model is at fitting future data from the same process, so we can use PRESS to compare different models. Personally, I'm used to thinking in terms of R², and I like having the ability to compare to the old R² statistic that I'm familiar with.

To calculate PRESS, first we calculate the predictive residuals, then take the sum of squares (thanks to (Walker’s helpful blog post) for this). This is pretty easy if we already have a linear model. It would take a little more work in Excel.

pr <- residuals(my.lm)/(1 - lm.influence(my.lm)$hat)
PRESS <- sum(pr^2)
PRESS

## [1] 19.9

The predictive R² is then (from a helpful comment by Ibanescu on LikedIn) the PRESS divided by the total sum of squares, subtracted from one. The total sum of squares can be calculated directly as the sum of the squared residuals, or obtained by summing over Sum Sq from an anova() on our linear model. I prefer using the anova function, as any statistical subtleties are more likely to be properly accounted for there than in my simple code.

# anova to calculate residual sum of squares
my.anova <- anova(my.lm)
tss <- sum(my.anova$"Sum Sq")
# predictive R^2
pred.r.squared <- 1 - PRESS/(tss)
pred.r.squared

## [1] 0.5401

You'll notice that this is smaller than the residual R², which is itself smaller than the basic R². This is the point of the exercise. We don't want to fool ourselves into thinking we have a better model than we actually do. One way to think of this is that 29% (83% – 54%) of the model is explained by too many factors and random correlations, which we would have attributed to our model if we were just using Excel's built-in function.

When the model is good and has few terms, the differences are small. For example, working through the examples in Mitsa's two posts, we see that for her model 3, R² = 0.96 and the predictive R² = 0.94, so calculating the predictive R² wasn't really worth the extra effort for that model. Unfortunately, we can't know, in advance, which models are “good.” For Mitsa's model 1 we have R² = 0.95 and predictive R² = 0.32. Even the adjusted R² looks pretty good for model 1, at 0.94, but we see from the predictive R² that our model is not very useful. This is the sort of thing we need to know to make correct decisions.

Automating

In R, we can easily wrap these in functions that we can source() and call directly, reducing the typing. Just create a linear model with lm() (or an equivalent) and pass that to either function. Note that pred_r_squared() calls PRESS(), so both functions have to be sourced.

pred_r_squared <- function(linear.model) {
    lm.anova <- anova(linear.model)
    tss <- sum(lm.anova$"Sum Sq")
    # predictive R^2
    pred.r.squared <- 1 - PRESS(linear.model)/(tss)
    return(pred.r.squared)
}

PRESS <- function(linear.model) {
    pr <- residuals(linear.model)/(1 - lm.influence(linear.model)$hat)
    PRESS <- sum(pr^2)
    return(PRESS)
}

Then we just call the function to get the result:

pred.r.squared <- pred_r_squared(my.lm)
pred.r.squared

## [1] 0.5401

I've posted these as Gists on GitHub, with extra comments, so you can copy and paste from here or go branch or copy them there.

References and further reading

Mitsa, T. Use PRESS, not R squared to judge predictive power of regression. 12 May 2013. Analytical Bridge. Accessed 14 May 2014. Shows how r-squared can lead to a misleading interpretation of model fit and provides an explanation of the PRESS statistic, with examples comparing three linear models in R.
Mitsa, T. Cross-validation in R: a do-it-yourself and a black box approach. 22 May 2013. Accessed 14 May 2014. Shows how to use the package DAAG to calculate PRESS, or to calculate it manually.
Walker, S. Calculating the PRESS statistic in R. 18 June 2013. ecology & stats. Accessed 14 May 2014. Provides a simple function for calculating PRESS in R.
Multiple Regression Analysis: Use Adjusted R-Squared and Predicted R-Squared to Include the Correct Number of Variables
Adjusted R-Square or Predicted R-Square. LinkedIn. Accessed 14 May 2014. Forum dscussion thread discusing the relative merits of adjusted and predicted R², in which the equation for calculating predicted R² is given.
Why is adjusted R-squared less than R-squared if adjusted R-squared predicts the model better?. StackExchange. Accessed 10 May 2014. Q&A thread discussing the relative merits of R² and adjusted R².
R Core Team (2014). R: A language and environment for statistical computing. R Foundation for
Statistical Computing, Vienna, Austria. URL http://www.R-project.org/.
H. Wickham. ggplot2: elegant graphics for data analysis. Springer New York, 2009.

Combining the Voice of the Customer with the Voice of the Process

Setting product specifications is an iterative and challenging process, combining lab test data, historical data and educated guesses. All too often, the result is a set of product and process specifications that must be changed to meet manufacturing needs and that do not meet all customer requirements. To achieve specifications that are correct and defendable, the engineer must understand how the voice of the customer and the voice of the process interact, and must fully specify product through both specification limits (also called tolerance limits) and production capability, or Cpk.

Voice of the Customer

What the customer expects a product to do—what they are willing to pay for—is known as the “voice of the customer” (VoC). A customer’s expectations may not all be written or explicitly stated, and unwritten or unrecognized needs or wants can be even more important than the written ones. As we design a product, we first translate the VoC to engineering requirements and then flow the requirements down to subcomponents.

For simplicity, I will adopt the convention of referring to the specifications or tolerances as the Target Specification, the Upper Specification Limit (USL) and the Lower Specification Limit (LSL). The target, USL and LSL are the engineering translation of the VoC. I will refer to the measure being specified—length, mass, voltage, etc.—as a characteristic.

Voice of the Process

What we know about the parts actually produced—maximum and minimum, average, standard deviation, outliers, etc.—is known as the “voice of the process” (VoP). The VoP tells us the limits of our manufacturing abilities.

Suppose that we know from the production plant that the typical weight of our product is between 99.2 and 104.2 kg. This is the VoP; it may or may not be acceptable to the customer or fit within the USL and LSL. When we engage in statistical process control (SPC), we are listening to the VoP, but not the VoC.

The engineer must design to the VoC while considering the VoP.

Specification Limits

First, some definitions:

Target Specification: The desired value of the characteristic.
Mean Specification: The average of the upper and lower specification values—the value in the middle of the specification or tolerance range.
Nominal Specification: Used in different ways. May mean the target value of the specification. Sometimes the value printed in marketing literature or on the product, which may be the minimum or some other value.
Upper Specification Limit: The maximum allowed value of the characteristic. Sometimes referred to as the upper tolerance.
Lower Specification Limit: The minimum allowed value of the characteristic. Also referred to the lower tolerance.

Suppose that the customer has said that they want our product to weigh at least 100 kg. Since the customer will always want to pay as little as possible, a customer-specified lower specification limit of 100 kg is equivalent to saying that they are only willing to pay for 100 kg worth of costs; any extra material is added cost that reduces our profit margin.

If the customer does not specify a maximum weight, or upper specification limit, then we can choose the upper limit by the maximum extra material cost we want to bear. For this example, we decide that we are willing to absorb up to 5% additional cost. For our product, material and construction contributes 50% to the total assembled part cost, so the USL on weight is 110 kg.

Defect Rates and Cpk: Where VoC and VoP Intersect

Defect rates are expressed variously as Sigma, Cpk (“process capability”), Ppk (“process performance”), parts per million (ppm), yield (usually as a percent) or defects per million opportunities (DPMO). “Sigma” refers to the number of standard deviations that fit between the tolerance limits and the process mean. A 1-Sigma process has 1 standard deviation ( $\sigma$ ) between the mean ( $\mu$ ) and the nearest specification limit. Cpk is a measure of the number of times that three standard deviations ( $3\sigma$ ) fit between the mean and the nearest specification limit. Cpk is often used by customers, especially automotive OEMs. We can easily convert between Cpk, Sigma yield, ppm and DPMO.

Production tells us that historically, they usually see a range of about 5 kg in assembled part weight, with weights between 99.2 kg and 104.2 kg. With enough data, the range of observed data will cover roughly the range from [mean – 3 standard devations] to [mean + 3 standard devations], so this gives us $\mu \approx 101.7$ kg and $\sigma \approx 0.83$ kg. From $(\mu-LSL)/\sigma=Sigma$ , $\left(101.7-100\right)/0.83=2.0$ , we have a 2-Sigma process. With this data, we could estimate the percent of parts that will be below the LSL.

Defect Rate Calculation

We can use this data to estimate the percent of product that will be out of the customer’s specification. We’ll assume that our process produces parts where the weight is normally distributed (having a Gaussian or bell-shaped curve).

Using Minitab, you can do this by opening the “Calc” menu, the “Probability Distributions” submenu, and choosing “Normal….” Then choose “Cumulative probability,” enter “101.7” for the mean and “0.83” for the standard deviation. Select “Input constant” and enter “100.” Click “OK.” The result in the Session Window looks like:

Cumulative Distribution Function

Normal with mean = 101.7 and standard deviation = 0.83

  x  P( X <= x )
100     0.02027

This is read as: the probability of seeing a weight X less than or equal to the given value x = 100 is 0.02, so we can expect about 2% of parts to be out of specification. This can also be done in Excel using NORMDIST(100, 101.7, 0.83, TRUE) or, in Excel 2010 and later, NORM.DIST(100, 101.7, 0.83, TRUE). In R we would use pnorm(100, 101.7, 0.83).

Minitab can also display this graphically. Open the “Graph” menu, select “Probability Distribution Plot…” Select “View Probability” (the right-most option) and click “OK,” On the “Distribution” tab, select “Normal” from the distribution menu and enter “101.7” for the mean and “0.83” for the standard deviation. Then switch to the “Shaded Area” tab, select “X Value,” “Left Tail” and enter “100” for the “X value.” Click on “OK” to obtain a plot like that below.

Probability density distribution showing cumulative probability below a target value of 100 shaded in red.

Of course, not all processes produce parts that are normally distributed, and you can use a distribution that fits your data. However, a normal distribution will give you a good approximation in most circumstances.

Cpk

The calculation for Cpk is:

$Cpk=minimum\begin{cases} Cpu\\ Cpl \end{cases}$

where Cpu and Cpl are defined as:

$Cpu=\frac{USL-\mu}{3\sigma}$

$Cpl=\frac{\mu-LSL}{3\sigma}$

If $\mu$ is centered between the USL and the LSL, then by simple algebra we have

$Cp=\frac{USL-LSL}{6\sigma}$

which is often a convenient form to use to estimate the “best case” process capability, even when we do not know what $\mu$ is.

The combination of VoC-derived specification limits (USL, LSL and T) with process data ( $\mu$ and $\sigma$ ) ties together the VoC and the VoP, and allows us to predict future performance against the requirements.

Multiplying Cp or Cpk by $\sqrt{1+\left(\left(\mu-T\right)/\sigma\right)^{2}}$ produces a measure of the process capability that includes centering on the target value, T, referred to as Cpm or Cpkm, respectively. These versions are more informative but much less commonly used than Cp and Cpk.

In the example, $\mu-LSL=101.7-100=1.7=2\times0.83=2\sigma$ . We can then calculate the Cpk:

$Cpk=\frac{\mu-LSL}{3\sigma}=\frac{2\sigma}{3\sigma}=0.67$

Acceptable values of Cpk are usually 1.33, 1.67 or 2.0. Automotive OEMs typically require Cpk = 1.33 for non-critical or new production processes, and Cpk = 1.67 or 2.0 for regular production. In safety-critical systems, a Cpk should be 6 or higher. The “Six Sigma” improvement methodology and Design For Six Sigma refers to reducing process variation until six standard deviations of variation fit between the mean and the nearest tolerance (i.e. Cpk = 2), achieving a defect rate of less than 3.4 per million opportunities. Some typical Cpk, and corresponding process sigma and process yield are provided in table [tblCpkSigmaYield].

Cpk	Sigma	Yield (max)	Yield (likely)
0.33	1	85.%	30.%
1.00	3	99.9%	90.%
1.33	4	99.997%	99.%
1.67	5	99.99997%	99.98%
2.00	6	99.9999999%	99.9997%

In the table, “Yield (max)” assumes that the process is perfectly stable, such that parts produced today and parts produced weeks from now all exhibit the same mean and variance. Since no manufacturing process is perfectly stable, “Yield (likely)” assumes additional sources of variation that shift the process by $1.5\sigma$ (e.g. seasonal effects, or differences in setups across days or shifts). This $1.5\sigma$ shift is a standard assumption in such calculations when we do not have real data about the long-term process stability of our processes.

Specification Limits and Cost

When parts are out of specification—that is, they may be identified prior to shipment as defective or nonconforming—they they can have four possible impacts:

they must be reworked;
they must be scrapped;
they come back as warranty claims; or
they result in lost customers (reduced revenue without reducing “fixed” costs).

For example, underweight or damaged injection-molded plastics might be melted and reprocessed, but this adds cost in capital for the added equipment to proces the parts and cost for extra electricity and labor to move, sort and remelt. Later in production, defective parts will have to be scrapped.

We can see, then, that a cost function can be associated with each end of a specification range. The specification limits must be derived from the VoC, but the VoP imposes the cost function. The figures below illustrate this for both one-sided and two-sided specifications.

Percent of target production costs given an average production weight and four different process capabilities.

The minimum of the cost function is a good place to set our target specification, unless we have some strong need to set a different target.

Variance of Components and Cpk

When a part characteristic is the sum of the part’s components, as with weight, then the variation in the part characteristic is likewise due to the variation in the individual components. However, while the weight adds as the sum of the components,

$m_{part}=m_{c1}+m_{c2}+m_{c3}+m_{c4}+m_{c5}$

the variance in weight, $\sigma^{2}$ , adds as the sum of squares

$\sigma_{part}^{2}=\sigma_{c1}^{2}+\sigma_{c2}^{2}+\sigma_{c3}^{2}+\sigma_{c4}^{2}+\sigma_{c5}^{2}$

Since the given $\sigma$ is the maximum allowed for the assembled part to meet the desired Cpk, this means that the component variances, $\sigma_{i}^{2}$ , are an estimate for the maximum allowed component variance. Manufacturing can produce parts better than this specification, but any greater variance will drive the parent part out of specification.

Specifying Characteristics

Very often, only specification ranges (or tolerances)—the USL and LSL—are provided by engineering when designing parts. Almost as often, these ranges are based on a target value (derived from the VoC) with some “allowed” tolerance based on what manufacturing says they can achieve on the existing equipment and processes (the VoP). It should be clear from the above that a minimally-adequate part specification includes USL and LSL derived from the VoC and the minimum acceptable Cpk derived from both the VoC and the VoP. The inclusion of the minimum process capability is the only way to ensure that the parts are made within the target costs and at the target quality level.

The next challenge is to flow these requirements down to the components. Having laid the groundwork for the basis of requirements flow-down, I will look at a way to do this in a future post.

References

All graphs created in R using ggplot2.

R Core Team (2014). R: A language and environment for statistical computing. R Foundation for
Statistical Computing, Vienna, Austria. URL http://www.R-project.org/.
H. Wickham. ggplot2: elegant graphics for data analysis. Springer New York, 2009.

Combining Expert Judgements

Both product development and project planning often require making educated guesses. There are two models that I’ve seen for doing this. The first is to have a designated subject matter expert (SME) who provides The Answer. The second is to get a group of SMEs together to discuss and arrive at a consensus answer.

Under most circumstances, I’m not a fan of the first approach. Individuals are simply too fallible, too easily swayed by anecdotes rather than real data or too busy to consider a problem fully and in depth. The exception is when you are estimating work; then the SME is the person who will actually have to do the work, and their estimate is better than anyone else’s. The second solution often suffers from several other problems. Too many people working a problem can take a long time, answers can be driven by the most vocal or the most risk-averse members of the group and groups sometimes deliver bad answers due to diffusion of responsibility or other “group think” effects.

One solution to the problems of group decisions is to use Affinity Diagramming and the Analytical Hierarchy Process (AHP) to structure the problem, gather individual judgements of the alternatives, and then determine their relative values or importance. Affinity diagramming followed by AHP is incredibly powerful, and does a great job taking the individual bias out of the equation. I’ve used it with teams to rank the importance of product requirements, as a training exercise to accurately estimate the relative area of geometric shapes and personally to decide which car to buy. It works.

Unfortunately, AHP takes time, and for any but the simplest assessments, you really need custom software to support the process. Affinity diagramming can be done very effectively with Post-It notes, but the calculations of AHP cannot be easily set up in a program like Excel. It also requires discrete alternatives to choose from. For estimates of a single variable, such as lifetime, or other performance characteristics, I have had to develop a different approach.

The technique below works when you want to create a point estimate of a continuous variable. For instance, you might need to estimate product lifetime and establish a warrant period, or you might need to estimate a performance level that can be communicated to customers (implying a performance guarantee), or you might need to estimate the duration of a set of project activities. We can easily implement the calculations in Excel, R, or a similar tool.

Example

What we’re going to do is get the SMEs to individually estimate a range of possible results (low, most likely, high). Then we’ll generate triangular probability distributions from each estimate and finally combine those estimates. To get a robust result, we will account for the SME’s estimation of their own accuracy, and treat their different estimates discrete distributions when combining.

The below is the data entered by the SMEs.

Using the above data, we can calculate triangular probability distributions for each, then combine them by treating them as discrete distributions. This produces the sequence of distributions, below.

We can then summarize the combined distribution with some useful values:

most likely:	7087
likely low:	4360
likely high:	10711
50% between	5162	and	8370

The process

Bring a few SMEs together. While the technique will work with any number, usually five to six is more than sufficient, and ten should be considered the maximum for useful input. We’ll call this number N so we can refer to it later.
Define the problem. Everyone should agree precisely what you are estimating.
Agree on a basis for the evaluation. This might be an agreed conceptual model of how the product behaves or a set of criteria or goals to evaluate against.
Collect three estimates—low, most likely, high—from each SMEs for the variable of interest (call this variable X).

Collecting a range is important; we need to be honest with ourselves that we don’t know what the value will actually be: there is uncertainty in the SME judgement—if we had the data to be more precise then we’d use that—and there’s variation to consider. Ranges allow us to derive probabilistic estimates that represent both the limits of our SME’s knowledge and the natural variation.

Ask each SME for their assessment of three probabilities on a scale of 0% to 100%:
1. How likely are we to see the real data fall within this range if your reasoning or model is correct? This will usually be very high, like 95% or higher.
2. How likely are we to see the real data fall within this range even if your reasoning has some flaw? This may also be high, but you can use 50% if the answer is “I don’t know.”
3. How likely is it that the reasoning is correct? Again, this will usually be pretty high. That’s why they’re SMEs.

At this point, you have all the information that you need from the SMEs and can proceed to the calculations.

Using the three probabilities of the argument accuracy, calculate the values of X at end points for a triangular probability distribution.
For each SME’s guesstimates, use the triangle distributions to generate a large number of probabilities across the total range of possible Xs.
For each value of X used in the generation of the probability distributions, randomly choose a probability from one of the SMEs. For each value of X, we now have N probabilities that we treat as part of a discrete distribution.
Use the resulting combined distribution to determine values of interest, e.g. the median, fifth percentile, ninety-fifth percentile and so on.

From a diverse set of estimates, you now have a single, robust estimation for a variable. The resulting probability distribution will not be smooth, but you will be able to pull out single values that are meaningful and robust, such as the ninety-fifth percentile for duration estimates.

A few tips

How you ask for the estimates matters. Just asking for “low,” “middle” and “high” estimates will get you very inconsistent results. Likewise, asking for “worst-case” or “best-case” will often get you some pretty wild estimates. You want to ask questions like “how long would this take if many things went right,” “how long would the product last in more severe operating conditions” or “what is the most likely power available?” You don’t want the “middle” estimate, but a “most likely” estimate; you’re not shooting for half-way between the low and high estimates, but for a high-probability point estimate.

When some SMEs know more about the variable being estimated than others, you could also weight their judgements. This weighting is used in the second to last step by adjusting the probability of randomly selecting a value from each SME’s probability distribution.

Issue Logs and Risk Registers

Every product development project includes uncertainty over what will happen. The uncertainty—each assumption or best guess—reduces our chances of project success. The job of the project manager and team members is to ensure success by managing risk.

When something goes wrong—deviates from the plan—it stops being a risk and becomes an issue that must be addressed to ensure success. Issues are those conditions that are having a negative impact on your ability to execute the project plan. You can easily identify them because they directly cause schedule slippage and extra work.

There are two simple tools that can—and should—be used on every project to manage risks and issues to prevent disaster. One is the risk register; the other is the issue log. In my experience, these two documents are often conflated, but they are distinct documents that should contain different information and drive different actions.

The risk register is a means of capturing risks that we want to monitor over the life of the project so that we can take action before they have a negative impact on the project. These are conditions that you have decided not to explicitly work into the plan, but don’t want to let “slip under the radar” to create big issues for you later.

The issue log is where you record any problems that were not accounted for in the plan and that threaten to delay the project, push it off budget or reduce the scope (e.g. reduce product performance).

Issue Log	Risk Register
Description of the issue	Description of the risk
Underlying problem or cause of the issue	Risk profile—sources of uncertainty and the potential impact
Action plan	Potential actions
Priority or scheduling	Monitoring plan
Who is responsible for assuring this issue is resolved	Who is responsible for monitoring
Date opened and date resolved, sometimes a tracking number or other ID	Date last updated, tracking ID

Issue Log

The issue log is fundamentally about corrective actions. The project has deviated from the plan, and now we need to get back on course to complete the project on time, on budget and with the agreed goals. The issue log is used to capture this information.

While the cause of the problem is often obvious, it is always a good idea to probe for deeper, systemic causes that could lead to further delays. Asking “why?” five times in order to permanently and irrevocably fix a problem doesn’t take very long compared to the total delays that a project can experience.

Risk Register

The hard part of a risk register is the risk profile. Different people respond differently to risk, and some are more comfortable with thinking about uncertain outcomes than others. These differences between people lead to a lot of variation and debate in identifying risks; a good strategy for making risk registers easy is to standardize. The best practices are to focus on the causes of the risk and the probable impacts and to standardize the process.

There has been a lot written about risk management. Some of the best, in my opinion, is the work by De Meyer, Loch and Pich, which was first brought to my attention by Glenn Alleman over at the Herding Cats blog. In their excellent book, Managing the Unknown: A New Approach to Managing High Uncertainty and Risk in Projects, they break down risk into two major components: relationship complexity and task complexity.

When the relationships of stakeholders or partners are complex—groups aren’t aligned—then you can expect disagreements and conflict. Successful strategies for dealing with relationship complexity include increased communication and more rigidly defined relationships.

When tasks are complex—there are many links between tasks, so that changing one task can affect many, or there is a high degree of uncertainty in what needs to be done—then the successful strategies range from critical path management to an entrepreneurial approach of working multiple solutions in parallel (see also De Meyer 2001).

By implementing these pairings of source of risk with management strategy in a risk register template, we can greatly simplify the process and drive more consistent risk management results. Adding in a simple analysis of the impact can help us with prioritization (where do we spend our resources monitoring) and monitoring frequency.

Monitoring is all about how you will know when to do something about the risk. i.e. You want to decide in advance what condition will trigger you to transition this risk to the issue log. Measures should be relevant to the risk, quantitative where possible and the method of measurement should be clearly defined (you don’t want people disagreeing over the project plan just because they measure something differently). Set up measurement intervals that make sense by asking yourself how long you can go without knowing that you have a problem. Plot the results as a time series or on a control chart to allow you to distinguish between normal variation in the measurement and a condition that requires action.

References

Loch, Christoph H, Arnoud De Meyer, and Michael T Pich. Managing the Unknown. Hoboken, New Jersey: John Wiley & Sons, 2006. Print.
De Meyer, Arnoud, Christoph H Loch, and Michael T Pich. “Uncertainty and Project Management: Beyond the Critical Path Mentality.” 2001 : 1–23. Print.

Apple-Google Wage Fixing and Systems Thinking

It seems that some of the most successful people in the world, at some of the world’s largest and most respected companies, engaged in an apparently wide-spread and illegal effort to limit employees’ job opportunities and wages.

This was stupid. It was stupid for the obvious reason: it was illegal. It was stupid for a more insidious reason, though: it will backfire on the company. We can explore why if we use a couple of system archetypes to think about the situation. These archetypes will come in handy in a wide range of situations.

The problem was retaining talent, who could be easily enticed away by more attractive compensation packages or to work at more exciting companies. Either it was easier to get an attractive compensation packages at a competitor, or the work did not stay sufficiently interesting or engaging. Simply put: employees were not happy enough.

The fix was to limit recruitment among competing companies.

In systems thinking, this is a classic “shifting the burden” dynamic. In shifting the burden, pictured below, you have two types of solutions to the symptoms of a problem: the fundamental solution—a corrective action for the root cause—and the symptomatic solution. The symptomatic solution reduces the symptom, but also creates a side effect that has a negative impact on the fundamental problem. Symptomatic solutions only have a temporary benefit before things get worse.

The classic shifting the burden system archetype describes wage-fixing practices as solutions to employee turnover.

The green links, labelled “+,” indicate that the two conditions at either end of the tail increase and decrease together. The application of more symptomatic solution causes an increase in the side effect; reducing the use of symptomatic solutions causes a decrease in the side effect. The red links, labelled “–,” indicate that the two conditions at either end work in opposite directions. An increase in the side effect causes a decrease in the effectiveness of the fundamental solution.

There are three cycles, or loops, in this diagram. Two of them are “balancing loops;” over time, one factor tends to balance out the other, and the situation stabilizes. The third loop is a “self-reinforcing loop;” such loops will “snow-ball” or continue increasing over time:

Applying a symptomatic solution
increases the side effect
which reduces the effectiveness of the fundamental solution
which increases the symptom
which drives more application of the symptomatic solution
and increases the side effect.

The solution is to focus on fundamental solutions—get at the root cause—and avoid or limit reliance on symptomatic solutions. Symptomatic solutions are always temporary and usually make things worse in the long term; fundamental solutions are permanent and don’t have negative side effects.

Why employees leave is also explained, in broad strokes, by another system archetype: the limits to growth.

The classic limits to growth system archetype describes why it’s hard to keep employees engaged and hold down turnover.

Here we have a self-reinforcing dynamic created by successes and interesting work with good compensation. Over time, employees should generate more success and have more interesting work and better compensation. However, this is coupled to a balancing loop that becomes stronger over time, and slows down the reinforcing loop, like the brakes in a car. Employees burn out, or generally stop producing as much. This balancing loop is driven by some limiting condition, which makes the slowing action—burnout or disengagement—stronger over time. While this simple version looks like it will lead to a steady state, more realistic versions often result in a crash, where the results not only level off, but actually decrease.

The solution to a limits to growth system is to attack the limiting condition. If employees get bored doing the same thing over time, then you have to find a way for them to be engaged with enough new, interesting work. If they can earn substantially better compensation packages at competitors, then you have to (approximately) match those packages.

If you don’t fix the limiting condition, you might see a temporary improvement via the dynamics of shifting the burden (or the closely-related archetype, fixes that fail), but in the long term the problem will only get worse.

A Simple Introduction to the Graphing Philosophy of ggplot2

“The emphasis in ggplot2 is reducing the amount of thinking time by making it easier to go from the plot in your brain to the plot on the page.” (Wickham, 2012)

“Base graphics are good for drawing pictures; ggplot2 graphics are good for understanding the data.” (Wickham, 2012)

I’m not ggplot2’s creator, Hadley Wickham, but I do find myself in discussions trying to explain how to build graphs in ggplot2. It’s a very elegant system, but also very different from other graphing systems. Once you understand the organizing philosophy, ggplot2 becomes very easy to work with.

The grammar of ggplot2 graphics

There is a basic grammar to all graphics production. In R‘s base graphics or in Excel, you feed ranges of data to a plot as x and y elements, then manipulate colors, scale dimensions and other parts of the graph as graphical elements or options.

ggplot2’s grammar makes a clear distinction between your data and what gets displayed on the screen or page. You feed ggplot2 your data, then apply a series of mappings and transformations to create a visual representation of that data. Even with base graphics or Excel we never really plot the data itself, we only create a representation; ggplot2 makes this distinction explicit. In addition, ggplot2’s structure makes it very easy to tweak a graph to look the way you want by adding mappings.

A ggplot2 graph is built up from a few basic elements:

1.	Data		The raw data that you want to plot
2.	Geometries	`geom_`	The geometric shapes that will represent the data.
3.	Aethetics	`aes()`	Aesthetics of the geometric and statistical objects, such as color, size, shape and position.
4.	Scales	`scale_`	Maps between the data and the aesthetic dimensions, such as data range to plot width or factor values to colors

Putting it together, the code to build a ggplot2 graph looks something like:

data
+ geometry to represent the data,
+ aesthetic mappings of data to plot coordinates like position, color and size
+ scaling of ranges of the data to ranges of the aesthetics

A real example shows off how this all fits together.

library(ggplot2)
# Create some data for our example
some.data <- data.frame(timer = 1:12, 
                        countdown = 12:1, 
                        category = factor(letters[1:3]))
# Generate the plot
some.plot <- ggplot(data = some.data, aes(x = timer, y = countdown)) +
  geom_point(aes(colour = category)) +
  scale_x_continuous(limits = c(0, 15)) +
  scale_colour_brewer(palette = "Dark2") +
  coord_fixed(ratio=1)
# Display the plot
some.plot

Demonstration of the key concepts in the grammar of graphics: data, geometries, aesthetic mappings and scale mappings.

Here you can see that the data is passed to ggplot(), aesthetic mappings between the data and the plot coordinates, a geometry to represent the data and a couple of scales to map between the data range and the plot ranges.

More advanced parts of the ggplot2 grammar

The above will get you a basic graph, but ggplot2 includes a few more parts of the grammar that you’ll want to be aware of as you try to visualize more complex data:

5.	Statistical transformations	`stat_`	Statistical summaries of the data that can be plotted, such as quantiles, fitted curves (loess, linear models, etc.), sums and so o.
6.	Coordinate systems	`coord_`	The transformation used for mapping data coordinates into the plane of the data rectangle.
7.	Facets	`facet_`	The arrangement of the data into a grid of plots (also known as latticing, trellising or creating small multiples).
8.	Visual Themes	`theme`	The overall visual defaults of a plot: background, grids, axe, default typeface, sizes, colors, etc.

Hadley Wickham describes various pieces of this grammar in recorded presentations on Vimeo and YouTube and the online documentation to ggplot2. The most complete explanation is in his book ggplot2: Elegant Graphics for Data Analysis (Use R!) (Wickham, 2009).

References

Wickham, Hadley. ggplot2: Elegant Graphics for Data Analysis. Dordrecht, Heibelberg, London, New York: Springer, 2009. Print.
Wickham, Hadley. A Backstage Tour of ggplot2 with Hadley Wickham. 2012. Video. YouTube. Web. 21 Mar 2014. . Contributed by REvolutionAnalytics.

Normality and Testing for Normality

Many of our statistical tests make assumptions about the distribution of the underlying population. Many of the most common—ImR (XmR) and XbarR control charts, ANOVA, t-tests—assume normal distributions in the underlying population (or normal distributions in the residuals, in the case of ANOVA), and we’re often told that we must carefully check the assumptions.

At the same time, there’s a lot of conflicting advice about how to test for normality. There are the statistical tests for normality, such as Shapiro-Wilk or Anderson-Darling. There’s the “fat pencil” test, where we just eye-ball the distribution and use our best judgement. We could even use control charts, as they’re designed to detect deviations from the expected distribution. We are discouraged from using the “fat pencil” because it will result in a lot of variation from person to person. We’re often told not to rely too heavily on the statistical tests because they are not sensitive with small sample sizes and too sensitive to the tails. In industrial settings, our data is often messy, and the tails are likely to be the least reliable portion of our data.

I’d like to explore what the above objections really look like. I’ll use R to generate some fake data based on the normal distribution and the t distribution, and compare the frequency of p-values obtained from the Shapiro-Wilk test for normality.

A Function to test normality many times

First, we need to load our libraries

library(ggplot2)
library(reshape2)

To make this easy to run, I’ll create a function to perform a large number of normality tests (Shapiro-Wilk) for sample sizes n = 5, 10 and 1000, all drawn from the same data:

#' @name assign_vector
#' @param data A vector of data to perform the t-test on.
#' @param n An integer indicating the number of t-tests to perform. Default is 1000
#' @return A data frame in "tall" format
assign_vector <- function(data, n = 1000) {
  # replicate the call to shapiro.test n times to build up a vector of p-values
  p.5 <- replicate(n=n, expr=shapiro.test(sample(my.data, 5, replace=TRUE))$p.value)
  p.10 <- replicate(n=n, expr=shapiro.test(sample(my.data, 10, replace=TRUE))$p.value)
  p.1000 <- replicate(n=n, expr=shapiro.test(sample(my.data, 1000, replace=TRUE))$p.value)
  #' Combine the data into a data frame, 
  #' one column for each number of samples tested.
  p.df <- cbind(p.5, p.10, p.1000)
  p.df <- as.data.frame(p.df)
  colnames(p.df) <- c("5 samples","10 samples","1000 samples")
  #' Put the data in "tall" format, one column for number of samples
  #' and one column for the p-value.
  p.df.m <- melt(p.df)
  #' Make sure the levels are sorted correctly.
  p.df.m <- transform(p.df.m, variable = factor(variable, levels = c("5 samples","10 samples","1000 samples")))
  return(p.df.m)  
}

Clean, random data

I want to simulate real-word conditions, where we have an underlying population from which we sample a limited number of times. To start, I’ll generate 100000 values from a normal distribution. To keep runtimes low I’ll have assign_vector() sample from that distribution when performing the test for normality.

n.rand <- 100000
n.test <- 10000
my.data <- rnorm(n.rand)
p.df.m <- assign_vector(my.data, n = n.test)

We would expect that normally distributed random data will have an equal probability of any given p-value. i.e. 5% of the time we’ll see p-value ≤ 0.05, 5% of the time we’ll see p-value > 0.05 and ≤ 0.10, and so on through > 0.95 and ≤ 1.00. Let’s graph that and see what we get for each sample size:

ggplot(p.df.m, aes(x = value)) + 
  geom_histogram(binwidth = 1/10) + 
  facet_grid(facets=variable ~ ., scales="free_y") + 
  xlim(0,1) +
  ylab("Count of p-values") +
  xlab("p-values") +
  theme(text = element_text(size = 16))

Histogram of p-values for the normal distribution, for sample sizes 5, 10 and 1000.

This is, indeed, what we expected.

Now let’s compare the normal distribution to a t distribution. The t distribution would pass the “fat pencil” test—it looks normal to the eye:

ggplot(NULL, aes(x=x, colour = distribution)) + 
  stat_function(fun=dnorm, data = data.frame(x = c(-6,6), distribution = factor(1)), size = 1) + 
  stat_function(fun=dt, args = list( df = 20), data = data.frame(x = c(-6,6), distribution = factor(2)), linetype = "dashed", size = 1) + 
  scale_colour_manual(values = c("blue","red"), labels = c("Normal","T-Distribution")) +
  theme(text = element_text(size = 12),
        legend.position = c(0.85, 0.75)) +
  xlim(-4, 4) +
  xlab(NULL) +
  ylab(NULL)

Starting with random data generated from the t-distribution:

my.data <- rt(n.rand, df = 20)

Histogram of p-values for the t distribution, for sample sizes 5, 10 and 1000.

The tests for normality are not very sensitive for small sample sizes, and are much more sensitive for large sample sizes. Even with a sample size of 1000, the data from a t distribution only fails the test for normality about 50% of the time (add up the frequencies for p-value > 0.05 to see this).

Testing the tails

Since the t distribution is narrower in the middle range and has longer tails than the normal distribution, the normality test might be failing because the entire distribution doesn’t look quite normal; we haven’t learned anything specifically about the tails.

To test the tails, we can construct a data set that uses the t distribution for the middle 99% of the data, and the normal distribution for the tails.

my.data <- rt(n.rand, df = 20)
my.data.2 <- rnorm(n.rand)
# Trim off the tails
my.data <- my.data[which(my.data < 3 & my.data > -3)]
# Add in tails from the other distribution
my.data <- c(my.data, my.data.2[which(my.data.2 < -3 | my.data.2 > 3)])

Histogram of p-values for sample sizes 5, 10 and 1000, from a data set constructed from the t distribution in the range -3 to +3 sigmas, with tails from the normal distribution below -3 and above +3.

Despite 99% of the data being from the t distribution, this is almost identical to our test with data from just the normal distribution. It looks like the tails may be having a larger impact on the normality test than rest of the data

Now let’s flip this around: data that is 99% normally-distributed, but using the t distribution in the extreme tails.

my.data <- rnorm(n.rand)
my.data.2 <- rt(n.rand, df = 20)
# Trim off the tails
my.data <- my.data[which(my.data < 3 & my.data > -3)]
# Add in tails from the other distribution
my.data <- c(my.data, my.data.2[which(my.data.2 < -3 | my.data.2 > 3)])

Histogram of p-values for sample sizes 5, 10 and 1000, from a data set constructed from the normal distribution in the range -3 to +3 sigmas, with tails from the t-distribution below -3 and above +3.

Here, 99% of the data is from the normal distribution, yet the normality test looks almost the same as the normality test for just the t-distribution. If you check the y-axis scales carefully, you’ll see that the chance of getting p-value ≤ 0.05 is a bit lower here than for the t distribution.

To make the point further, suppose we have highly skewed data:

my.data <- rlnorm(n.rand, 0, 0.4)

This looks like:

For small sample sizes, even this is likely to pass a test for normality:

What have we learned?

With small sample sizes, everything looks normal.
The normality tests are, indeed, very sensitive to what goes on in the extreme tails.

In other words, if we have enough data to fail a normality test, we always will because our real-world data won’t be clean enough. If we don’t have enough data to reliably fail a normality test, then there’s no point in performing the test, and we have to rely on the fat pencil test or our own understanding of the underlying processes.

Don’t get too hung up on whether your data is normally distributed or not. When evaluating and summarizing data, rely mainly on your brain and use the statistics only to catch really big errors in judgement. When attempting to make predictions about future performance, e.g. calculating Cpk or simulating a process, recognize the opportunities for errors in judgment and explicitly state you assumptions.

Rewriting plot.qcc using ggplot2 and grid

The free and open-source R statistics package is a great tool for data analysis. The free add-on package qcc provides a wide array of statistical process control charts and other quality tools, which can be used for monitoring and controlling industrial processes, business processes or data collection processes. It’s a great package and highly customizable, but the one feature I wanted was the ability to manipulate the control charts within the grid graphics system, and that turned out to be not so easy.

I went all-in and completely rewrote qcc’s plot.qcc() function to use Hadley Wickham’s ggplot2 package, which itself is built on top of grid graphics. I have tested the new code against all the examples provided on the qcc help page, and the new ggplot2 version works for all the plots, including X-bar and R, p- and u- and c-charts.

In qcc, an individuals and moving range (XmR or ImR) chart can be created simply:

library(qcc)
my.xmr.raw <- c(5045,4350,4350,3975,4290,4430,4485,4285,3980,3925,3645,3760,3300,3685,3463,5200)
x <- qcc(my.xmr.raw, type = "xbar.one", title = "Individuals Chart\nfor Wheeler sample data")
x <- qcc(matrix(cbind(my.xmr.raw[1:length(my.xmr.raw)-1], my.xmr.raw[2:length(my.xmr.raw)]), ncol = 2), type = "R", title = "Moving Range Chart\nfor Wheeler sample data")

This both generates the plot and creates a qcc object, assigning it to the variable x. You can generate another copy of the plot with plot(x).

To use my new plot function, you will need to have the packages ggplot2, gtable, qcc and grid installed. Download my code from the qcc_ggplot project on Github, load qcc in R and then run source("qcc.plot.R"). The ggplot2-based version of the plotting function will be used whenever a qcc object is plotted.

library(qcc)
source("qcc.plot.R")
my.xmr.raw <- c(5045,4350,4350,3975,4290,4430,4485,4285,3980,3925,3645,3760,3300,3685,3463,5200)
x <- qcc(my.xmr.raw, type = "xbar.one", title = "Individuals Chart\nfor Wheeler sample data")
x <- qcc(matrix(cbind(my.xmr.raw[1:length(my.xmr.raw)-1], my.xmr.raw[2:length(my.xmr.raw)]), ncol = 2), type = "R", title = "Moving Range Chart\nfor Wheeler sample data")

Below, you can compare the individuals and moving range charts generated by qcc and by my new implementation of plot.qcc():

The qcc individuals chart as implemented in the qcc package.

The qcc individuals chart as implemented using ggplot2 and grid graphics.

The qcc moving range chart as implemented in the qcc package.

The qcc moving range chart as implemented using ggplot2 and grid graphics.

New features

In addition to the standard features in qcc plots, I’ve added a few new options.

size or cex: Set the size of the points used in the plot. This is passed directly to geom_point().
font.size: Sets the size of text elements. Passed directly to ggplot() and grid’s viewport().
title = element_blank(): Eliminate the main graph title completely, and expand the data region to fill the empty space. As with qcc, with the default title = NULL a title will be created, or a user-defined text string may be passed to title.
new.plot: If TRUE, creates a new graph (grid.newpage()). Otherwise, will write into the existing device and viewport. Intended to simplify the creation of multi-panel or composite charts.
digits: The argument digits is provided by the qcc package to control the number of digits printed on the graph, where it either uses the default option set for R or a user-supplied value. I have tried to add some intelligence to calculating a default value under the assumption that we can tell something about the measurement from the data supplied. You can see the results in the sample graphs above.

Lessons Learned

This little project turned out to be somewhat more difficult than I had envisioned, and there are several lessons-learned, particularly in the use of ggplot2.

First, ggplot2 really needs data frames when plotting. Passing discrete values or variables not connected to a data frame will often result in errors or just incorrect results. This is different than either base graphics or grid graphics, and while Hadley Wickham has mentioned this before, I hadn’t fully appreciated it. For instance, this doesn’t work very well:

my.test.data <- data.frame(x = seq(1:10), y = round(runif(10, 100, 300)))
my.test.gplot <- ggplot(my.test.data, aes(x = x, y = y)) + 
  geom_point(shape = 20)
index.1 <- c(5, 6, 7)
my.test.gplot <- my.test.gplot +
  geom_point(aes(x = x[index.1], y = y[index.1]), col = "red")
my.test.gplot

Different variations of this sometimes worked, or sometimes only plotted some of the points that are supposed to be colored red.

However, if I wrap that index.1 into a data frame, it works perfectly:

my.test.data <- data.frame(x = seq(1:10), y = round(runif(10, 100, 300)))
my.test.gplot <- ggplot(my.test.data, aes(x = x, y = y)) + 
  geom_point(shape = 20)
index.1 <- c(5, 6, 7)
my.test.subdata <- my.test.data[index.1,]
my.test.gplot <- my.test.gplot +
  geom_point(data = my.test.subdata, aes(x = x, y = y), col = "red")
my.test.gplot

Another nice lesson was that aes() doesn’t always work properly when ggplot2 is called from within a function. In this case, aes_string() usually works. There’s less documentation than I would like on this, but you can search the ggplot2 Google Group or Stack Overflow for more information.

One of the bigger surprises was discovering that aes() searches for data frames in the global environment. When ggplot() is used from within a function, though, any variables created within that function are not accessible in the global environment. The work-around is to tell ggplot which environment to search in, and a simple addition of environment = environment() within the ggplot() call seems to do the trick. This is captured in a stack overflow post and the ggplot2 issue log.

my.test.data <- data.frame(x = seq(1:10), y = round(runif(10, 100, 300)))
my.test.gplot <- ggplot(my.test.data, environment = environment(), aes(x = x, y = y)) + 
  geom_point(shape = 20)
index.1 <- c(5, 6, 7)
my.test.subdata <- my.test.data[index.1,]
my.test.gplot <- my.test.gplot +
  geom_point(data = my.test.subdata, aes(x = x, y = y), col = "blue")
my.test.gplot

Finally, it is possible to completely and seamlessly replace a function created in a package and loaded in that package’s namespace. When I set out, I wanted to end up with a complete replacement for qcc’s internal plot.qcc() function, but wasn’t quite sure this would be possible. Luckily, the below code, called after the function declaration, worked. One thing I found was that I needed to name my function the same as the one in the qcc package in order for the replacement to work in all cases. If I used a different name for my function, it would work when I called plot() with a qcc object, but qcc’s base graphics version would be used when calling qcc() with the parameter plot = TRUE.

unlockBinding(sym="plot.qcc", env=getNamespace("qcc"));
assignInNamespace(x="plot.qcc", value=plot.qcc, ns=asNamespace("qcc"), envir=getNamespace("qcc"));
assign("plot.qcc", plot.qcc, envir=getNamespace("qcc"));
lockBinding(sym="plot.qcc", env=getNamespace("qcc"));

Outlook

For now, the code suits my immediate needs, and I hope that you will find it useful. I have some ideas for additional features that I may implement in the future. There are some parts of the code that can and should be further cleaned up, and I’ll tweak the code as needed. I am certainly interested in any bug reports and in seeing any forks; good ideas are always welcome.

References

R Core Team (2013). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL http://www.R-project.org/.
Scrucca, L. (2004). qcc: an R package for quality control charting and statistical process control. R News 4/1, 11-17.
H. Wickham. ggplot2: elegant graphics for data analysis. Springer New York, 2009.
Wheeler, Donald. “Individual Charts Done Right and Wrong.” Quality Digest. 2 Feb 20102 Feb 2010. Print. <http://www.spcpress.com/pdf/DJW206.pdf>.