Understanding Data

When analyzing data, I have often found it useful to think of the data as being one of four main types, according to the typology proposed by Stevens.^[1] Different types of data have certain characteristics; understanding what type of data you have helps with selecting the analysis to perform perform while preventing basic mistakes.

The types, or “scales of measurement,” are:

Nominal: Data identifying unique classifications or objects where the order of values is not meaningful. Examples include zip codes, gender, nationality, sports teams and multiple choice answers on a test.
Ordinal: Data where the order is important but the difference or distance between items is not important or not measured. Examples include team rankings in sport (team A is better than team B, but how much better is open to debate), scales such as health (e.g. “healthy” to “sick”), ranges of opinion (e.g. “strongly agree” to “strongly disagree” or “on a scale of 1 to 10”) and Intelligence Quotient.
Interval: Numeric data identified by values where the degree of difference between items is significant and meaningful, but their ratio is not. Common examples are dates—we can say 2000 CE is 1000CE + 1000 years, but 1000 CE is not half of 2000 CE in any meaningful way—and temperatures on the Celsius and Fahrenheit scales, where a difference of 10° is meaningful, but 10° is not twice as hot as 5°.
Ratio: Numeric data where the ratio between numbers is meaningful. Usually, such scales have a meaningful “0.” Examples include length, mass, velocity, acceleration, voltage, power, duration, energy and Kelvin-scale temperature.

The generally-appropriate statistics and mathematical operations for each type are summarized in table 1.

Table 1: Scales of measurement and allowed statistical and mathematical operations.
Scale Type	Statistics	Operations
Nominal	mode, frequency, chi-squared, cluster analysis	=, ≠
Ordinal	above, plus: median, non-parametric tests, Kruskal-Wallis, rank-correlation	=, ≠, >, <
Interval	plus: arithmetic mean, some parametric tests, correlation, regression, ANOVA (sometimes), factor analysis	=, ≠, >, <, +, –
Ratio	plus: geometric and harmonic mean, ANOVA, regression, correlation coefficient	=, ≠, >, <, +, -, ×, ÷

While this is a useful typology for most use, and certainly for initial consideration, there are valid criticisms of Stevens’ typology. For example, percentages and count data have some characteristics of ratio-scale data, but with additional constraints. e.g. the average of the counts $\overline{(2, 2, 1)} = 1.66\ldots$ may not be meaningful. This typology is a useful thinking tool, but it is essential to understand the statistical methods being applied and their sensitivity to departures from underlying assumptions.

Types of data in R

R^[2] recognizes at least fifteen different types of data. Several of these are related to identifying functions and other objects—most users don’t need to worry about most of them. The main types that industrial engineers and scientists will need to use are:

numeric

Real numbers. Also known as double, real and single (note that R stores all real numbers in double-precision). May be used for all scales of measurement, but is particularly suited to ratio scale measurements.

complex

Imaginary real numbers can be manipulated directly as a data type using

x <- 1 + i2

x <- complex(real=1, imaginary=2)

Like type numeric, may be used for all scales of measurement.

integer

Stores integers only, without any decimal point. Can be used mainly for ordinal or interval data, but may be used as ratio data—such as counts—with some caution.

logical

Stores Boolean values of TRUE or FALSE, typically used as nominal data.

character

Stores text strings and can be used as nominal or ordinal data.

Types of variables in R

The above types of data can be stored in several types, or structures, of variables. The equivalent to a variable in Excel would be rows, columns or tables of data. The main ones that we will use are:

vector

Contains one or many elements, and behaves like a column or row of data. Vectors can contain any of the above types of data but each vector is stored, or encoded, as a single type. The vector

c(1, 2, 1, 3, 4)
## [1] 1 2 1 3 4

is, by default, a numeric vector of type double, but

c(1, 2, 1, 3, 4, "name")
## [1] "1" "2" "1" "3" "4" "name"

will be a character vector, or a vector where all data is stored as type character, and the numbers will be stored as characters rather than numbers. It will not be possible to perform mathematical operations on these numbers-stored-as-characters without first converting them to type numeric.

factor

A special type of character vector, where the text strings signify factor levels and are encoded internally as integer counts of the occurrence of each factor. Factors can be treated as nominal data when the order does not matter, or as ordinal data when the order does matter.

factor(c("a", "b", "c", "a"), levels=c("a","b","c","d"))
## [1] a b c a  
## Levels: a b c d

array

A generalization of vectors from one dimension to two or more dimensions. Array dimensions must be pre-defined and can have any number of dimensions. Like vectors, all elements of an array must be of the same data type. (Note that the letters object used in the example below is a variable supplied by R that contains the letters a through z.)

# letters a - c in 2x4 array 
array(data=letters[1:3], dim=c(2,4))
##      [,1] [,2] [,3] [,4]  
## [1,] "a"  "c"  "b"  "a"  
## [2,] "b"  "a"  "c"  "b"

# numbers 1 - 3 in 2x4 array 
array(data=1:3, dim=c(2,4))
##      [,1] [,2] [,3] [,4]  
## [1,]    1    3    2    1  
## [2,]    2    1    3    2

matrix

A special type of array with the properties of a mathematical matrix. It may only be two-dimensional, having rows and columns, where all columns must have the same type of data and every column must have the same number of rows. R provides several functions specific to manipulating matrices, such as taking the transpose, performing matrix multiplication and calculation eigenvectors and eigenvalues.

matrix(data = rep(1:3, times=2), nrow=2, ncol=3)
##      [,1] [,2] [,3]  
## [1,]    1    3    2  
## [2,]    2    1    3

list

Vectors whose elements are other R objects, where each object of the list can be of a different data type, and each object can be of different length and dimension than the other objects. Lists can therefore store all other data types, including other lists.

list("text", "more", 2, c(1,2,3,2))
## [[1]]  
## [1] "text"  
##  
## [[2]]  
## [1] "more"  
##  
## [[3]]  
## [1] 2  
##  
## [[4]]  
## [1] 1 2 3 2

data.frame

For most industrial and data scientists, data frames are the most widely useful type of variable. A data.frame is the list analog to the matrix: it is an $m \times n$ list where all columns must be vectors of the same number of rows (determined with NROW()). However, unlike matrices, different columns can contain different types of data and each row and column must have a name. If not named explicitly, R names rows by their row number and columns according to the data assigned assigned to the column. Data frames are typically used to store the sort of data that industrial engineers and scientists most often work with, and is the closest analog in R to an Excel spreadsheet. Usually data frames are made up of one or more columns of factors and one or more columns of numeric data.

data.frame(rnorm(5), rnorm(5), rnorm(5))
##     rnorm.5.  rnorm.5..1  rnorm.5..2  
## 1  0.2939566  1.28985202 -0.01669957  
## 2  0.3672161 -0.01663912 -1.02064116  
## 3  1.0871615  1.13855476  0.78573775  
## 4 -0.8501263 -0.17928722  1.03848796  
## 5 -1.6409403 -0.34025455 -0.62113545

More generally, in R all variables are objects, and R distinguishes between objects by their internal storage type and by their class declaration, which are accessible via the typeof() and class() functions. Functions in R are also objects, and the users can define new objects to control the output from functions like summary() and print(). For more on objects, types and classes, see section 2 of the R Language Definition.

Table 2 summarizes the internal storage and R classes of the main data and variable types.

Table 2: Table of R data and variable types.
Variable type	Storage type	Class	Measurement Scale
vector of decimals	double	numeric	ratio
vector of integers	integer	integer	ratio or interval
vector of complex	complex	complex	ratio
vector of characters	character	character	nominal
factor vector	integer	factor	nominal or ordinal
matrix of decimals	double	matrix	ratio
data frame	list	data.frame	mixed
list	list	list	mixed

References

Stevens, S. S. “On the Theory of Scales of Measurement.” Science. 103.2684 (1946): 677-680. Print.
R Core Team (2017). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.

Introduction to R for Excel Users

Download the PDF

As the saying goes, when all you have is a hammer, everything looks like a nail. Excel was designed to do simple financial analyses and to craft financial statements. Though its capabilities have been expanded over the years, it was never designed to perform the sort of data analysis that industry scientists, engineers and Six Sigma belts need to perform on a daily basis.

Most data analyses performed in Excel look more like simple financial spreadsheets rather than actual data analysis, and this quality of work translates into bad—or at least sub-optimal—business decisions. There are alternatives to Excel, and the free, open-source data analysis platform R is one of them.

Unfortunately, R has a steep learning curve. I’m offering, for free, a short primer on R [PDF] where I’ve sought to make that learning curve a little less painful for engineers and scientists who normally work in Excel.

Background

A couple of years ago, I was developing a short course to teach R to scientists and engineers in industry who normally used Excel. The goal was to help them transition to a more capable tool. My course design notes morphed into a handout, and when plans for the course fell through, that handout grew into a self-study guide, which I later adapted into this seventy-page, stand-alone introduction for Excel users.

Organization

The primer walks the reader through the basics of R, starting with a brief overview of capabilities, then diving into installation, basic operations, graphical analysis and basic statistics. I believe that a picture is worth a thousand words, so it’s light on text and heavy on examples and visuals.

The end of the book rounds out with a look at some of the most useful add-ons, the briefest of introductions to writing your own, custom functions in R, and a cross-reference of common Excel functions with their equivalents in R.

The text is broken up into chapters and fully indexed so that it can be used either as a walk-through tutorial or as a quick reference.

Update to plot.qcc using ggplot2 and grid

Two years ago, I blogged about my experience rewriting the plot.qcc() function in the qcc package to use ggplot2 and grid. My goal was to allow manipulation of qcc’s quality control plots using grid graphics, especially to combine range charts with their associated individuals or moving range charts, as these two diagnostic tools should be used together. At the time, I posted the code on my GitHub.

I recently discovered that the update to ggplot2 v2.0 broke my code, so that attempting to generate a qcc plot would throw an obscure error from someplace deep in ggplot2. The fix turned out to be pretty easy. The original code used aes_string() instead of aes() because of a barely-documented problem of calling aes() inside a function. It looks like this has been quietly corrected with ggplot2 2.0, and aes_string() is no longer needed for this.

The updated code is up on GitHub. As before, load the qcc library, then source() qcc.plot.R. For the rest of the current session, calls to qcc() will automatically use the new plot.qcc() function.

Sample Size Matters: Design and Cost

We’ve seen in the previous posts that in designing products we need to know characteristics like the mean and standard deviation of the population, but are limited to only being able to measure sample means and standard deviations. This leaves us with uncertainty in our knowledge of population characteristics, and that uncertainty directly impacts our ability to make better products. In this post, we’ll see how business financial requirements and estimation uncertainties due to sample size interact both to to further limit our available design options and to drive up our sample size requirements.

Impact on specifications

Looking back at our graph of Cpk, Cpk values below the target value (blue line) increase production and sales costs through increased rework, scrap and warranty. Above the blue line, we’ve added product or production costs by over-designing the product or process. Since the price of a product is determined by the market, any increase in cost decreases our gross profit margin:

$\text{Gross Profit} = f\left(\text{Price}, \text{Cost} \right) = \text{Price} - \text{Cost}$

As outlined in the first post of this series, we are going to cut material costs by 10% on a part that had to weigh at least 100 kg. That was a $6 reduction in costs on a $120 part.

Our first issue is that we have to be sure that we have a good baseline for improvement. If the existing parts are very different than our expectations, we may be creating more trouble by making changes. We also don’t know how much variation there is in part weight.

We collect production data over a week and determine that the current mean part weight is, as expected, 120 kg with a standard deviation of 6.7 kg. With 120 kg of material, we calculate a Cpk of

We calculate that we have to remove $6/0.5 = 12$ kg, reducing mean weight from 120 kg to 108 kg.

Any single product below 100 kg runs the risk of being rejected by the customer, possibly at great cost (e.g. they may require special field service on older parts, or decide to buy from a competitor in the future, or both), so we don’t want to have a higher defect rate with the new product and process than with the old, because this will increase labor and overhead costs. With the new product and process, we want to target a standard deviation of at least

We might stop there, and say that when we have a design for 108 kg and prototypes that weigh on average 108 kg with a standard deviation of up to 2.7 kg, we’re done. Our specification now looks like this:

	Minimum	Maximum	Target
Part Weight	100	108	?
Standard Deviation	0	2.7	?
Cpk	1.0	?	?

However, there would be substantial risk that we would not achieve our goals of both meeting the customer requirement of 100 kg and reducing material costs by 10%. Using these numbers as our target, we have a 50% chance that we will be over the cost target, and a 50% chance that our defect rate will be higher than target.

In order to meet customer requirements, we want to be confident that all parts weigh at least 100 kg. In order to meet business needs, we have to be 95% confident that at least half of our product weighs at most 108 kg.

For the customer requirement, we need to calculate the Cpk. In the past, “all” product really meant a Cpk of 1.0, or 93% of product. To calculate this we need our 95% confidence estimate of the mean, $\overline{X}_{\text{lower 95\%}}$ and our 95% confidence of the standard deviation, $S_{\text{upper 95\%}}$ .

For the business requirement, we need the confidence bounds on our estimate of the mean, $\overline{X}_{\text{upper 95\%}}$

Now we need to design and build our prototypes. How many parts do we build and weigh? Recognizing that there will be uncertainty in our estimate of $\mu$ and $\sigma$ from such trials, we cannot simply calculate $\overline{X}$ and $S$ and then calculate the estimated Cpk based on the sample, since there is a 50% chance that our products will be worse than we measure from our study. We have to be more careful with our customer base than that.

We have to use the confidence bounds on $\overline{X}$ and $S$ :

calculation of S_upper, X-bar_upper, X-bar_lower

These equations are easier to understand if we graph them for several values of $S$ .

This graph shows the maximum and minimum possible $\overline{X}$ to assure compliance with customer requirements on both weight (solid green line) and the cost targets (dashed blue line), for four different values of $S$ . Red regions indicate that both sets of requirements cannot be met; green shaded regions indicate possible $\overline{X}$ that meet both sets of requirements.

As can be seen, while we calculated a naive target for standard deviation, $\sigma$ , of 2.7 kg, the measured sample standard deviation, $S$ , must be much smaller to assure that we meet requirements. Likewise, small sample sizes can make it impossible to assure that we meet requirements.

We can now ammend our requirements:

	Minimum	Maximum	Target
Part Weight	100	108	105.5
Standard Deviation	0	2.7	(from Cpk)
Cpk	1.0	f(n, S)	1.0
$\overline{X}$	f(n, S)	f(n, S)	104
$S$	0	1.2	1.0
n (for sampling)	10	100	10

Not only do we have to design our product and process to be more stringent than the naive requirements, we have to test more than we might otherwise wish to.

Importantly, our specification now contains the tolerance ranges on the weight, the standard deviation of the weight, and the Cpk. This is the minimum set of information that we need to fully specify a part. For the purposes of testing and checking short-term process performance, we also need to specify the number of samples to collect and sample mean and standard deviation.

UPDATE 2016-08-25: Equations were no longer rendering correctly; this was fixed.

References

R Core Team (2014). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL http://www.R-project.org/.
H. Wickham. ggplot2: elegant graphics for data analysis. Springer New York, 2009.

Sample Size Matters: Design and Experiments

Previously, I introduced the idea that samples do not look exactly like the populations that they are drawn from, and had a closer look at what impact sample size has on our ability to estimate population statistics like mean, proportion or Cpk from samples. Here, I will have a closer look at how this uncertainty impacts our engineering process. In the next post, I will tie in the engineering impacts and decisions to the business value and costs.

Difference to detect

When we are testing, we’re either testing to determine that the new product or process performs better than the old or, for cost reduction projects, that the cheaper product or process is at least as good as the existing one.

This means that we need to detect a difference between old and new values, such as a difference in the mean weight between the new and old parts. The larger the sample size, $n$ the smaller the difference, $\Delta$ that we can detect. The error, $\epsilon$ , in our estimate of the differences gets smaller as sample size increases:

$\epsilon_{\Delta} \propto \frac{\sigma}{\sqrt{n}}$

Given the uncertainties in our estimate of $\mu$ and $\sigma$ , illustrated above, it should be clear now that with small sample sizes we can only detect large differences of many multiples of the sample standard deviation, $S$ .

Mean

When trying to determine if a new product or process is better than an old one, we are usually interested in shifting the mean. We want a product to be lighter, provide more power, or a process to work faster. In such cases, we need to estimate the difference of the means, $\Delta = \mu_2 - \mu_1$ and ensure that it is different than 0 (or some other pre-determined value). The minimum difference that we can reliable detect is plotted below for different sample sizes.

Standard deviation

In many Six Sigma projects, and any time we want to shift the mean closer to a specification limit, we need to compare the new population standard deviation with the old. The simplest way of making this comparison is by taking the ratio $F = \sigma_{2}^{2} / \sigma_{1}^{2}$ , where $\sigma_{2}^{2}$ is the larger of the two variances. The dependence on sample size is illustrated below.

You can see from the inset plot, which includes sample sizes of 2 and 3, that small sample sizes really hurt comparisons of variance, and that interesting differences in variance can’t be detected until we have more than 10 samples.

Proportions

Proportions, such as fraction of defective parts between a new and old design, can be compared by looking at the difference between the two proportions, $\Delta = \left| p_1 - p_0 \right|$ .

You can see from this that proportions data provides much less information than variable data; we need much larger sample sizes to achieve usefully small $\Delta$ .

Summary and look forward

When designing experiments, the goal is to detect some difference between two populations. The uncertainty in our measurements and the variation in the parts has a big impact on how many parts we need to test, or greatly limits what we can learn from an experiment.

Next time, I’ll show how these calculations of sample size and uncertainty impact the busines.

Sample Size Matters: Uncertainty in Measurement

In my previous post, I gave a brief introduction to populations and samples, and stated that sample size impacts our ability to know what a population really looks like. In this post, I want to show this relationship in more detail. In future posts, I will look at how sample size considerations impact our engineering process and what impacts this has on the business.

Mean and sample size

The error in our estimate of the mean, $E$ , is proportional to the standard deviation of the sample, $S$ , and the sample size, $n$ .

$E \propto \frac{S}{\sqrt{n}}$

We can visualize this easily enough by plotting the 95% confidence interval. When we sample and calculate the sample mean ( $\overline{X}$ ), the true population mean, $\mu$ , (what we really want to know) is likely to be anywhere in the shaded region of the graph below.

This graph shows the 95% confidence region for the true population mean, $\mu$ ; there’s a 95% chance that the true population mean is within this band. The “0” line on the y axis is our estimate of the mean, $\overline{X}$ . We can’t know what the true population mean is, but it’s clear that if we use more samples, we can be sure that our estimate is closer to the true mean.

Standard deviation and sample size

Likewise, when we calculate the sample standard deviation, $S$ , the true standard deviation, $\sigma$ has a 95% chance of being within the confidence band below. For small sample sizes (roughly less than 10), the measured standard deviation can be off from the true standard deviation by several times. Even for ten samples, the potential error is nearly $\pm 1$ standard deviation.

Proportion and sample size

For proportions, the situation is similar: there is a 95% chance that the true sample proportion, $p$ , is within the shaded band based on the measured sample proportion $\hat{p}$ . Since this confidence interval depends on $\hat{p}$ and cannot be standardized the way $\mu$ and $\sigma$ can be, confidence intervals for two different proportions are plotted.

For small $n$ , proportions data tells us very little.

Process capability and production costs

The cost of poor quality in product or process design can be characterized by the Cpk:

$Cpk = \mathrm{minimum} \begin{cases}\frac{USL - \mu}{3\sigma} \\\frac{\mu - LSL}{3\sigma}\end{cases}$

Where USL is the upper specification limit (also called the upper tolerance) and LSL is the lower specification limit (or lower tolerance).

We can estimate the defect rate (defects per opportunity, or DPO) from the Cpk:

$DPO = 1 - \Pr\left(X < 3 \times Cpk - 1.5\right)$

That probability function is calculated in R with pnorm(3 * Cpk - 1.5) and in Excel with NORMSDIST(3 * Cpk - 1.5). The 1.5 is a typical value used to account for uncorrected or undetected process drift.

Since we don’t know $\mu$ and $\sigma$ , we have to substitute $\overline{X}$ and $S$ . The uncertainty in these estimates of the population $\mu$ and $\sigma$ mean that we have uncertainty in what the true process Cpk (or defect rates) will be once we’re in production. When our sample testing tells us that the Cpk should be 1.67 (the blue line), the true process Cpk will actually turn out to be somewhere in the shaded band:

Below the blue line, our product or process is failing to meet customer expectations, and will result in lost customers or higher warranty costs. Above the blue line, we’ve added more cost to the production of the product than we need to, reducing our gross profit margin. Since that gray band doesn’t completely disappear, even at 100 samples, we can never eliminate these risks; we have to find a way to manage them effectively.

The impact of this may be more evident when we convert from Cpk to defect rates (ppm):

Summary and a look forward

With a fair sampling process, samples will look similar to—and statistically indistinguishable from—the population that they were drawn from. How much they look like the population depends critically on how many samples are tested. The uncertainties, or errors in our estimates, resulting from sample size decisions have impacts all through our design analysis and production planning.

In the next post, I will explore in more detail how these uncertainties impact our experiment designs.

Sample Size Matters

I find that Six Sigma and Design for Six Sigma courses are often eye-opening experiences for participants. There is an experience of discovering that there are tools available to answer problems that have vexed them, and learning that good engineering and science decisions can lead directly to good business outcomes through logical steps.

One of the most remarkable such moments is when students realize the importance of sample size. In the best cases, there is a forehead-slapping moment where the student realizes that much of the testing they’ve done in the past has probably been a complete waste of time; that while they thought they were seeing interesting differences and making good decisions, they were in fact only fooling themselves by comparing too-small data sets.

I want to show in the next few blog posts why sample size matters, both from a technical perspective and from a business perspective.

Design example

Throughout the next few posts, I’ll use the example of a manufactured product which the customer requires weigh at least 100 kg, sells for about $140 and that costs $120 to manufacture and convert to a sale (the cost of goods sold, or COGS, is $120).

		Amount
Sales		140
COGS		120
Material	60
Labor and Overhead	60
—	—	—
Gross Profit		20

We want to develop a new version of the product, using a modified design and a new process that, by design, will reduce the cost of material by 10%. The old cost of material was 50% of COGS, or $60. To achieve the material cost reduction of 10%, we have to remove $6 in material costs, improving gross profit to $26.

We believe that the current design masses 120 kg, so we estimate that our new part mass should be $120 - 0.1 \times 120 = 108$ kg.

	Current Design	New Design Target
Part Weight	120	108

Seems like we might be done at this point, and I’ve seen plenty of engineering projects that stop here. Unfortunately, this isn’t the whole story. Manufacturing will be unable to produce parts of exactly 108 kg, so they’ll need a tolerance range to check parts against. We have that customer requirement for at least 100 kg, so any variation has to stay above that. We also want to save money relative to the current design, so we don’t want many parts to weigh much more than this, especially since the customer isn’t really willing to pay us for the “extra” material beyond 100 kg.

Population versus sample statistics

Most of process or product improvement is concerned with reducing the standard deviation, $\sigma$ , shifting the mean (a.k.a. average), $\mu$ , or reducing a proportion, $p$ , of a process or product characteristic. These summary statistics refer to the population characteristics—the mean, standard deviation or proportion of all parts of a certain design that will ever be produced, or all times that a production step will ever be completed in the intended manner.

Since we can’t measure the whole population up front—we will be producing parts for a long time—we have to draw a sample from the population, and use the statistics of that sample to gain insight into the total population. We can visualize this, somewhat crudely, with the following:

We can imagine that the blue circles are conforming parts, and the orange octagons are non-conforming parts. If the sampling process is fair, then the sample proportion $\hat{p}$ will be close to—and statistically indistinguishable from—the true population proportion $p$ . In the population we have 44 parts total, 8 defective parts and 36 conforming parts. In the sample that we drew, we have 10 parts total, 9 conforming and 1 defective. While $(p = 8/36 = 1/4 \ne \hat{p} = 1/9$ , statistically we have

matrix(c(1, 8, 10-1, 44-8), ncol=2) %>% 
  chisq.test(simulate.p.value = TRUE)

## 
##  Pearson's Chi-squared test with simulated p-value (based on 2000
##  replicates)
## 
## data:  matrix(c(1, 8, 10 - 1, 44 - 8), ncol = 2)
## X-squared = 0.3927, df = NA, p-value = 0.6692

With such a high p-value (0.67), we fail to reject the null hypothesis that $\hat{p} = p$ ; in more colloquial terms, we conclude that the apparent difference between 8/36 and 1/9 is only due to random errors in sampling. (For larger counts of successes and failures, prop.test() would also work and would be more informative.)

From our perspective, of course, we don’t know what the population looks like. We don’t have any way of knowing with certainty—or accessing data about—future performance, so there is no way for us to know what the total population looks like. In lieu of population data, we develop a sampling process that allows us to fairly draw a sample from that population.

While we want to know the true population mean, $\mu$ , the true population standard deviation, $\sigma$ , or the true population proportion $p$ , we can only calculate the sample mean, $\overline{X}$ , the sample standard deviation, $S$ , or the sample proportion $\hat{p}$ .

From the known sample, we then reason backward to what the true population looks like. This is where statistics comes into play; statistics allows us to place rigorous boundaries on what the population may look like, without fooling ourselves. Sample size is critical to controlling the uncertainty in these boundaries.

Summary and a look forward

Testing in product development—and usually in production—involves sampling a product or process. Samples never look exactly like the population that we are concerned about, but if the sampling process is fair then the samples will be statistically indistinguishable from the population. With due awareness of the statistical uncertainties, we can use samples to make decisions about the population.

In the next post, I will look at how sample size impacts the uncertainty in our estimation of population statistics like the mean and standard deviation. In a later post, I will look at how this uncertainty impacts the business.

A short aside on statistical tests for proportions

The usual way to compare two proportions would be a proportions test (prop.test() in R), but because we have so few samples to compare, the results may be unreliable and prop.test() generates an appropriate warning. fisher.test() provides an exact estimate of the p-value, but the assumptions are violated with data like this, where we are sampling a fixed number of parts (i.e. row sums are fixed, but column sums are not controlled). This leaves us with using a chi-squared test (chisq.test() in R) which is less informative but does the job. Either the Barnard test or Bayesian estimation based on Monte Carlo simulation would be more informative and possibly more robust.

The Most Useful Data Plot You’ve Never Used

Those of us working in industry with Excel are familiar with scatter plots, line graphs, bar charts, pie charts and maybe a couple of other graph types. Some of us have occasionally used the Analysis Pack to create histograms that don’t update when our data changes (though there is a way to make dynamic histograms in Excel; perhaps I’ll cover this in another blog post).

One of the most important steps in data analysis is to just look at the data. What does the data look like? When we have time-dependent data, we can lay it out as a time-series or, better still, as a control chart (a.k.a. “natural process behavior chart”). Sometimes we just want to see how the data looks as a group. Maybe we want to look at the product weight or the cycle time across production shifts.

Unless you have Minitab, R or another good data analysis tool at your disposal, you have probably never used—maybe never heard of—boxplots. That’s unfortunate, because boxplots should be one of the “go-to” tools in your data analysis tool belt. It’s a real oversight that Excel doesn’t provide a good way to create them.

For the purpose of demonstration, let’s start with creating some randomly generated data:

head(df)

##   variable   value
## 1   group1 -1.5609
## 2   group1 -0.3708
## 3   group1  1.4242
## 4   group1  1.3375
## 5   group1  0.3007
## 6   group1  1.9717

tail(df)

##     variable   value
## 395   group1  1.4591
## 396   group1 -1.5895
## 397   group1 -0.4692
## 398   group1  0.1450
## 399   group1 -0.3332
## 400   group1 -2.3644

If we don’t have much data, we can just plot the points:

library(ggplot2)

ggplot(data = df[1:10,]) +
  geom_point(aes(x = variable, y = value)) +
  coord_flip() +
  theme_bw()

But if we have lots of data, it becomes hard to see the distribution due to overplotting:

ggplot(data = df) +
  geom_point(aes(x = variable, y = value)) +
  coord_flip() +
  theme_bw()

We can try to fix this by changing some parameters, like adding semi-transparency (alpha blending) and using an open plot symbol, but for the most part this just makes the data points harder to see; the distribution is largely lost:

ggplot(data = df) +
  geom_point(aes(x = variable, y = value), alpha = 0.3, shape = 1) +
  coord_flip() +
  theme_bw()

The natural solution is to use histograms, another “go-to” data analysis tool that Excel doesn’t provide in a convenient way:

ggplot(data = df) +
  geom_histogram(aes(x = value), binwidth = 1) +
  theme_bw()

But histograms don’t scale well when you want to compare multiple groups; the histograms get too short (or too narrow) to really provide useful information. Here I’ve broken the data into eight groups:

head(df)

##   variable   value
## 1   group1 -1.5609
## 2   group1 -0.3708
## 3   group1  1.4242
## 4   group1  1.3375
## 5   group1  0.3007
## 6   group1  1.9717

tail(df)

##     variable   value
## 395   group8 -0.6384
## 396   group8 -3.0245
## 397   group8  1.5866
## 398   group8  1.9747
## 399   group8  0.2377
## 400   group8 -0.3468

ggplot(data = df) +
  geom_histogram(aes(x = value), binwidth = 1) +
  facet_grid(variable ~ .) +
  theme_bw()

Either the histograms need to be taller, making the stack too tall to fit on a page, or we need a better solution.

The solution is the box plot:

ggplot() +
  geom_boxplot(data = df, aes(y = value, x = variable)) +
  coord_flip() +
  theme_bw()

The boxplot provides a nice, compact representation of the distribution of a set of data, and makes it easy to compare across a large number of groups.

There’s a lot of information packed into that graph, so let’s unpack it:

Median

A measure of the central tendency of the data that is a little more robust than the mean (or arithmetic average). Half (50%) of the data falls below this mark. The other half falls above it.

First quartile (25th percentile) hinge

Twenty-five percent (25%) of the data falls below this mark.

Third quartile (75th percentile) hinge

Seventy-five percent (75%) of the data falls below this mark.

Inter-Quartile Range (IQR)

The middle half (50%) of the data falls within this band, drawn between the 25th percentile and 75th percentile hinges.

Lower whisker

The lower whisker connects the first quartile hinge to the lowest data point within 1.5 * IQR of the hinge.

Upper whisker

The upper whisker connects the third quartile hinge to the highest data point within 1.5 * IQR of the hinge.

Outliers

Any data points below 1.5 * IQR of the first quartile hinge, or above 1.5 * IQR of the third quartile hinge, are marked individually as outliers.

We can add additional values to these plots. For instance, it’s sometimes useful to add the mean (average) when the distributions are heavily skewed:

ggplot(data = df, aes(y = value, x = variable)) +
  geom_boxplot() +
  stat_summary(fun.y = mean, geom="point", shape = 10, size = 3, colour = "blue") +
  coord_flip() +
  theme_bw()

Graphs created in the R programming language using the ggplot2 and gridExtra packages.

References

R Core Team (2014). R: A language and environment for statistical computing. R Foundation for Statistical
Computing, Vienna, Austria. URL http://www.R-project.org/.
H. Wickham. ggplot2: elegant graphics for data analysis. Springer New York, 2009.
Baptiste Auguie (2012). gridExtra: functions in Grid graphics. R package version 0.9.1.
http://CRAN.R-project.org/package=gridExtra

Flowing Requirements from the VoC or VoP

In a previous post, I talked about the voice of the customer (VoC), voice of the process (VoP) and the necessity of combining the two when specifying a product. Here, I’d like to offer a general method for applying this in the real world, which can be implemented as a template in Excel.

Recap

I showed that there was a cost function associated with any specification that derived from both the VoC (expressed as tolerances or specification limits) and from the process capability. An example cost function for a two-sided tolerance is reproduced below.

Percent of target production costs given an average production weight and four different process capabilities.

I argued that, given this cost function, specifying a product requires specifying both the product specification limits (or tolerances) and the minimally acceptable process capability, Cpk. Ideally, both of these should flow down from a customer needs analysis to the finished product, and from the finished product to the components, and so on to materials.

Requirements flow down and up

To flow all requirements down like this, we would need to know the transfer functions, $Y = f(X)$ , for each requirement Y and each subcomponent characteristic X. There are methods for doing this, like Design for X or QFD, but they can be difficult to implement. In the real world, we don’t always know these transfer functions, and determining them can require non-trivial research projects that are best left to academia.

As an illustration, we will use the design of a battery (somewhat simplified), where we have to meet a minimum requirement that is the sum of component parts. The illustration below shows the component parts of a battery, or cell. It includes a container (or “cell wall”), positive and negative electrodes (or positive and negative “plates”), electrolyte and terminals that provide electrical connection to the outside world. Usually, we prefer lighter batteries to heavier ones, but for this example, we’ll suppose that a customer requires a minimum weight. This requirement naturally places limits on the weight of all components.

In the absence of transfer functions, we often make our best guess, build a few prototypes, and then adjust the design. This may take several iterations. A better approach is to estimate the weight specification limits and minimum Cpk by calculation before any cells are actually built.

General drawing of the structure of aircraft battery’s vented type NiCd cell. Ransu. Wikipedia, [http://en.wikipedia.org/wiki/ File:Aircraft_battery_cell.gif]. Accessed 2014-04-04.

Suppose the customer specifies a cell minimum weight of 100 kg. From similar designs, we know the components that contribute to the cell mass and have an idea of the percentage of total weight that each component contributes.

$m_{cell}=m_{container}+m_{terminals}+m_{electrolyte}+m_{poselect}+m_{negelect}$

Each individual component is therefore a fraction f_m of the total cell mass, e.g.

$m_{container}=f_{m,container}m_{cell}$

More generally, for a measurable characteristic c, component i has an expected mean or target value of $T_{i,c}=f_{i,c}\mu_{parent,c}$ or $T_{i,c}=f_{i,c}T_{parent,c}$ .

In our example, we may know from similar products or from design considerations that we want to target the following percents for each fraction f_m:

5% for container
19% for terminals
24% for electrolyte
26% for positive electrodes
26% for negative electrodes

Specification Limits

Upper Specification Limit (USL): The maximum allowed value of the characteristic. Also referred to as the upper tolerance.
Lower Specification Limit (LSL): The minimum allowed value of the characteristic. Also referred to as the lower tolerance.

Since the customer will always want to pay as little as possible, a specified lower weight of 100 kg is equivalent to saying that they are only willing to pay for 100 kg of material; any extra material is added cost that reduces our profit margin. If we tried to charge them for 150 kg of material, they would go buy from our competitors. The lower specification limit, or lower tolerance, of the cell weight is then 100 kg.

If the customer does not specify a maximum weight, or upper specification limit, then we determine the upper limit by the maximum extra material cost that we are willing to bear. In this example, we decide that we are willing to absorb up to 5% additional cost per part. Assuming that material and construction contributes 50% to the total cell cost, the USL is then 110 kg. To allow for some variation, we can set a target weight in the middle: 105 kg. From data on previous designs and the design goals, we can apportion the target weight to each component of the design, as shown in the table below.

We can apply the same fractions to the cell USL and LSL to obtain a USL and LSL of each component. As long as parts are built within these limits, the cell will be within specification. The resulting specification for cell and major subcomponents is illustrated in table [tblSpecification]. Further refinement of the allocation of USL and LSL to the components is possible and may be needed if the limits do not make sense from a production or cost perspective.

Part	Percent	Target	LSL	USL
		/kg	/kg	/kg
Cell	100%	105	100	110
Container	5%	5.2	5	5.5
Terminals	19%	19.9	19	20.9
Electrolyte	24%	25.2	24	26.4
Positive electrodes	26%	27.3	26	28.6
Negative electrodes	26%	27.3	26	28.6

Variance of components and Cpk

When a characteristic is due to the sum of the part’s components, as with cell mass, the part-to-part variation in the characteristic is likewise due to the variation in the components. However, where the characteristic adds as the sum of the components,

$m_{cell}=m_{container}+m_{terminals}+m_{electrolyte}+m_{poselect}+m_{negelect}$

the variance, $\sigma^{2}$ adds as the sum of squares

$\sigma_{cell}^{2}=\sigma_{container}^{2}+\sigma_{terminal}^{2}+\sigma_{electrolyte}^{2}+\sigma_{poselect}^{2}+\sigma_{negelect}^{2}$

The variance of any individual component is therefore a function of the total parent part variance

$\sigma_{container}^{2}=\sigma_{cell}^{2}-\sigma_{terminal}^{2}-\sigma_{electrolyte}^{2}-\sigma_{poselect}^{2}-\sigma_{negelect}^{2}$

$\displaystyle \sigma_{container,mass}^{2}=f_{\sigma,container}\sigma_{cell,mass}^{2}$

Since this is true for all components, the two fractions $f_{m}$ and $f_{\sigma}$ will be approximately equal. Therefore if we don’t know the fractions $f_{\sigma}$ , we can use the fraction $f_{m}$ , which usually easier to work out, to allocate the variance to each component:

$\displaystyle \sigma_{container,mass}^{2}=f_{m,container}\times\sigma_{cell,mass}^{2}$

More generally, for measurable characteristic $c$ of a subcomponent $i$ of a parent component,

$\displaystyle \sigma_{i,c}=\sqrt{f_{c,i}}\:\sigma_{c,parent}$

Since the given $\sigma$ is the maximum allowed for the parent to meet the desired Cpk, this means that $\sigma_{i}^{2}$ is an estimate for the maximum allowed component variance. Manufacturing can produce parts better than this specification, but any greater variance will drive the parent part out of specification.

Calculating Specification Limits

In general, there are two conflicting goals in setting specifications:

Make them as wide as possible to allow for manufacturing variation while still meeting the VoC.
Make them as narrow as possible to stay near the minimum of the cost function.

For this, Crystall Ball or iGrafx are very useful tools during development, as we can simulate a set of arts or processes, analyze the allowed variation in the product and easily flow that variation down to each component. In the absence of these tools, Minitab or Excel can be used to derive slightly less robust solutions.

Calculating from Customer Requirements

Identify any customer requirements and set specification limits (USL and LSL) accordingly. If the customer requirements are one-sided, determine the maximum additional cost we are willing to accept, and set the other specification limit accordingly. Some approximation of costs may be needed.
If no target is given, set the target specification for each requirement as the average of USL and LSL.
Set the minimum acceptable Cpk for each specification. Cpk = 1.67 is a good starting value. Use customer requirements for Cpk, where appropriate, and consider, also, whether the application requires a higher Cpk (weakest link in the chain….
Calculate the maximum allowed standard deviation to meet the Cpk requirement as $\sigma_{parent}=\left(USL-LSL\right)/\left(6\times Cpk\right)$ .
For each subcomponent (e.g. the cell has subcomponents of container, electrodes, electrolyte, and so on), apportion the target specification to each of the subcomponents based on engineering considerations and judgement. If the fractions $f$ are known, $T_{i}=f_{i}\times T_{parent}$ .
Calculate the fraction $f_{i}$ (or percent) of the parent total for each subcomponent if not already established in step (5).
Calculate the USL and LSL for each subcomponent by multiplying the parent USL and LSL by the component’s fraction of parent (from step 6). $USL_{i}=f_{i}\times USL_{parent}$ and $LSL_{i}=f_{i}\times LSL_{parent}$ .
Estimate the allowed standard deviation $\sigma_{i}$ for each subcomponent as
$\displaystyle \sigma_{i}=\mathtt{SQRT}\left(f_{i}\right)\times\sigma_{parent}.$
Calculate the minimum allowed Cpk for each subcomponent from the results of (5), (7) and (8), using the target, $T$ , for the mean, $\mu$ .
$\displaystyle Cpk_{i}=minimum\begin{cases}\frac{USL_{i}-T_{i}}{3\sigma_{i}}\\\frac{T_{i}-LSL_{i}}{3\sigma_{i}}\end{cases}$
Repeat steps (5) through (9) until all components have been specified.
For each component, report the specified USL, LSL, target T and maximum Cpk.

Calculating from Process Data

When there is no clear customer-driven requirement or clear requirement from parent parts (e.g. dimensional specifications that can be driven by the fit of parts), but specification limits are still reasonably needed, we can start from existing process data.

This is undesirable because any change to the process can force a change to the product specification, without any clear understanding of the impact on customer needs or requirements; the VoC is lost.

The calculation of USL and LSL from process data is also somewhat more complicated, as we have to use the population mean and standard deviation to determine where to set the USL and LSL, without really knowing what that mean and standard deviation are.

In the real world, we have to live with such constraints. To deal with these limitations, we will use as much data as is available and calculate the confidence intervals on both the mean and the standard deviation. The calculation for USL and LSL becomes

$\setlength\arraycolsep{2pt}\begin{array}{rl}\displaystyle USL &=\textrm{upper 95\% confidence on the mean}\smallskip\\ \displaystyle &\quad +k\times\textrm{upper 95\% confidence on the standard deviation}\end{array}$
$\setlength\arraycolsep{2pt}\begin{array}{rl}\displaystyle LSL &=\textrm{lower 95\% confidence on the mean}\smallskip\\ &\quad -k\times\textrm{upper 95\% confidence on the standard deviation}\end{array}$

where $k$ is the number of process Sigmas desired, based on the tolerance cost function. Most of the time, we will use $k=5$ , to achieve a Cpk of 1.67.

We always use the upper 95% confidence interval on the standard deviation. We don’t care about the lower confidence interval, since a small $\sigma$ will not help us in setting specification limits.

Calculate the mean ( $\mu_{parent}$ ) from recent production data. In Excel, use the AVERAGE() function on the data range.
Calculate the standard deviation () from recent production data. In Excel, you can use the STDEV() function on the data range.
1. If the order of production data is known, or SPC is in use, a better method is to use the range-based estimate from the control charts. This will be discussed in subsequent training on control charts.
Count the number of data points, n, that were used for the calculations (1) and (2). You can use the COUNT() function on the data range.
Calculate the 95% confidence level on the mean. In Excel, this is accomplished with
$CL=\mathtt{TINV}\left(\left(1-0.95\right);n-1\right)\times\sigma_{parent}/\mathtt{SQRT}\left(n\right)$

In Excel 2010 and later, TINV() should be replaced with T.INV.2T().
Calculate the 95% confidence interval on the mean as $CI_{upper}=\mu+CL$ and $CI_{lower}=\mu-CL$ .
Calculate the upper and lower 95% confidence limits on the standard deviation. In Excel, this is accomplished with
$\sigma_{upper}=\sigma_{parent}\times\mathtt{SQRT}\left(\left(n-1\right)/\mathtt{CHIINV}\left(\left(1-0.95\right)/2;n-1\right)\right)$

and

$\sigma_{lower}=\sigma_{parent}\times\mathtt{SQRT}\left(\left(n-1\right)/\mathtt{CHIINV}\left(1-\left(1-0.95\right)/2;n-1\right)\right)$

In Excel 2010 and later, CHIINV() can be replaced with CHISQ.INV.RT() for improved accuracy.
Calculate the LSL as $LSL_{parent}=CI_{lower}-k\sigma_{upper}$ . You might use a value other than 5 if the customer requirements or application require a higher process Sigma.
Calculate the USL as $USL_{parent}=CI_{upper}+k\sigma_{upper}$ .
For each subcomponent (e.g. the cell has subcomponents of positive electrode, negative electrode, electrolyte, and so on), apportion the parent part mean to each of the subcomponents based on engineering considerations and judgement. If the fractions $f$ are known, $T_{i}=f_{i}\times\mu_{parent}$ .
If the the fraction (or percent) $f_{i}$ of the parent total for each subcomponent is not known, calculate it using the results of step (9).
Calculate the USL and LSL for each subcomponent by multiplying the parent USL and LSL by the component’s fraction of parent (from step 6). $USL_{i}=f_{i}\times USL_{parent}$ and $LSL_{i}=f_{i}\times LSL_{parent}$ .
Estimate the allowed standard deviation $\sigma_{i}$ for each subcomponent as $\sigma_{i}=\mathtt{SQRT}\left(f_{i}\right)\times\sigma_{lower}$
Calculate the minimum allowed Cpk for each subcomponent from the results of (5), (7) and (8), using the target $T_{i}$ for the mean, $\mu_{i}$ .
$\displaystyle Cpk_{i}=minimum\begin{cases}\frac{USL_{i}-T_{i}}{3\sigma_{i}}\\\frac{T_{i}-LSL_{i}}{3\sigma_{i}}\end{cases}$
Repeat steps (9) through (13) until all components have been specified.
For each component, report the specified USL, LSL, target T and maximum Cpk.

Can We do Better than R-squared?

If you're anything like me, you've used Excel to plot data, then used the built-in “add fitted line” feature to overlay a fitted line to show the trend, and displayed the “goodness of fit,” the r-squared (R²) value, on the chart by checking the provided box in the chart dialog.

The R² calculated in Excel is often used as a measure of how well a model explains a response variable, so that “R² = 0.8” is interpreted as “80% of the variation in the 'y' variable is explained by my model.” I think that the ease with which the R² value can be calculated and added to a plot is one of the reasons for its popularity.

There's a hidden trap, though. R² will increase as you add terms to a model, even if those terms offer no real explanatory power. By using the R² that Excel so helpfully provides, we can fool ourselves into believing that a model is better than it is.

Below I'll demonstrate this and show an alternative that can be implemented easily in R.

Some data to work with

First, let's create a simple, random data set, with factors a, b, c and response variable y.

head(my.df)

##       y a       b      c
## 1 2.189 1 -1.2935 -0.126
## 2 3.912 2 -0.4662  1.623
## 3 4.886 3  0.1338  2.865
## 4 5.121 4  1.2945  4.692
## 5 4.917 5  0.1178  5.102
## 6 4.745 6  0.4045  5.936

Here is what this data looks like:

Calculating R-squared

What Excel does when it displays the R² is create a linear least-squares model, which in R looks something like:

my.lm <- lm(y ~ a + b + c, data = my.df)

Excel also does this when we call RSQ() in a worksheet. In fact, we can do this explicitly in Excel using the Regression analysis option in the Analysis Pack add-on, but I don't know many people who use this, and Excel isn't known for its reliability in producing good output from the Analysis Pack.

In R, we can obtain R² via the summary() function on a linear model.

summary(my.lm)

## 
## Call:
## lm(formula = y ~ a + b + c, data = my.df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.2790 -0.6006  0.0473  0.5177  1.5299 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(&gt;|t|)  
## (Intercept)    2.080      0.763    2.72    0.034 *
## a             -0.337      0.776   -0.43    0.679  
## b             -0.489      0.707   -0.69    0.515  
## c              1.038      0.817    1.27    0.250  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.1 on 6 degrees of freedom
## Multiple R-squared:  0.833,  Adjusted R-squared:  0.75 
## F-statistic:   10 on 3 and 6 DF,  p-value: 0.00948

Since summary() produces a list object as output, we can grab just the R² value.

summary(my.lm)$r.squared

## [1] 0.8333

Normally, we would (somewhat loosely) interpret this as telling us that about 83% of the variation in the response y is explained by the model.

Notice that there is also an "adjusted r-squared” value given by summary(). This tells us that only 75% of the variation is explained by the model. Which is right?

The problem with R-squared

Models that have many terms will always give higher R² values, just because more terms will slightly improve the model fit to the given data. The unadjusted R² is wrong. The calculation for adjusted R² is intended to partially compensate for that “overfit,” so it's better.

It's nice that R shows us both values, and a pity that Excel won't show the adjusted value. The only way to get an adjusted R² in Excel is to run the Regression analysis; otherwise, we have to calculate adjusted R² manually.

Both R² and adjusted R² are measures of how well the model explains the given data. However, in industry we usually want to know something a little different. We don't build regression models to explain only the data we have; we build them to think about future results. We want R² to tell us how well the model predicts the future. That is, we want a predictive R². Minitab has added the ability to calculate predictive R² in Minitab 17, and has a nice blog post explaining this statistic.

Calcuting predictive R-squared

Neither R nor Excel provide a means of calculating the predictive R² within the default functions. While some free R add-on packages provide this ability (DAAG, at least), we can easily do it ourselves. We'll need a linear model, created with lm(), for the residuals so we can calculate the “PRESS” statistic, and then we need the sum of squares of the terms so we can calculate a predictive R².

Since the predictive R² depends entirely on the PRESS statistic, we could skip the added work of calculating predictive R² and just use PRESS, as some authors advocate. The lower the PRESS, the better the model is at fitting future data from the same process, so we can use PRESS to compare different models. Personally, I'm used to thinking in terms of R², and I like having the ability to compare to the old R² statistic that I'm familiar with.

To calculate PRESS, first we calculate the predictive residuals, then take the sum of squares (thanks to (Walker’s helpful blog post) for this). This is pretty easy if we already have a linear model. It would take a little more work in Excel.

pr <- residuals(my.lm)/(1 - lm.influence(my.lm)$hat)
PRESS <- sum(pr^2)
PRESS

## [1] 19.9

The predictive R² is then (from a helpful comment by Ibanescu on LikedIn) the PRESS divided by the total sum of squares, subtracted from one. The total sum of squares can be calculated directly as the sum of the squared residuals, or obtained by summing over Sum Sq from an anova() on our linear model. I prefer using the anova function, as any statistical subtleties are more likely to be properly accounted for there than in my simple code.

# anova to calculate residual sum of squares
my.anova <- anova(my.lm)
tss <- sum(my.anova$"Sum Sq")
# predictive R^2
pred.r.squared <- 1 - PRESS/(tss)
pred.r.squared

## [1] 0.5401

You'll notice that this is smaller than the residual R², which is itself smaller than the basic R². This is the point of the exercise. We don't want to fool ourselves into thinking we have a better model than we actually do. One way to think of this is that 29% (83% – 54%) of the model is explained by too many factors and random correlations, which we would have attributed to our model if we were just using Excel's built-in function.

When the model is good and has few terms, the differences are small. For example, working through the examples in Mitsa's two posts, we see that for her model 3, R² = 0.96 and the predictive R² = 0.94, so calculating the predictive R² wasn't really worth the extra effort for that model. Unfortunately, we can't know, in advance, which models are “good.” For Mitsa's model 1 we have R² = 0.95 and predictive R² = 0.32. Even the adjusted R² looks pretty good for model 1, at 0.94, but we see from the predictive R² that our model is not very useful. This is the sort of thing we need to know to make correct decisions.

Automating

In R, we can easily wrap these in functions that we can source() and call directly, reducing the typing. Just create a linear model with lm() (or an equivalent) and pass that to either function. Note that pred_r_squared() calls PRESS(), so both functions have to be sourced.

pred_r_squared <- function(linear.model) {
    lm.anova <- anova(linear.model)
    tss <- sum(lm.anova$"Sum Sq")
    # predictive R^2
    pred.r.squared <- 1 - PRESS(linear.model)/(tss)
    return(pred.r.squared)
}

PRESS <- function(linear.model) {
    pr <- residuals(linear.model)/(1 - lm.influence(linear.model)$hat)
    PRESS <- sum(pr^2)
    return(PRESS)
}

Then we just call the function to get the result:

pred.r.squared <- pred_r_squared(my.lm)
pred.r.squared

## [1] 0.5401

I've posted these as Gists on GitHub, with extra comments, so you can copy and paste from here or go branch or copy them there.

References and further reading

Mitsa, T. Use PRESS, not R squared to judge predictive power of regression. 12 May 2013. Analytical Bridge. Accessed 14 May 2014. Shows how r-squared can lead to a misleading interpretation of model fit and provides an explanation of the PRESS statistic, with examples comparing three linear models in R.
Mitsa, T. Cross-validation in R: a do-it-yourself and a black box approach. 22 May 2013. Accessed 14 May 2014. Shows how to use the package DAAG to calculate PRESS, or to calculate it manually.
Walker, S. Calculating the PRESS statistic in R. 18 June 2013. ecology & stats. Accessed 14 May 2014. Provides a simple function for calculating PRESS in R.
Multiple Regression Analysis: Use Adjusted R-Squared and Predicted R-Squared to Include the Correct Number of Variables
Adjusted R-Square or Predicted R-Square. LinkedIn. Accessed 14 May 2014. Forum dscussion thread discusing the relative merits of adjusted and predicted R², in which the equation for calculating predicted R² is given.
Why is adjusted R-squared less than R-squared if adjusted R-squared predicts the model better?. StackExchange. Accessed 10 May 2014. Q&A thread discussing the relative merits of R² and adjusted R².
R Core Team (2014). R: A language and environment for statistical computing. R Foundation for
Statistical Computing, Vienna, Austria. URL http://www.R-project.org/.
H. Wickham. ggplot2: elegant graphics for data analysis. Springer New York, 2009.