Introduction to R for Excel Users

Download the PDF

As the saying goes, when all you have is a hammer, everything looks like a nail. Excel was designed to do simple financial analyses and to craft financial statements. Though its capabilities have been expanded over the years, it was never designed to perform the sort of data analysis that industry scientists, engineers and Six Sigma belts need to perform on a daily basis.

Most data analyses performed in Excel look more like simple financial spreadsheets rather than actual data analysis, and this quality of work translates into bad—or at least sub-optimal—business decisions. There are alternatives to Excel, and the free, open-source data analysis platform R is one of them.

Unfortunately, R has a steep learning curve. I’m offering, for free, a short primer on R [PDF] where I’ve sought to make that learning curve a little less painful for engineers and scientists who normally work in Excel.

Background

A couple of years ago, I was developing a short course to teach R to scientists and engineers in industry who normally used Excel. The goal was to help them transition to a more capable tool. My course design notes morphed into a handout, and when plans for the course fell through, that handout grew into a self-study guide, which I later adapted into this seventy-page, stand-alone introduction for Excel users.

Organization

The primer walks the reader through the basics of R, starting with a brief overview of capabilities, then diving into installation, basic operations, graphical analysis and basic statistics. I believe that a picture is worth a thousand words, so it’s light on text and heavy on examples and visuals.

The end of the book rounds out with a look at some of the most useful add-ons, the briefest of introductions to writing your own, custom functions in R, and a cross-reference of common Excel functions with their equivalents in R.

The text is broken up into chapters and fully indexed so that it can be used either as a walk-through tutorial or as a quick reference.

Can We do Better than R-squared?

If you're anything like me, you've used Excel to plot data, then used the built-in “add fitted line” feature to overlay a fitted line to show the trend, and displayed the “goodness of fit,” the r-squared (R²) value, on the chart by checking the provided box in the chart dialog.

The R² calculated in Excel is often used as a measure of how well a model explains a response variable, so that “R² = 0.8” is interpreted as “80% of the variation in the 'y' variable is explained by my model.” I think that the ease with which the R² value can be calculated and added to a plot is one of the reasons for its popularity.

There's a hidden trap, though. R² will increase as you add terms to a model, even if those terms offer no real explanatory power. By using the R² that Excel so helpfully provides, we can fool ourselves into believing that a model is better than it is.

Below I'll demonstrate this and show an alternative that can be implemented easily in R.

Some data to work with

First, let's create a simple, random data set, with factors a, b, c and response variable y.

head(my.df)

##       y a       b      c
## 1 2.189 1 -1.2935 -0.126
## 2 3.912 2 -0.4662  1.623
## 3 4.886 3  0.1338  2.865
## 4 5.121 4  1.2945  4.692
## 5 4.917 5  0.1178  5.102
## 6 4.745 6  0.4045  5.936

Here is what this data looks like:

Calculating R-squared

What Excel does when it displays the R² is create a linear least-squares model, which in R looks something like:

my.lm <- lm(y ~ a + b + c, data = my.df)

Excel also does this when we call RSQ() in a worksheet. In fact, we can do this explicitly in Excel using the Regression analysis option in the Analysis Pack add-on, but I don't know many people who use this, and Excel isn't known for its reliability in producing good output from the Analysis Pack.

In R, we can obtain R² via the summary() function on a linear model.

summary(my.lm)

## 
## Call:
## lm(formula = y ~ a + b + c, data = my.df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.2790 -0.6006  0.0473  0.5177  1.5299 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(&gt;|t|)  
## (Intercept)    2.080      0.763    2.72    0.034 *
## a             -0.337      0.776   -0.43    0.679  
## b             -0.489      0.707   -0.69    0.515  
## c              1.038      0.817    1.27    0.250  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.1 on 6 degrees of freedom
## Multiple R-squared:  0.833,  Adjusted R-squared:  0.75 
## F-statistic:   10 on 3 and 6 DF,  p-value: 0.00948

Since summary() produces a list object as output, we can grab just the R² value.

summary(my.lm)$r.squared

## [1] 0.8333

Normally, we would (somewhat loosely) interpret this as telling us that about 83% of the variation in the response y is explained by the model.

Notice that there is also an "adjusted r-squared” value given by summary(). This tells us that only 75% of the variation is explained by the model. Which is right?

The problem with R-squared

Models that have many terms will always give higher R² values, just because more terms will slightly improve the model fit to the given data. The unadjusted R² is wrong. The calculation for adjusted R² is intended to partially compensate for that “overfit,” so it's better.

It's nice that R shows us both values, and a pity that Excel won't show the adjusted value. The only way to get an adjusted R² in Excel is to run the Regression analysis; otherwise, we have to calculate adjusted R² manually.

Both R² and adjusted R² are measures of how well the model explains the given data. However, in industry we usually want to know something a little different. We don't build regression models to explain only the data we have; we build them to think about future results. We want R² to tell us how well the model predicts the future. That is, we want a predictive R². Minitab has added the ability to calculate predictive R² in Minitab 17, and has a nice blog post explaining this statistic.

Calcuting predictive R-squared

Neither R nor Excel provide a means of calculating the predictive R² within the default functions. While some free R add-on packages provide this ability (DAAG, at least), we can easily do it ourselves. We'll need a linear model, created with lm(), for the residuals so we can calculate the “PRESS” statistic, and then we need the sum of squares of the terms so we can calculate a predictive R².

Since the predictive R² depends entirely on the PRESS statistic, we could skip the added work of calculating predictive R² and just use PRESS, as some authors advocate. The lower the PRESS, the better the model is at fitting future data from the same process, so we can use PRESS to compare different models. Personally, I'm used to thinking in terms of R², and I like having the ability to compare to the old R² statistic that I'm familiar with.

To calculate PRESS, first we calculate the predictive residuals, then take the sum of squares (thanks to (Walker’s helpful blog post) for this). This is pretty easy if we already have a linear model. It would take a little more work in Excel.

pr <- residuals(my.lm)/(1 - lm.influence(my.lm)$hat)
PRESS <- sum(pr^2)
PRESS

## [1] 19.9

The predictive R² is then (from a helpful comment by Ibanescu on LikedIn) the PRESS divided by the total sum of squares, subtracted from one. The total sum of squares can be calculated directly as the sum of the squared residuals, or obtained by summing over Sum Sq from an anova() on our linear model. I prefer using the anova function, as any statistical subtleties are more likely to be properly accounted for there than in my simple code.

# anova to calculate residual sum of squares
my.anova <- anova(my.lm)
tss <- sum(my.anova$"Sum Sq")
# predictive R^2
pred.r.squared <- 1 - PRESS/(tss)
pred.r.squared

## [1] 0.5401

You'll notice that this is smaller than the residual R², which is itself smaller than the basic R². This is the point of the exercise. We don't want to fool ourselves into thinking we have a better model than we actually do. One way to think of this is that 29% (83% – 54%) of the model is explained by too many factors and random correlations, which we would have attributed to our model if we were just using Excel's built-in function.

When the model is good and has few terms, the differences are small. For example, working through the examples in Mitsa's two posts, we see that for her model 3, R² = 0.96 and the predictive R² = 0.94, so calculating the predictive R² wasn't really worth the extra effort for that model. Unfortunately, we can't know, in advance, which models are “good.” For Mitsa's model 1 we have R² = 0.95 and predictive R² = 0.32. Even the adjusted R² looks pretty good for model 1, at 0.94, but we see from the predictive R² that our model is not very useful. This is the sort of thing we need to know to make correct decisions.

Automating

In R, we can easily wrap these in functions that we can source() and call directly, reducing the typing. Just create a linear model with lm() (or an equivalent) and pass that to either function. Note that pred_r_squared() calls PRESS(), so both functions have to be sourced.

pred_r_squared <- function(linear.model) {
    lm.anova <- anova(linear.model)
    tss <- sum(lm.anova$"Sum Sq")
    # predictive R^2
    pred.r.squared <- 1 - PRESS(linear.model)/(tss)
    return(pred.r.squared)
}

PRESS <- function(linear.model) {
    pr <- residuals(linear.model)/(1 - lm.influence(linear.model)$hat)
    PRESS <- sum(pr^2)
    return(PRESS)
}

Then we just call the function to get the result:

pred.r.squared <- pred_r_squared(my.lm)
pred.r.squared

## [1] 0.5401

I've posted these as Gists on GitHub, with extra comments, so you can copy and paste from here or go branch or copy them there.

References and further reading

Mitsa, T. Use PRESS, not R squared to judge predictive power of regression. 12 May 2013. Analytical Bridge. Accessed 14 May 2014. Shows how r-squared can lead to a misleading interpretation of model fit and provides an explanation of the PRESS statistic, with examples comparing three linear models in R.
Mitsa, T. Cross-validation in R: a do-it-yourself and a black box approach. 22 May 2013. Accessed 14 May 2014. Shows how to use the package DAAG to calculate PRESS, or to calculate it manually.
Walker, S. Calculating the PRESS statistic in R. 18 June 2013. ecology & stats. Accessed 14 May 2014. Provides a simple function for calculating PRESS in R.
Multiple Regression Analysis: Use Adjusted R-Squared and Predicted R-Squared to Include the Correct Number of Variables
Adjusted R-Square or Predicted R-Square. LinkedIn. Accessed 14 May 2014. Forum dscussion thread discusing the relative merits of adjusted and predicted R², in which the equation for calculating predicted R² is given.
Why is adjusted R-squared less than R-squared if adjusted R-squared predicts the model better?. StackExchange. Accessed 10 May 2014. Q&A thread discussing the relative merits of R² and adjusted R².
R Core Team (2014). R: A language and environment for statistical computing. R Foundation for
Statistical Computing, Vienna, Austria. URL http://www.R-project.org/.
H. Wickham. ggplot2: elegant graphics for data analysis. Springer New York, 2009.

Normality and Testing for Normality

Many of our statistical tests make assumptions about the distribution of the underlying population. Many of the most common—ImR (XmR) and XbarR control charts, ANOVA, t-tests—assume normal distributions in the underlying population (or normal distributions in the residuals, in the case of ANOVA), and we’re often told that we must carefully check the assumptions.

At the same time, there’s a lot of conflicting advice about how to test for normality. There are the statistical tests for normality, such as Shapiro-Wilk or Anderson-Darling. There’s the “fat pencil” test, where we just eye-ball the distribution and use our best judgement. We could even use control charts, as they’re designed to detect deviations from the expected distribution. We are discouraged from using the “fat pencil” because it will result in a lot of variation from person to person. We’re often told not to rely too heavily on the statistical tests because they are not sensitive with small sample sizes and too sensitive to the tails. In industrial settings, our data is often messy, and the tails are likely to be the least reliable portion of our data.

I’d like to explore what the above objections really look like. I’ll use R to generate some fake data based on the normal distribution and the t distribution, and compare the frequency of p-values obtained from the Shapiro-Wilk test for normality.

A Function to test normality many times

First, we need to load our libraries

library(ggplot2)
library(reshape2)

To make this easy to run, I’ll create a function to perform a large number of normality tests (Shapiro-Wilk) for sample sizes n = 5, 10 and 1000, all drawn from the same data:

#' @name assign_vector
#' @param data A vector of data to perform the t-test on.
#' @param n An integer indicating the number of t-tests to perform. Default is 1000
#' @return A data frame in "tall" format
assign_vector <- function(data, n = 1000) {
  # replicate the call to shapiro.test n times to build up a vector of p-values
  p.5 <- replicate(n=n, expr=shapiro.test(sample(my.data, 5, replace=TRUE))$p.value)
  p.10 <- replicate(n=n, expr=shapiro.test(sample(my.data, 10, replace=TRUE))$p.value)
  p.1000 <- replicate(n=n, expr=shapiro.test(sample(my.data, 1000, replace=TRUE))$p.value)
  #' Combine the data into a data frame, 
  #' one column for each number of samples tested.
  p.df <- cbind(p.5, p.10, p.1000)
  p.df <- as.data.frame(p.df)
  colnames(p.df) <- c("5 samples","10 samples","1000 samples")
  #' Put the data in "tall" format, one column for number of samples
  #' and one column for the p-value.
  p.df.m <- melt(p.df)
  #' Make sure the levels are sorted correctly.
  p.df.m <- transform(p.df.m, variable = factor(variable, levels = c("5 samples","10 samples","1000 samples")))
  return(p.df.m)  
}

Clean, random data

I want to simulate real-word conditions, where we have an underlying population from which we sample a limited number of times. To start, I’ll generate 100000 values from a normal distribution. To keep runtimes low I’ll have assign_vector() sample from that distribution when performing the test for normality.

n.rand <- 100000
n.test <- 10000
my.data <- rnorm(n.rand)
p.df.m <- assign_vector(my.data, n = n.test)

We would expect that normally distributed random data will have an equal probability of any given p-value. i.e. 5% of the time we’ll see p-value ≤ 0.05, 5% of the time we’ll see p-value > 0.05 and ≤ 0.10, and so on through > 0.95 and ≤ 1.00. Let’s graph that and see what we get for each sample size:

ggplot(p.df.m, aes(x = value)) + 
  geom_histogram(binwidth = 1/10) + 
  facet_grid(facets=variable ~ ., scales="free_y") + 
  xlim(0,1) +
  ylab("Count of p-values") +
  xlab("p-values") +
  theme(text = element_text(size = 16))

Histogram of p-values for the normal distribution, for sample sizes 5, 10 and 1000.

This is, indeed, what we expected.

Now let’s compare the normal distribution to a t distribution. The t distribution would pass the “fat pencil” test—it looks normal to the eye:

ggplot(NULL, aes(x=x, colour = distribution)) + 
  stat_function(fun=dnorm, data = data.frame(x = c(-6,6), distribution = factor(1)), size = 1) + 
  stat_function(fun=dt, args = list( df = 20), data = data.frame(x = c(-6,6), distribution = factor(2)), linetype = "dashed", size = 1) + 
  scale_colour_manual(values = c("blue","red"), labels = c("Normal","T-Distribution")) +
  theme(text = element_text(size = 12),
        legend.position = c(0.85, 0.75)) +
  xlim(-4, 4) +
  xlab(NULL) +
  ylab(NULL)

Starting with random data generated from the t-distribution:

my.data <- rt(n.rand, df = 20)

Histogram of p-values for the t distribution, for sample sizes 5, 10 and 1000.

The tests for normality are not very sensitive for small sample sizes, and are much more sensitive for large sample sizes. Even with a sample size of 1000, the data from a t distribution only fails the test for normality about 50% of the time (add up the frequencies for p-value > 0.05 to see this).

Testing the tails

Since the t distribution is narrower in the middle range and has longer tails than the normal distribution, the normality test might be failing because the entire distribution doesn’t look quite normal; we haven’t learned anything specifically about the tails.

To test the tails, we can construct a data set that uses the t distribution for the middle 99% of the data, and the normal distribution for the tails.

my.data <- rt(n.rand, df = 20)
my.data.2 <- rnorm(n.rand)
# Trim off the tails
my.data <- my.data[which(my.data < 3 & my.data > -3)]
# Add in tails from the other distribution
my.data <- c(my.data, my.data.2[which(my.data.2 < -3 | my.data.2 > 3)])

Histogram of p-values for sample sizes 5, 10 and 1000, from a data set constructed from the t distribution in the range -3 to +3 sigmas, with tails from the normal distribution below -3 and above +3.

Despite 99% of the data being from the t distribution, this is almost identical to our test with data from just the normal distribution. It looks like the tails may be having a larger impact on the normality test than rest of the data

Now let’s flip this around: data that is 99% normally-distributed, but using the t distribution in the extreme tails.

my.data <- rnorm(n.rand)
my.data.2 <- rt(n.rand, df = 20)
# Trim off the tails
my.data <- my.data[which(my.data < 3 & my.data > -3)]
# Add in tails from the other distribution
my.data <- c(my.data, my.data.2[which(my.data.2 < -3 | my.data.2 > 3)])

Histogram of p-values for sample sizes 5, 10 and 1000, from a data set constructed from the normal distribution in the range -3 to +3 sigmas, with tails from the t-distribution below -3 and above +3.

Here, 99% of the data is from the normal distribution, yet the normality test looks almost the same as the normality test for just the t-distribution. If you check the y-axis scales carefully, you’ll see that the chance of getting p-value ≤ 0.05 is a bit lower here than for the t distribution.

To make the point further, suppose we have highly skewed data:

my.data <- rlnorm(n.rand, 0, 0.4)

This looks like:

For small sample sizes, even this is likely to pass a test for normality:

What have we learned?

With small sample sizes, everything looks normal.
The normality tests are, indeed, very sensitive to what goes on in the extreme tails.

In other words, if we have enough data to fail a normality test, we always will because our real-world data won’t be clean enough. If we don’t have enough data to reliably fail a normality test, then there’s no point in performing the test, and we have to rely on the fat pencil test or our own understanding of the underlying processes.

Don’t get too hung up on whether your data is normally distributed or not. When evaluating and summarizing data, rely mainly on your brain and use the statistics only to catch really big errors in judgement. When attempting to make predictions about future performance, e.g. calculating Cpk or simulating a process, recognize the opportunities for errors in judgment and explicitly state you assumptions.

Scientific Consensus

I have seen a number of comments lately that “consensus” has no place in science, and that claims that there is a “scientific consensus” are just thinly veiled political double-speak. I have to take issue with such criticisms. In fact, consensus is one of the cornerstones of science.

This is not quite the same consensus used in politics or everyday life. Consensus means “general agreement” and is usually achieved through some form of discussion and negotiation. Consensus is therefore agreement over opinions, and is often agreement over a course of action despite differing opinions.

In science, consensus is derived from data and independent replication of experiments. It is the consensus that an idea—specifically, a testable hypothesis—is correct. It is the expression of scientists that a hypothesis is (a) scientifically testable and falsifiable, (b) that it has not been falsified and (c) that it explains the universe better than competing hypotheses. It is a consensus derived from the replication of observations or tests by other researchers.

Part of the problem, here, is that science has a habit of taking everyday words and developing very specific meanings around them. This happens because scientists need to communicate clearly and exactly at times, and language is messy and full of fuzzy concepts. The same thing happens in a lot of occupations. For example, accountants also develop specific meanings for everyday words.

Scientists no longer argue over the validity of Newton’s hypothesis on the gravitational force because there is broad consensus that the hypothesis is correct (as far as it goes). Objects attract each other according to their mass and the distance between them, and there have been plenty of independent experiments confirming the specific relationship, F = -G M₁m₂ / r². The consensus is so strong that it’s referred to as Newton’s law of gravity. Likewise, scientists no longer argue over the geocentric model of the universe because there is broad consensus—derived from data collected over centuries by many independent researchers—that the Earth is not at the center of the solar system, let alone the universe.

Conversely, there is very little consensus when it comes to the accelerating expansion of the universe. Cosmologists agree that the universe is expanding faster than can be explained by our current understanding of the universe, but there are many, conflicting hypotheses about the causes. There is consensus over the fact of the accelerating expansion but the data does not yet support consensus on the underlying physical processes or mechanism causing it, and so there is no consensus about the physical process.

In one sense, scientific consensus is stronger, or more robust, than we are used to thinking about in politics and everyday life, precisely because it is based on observation and careful analysis by independent groups. It’s not just consensus based on what we think might be true, or what we want to be true, but based on what careful observation tells us must be true. Scientific consensus is not subject to whim.

From another perspective, though, scientific consensus is much weaker than we are used to. In science, there is no downside to abandoning or overturning a consensus when the data points in a different direction. In fact, there is significant benefit to being the person who can overturn a previous consensus; we remember Galileo, Newton, Darwin, Einstein and others precisely because their work, collecting and analyzing data, was so pivotal in altering the scientific consensus. In everyday life, if you back out of a consensus agreement, it’s likely that others party to the agreement will feel betrayed. There can be a significant social cost to pay for backing out of a consensus, even when you are convinced that you are right. Scientists may sometimes feel this same social pressure, but the scientific method provides clear guidance for adopting or abandoning consensus, and it doesn’t focus on the people involved but rather on external, objective observations of how the universe works. Scientific consensus is, perhaps, more readily changed than conventional consensus.

So consensus does exist is science and it plays an important role in science. We should be careful to distinguish, though, between consensus based on independent replication of results and consensus based on preconceptions and social negotiation.

And You Thought Physics Was YAWN

Part of my day job involves monitoring the renewable energy market, and particularly keeping abreast of storage technologies. It’s like combining my hobby with my job.

A new wind turbine design was recently announced, the SeaTwirl. It’s an off-shore turbine design using vertical blades. The key technological advance here is that it includes a method to store energy, so that it can continue to produce electricity when the wind stops blowing. This ability to deliver a constant output is important, because, as you may have heard, wind energy is intermittent; you only get electricity when the wind blows, and only to the degree that it’s blowing. Demand, unfortunately, doesn’t follow wind’s intermittancy—nobody stops to check that the wind is blowing before they turn on their lights—and the utilities, transmission system operators and distribution system operators all have to supply electricity to meet demand.

The SeaTwirl stores energy by integrating an unusual wind turbine design with a pumped hydro system. It has the turbine blades on a large circular ring, which rotates parallel to the water’s surface kind of like a hula-hoop. This ring is hollow (more or less), and is filled up with water when the wind blows. When the wind stops blowing, the momentum of the water keeps the tube spinning, generating electricity, and the water is allowed to drain back out, spinning a hydro turbine to generate electricity.

The nice thing about this is that the storage can always be “recharged” and it’s “free.” Or at least it seems to be. If you’ve ever held a bucket while spinning around, you know that spinning with an empty bucket takes a lot less effort than spinning around with a full bucket. In part, this is because of a property known as the moment of inertia. The heavier or larger a spinning object gets, the more it resists changes to its rate of rotation (or rpm).

If the SeaTwirl is filling up this horizontal “hula-hoop” with water, then the weight of the tube is increasing and so is the moment of inertia. As the moment of inertia increases, the energy needed to reach a given rpm increases. Wind turbines normally generate electricity in proportion to the wind speed, because the rpm of the blades is proportional to the wind speed. Increase the moment of inertia and you decrease the rpm, which means you generate less electricity for a given wind speed.

Now comes the physics. For something shaped like a hula-hoop, the moment of inertia, I, is calculated from the mass, M, and the radius of the hoop, R, according to:

$I = M R^{2}$

The energy, E, in a spinning object is equal to the moment of inertia times the speed of rotation, ω, according to

$E=\frac{1}{2} I \omega ^{2}$

If we know the energy (because we know the wind speed), then we can calculate the speed of rotation, ω, by rearranging that equation to get ω on the left-hand side:

$\omega = \sqrt{\frac{2E}{I}}$

We can then replace I with mass and radius from the first equation to get

$\omega = \sqrt{\frac{2E}{MR^{2}}}$

So we can see that, if we don’t change the energy E (or don’t change the wind speed), and don’t change the radius R of the spinning hoop, then increasing the mass M results in a slower rate of rotation.

From SeaTwirl’s website and press releases, we can estimate how big the SeaTwirl is, which will let us estimate how much slower a full SeaTwirl will spin than an empty one, and therefore how much less electricity must be generated. We can calculate this by taking the ratio of ω full to ω empty, so that the parts that we don’t have to know E and R.

$\frac{\omega_{full}}{\omega_{empty}} = \frac{\sqrt{\frac{2E}{M_{full}R^{2}}}}{\sqrt{\frac{2E}{M_{empty}R^{2}}}} = \sqrt{\frac{\frac{2E}{R^{2}}}{\frac{2E}{R^{2}}}}\sqrt{\frac{M_{empty}}{M_{full}}}=\sqrt{\frac{M_{empty}}{M_{full}}}$

For the SeaTwirl, we now have to find out what R is, and estimate M for both filled and empty

The whole turbine assembly is made of composite materials, which probably have a density, $\rho_{c}$ , of around 2500 kilograms per cubic meter (similar to fiberglass). Water has a density, $\rho_{w}$ , of near 1000 kilograms per cubit meter (depending on temperature). The diameter of the turbine will be near 180 meters, so the radius, R, of our “hula-hoop” is half that, or 90 meters. From the pictures, it looks like the thickness of that hula-hoop is a few percent of the total diameter of the turbine, so we can figure an outside diameter of the “hula-hoop” of about 2 meters, for a radius, r, of 1 meter. Figure that at least ten percent of this is composite, and the rest is the hollow, water-filled portion.

To estimate the weight of the water in the “hula-hoop,” we can approximate the water as being a cylinder of radius $r_{w} = 0.9r$ and length equal to the circumference of the “hula-hoop,” $l = 2\pi R$ . The volume of such a cylinder is equal to the cross-sectional area of the water column, $A_{w}=\pi r_{w}^{2}$ times the length of the column, l. The total mass of the water, $m_{w}$ is the density times this volume.

$m_{w} = \rho_{w}\pi r_{w}^{2}2\pi R = \rho_{w}3\pi (0.9r)^{2} R$

Plugging in our estimates for the above values gives us

$m_{w} = 1000 \cdot 3\pi (0.9)^{2} 90 = 687000$ kg

That’s a lot of water.

Now for the empty “hula-hoop.” We can treat it in the same way: a cylinder of material of radius r, length $l = 2\pi R$ . However, we don’t want to calculate for a solid cylinder of composite; we have to subtract out the hollow part with radius $r_{w}$ . So the mass of the composite is

$m_{c} = \rho_{c} 2 \pi R ( \pi r^{2} - \pi r^{2}_{w} )$

$m_{c} = 2500 \cdot 3 \pi 90 (1^{2} - 0.9^{2} ) = 403000$ kg

So the water more than doubles the weight of the hoop.

From the picture, you can see that there’s another hoop at the top, and the two hoops are connected by the turbines, which combined are probably worth at least another hoop in weight, so we can further assume that the mass of this bottom hoop, empty, is roughly one-third of the total mass of the movable parts of turbine.

The mass of the turbine, empty, is therefore about 1200000 kg, or 1200 tons. Filled with water, this goes up to about 1900000 kg, or 1900 tons. Empty, that’s maybe twice the largest off-shore turbine currently in existence, but this thing is easily twice as big as any current turbine, so our estimate appears to be in the right neighborhood.

Now we go back to our equation for the ratio of the rotational velocities, ω, and plug in these weights:

$\frac{\omega_{full}}{\omega_{empty}} = \sqrt{\frac{M_{empty}}{M_{full}}} =\sqrt{\frac{1200}{1900}} = 0.8$

So we get about 80% as much electricity from a water-filled turbine as from an empty one, when the wind blows. This is a direct efficiency loss due to the storage of energy in the spinning-water-hoop.

In addition, there’s the efficiency losses in loading water into the hoop, or “charging” the hoop, and the efficiency losses of “discharging” the hoop, running the water back out through a hydro turbine. Pumped hydro is usually about 72% efficient, or less, in each direction, so the total round-trip efficiency of storage + discharge is about 50% efficient. There are a lot of other storage technologies that do at least this good, if not better.

These two figures, the 80% efficiency loss of just operating the turbine and the 50% storage-discharge efficiency, can be used to directly compare SeaTwist with other wind turbine + storage technology solutions. Any storage technology that has at least a 50% round-trip efficiency and increases the total system cost by less than 20% over the system’s operating lifetime will outperform SeaTwist in terms of return on investment.

R Function Reference

Updated below

The R Function Reference is a mind map that I created as a guide for novice and intermediate users of the R statistics language. When you first open it, I suggest that you collapse all the nodes by clicking on the “Expand/Collapse all nodes” button in the bottom left of the screen to make the map easier to navigate. You can also adjust the zoom level with the slider next to that button.

The top-level nodes of the R Function Reference

The mind map is arranged in eight sections, or main branches, arranged by task. What do you want to do? Each branch covers a general set tasks, such as learning to use R, running R, working with data, statistical analysis or plotting data. The end of each string of nodes is generally a function and example. The Reference provides code fragments, rather than details of the function or complete reproducible code blocks. Once you’ve followed the Reference and have an idea of how to accomplish something, you can look up the details in R’s help system (e.g. “?read.csv” to learn more about using the read.csv() function), or search Google or the online R-Help mailing list archives for answers using the function name.

There are a lot of useful nodes and examples, especially in the “Graphs” section, but the mind map is not complete; some trails end before you get to a useful function reference. I am sorry for that, but it’s a work in progress, and will be slowly updated over time.

Comments and suggestions are welcome.

Update 1

In comments, several users reported problems opening the mind map. With a little investigation, it appears that the size of the mind map is the problem. To try to fix the problems , I have split the mind map out into several small mind maps, all linked together.

The new main mind map is the R Function Reference, Main. The larger branches on this main map no longer expand to their own content, but contain a link to a “child” mind map. The link looks like a sheet of paper with an arrow pointing to the right, click on it and little cartoon speech bubble will pop up with a link that you have to click on to go to the child mind map. Likewise, the central nodes on the child mind maps contain a link back to the main mind map.

Due to load times and the required extra clicks, this may slightly reduce usability for users who didn’t have a problem with the all-in-one version, but will hopefully make the mind map accessible to a broader audience.

I have to offer praise to the developers of Mind42. Though I couldn’t directly split branches off into their own mind maps or duplicate the mind map, it was very easy to export the mind map as a native Mind42 file and then import it multiple times, editing the copies without any loss of data or links. The ability to link directly between mind maps within Mind42 was also a key enabling feature. Considering that this is a free web app, its capabilities are most impressive. They were also quick to respond when I posted a call for help on the Mind42 forum.

Please let me know how the new, “improved” version works.

The old mind map, containing everything, is still available, but I will not update it.

Graphing Highly Skewed Data

Recently Chandoo.org posted a question about how to graph data when you have a lot of small values and a few larger values. It’s not the first time that I’ve come across this question, and I’ve seen a lot of answers, many of them really bad. While all solutions involve trade-offs for understanding and interpreting graphs, some solutions are better than others.

Data graphs tell stories by revealing patterns in complex data. Good data graphs let the data tell the story by revealing the patterns, rather than trying to impose patterns on the data.

As William Cleveland discusses in The Elements of Graphing Data and his 1993 paper A Model for Studying Display Methods of Statistical Graphics, there are two basic visual operations that people employ when looking at and interpreting graphs: pattern perception and table look-up. Pattern perception is where we see the geometric patterns in a graph: groupings; relative differences (larger/smaller); or trends (straight/curved or increasing/decreasing). Table look-up is where we explore details of values and names on a graph. These two operations are distinct and complimentary, and it is through these two operations that the data’s story is told.

	month	sales
1	Feb 09	200
2	Mar 09	300
3	Apr 09	200
4	May 09	300
5	Jun 09	200
6	Jul 09	300
7	Aug 09	350
8	Sep 09	400
9	Oct 09	450
10	Nov 09	1200
11	Dec 09	100000
12	Jan 10	85000
13	Feb 10	450

So suppose that we have some data like that at right, where we are interested in the patterns of smaller, individual values, but there are also a few extremely large values, or outliers. We describe such data as being skewed. How do we plot this data? First, for such a small data set, a simple table is the best approach. People can see the numbers and interpret them, there aren’t too many numbers to make sense of and the table is very compact. For more complicated data sets, though, a graph is needed. There’s a few basic options:

Graph as-is;
Graph with a second axis;
Graph the logarithm of the data;
Use a scale break.
Plot the data multiple times.

Graph As-Is

A bar chart with all data, including outliers, plotted on the same scale.

This is the simplest solution, and if you’re only interested in knowing about the outliers (Dec ’09 and Jan ’10) then it will do. However, it completely hides whatever is happening in the rest of the months. Pattern recognition tells us that two months near the end of the series have the big numbers. Table-lookup tells us the approximate values and that these months are around December ’09 and February ’10, but the way the labels string together and overlap the tick marks, it’s not clear exactly what the labels are, let alone which label applies to which bar (which months are those, precisely? Is that “09 Dec” and “09 Feb?” Do the numbers even go with the text, or are they separate labels?).

For all but the simplest of messages, this rendition defeats both pattern recognition and table look-up. We definitely need a better solution.

Use a Secondary Axis

Excel gives us an easy solution: break the data into two columns (“small” numbers in one and “large” numbers in the other) and plot them on separate axes. Now we can see all the data, including the patterns in all the months.

Bar Chart with Outliers on Secondary Axis

Bar chart, with outliers plotted using a secondary axis.

Unfortunately, pattern recognition tells us that the big-sales months are about the same as all the other months. It’s only the table look-up that tells us how big of a difference there is between the two blue columns and the rest of the data. This is why I’ve added data labels to the two columns: to aid table look-up.

Even if we tweaked around with the axes to set the outliers off from the rest of the data, we’d still have the same basic problem: pattern recognition would tell us that there is a much smaller difference than there actually is. By using a secondary axis, we’ve set up a basic conflict between pattern recognition and table look-up. Worse, it’s easy to confuse the axes; which bars go with which axis? Reproduction in black and white or grayscale would make it impossible to correctly connect bars to the correct axis. Some types of color blindness would similarly make it difficult to interpret the graph. Table look-up is easily defeated with secondary axes.

The secondary axis presents so many problems that I always advise against using it. Stephen Few, author of Show Me The Numbers and Information Dashboard Design, calls graphs with secondary axes “dual-scaled graphs.” In his 2008 article Dual-Scaled Axes in Graphs, he concludes that there is always a better way to display data than by using secondary axes. Excel makes it easy to create graphs like this, but it’s always a bad idea.

Take the Logarithm

In scientific applications, skewed data is common, and the usual solution is to plot the logarithm of the values.

Bar chart plotting skewed with logarithmic axis.

With the logarithm, it is easy to plot, and see, all of the data. Trends in small values are not hidden. Pattern perception immediately tells us the overall story of the data. Table look-up is easier than with secondary axes, and immediately tells us the scale of the differences. Plotting the logarithm allows pattern perception and table look-up to compliment each other.

Below, I’ve created the same graph using a dot plot instead of a bar chart. Dot plots have many advantages over bar charts: most obviously, dot plots provide a better arrangement for category labels (e.g. the months); also, dot plots provide a clearer view of the data by plotting the data points rather than filling in the space between the axis and the data point. There are some nice introductions to dot plots, including William Cleveland’s works and a short introduction by Naomi Robbins. The message is clear: any data that you might present with a bar chart (or pie chart) will be better presented using dot plots.

Skewed data plotted on a dot plot using a logarithmic scale.

Use a Scale Break

Another approach, which might be better for audiences unfamiliar with logarithmic scales, is to use a scale break, or broken axis. With some work, we can create a scale break in Excel or OpenOffice.org.

Bar chart with a subtle scale break on the Y axis.

Bar chart with outliers plotted by introducing a subtle scale break on the y-axis.

There are plenty of tutorials for how to accomplish this in Excel. For this example, I created the graph in OpenOffice.org Spreadsheet, using the same graph with the secondary axis, above. I adjusted the two scales, turned off the labels for both y-axes and turned off the tick marks for the secondary y-axis. Then I copied the graph over to the OpenOffice.org Draw application and added y-axis labels and the break marks as drawing objects.

That pretty much highlights the first problem with this approach: it takes a lot of work. The second problem is that those break marks are just too subtle; people will miss them.

The bigger problem is with interpretation. As with the secondary axis, this subtle scale break sets up a basic conflict between the two basic operations of graph interpretation. Pattern recognition tells us that the numbers are comparable; it’s only table look-up that tells us what a large difference there is.

Cleveland’s recommendation, when the logarithm won’t work, is to use a full-panel scale break. In this way, pattern recognition tells that there are two distinct groups of data, and table look-up tells us what they are.

Dot plot with a full scale break to show outliers.

The potential disadvantage of this approach is that pattern perception might be fooled. While the scale break visually groups the “large” values from the “small” ones, the scale also changes, so that the broader panel on the left actually represents a much narrower range of values (about 1100 dollars range) than the narrower panel on the right (about 17000 dollars range). Our audience might have difficulties interpreting this correctly.

Small Multiples

Edward Tufte has popularized the idea of small multiples, the emphasis of differences by repeating a graph or image with small changes from one frame to the next. In this case, we could show the full data set, losing fidelity in the smaller values, and then repeat the graph while progressively zooming in on a narrower and narrower slice with each repetition.

Dot Plot showing full data (including outliers) side-by-side with zoomed view.

The full data, with outliers, is plotted on the left. On the right, a zoomed view showing detail in the smaller values.

This shares many similarities to Cleveland’s full scale break, but provides greater flexibility. With this data, there are two natural ranges: 0 – 100000 and 0 – 1200. If there were more data between 1200 and 85000, we might repeat the graph several times, zooming in more with each repetition to show lower levels of detail.

I think there are two potential pitfalls. As with the full scale break, the audience might fail to appreciate the effect of the changes to scale. Worse, the audience might be fooled into thinking that each graph represented a different set of data, rather than just a different slice of the same data. Some care in preparing such graphs will be needed for successful communication.

Summary

When presenting data that is, like the data above, arranged by category, use a dot plot instead of bar charts. When your data is heavily skewed, the best solution is to graph the logarithm of the data. However, if your audience will be unable to correctly interpret the logarithm, try a full scale break or small multiples.

Definitions

I was recently asked a question that raised some good design issues. The question went “why should changing this cause a change in that characteristic?”

The immediate and obvious answer was that it wouldn’t and couldn’t. Theoretically, a large decrease in this (X) might cause an increase of a few percent in that (Y); nothing more. Only someone was claiming that decreasing X decreased Y, too.

They were right. No, the theoretical relationship isn’t wrong. It’s right.

The theoretical calculation is fairly straightforward. You put so much of X in, and, after some calculation, you get so much of Y out. The less X you have, the more Y you get. The hard part is figuring out just how much of X you’re putting in.

The measurement of Y introduces a bunch of variation based on other factors. You measure by changing certain conditions A, B and C. These, in turn, affect some other factors, M and N. X, A, M and N together determine what value you measure for Y.

So decreasing X affects the other factors in such a way that the net effect is a decrease in the measured value of Y.

“Oh, sure,” you respond. “But the theoretical calculation should account for that.”

Not really. The theoretical calculation should tell us what the best case is…what our target should be. The actual measurement is going to produce different results based on various factors, some of which we control and some we can’t. A calculation based on the measurement process would require uncertainty ranges and return a probability distribution; not a singular value. Messy.

Engineers and researchers need to consider both of these as definitions. If you’re designing for some characteristic, as a researcher or engineer you’re usually going to be concerned with the theoretical calculations. This is how you were taught in school, and you’ll naturally be interested in getting as close to the best case as possible. However, not everyone is going to be interested in the theoretical calculation. The folks in Quality who are checking the product for conformance will be more interested in how it’s measured, the operational definition, than in the theoretical definition. The manufacturing plant only want to hear about the operational definition; for them, the world would be a better place without the theoretical definition.

As a design engineer, you need to be more concerned about the operational definition. You’ll be arguing that you designed a part for Y performance (or to “do Y“). The next question that management and your customers should (and probably will) ask is, how do you know you designed it to do that? The answer is always by data analysis. How do you get the data? Via the operational definition. What you know is determined by how you measure, and that’s the operational definition.

This has applicability well outside of engineering design. Physicists have been arguing this very point ever since Bohm and Heisenberg developed the Copenhagen interpretation of quantum physics. Management by objective depends on the ability to close the loop by measuring outcomes. This means that management by objectives requires operational definitions of every objective (though few organizations actually get this far, and management by objectives becomes management by manager gut feeling). Even more enlightened management techniques, such as those advocated by Deming and Scholtes, require operational definitions to enable an organization’s performance improvement (e.g. through the use of control charts, which are only possible with operational definitions).

Use the theoretical definition to tell you the best possible case, but be sure to design according to the operational definition.

Successful Labs

I’ve worked in and managed a few R&D and test labs in my career. Lately, I’ve been thinking about success factors for those labs. At a high level, I believe that a successful Lab has four key attributes: good data; the right data; the ability to communicate the data clearly and effectively; and a relentless pursuit of perfection.

Good Data

Good data is accurate, has a known precision, and is reproducible by your lab and by third parties. This requires good calibration practices, careful evaluation of measurement uncertainty and documented test methods. In short, you need:

good calibration procedures and a calibration schedule that keeps equipment in calibration;
good procedures for measuring and documenting the measurement uncertainty and sources of error;
good procedures for how to set up tests, collect data and then utilize the information from calibration and measurement system analysis in your data analysis.

Yes, this all boils down to standard work. I’ve said it before, and I’ll say it again. Despite the common opinion that standard work is an impediment creative R&D-type work, I’ve found just the opposite is true.

This is harder to accomplish than it sounds, but there’s plenty of resources available to help. There’s even an international standard that a lab can be accredited against: ISO 17025.

The Right Data

Ensuring that you’re collecting the right data makes collecting good data look easy. Getting the right data means performing a test that provides useful information. There are two components to this: testing the expected conditions; and testing the boundary conditions.

If you’re doing your own testing, then testing the expected conditions is easy. You know what you’re thinking and what you expect, and you go test it. If I want to test if ice freezes at zero degrees Celsius, then I test it. However, if the testing is outsourced, then things get complicated. Suppose I live in Denver, Colorado, and I want to test if ice freezes at zero degrees Celsius in the winter. To test it myself, I might stick a thermometer in a glass of water, put it outside and wait. Suppose, though, that the testing is outsourced to a lab in a place like Bangladesh, India, that’s hotter, lower in altitude and more humid. They can provide an answer, but will they address my intent? As the test requester, I may not ask the right question; as the test group, they may leap to test without fully understanding why I’m asking. This sort of confusion actually happens quite frequently, even when the test group and the requester are in the same building. It can be months before people realize that their question was only partially answered.

Paradoxically, testing the boundary conditions is difficult when you’re doing your own testing, while the communication errors described above make it easier for an outside lab to test the boundary conditions. It’s almost inevitable that they’ll test some boundary conditions. The reason for this is that boundary conditions are defined by one’s assumptions, and people are generally pretty poor at identifying and thinking through their own assumptions. Mentally, we tear right through the assumptions to the interesting bits. The outside lab, though, isn’t going to be quite testing the expected conditions; they’ll always be nearer at least one set of boundary conditions.

One common solution used by labs is to develop a detailed questionnaire to try to force their customers to detail their request in the lab’s terms. I’ve done this myself. It doesn’t work. A questionnaire, like a checklist, can help capture the things you know you’d otherwise overlook, but neither a questionnaire nor a checklist can bridge a communication gap.

The solution that works is to send the lab personnel out into the gemba; to go and see the customer’s world and understand what they’re doing and why they’re making their request. This is difficult. The lab may be geographically distant from the gemba. People working in a lab often got there by being independent thinkers and workers, and not by being very gregarious. Labs are also paid or evaluated for the testing they do; not for the customer visits they make. A lab manager needs to be able to overcome these obstacles.

Communication

Once the lab has the right, good data, there’s still one big challenge left. All that data goes to waste if it isn’t communicated effectively. This means understanding that the data tells a story, knowing what that story is, and then telling that story honestly and with clarity. One could write several books on this subject. I direct your attention to the exceptional works of Edward Tufte, especially The Visual Display of Quantitative Information, an excellent NASA report by S. Katzoff titled Clarity in Technical Reporting, and the deep works of William S. Cleveland, including The Elements of Graphing Data and Visualizing Data. If you don’t have these in your library, then you’re probably not communicating as clearly and effectively as you should.

Effective communication is critical even when you’re doing your own testing. It provides a record of your work so that others can follow in your footsteps. Without effective communication, whatever you learn stays with you, and you lose the ability to leverage the ideas and experience of others.

Pursuing Perfection

While we have to get the job done today, we probably haven’t delivered everything your customer needed. There’s always opportunity for improvement. A good lab recognizes this, constantly engages in self reflection and finds ways to improve. This is a people-based activity. A good manager enables this critical self-reflection and supports and encourages the needed changes.

Such critical self-reflection is not always easy. In many business environments, an admission of imperfection is an invitation to be attacked, demoted, or fired. Encouraging the self-reflection that is crucial to improvement requires building trust with your employees; providing a safe environment. Employees have to be comfortable talking about their professional faults, having others talk about those faults, and they have to believe that they can improve.

Being genuine and honest can help a manager move their group in this direction. Ensuring that there are no negative consequences to the pursuit of perfection will also help. Unfortunately, it’s not entirely up to the manager; it’s a question of politics, policies and corporate culture. Effective managers and leaders need to navigate these waters for the good of their team. They’ll do this better with the support and involvement of their team.

Innumeracy

A number of blogs that I follow are talking about a recent article in the Wall Street Journal, We’re Number One, Alas. The author argues that the U.S.’s corporate tax rate is too high, claiming that countries with somewhat lower corporate tax rates generate more revenue from those taxes as a fraction of GDP. He uses the graph below to make his point.

The Laffer Curve on this graph is claimed to show the relationship between revenue and tax rate. A Laffer curve is based on the hypothesis that a government generates zero revenue when the tax rate is at either zero percent or one hundred percent, and that the maximum revenue from taxes falls somewhere in between these two extremes. The author is claiming that the optimum is below the current U.S. rate, and illustrates this by placing the U.S. on the far side of a big cliff.

This is an egregious case of innumeracy. I told myself when I started blogging that I would steer clear of the blog echo chamber as much as possible, but it is not all that uncommon to see similar presentations in the corporate world. Some data points are plotted, and then some chartjunk is added to tell a story…a story that may not be supported by the data at all. There are a few things that managers and engineers can do to combat this. For instance, if there’s supposed to be a correlation between values, we can ask for the correlation coefficient.

In this case, the correlation coefficient is about 0.1. This is equivalent to saying that just ten percent of the variation in revenues from taxes is due to the tax rate. If you were working on process improvement, you would not want to focus on a factor that only accounted for ten percent of the variation; you would be looking for a factor that explained greater than fifty percent.

Another approach would be to ask for a hypothesis test. The hypothesis would be that there is no correlation; the alternate hypothesis is that there is some correlation. As a business manager, you want to select the level of risk that you’re willing to accept. This is an economic decision, as risk analysis usually is. For the sake of argument, we will accept a risk of five percent. This is our “alpha” value (α-value), which we’ll express as a fraction: 0.05. We now need to perform the appropriate hypothesis test and compare the resulting p-value against our α-value. If the p-value is lower than the α-value, then we have correlation; if the p-value is greater than the α-value, there is no correlation.

There are plenty of statistics packages out there that can perform these analyses for us. Some are easier to use than others; some are more powerful than others. We use Minitab at work, and I find it indispensable. I also drop into R occasionally. R is much more powerful and free, but it’s all command-line programming, so it also has a much larger learning curve.

The p-value on this data is greater than 0.05, which means there is no linear correlation between the revenue and the tax rate.

Linear, though, means you have a straight line, and the Laffer curve is not linear by definition. The data fails our first tests, but the assumptions in our tests may have driven us to a false failure.

Let’s go back and start by plotting just the data.

The exaggerated Laffer curve in the original presentation is not evident in this data. Excluding the outlier where revenue is 0.1 of GDP (looking at the Wall Street Journal’s graph, we see that this is for Norway), the data is roughly linear: zero revenue at a zero percent tax rate, and slightly increasing revenue with increasing tax rate. There may be a slight rounding-off or flattening in the tax rate range 0.20 to 0.35.

Since we do not know what shape the Laffer curve should take—where the maximum should be—and we don’t have enough data to find it, we can use the Lowess algorithm to create a smoothed curve.

This confirms our observation that the relationship is essentially linear, with a possible rounding off above 0.20. I’ve added a rug plot to the axes, which gives a tick for every data point. This is useful because it helps us to focus on the distribution of the data, much as a separate histogram would.

Where does all this get us? It tells us that the author’s curve was most likely drawn in just to make his point and does not fit any data or data analysis. It also tells us that the author’s story had nothing to do with the data.

I have seen this many times in the corporate world. Graphs of neat lines, where all the data points have been averaged out and left out. Graphs where a preconceived curve is fitted to data without regard for how well (or poorly) the curve fits the data. Data may be removed completely and fitted curves smoothed until they look “look right.”

Combating this is not hard. It just takes some thought and a few questions.

First, make sure you actually see the data, and not just some prettified lines. Graphs should contain data points, and real data usually is not terribly clean or smooth.

Second, when a curve is fit to the data, ask what the correlation is. This should be a number, and less than 0.5 (or 50%) means there is no useful correlation. The closer to 1 (or 100%), the better. Ask, too, what the basis of the fitted line is: is this just some high-order polynomial or spline that was selected because of the high correlation, or is there a solid physical—or theoretical—basis for the selected line? If there is no physical basis, straight lines are preferable to curves.

Third, ask for a numerical analysis. Graphs are powerful; they allow us to see all the data, to see the trends, and to determine what sorts of numerical analyses are possible. However, graphs are not a substitute for numerical, or statistical, analysis. The two compliment each other. So ask for the hypothesis statement and the alternate hypothesis. Ask for the p-value and the α-value (which should have been selected before the experiment was conducted).

I realize that this is unfamiliar territory for a lot of managers, who have little mathematical background and often no training in statistics. It’s not hard to ask the questions, and it’s not hard to understand the answers—they’re pretty straight-forward—and I don’t think that you need to have a deep understanding of the mathematics. Let the experts be the experts, but ask questions to make sure that you are covering the risk.

Tom Hopper

Competitive organizations through high-performance learning

Science