Rewriting plot.qcc using ggplot2 and grid

The free and open-source R statistics package is a great tool for data analysis. The free add-on package qcc provides a wide array of statistical process control charts and other quality tools, which can be used for monitoring and controlling industrial processes, business processes or data collection processes. It’s a great package and highly customizable, but the one feature I wanted was the ability to manipulate the control charts within the grid graphics system, and that turned out to be not so easy.

I went all-in and completely rewrote qcc’s plot.qcc() function to use Hadley Wickham’s ggplot2 package, which itself is built on top of grid graphics. I have tested the new code against all the examples provided on the qcc help page, and the new ggplot2 version works for all the plots, including X-bar and R, p- and u- and c-charts.

In qcc, an individuals and moving range (XmR or ImR) chart can be created simply:

library(qcc)
my.xmr.raw <- c(5045,4350,4350,3975,4290,4430,4485,4285,3980,3925,3645,3760,3300,3685,3463,5200)
x <- qcc(my.xmr.raw, type = "xbar.one", title = "Individuals Chart\nfor Wheeler sample data")
x <- qcc(matrix(cbind(my.xmr.raw[1:length(my.xmr.raw)-1], my.xmr.raw[2:length(my.xmr.raw)]), ncol = 2), type = "R", title = "Moving Range Chart\nfor Wheeler sample data")

This both generates the plot and creates a qcc object, assigning it to the variable x. You can generate another copy of the plot with plot(x).

To use my new plot function, you will need to have the packages ggplot2, gtable, qcc and grid installed. Download my code from the qcc_ggplot project on Github, load qcc in R and then run source("qcc.plot.R"). The ggplot2-based version of the plotting function will be used whenever a qcc object is plotted.

library(qcc)
source("qcc.plot.R")
my.xmr.raw <- c(5045,4350,4350,3975,4290,4430,4485,4285,3980,3925,3645,3760,3300,3685,3463,5200)
x <- qcc(my.xmr.raw, type = "xbar.one", title = "Individuals Chart\nfor Wheeler sample data")
x <- qcc(matrix(cbind(my.xmr.raw[1:length(my.xmr.raw)-1], my.xmr.raw[2:length(my.xmr.raw)]), ncol = 2), type = "R", title = "Moving Range Chart\nfor Wheeler sample data")

Below, you can compare the individuals and moving range charts generated by qcc and by my new implementation of plot.qcc():

The qcc individuals chart as implemented in the qcc package.

The qcc individuals chart as implemented in the qcc package.

The qcc individuals chart as implemented using ggplot2 and grid graphics.

The qcc individuals chart as implemented using ggplot2 and grid graphics.

The qcc moving range chart as implemented in the qcc package.

The qcc moving range chart as implemented in the qcc package.

The qcc moving range chart as implemented using ggplot2 and grid graphics.

The qcc moving range chart as implemented using ggplot2 and grid graphics.

New features

In addition to the standard features in qcc plots, I’ve added a few new options.

size or cex
Set the size of the points used in the plot. This is passed directly to geom_point().
font.size
Sets the size of text elements. Passed directly to ggplot() and grid’s viewport().
title = element_blank()
Eliminate the main graph title completely, and expand the data region to fill the empty space. As with qcc, with the default title = NULL a title will be created, or a user-defined text string may be passed to title.
new.plot
If TRUE, creates a new graph (grid.newpage()). Otherwise, will write into the existing device and viewport. Intended to simplify the creation of multi-panel or composite charts.
digits
The argument digits is provided by the qcc package to control the number of digits printed on the graph, where it either uses the default option set for R or a user-supplied value. I have tried to add some intelligence to calculating a default value under the assumption that we can tell something about the measurement from the data supplied. You can see the results in the sample graphs above.

Lessons Learned

This little project turned out to be somewhat more difficult than I had envisioned, and there are several lessons-learned, particularly in the use of ggplot2.

First, ggplot2 really needs data frames when plotting. Passing discrete values or variables not connected to a data frame will often result in errors or just incorrect results. This is different than either base graphics or grid graphics, and while Hadley Wickham has mentioned this before, I hadn’t fully appreciated it. For instance, this doesn’t work very well:

my.test.data <- data.frame(x = seq(1:10), y = round(runif(10, 100, 300)))
my.test.gplot <- ggplot(my.test.data, aes(x = x, y = y)) + 
  geom_point(shape = 20)
index.1 <- c(5, 6, 7)
my.test.gplot <- my.test.gplot +
  geom_point(aes(x = x[index.1], y = y[index.1]), col = "red")
my.test.gplot

Different variations of this sometimes worked, or sometimes only plotted some of the points that are supposed to be colored red.

However, if I wrap that index.1 into a data frame, it works perfectly:

my.test.data <- data.frame(x = seq(1:10), y = round(runif(10, 100, 300)))
my.test.gplot <- ggplot(my.test.data, aes(x = x, y = y)) + 
  geom_point(shape = 20)
index.1 <- c(5, 6, 7)
my.test.subdata <- my.test.data[index.1,]
my.test.gplot <- my.test.gplot +
  geom_point(data = my.test.subdata, aes(x = x, y = y), col = "red")
my.test.gplot

Another nice lesson was that aes() doesn’t always work properly when ggplot2 is called from within a function. In this case, aes_string() usually works. There’s less documentation than I would like on this, but you can search the ggplot2 Google Group or Stack Overflow for more information.

One of the bigger surprises was discovering that aes() searches for data frames in the global environment. When ggplot() is used from within a function, though, any variables created within that function are not accessible in the global environment. The work-around is to tell ggplot which environment to search in, and a simple addition of environment = environment() within the ggplot() call seems to do the trick. This is captured in a stack overflow post and the ggplot2 issue log.

my.test.data <- data.frame(x = seq(1:10), y = round(runif(10, 100, 300)))
my.test.gplot <- ggplot(my.test.data, environment = environment(), aes(x = x, y = y)) + 
  geom_point(shape = 20)
index.1 <- c(5, 6, 7)
my.test.subdata <- my.test.data[index.1,]
my.test.gplot <- my.test.gplot +
  geom_point(data = my.test.subdata, aes(x = x, y = y), col = "blue")
my.test.gplot

Finally, it is possible to completely and seamlessly replace a function created in a package and loaded in that package’s namespace. When I set out, I wanted to end up with a complete replacement for qcc’s internal plot.qcc() function, but wasn’t quite sure this would be possible. Luckily, the below code, called after the function declaration, worked. One thing I found was that I needed to name my function the same as the one in the qcc package in order for the replacement to work in all cases. If I used a different name for my function, it would work when I called plot() with a qcc object, but qcc’s base graphics version would be used when calling qcc() with the parameter plot = TRUE.

unlockBinding(sym="plot.qcc", env=getNamespace("qcc"));
assignInNamespace(x="plot.qcc", value=plot.qcc, ns=asNamespace("qcc"), envir=getNamespace("qcc"));
assign("plot.qcc", plot.qcc, envir=getNamespace("qcc"));
lockBinding(sym="plot.qcc", env=getNamespace("qcc"));

Outlook

For now, the code suits my immediate needs, and I hope that you will find it useful. I have some ideas for additional features that I may implement in the future. There are some parts of the code that can and should be further cleaned up, and I’ll tweak the code as needed. I am certainly interested in any bug reports and in seeing any forks; good ideas are always welcome.

References

  • R Core Team (2013). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL http://www.R-project.org/.
  • Scrucca, L. (2004). qcc: an R package for quality control charting and statistical process control. R News 4/1, 11-17.
  • H. Wickham. ggplot2: elegant graphics for data analysis. Springer New York, 2009.
  • Wheeler, Donald. “Individual Charts Done Right and Wrong.” Quality Digest. 2 Feb 20102 Feb 2010. Print. <http://www.spcpress.com/pdf/DJW206.pdf>.

Individuals and Moving Range Charts in R

Individuals and moving range charts, abbreviated as ImR or XmR charts, are an important tool for keeping a wide range of business and industrial processes in the zone of economic production, where a process produces the maximum value at the minimum costs.

While there are many commercial applications that will produce such charts, one of my favorites is the free and open-source software package R. The freely available add-on package qcc will do all the heavy-lifting. There is little documentation on how to create a moving range chart, but the code is actually quite simple, as shown below.

The individuals chart requires a simple vector of data. The moving range chart needs a two-column matrix arranged so that qcc() can calculate the moving range from each row.

library(qcc)
#' The data, from sample published by Donald Wheeler
my.xmr.raw <- c(5045,4350,4350,3975,4290,4430,4485,4285,3980,3925,3645,3760,3300,3685,3463,5200)
#' Create the individuals chart and qcc object
my.xmr.x <- qcc(my.xmr.raw, type = "xbar.one", plot = TRUE)
#' Create the moving range chart and qcc object. qcc takes a two-column matrix
#' that is used to calculate the moving range.
my.xmr.raw.r <- matrix(cbind(my.xmr.raw[1:length(my.xmr.raw)-1], my.xmr.raw[2:length(my.xmr.raw)]), ncol=2)
my.xmr.mr <- qcc(my.xmr.raw.r, type="R", plot = TRUE)

This produces the individuals chart:

The qcc individuals chart as implemented in the qcc package.

The qcc individuals chart.

and the moving range chart:

The qcc moving range chart as implemented in the qcc package.

The qcc moving range chart.

The code is also available as a gist.

References

  • R Core Team (2013). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL http://www.R-project.org/.
  • Scrucca, L. (2004). qcc: an R package for quality control charting and statistical process control. R News 4/1, 11-17.
  • Wheeler, Donald. “Individual Charts Done Right and Wrong.” Quality Digest. 2 Feb 20102 Feb 2010. Print. <http://www.spcpress.com/pdf/DJW206.pdf>.

Is there a Market for Premium R Packages?

Nathan Yau, of the excellent FlowingData blog, recently asked on his Twitter stream:

I wonder if there’s a market for premium R packages, like there is for say, @wordpress themes and plugins

There are some great packages available for R, all of which are currently free. I think it would be great if authors like Hadley Wickham and Ian Fellows received remuneration for their efforts. However, I see a trap here.

From my perspective, R has two main barriers to adoption: the learning curve and IT support.

The learning curve is steep enough that casual users will not get very far, and infrequent users tend to slide backwards and have to relearn (I’ve had to develop a mind map of common functions to help mitigate this problem for myself). R packages generally address the learning curve. There aren’t many packages that provide functions that the user couldn’t have created for themselves with base R, but the packages make using their functions much easier. ggplot2 and its support packages plyr and reshape make a perfect example. The default R graphical output is pretty good, but ggplot2 offers better aesthetic defaults and provides an easier path to advanced functions, like transforming data and adding fitted curves and “ribbons.”

IT departments will not have any readily available, professional support should problems arise with any R installation. I’ve seen a couple of IT departments balk at supporting open source software for this very reason, and one of them balked at supporting R for this reason. IT departments must evaluate software through the lenses of incident response and down time. However good the community, open source software leaves a big uncertainty when planning for support budgets. The only solution that I’ve found for IT is to convince them that they don’t have to support R; I can do it myself. I’m sure some of you are luckier that way, and it seems that Revolution is slowly addressing this issue, but it’s not an issue that has been generally addressed in the community. In addition, IT departments are usually responsible for ensuring that all software installed on their organization’s computers is legally licensed to the organization. With everything currently free, that’s a problem that is easily overcome.

Make ggplot2, or any other package, available only to those who can pay, and you exacerbate the two main problems with R: great functionality that flattens the learning curve will be lost to a large segment of users (i.e. casual or infrequent users and cash-strapped users like students), and IT departments will have to choose between actively supporting R and simply banning it. Providing user-installable R packages where some are freely licensed and others are not would create an environment where some IT departments would simply ban R rather than have to sort out the licensing issues. I suspect that many would ban R.

We need a way to repay package authors for their time, without losing the benefits of freely available packages. Donationware seems like a good first step, even though the response rate is typically very low.

R Function Reference

Updated below

The R Function Reference is a mind map that I created as a guide for novice and intermediate users of the R statistics language. When you first open it, I suggest that you collapse all the nodes by clicking on the “Expand/Collapse all nodes” button in the bottom left of the screen to make the map easier to navigate. You can also adjust the zoom level with the slider next to that button.

R Function Reference screenshot

The top-level nodes of the R Function Reference

The mind map is arranged in eight sections, or main branches, arranged by task. What do you want to do? Each branch covers a general set tasks, such as learning to use R, running R, working with data, statistical analysis or plotting data. The end of each string of nodes is generally a function and example. The Reference provides code fragments, rather than details of the function or complete reproducible code blocks. Once you’ve followed the Reference and have an idea of how to accomplish something, you can look up the details in R’s help system (e.g. “?read.csv” to learn more about using the read.csv() function), or search Google or the online R-Help mailing list archives for answers using the function name.

There are a lot of useful nodes and examples, especially in the “Graphs” section, but the mind map is not complete; some trails end before you get to a useful function reference. I am sorry for that, but it’s a work in progress, and will be slowly updated over time.

Comments and suggestions are welcome.

Update 1

In comments, several users reported problems opening the mind map. With a little investigation, it appears that the size of the mind map is the problem. To try to fix the problems , I have split the mind map out into several small mind maps, all linked together.

The new main mind map is the R Function Reference, Main. The larger branches on this main map no longer expand to their own content, but contain a link to a “child” mind map. The link looks like a sheet of paper with an arrow pointing to the right, click on it and little cartoon speech bubble will pop up with a link that you have to click on to go to the child mind map. Likewise, the central nodes on the child mind maps contain a link back to the main mind map.

Due to load times and the required extra clicks, this may slightly reduce usability for users who didn’t have a problem with the all-in-one version, but will hopefully make the mind map accessible to a broader audience.

I have to offer praise to the developers of Mind42. Though I couldn’t directly split branches off into their own mind maps or duplicate the mind map, it was very easy to export the mind map as a native Mind42 file and then import it multiple times, editing the copies without any loss of data or links. The ability to link directly between mind maps within Mind42 was also a key enabling feature. Considering that this is a free web app, its capabilities are most impressive. They were also quick to respond when I posted a call for help on the Mind42 forum.

Please let me know how the new, “improved” version works.

The old mind map, containing everything, is still available, but I will not update it.

Graphing Highly Skewed Data

Recently Chandoo.org posted a question about how to graph data when you have a lot of small values and a few larger values. It’s not the first time that I’ve come across this question, and I’ve seen a lot of answers, many of them really bad. While all solutions involve trade-offs for understanding and interpreting graphs, some solutions are better than others.

Data graphs tell stories by revealing patterns in complex data. Good data graphs let the data tell the story by revealing the patterns, rather than trying to impose patterns on the data.

As William Cleveland discusses in The Elements of Graphing Data and his 1993 paper A Model for Studying Display Methods of Statistical Graphics, there are two basic visual operations that people employ when looking at and interpreting graphs: pattern perception and table look-up. Pattern perception is where we see the geometric patterns in a graph: groupings; relative differences (larger/smaller); or trends (straight/curved or increasing/decreasing). Table look-up is where we explore details of values and names on a graph. These two operations are distinct and complimentary, and it is through these two operations that the data’s story is told.

month sales
1 Feb 09 200
2 Mar 09 300
3 Apr 09 200
4 May 09 300
5 Jun 09 200
6 Jul 09 300
7 Aug 09 350
8 Sep 09 400
9 Oct 09 450
10 Nov 09 1200
11 Dec 09 100000
12 Jan 10 85000
13 Feb 10 450

So suppose that we have some data like that at right, where we are interested in the patterns of smaller, individual values, but there are also a few extremely large values, or outliers. We describe such data as being skewed. How do we plot this data? First, for such a small data set, a simple table is the best approach. People can see the numbers and interpret them, there aren’t too many numbers to make sense of and the table is very compact. For more complicated data sets, though, a graph is needed. There’s a few basic options:

  • Graph as-is;
  • Graph with a second axis;
  • Graph the logarithm of the data;
  • Use a scale break.
  • Plot the data multiple times.

Graph As-Is

Bar chart with all data plotted

A bar chart with all data, including outliers, plotted on the same scale.

This is the simplest solution, and if you’re only interested in knowing about the outliers (Dec ’09 and Jan ’10) then it will do. However, it completely hides whatever is happening in the rest of the months. Pattern recognition tells us that two months near the end of the series have the big numbers. Table-lookup tells us the approximate values and that these months are around December ’09 and February ’10, but the way the labels string together and overlap the tick marks, it’s not clear exactly what the labels are, let alone which label applies to which bar (which months are those, precisely? Is that “09 Dec” and “09 Feb?” Do the numbers even go with the text, or are they separate labels?).

For all but the simplest of messages, this rendition defeats both pattern recognition and table look-up. We definitely need a better solution.

Use a Secondary Axis

Excel gives us an easy solution: break the data into two columns (“small” numbers in one and “large” numbers in the other) and plot them on separate axes. Now we can see all the data, including the patterns in all the months.

Bar Chart with Outliers on Secondary Axis

Bar chart, with outliers plotted using a secondary axis.

Unfortunately, pattern recognition tells us that the big-sales months are about the same as all the other months. It’s only the table look-up that tells us how big of a difference there is between the two blue columns and the rest of the data. This is why I’ve added data labels to the two columns: to aid table look-up.

Even if we tweaked around with the axes to set the outliers off from the rest of the data, we’d still have the same basic problem: pattern recognition would tell us that there is a much smaller difference than there actually is. By using a secondary axis, we’ve set up a basic conflict between pattern recognition and table look-up. Worse, it’s easy to confuse the axes; which bars go with which axis? Reproduction in black and white or grayscale would make it impossible to correctly connect bars to the correct axis. Some types of color blindness would similarly make it difficult to interpret the graph. Table look-up is easily defeated with secondary axes.

The secondary axis presents so many problems that I always advise against using it. Stephen Few, author of Show Me The Numbers and Information Dashboard Design, calls graphs with secondary axes “dual-scaled graphs.” In his 2008 article Dual-Scaled Axes in Graphs, he concludes that there is always a better way to display data than by using secondary axes. Excel makes it easy to create graphs like this, but it’s always a bad idea.

Take the Logarithm

In scientific applications, skewed data is common, and the usual solution is to plot the logarithm of the values.

Bar Chart with Logarithmic Axis

Bar chart plotting skewed with logarithmic axis.

With the logarithm, it is easy to plot, and see, all of the data. Trends in small values are not hidden. Pattern perception immediately tells us the overall story of the data. Table look-up is easier than with secondary axes, and immediately tells us the scale of the differences. Plotting the logarithm allows pattern perception and table look-up to compliment each other.

Below, I’ve created the same graph using a dot plot instead of a bar chart. Dot plots have many advantages over bar charts: most obviously, dot plots provide a better arrangement for category labels (e.g. the months); also, dot plots provide a clearer view of the data by plotting the data points rather than filling in the space between the axis and the data point. There are some nice introductions to dot plots, including William Cleveland’s works and a short introduction by Naomi Robbins. The message is clear: any data that you might present with a bar chart (or pie chart) will be better presented using dot plots.

Dot plot with logarithmic scale

Skewed data plotted on a dot plot using a logarithmic scale.

Use a Scale Break

Another approach, which might be better for audiences unfamiliar with logarithmic scales, is to use a scale break, or broken axis. With some work, we can create a scale break in Excel or OpenOffice.org.

Bar chart with a subtle scale break on the Y axis.

Bar chart with outliers plotted by introducing a subtle scale break on the y-axis.

There are plenty of tutorials for how to accomplish this in Excel. For this example, I created the graph in OpenOffice.org Spreadsheet, using the same graph with the secondary axis, above. I adjusted the two scales, turned off the labels for both y-axes and turned off the tick marks for the secondary y-axis. Then I copied the graph over to the OpenOffice.org Draw application and added y-axis labels and the break marks as drawing objects.

That pretty much highlights the first problem with this approach: it takes a lot of work. The second problem is that those break marks are just too subtle; people will miss them.

The bigger problem is with interpretation. As with the secondary axis, this subtle scale break sets up a basic conflict between the two basic operations of graph interpretation. Pattern recognition tells us that the numbers are comparable; it’s only table look-up that tells us what a large difference there is.

Cleveland’s recommendation, when the logarithm won’t work, is to use a full-panel scale break. In this way, pattern recognition tells that there are two distinct groups of data, and table look-up tells us what they are.

Dot plot with full scale break

Dot plot with a full scale break to show outliers.

The potential disadvantage of this approach is that pattern perception might be fooled. While the scale break visually groups the “large” values from the “small” ones, the scale also changes, so that the broader panel on the left actually represents a much narrower range of values (about 1100 dollars range) than the narrower panel on the right (about 17000 dollars range). Our audience might have difficulties interpreting this correctly.

Small Multiples

Edward Tufte has popularized the idea of small multiples, the emphasis of differences by repeating a graph or image with small changes from one frame to the next. In this case, we could show the full data set, losing fidelity in the smaller values, and then repeat the graph while progressively zooming in on a narrower and narrower slice with each repetition.

Dot Plot showing full data (including outliers) side-by-side with zoomed view.

The full data, with outliers, is plotted on the left. On the right, a zoomed view showing detail in the smaller values.

This shares many similarities to Cleveland’s full scale break, but provides greater flexibility. With this data, there are two natural ranges: 0 – 100000 and 0 – 1200. If there were more data between 1200 and 85000, we might repeat the graph several times, zooming in more with each repetition to show lower levels of detail.

I think there are two potential pitfalls. As with the full scale break, the audience might fail to appreciate the effect of the changes to scale. Worse, the audience might be fooled into thinking that each graph represented a different set of data, rather than just a different slice of the same data. Some care  in preparing such graphs will be needed for successful communication.

Summary

When presenting data that is, like the data above, arranged by category, use a dot plot instead of bar charts. When your data is heavily skewed, the best solution is to graph the logarithm of the data. However, if your audience will be unable to correctly interpret the logarithm, try a full scale break or small multiples.