# Understanding Data

When analyzing data, I have often found it useful to think of the data as being one of four main types, according to the typology proposed by Stevens.[1] Different types of data have certain characteristics; understanding what type of data you have helps with selecting the analysis to perform perform while preventing basic mistakes.

The types, or “scales of measurement,” are:

Nominal
Data identifying unique classifications or objects where the order of values is not meaningful. Examples include zip codes, gender, nationality, sports teams and multiple choice answers on a test.
Ordinal
Data where the order is important but the difference or distance between items is not important or not measured. Examples include team rankings in sport (team A is better than team B, but how much better is open to debate), scales such as health (e.g. “healthy” to “sick”), ranges of opinion (e.g. “strongly agree” to “strongly disagree” or “on a scale of 1 to 10”) and Intelligence Quotient.
Interval
Numeric data identified by values where the degree of difference between items is significant and meaningful, but their ratio is not. Common examples are dates—we can say 2000 CE is 1000CE + 1000 years, but 1000 CE is not half of 2000 CE in any meaningful way—and temperatures on the Celsius and Fahrenheit scales, where a difference of 10° is meaningful, but 10° is not twice as hot as 5°.
Ratio
Numeric data where the ratio between numbers is meaningful. Usually, such scales have a meaningful “0.” Examples include length, mass, velocity, acceleration, voltage, power, duration, energy and Kelvin-scale temperature.

The generally-appropriate statistics and mathematical operations for each type are summarized in table 1.

Table 1: Scales of measurement and allowed statistical and mathematical operations.
Scale Type Statistics Operations
Nominal mode, frequency, chi-squared, cluster analysis =, ≠
Ordinal above, plus: median, non-parametric tests, Kruskal-Wallis, rank-correlation =, ≠, >, <
Interval plus: arithmetic mean, some parametric tests, correlation, regression, ANOVA (sometimes), factor analysis =, ≠, >, <, +, –
Ratio plus: geometric and harmonic mean, ANOVA, regression, correlation coefficient =, ≠, >, <, +, -, ×, ÷

While this is a useful typology for most use, and certainly for initial consideration, there are valid criticisms of Stevens’ typology. For example, percentages and count data have some characteristics of ratio-scale data, but with additional constraints. e.g. the average of the counts $\overline{(2, 2, 1)} = 1.66\ldots$ may not be meaningful. This typology is a useful thinking tool, but it is essential to understand the statistical methods being applied and their sensitivity to departures from underlying assumptions.

### Types of data in R

R[2] recognizes at least fifteen different types of data. Several of these are related to identifying functions and other objects—most users don’t need to worry about most of them. The main types that industrial engineers and scientists will need to use are:

numeric
Real numbers. Also known as double, real and single (note that R stores all real numbers in double-precision). May be used for all scales of measurement, but is particularly suited to ratio scale measurements.
complex
Imaginary real numbers can be manipulated directly as a data type using

x <- 1 + i2

or

x <- complex(real=1, imaginary=2)

Like type numeric, may be used for all scales of measurement.

integer
Stores integers only, without any decimal point. Can be used mainly for ordinal or interval data, but may be used as ratio data—such as counts—with some caution.
logical
Stores Boolean values of TRUE or FALSE, typically used as nominal data.
character
Stores text strings and can be used as nominal or ordinal data.

### Types of variables in R

The above types of data can be stored in several types, or structures, of variables. The equivalent to a variable in Excel would be rows, columns or tables of data. The main ones that we will use are:

vector
Contains one or many elements, and behaves like a column or row of data. Vectors can contain any of the above types of data but each vector is stored, or encoded, as a single type. The vector

c(1, 2, 1, 3, 4)
## [1] 1 2 1 3 4

is, by default, a numeric vector of type double, but

c(1, 2, 1, 3, 4, "name")
## [1] "1" "2" "1" "3" "4" "name"

will be a character vector, or a vector where all data is stored as type character, and the numbers will be stored as characters rather than numbers. It will not be possible to perform mathematical operations on these numbers-stored-as-characters without first converting them to type numeric.

factor
A special type of character vector, where the text strings signify factor levels and are encoded internally as integer counts of the occurrence of each factor. Factors can be treated as nominal data when the order does not matter, or as ordinal data when the order does matter.
factor(c("a", "b", "c", "a"), levels=c("a","b","c","d"))
## [1] a b c a
## Levels: a b c d
array
A generalization of vectors from one dimension to two or more dimensions. Array dimensions must be pre-defined and can have any number of dimensions. Like vectors, all elements of an array must be of the same data type. (Note that the letters object used in the example below is a variable supplied by R that contains the letters a through z.)

# letters a - c in 2x4 array
array(data=letters[1:3], dim=c(2,4))
##      [,1] [,2] [,3] [,4]
## [1,] "a"  "c"  "b"  "a"
## [2,] "b"  "a"  "c"  "b"

# numbers 1 - 3 in 2x4 array
array(data=1:3, dim=c(2,4))
##      [,1] [,2] [,3] [,4]
## [1,]    1    3    2    1
## [2,]    2    1    3    2
matrix
A special type of array with the properties of a mathematical matrix. It may only be two-dimensional, having rows and columns, where all columns must have the same type of data and every column must have the same number of rows. R provides several functions specific to manipulating matrices, such as taking the transpose, performing matrix multiplication and calculation eigenvectors and eigenvalues.

matrix(data = rep(1:3, times=2), nrow=2, ncol=3)
##      [,1] [,2] [,3]
## [1,]    1    3    2
## [2,]    2    1    3
list
Vectors whose elements are other R objects, where each object of the list can be of a different data type, and each object can be of different length and dimension than the other objects. Lists can therefore store all other data types, including other lists.

list("text", "more", 2, c(1,2,3,2))
## [[1]]
## [1] "text"
##
## [[2]]
## [1] "more"
##
## [[3]]
## [1] 2
##
## [[4]]
## [1] 1 2 3 2
data.frame
For most industrial and data scientists, data frames are the most widely useful type of variable. A data.frame is the list analog to the matrix: it is an $m \times n$ list where all columns must be vectors of the same number of rows (determined with NROW()). However, unlike matrices, different columns can contain different types of data and each row and column must have a name. If not named explicitly, R names rows by their row number and columns according to the data assigned assigned to the column. Data frames are typically used to store the sort of data that industrial engineers and scientists most often work with, and is the closest analog in R to an Excel spreadsheet. Usually data frames are made up of one or more columns of factors and one or more columns of numeric data.

data.frame(rnorm(5), rnorm(5), rnorm(5))
##     rnorm.5.  rnorm.5..1  rnorm.5..2
## 1  0.2939566  1.28985202 -0.01669957
## 2  0.3672161 -0.01663912 -1.02064116
## 3  1.0871615  1.13855476  0.78573775
## 4 -0.8501263 -0.17928722  1.03848796
## 5 -1.6409403 -0.34025455 -0.62113545

More generally, in R all variables are objects, and R distinguishes between objects by their internal storage type and by their class declaration, which are accessible via the typeof() and class() functions. Functions in R are also objects, and the users can define new objects to control the output from functions like summary() and print(). For more on objects, types and classes, see section 2 of the R Language Definition.

Table 2 summarizes the internal storage and R classes of the main data and variable types.

Table 2: Table of R data and variable types.
Variable type Storage type Class Measurement Scale
vector of decimals double numeric ratio
vector of integers integer integer ratio or interval
vector of complex complex complex ratio
vector of characters character character nominal
factor vector integer factor nominal or ordinal
matrix of decimals double matrix ratio
data frame list data.frame mixed
list list list mixed

### References

1. Stevens, S. S. “On the Theory of Scales of Measurement.” Science. 103.2684 (1946): 677-680. Print.
2. R Core Team (2017). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.

# Introduction to R for Excel Users

As the saying goes, when all you have is a hammer, everything looks like a nail. Excel was designed to do simple financial analyses and to craft financial statements. Though its capabilities have been expanded over the years, it was never designed to perform the sort of data analysis that industry scientists, engineers and Six Sigma belts need to perform on a daily basis.

Most data analyses performed in Excel look more like simple financial spreadsheets rather than actual data analysis, and this quality of work translates into bad—or at least sub-optimal—business decisions. There are alternatives to Excel, and the free, open-source data analysis platform R is one of them.

Unfortunately, R has a steep learning curve. I’m offering, for free, a short primer on R [PDF] where I’ve sought to make that learning curve a little less painful for engineers and scientists who normally work in Excel.

### Background

A couple of years ago, I was developing a short course to teach R to scientists and engineers in industry who normally used Excel. The goal was to help them transition to a more capable tool. My course design notes morphed into a handout, and when plans for the course fell through, that handout grew into a self-study guide, which I later adapted into this seventy-page, stand-alone introduction for Excel users.

### Organization

The primer walks the reader through the basics of R, starting with a brief overview of capabilities, then diving into installation, basic operations, graphical analysis and basic statistics. I believe that a picture is worth a thousand words, so it’s light on text and heavy on examples and visuals.

The end of the book rounds out with a look at some of the most useful add-ons, the briefest of introductions to writing your own, custom functions in R, and a cross-reference of common Excel functions with their equivalents in R.

The text is broken up into chapters and fully indexed so that it can be used either as a walk-through tutorial or as a quick reference.

Aside

# Update to plot.qcc using ggplot2 and grid

Two years ago, I blogged about my experience rewriting the plot.qcc() function in the qcc package to use ggplot2 and grid. My goal was to allow manipulation of qcc’s quality control plots using grid graphics, especially to combine range charts with their associated individuals or moving range charts, as these two diagnostic tools should be used together. At the time, I posted the code on my GitHub.

I recently discovered that the update to ggplot2 v2.0 broke my code, so that attempting to generate a qcc plot would throw an obscure error from someplace deep in ggplot2. The fix turned out to be pretty easy. The original code used aes_string() instead of aes() because of a barely-documented problem of calling aes() inside a function. It looks like this has been quietly corrected with ggplot2 2.0, and aes_string() is no longer needed for this.

The updated code is up on GitHub. As before, load the qcc library, then source() qcc.plot.R. For the rest of the current session, calls to qcc() will automatically use the new plot.qcc() function.

# Explorable, multi-tabbed reports in R and Shiny

Matt Parker recently showed us how to create multi-tab reports with R and jQuery UI. His example was absurdly easy to reproduce; it was a great blog post.

I have been teaching myself Shiny in fits and starts, and I decided to attempt to reproduce Matt’s jQuery UI example in Shiny. You can play with the app on shinyapps.io, and the complete project is up on Github. The rest of this post walks through how I built the Shiny app.

### It’s a demo

The result demonstrates a few Shiny and ggplot2 techniques that will be useful in other projects, including:

• Creating tabbed reports in Shiny, with different interactive controls or widgets associated with each tab;
• Combining different ggplot2 scale changes in a single legend;
• Sorting a data frame so that categorical labels in a legend are ordered to match the position of numerical data on a plot;
• Borrowing from Matt’s work,
• Summarizing and plotting data using dplyr and ggplot2;
• Limiting display of categories in a graph legend to the top n (selectable by the user), with remaining values listed as “other;”
• Coloring only the top n categories on a graph, and making all other categories gray;
• Changing line weight for the top n categories on a graph, and making;

### Obtaining the data

As with Matt’s original report, the data can be downloaded from the CDC WONDER database by selecting “Data Request” under “current cases.”

To get the same data that I’ve used, group results by “state” and by “year,” check “incidence rate per 100,000” and, near the bottom, “export results.” Uncheck “show totals,” then submit the request. This will download a .txt tab-delimited data file, which in this app I read in using read_tsv() from the readr package.

Looking at Matt’s example, his “top 5” states look suspiciously like the most populous states. He’s used total count of cases, which will be biased toward more populous states and doesn’t tell us anything interesting. When examining occurrences—whether disease, crime or defects—we have to look at the rates rather than total counts; we can only make meaningful comparisons and make useful decisions from an examination of the rates.

### Setup

As always, we need to load our libraries into R. For this example, I use readr, dplyr, ggplot2 and RColorBrewer.

### The UI

The app generates three graphs: a national total, that calculates national rates from the state values; a combined state graph that highlights the top $n$ states, where the user chooses $n$; and a graph that displays individual state data, where the user can select the state to view. Each goes on its own tab.

ui.R contains the code to create a tabset panel with three tab panels.

tabsetPanel(
tabPanel("National", fluidRow(plotOutput("nationPlot"))),
tabPanel("By State",
fluidRow(plotOutput("statePlot"),
wellPanel(
sliderInput(inputId = "nlabels",
label = "Top n States:",
min = 1,
max = 10,
value = 6,
step = 1)
)
)
),
tabPanel("State Lookup",
fluidRow(plotOutput("iStatePlot"),
wellPanel(
htmlOutput("selectState"))
)
)
)


Each panel contains a fluidRow element to ensure consistent alignment of graphs across tabs, and on tabs where I want both a graph and controls, fluidRow() is used to add the controls below the graph. The controls are placed inside a wellPanel() so that they are visually distinct from the graph.

Because I wanted to populate a selection menu (selectInput()) from the data frame, I created the selection menu in server.R and then displayed it in the third tab panel set using the htmlOutput() function.

### The graphs

The first two graphs are very similar to Matt’s example. For the national rates, the only change is the use of rates rather than counts.

df_tb <- read_tsv("../data/OTIS 2013 TB Data.txt", n_max = 1069, col_types = "-ciiii?di")

df_tb %>%
group_by(Year) %>%
summarise(n_cases = sum(Count), pop = sum(Population), us_rate = (n_cases / pop * 100000)) %>%
ggplot(aes(x = Year, y = us_rate)) +
geom_line() +
labs(x = "Year Reported",
y = "TB Cases per 100,000 residents",
title = "Reported Active Tuberculosis Cases in the U.S.") +
theme_minimal()


The main trick, here, is the use of dplyr to summarize the data across states. Since we can’t just sum or average rates to get the combined rate, we have to sum all of the state counts and populations for each year, and add another column for the calculated national rate.

To create a graph that highlights the top $n$ states, we generate a data frame with one variable, State, that contains the top $n$ states. This is, again, almost a direct copy of Matt’s code with changes to make the graph interactive within Shiny. This code goes inside of the shinyServer() block so that it will update when the user selects a different value for $n$. Instead of hard-coding $n$, there’s a Shiny input slider named nlabels. With a list of the top $n$ states ordered by rate of TB cases, df_tb is updated with a new field containing the top $n$ state names and “Other” for all other states.

top_states <- df_tb %>%
filter(Year == 2013) %>%
arrange(desc(Rate)) %>%
slice(1:input$nlabels) %>% select(State) df_tb$top_state <- factor(df_tb$State, levels = c(top_states$State, "Other"))
df_tb$top_state[is.na(df_tb$top_state)] <- "Other"


The plot is generated from the newly-organized data frame. Where Matt’s example has separate legends for line weight (size) and color, I’ve had ggplot2 combine these into a single legend by passing the same value to the “guide =” argument in the scale_XXX_manual() calls. The colors and line sizes also have to be updated dynamically for the selected $n$.

    df_tb %>%
ggplot() +
labs(x = "Year reported",
y = "TB Cases per 100,000 residents",
title = "Reported Active Tuberculosis Cases in the U.S.") +
theme_minimal() +
geom_line(aes(x = Year, y = Rate, group = State, colour = top_state, size = top_state)) +
scale_colour_manual(values = c(brewer.pal(n = input$nlabels, "Paired"), "grey"), guide = guide_legend(title = "State")) + scale_size_manual(values = c(rep(1,input$nlabels), 0.5), guide = guide_legend(title = "State"))
})

})


The last graph is nearly a copy of the national totals graph, except that it is filtered for the state selected in the drop-down menu control. The menu is as selectInput() control.

renderUI({
selectInput(inputId = "state", label = "Which state?", choices = unique(df_tb$State), selected = "Alabama", multiple = FALSE) })  With a state selected, the data is filtered by the selected state and TB rates are plotted. df_tb %>% filter(State == input$state) %>%
ggplot() +
labs(x = "Year reported",
y = "TB Cases per 100,000 residents",
title = "Reported Active Tuberculosis Cases in the U.S.") +
theme_minimal() +
geom_line(aes(x = Year, y = Rate))


### Wrap up

I want to thank Matt Parker for his original example. It was well-written, clear and easy to reproduce.

# Sample Size Matters: Design and Cost

## References

• R Core Team (2014). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL http://www.R-project.org/.
• H. Wickham. ggplot2: elegant graphics for data analysis. Springer New York, 2009.

# Sample Size Matters: Design and Experiments

Previously, I introduced the idea that samples do not look exactly like the populations that they are drawn from, and had a closer look at what impact sample size has on our ability to estimate population statistics like mean, proportion or Cpk from samples. Here, I will have a closer look at how this uncertainty impacts our engineering process. In the next post, I will tie in the engineering impacts and decisions to the business value and costs.

### Difference to detect

When we are testing, we’re either testing to determine that the new product or process performs better than the old or, for cost reduction projects, that the cheaper product or process is at least as good as the existing one.

This means that we need to detect a difference between old and new values, such as a difference in the mean weight between the new and old parts. The larger the sample size, $n$ the smaller the difference, $\Delta$ that we can detect. The error, $\epsilon$, in our estimate of the differences gets smaller as sample size increases:

Given the uncertainties in our estimate of $\mu$ and $\sigma$, illustrated above, it should be clear now that with small sample sizes we can only detect large differences of many multiples of the sample standard deviation, $S$.

### Mean

When trying to determine if a new product or process is better than an old one, we are usually interested in shifting the mean. We want a product to be lighter, provide more power, or a process to work faster. In such cases, we need to estimate the difference of the means, $\Delta = \mu_2 - \mu_1$ and ensure that it is different than 0 (or some other pre-determined value). The minimum difference that we can reliable detect is plotted below for different sample sizes.

### Standard deviation

In many Six Sigma projects, and any time we want to shift the mean closer to a specification limit, we need to compare the new population standard deviation with the old. The simplest way of making this comparison is by taking the ratio $F = \sigma_{2}^{2} / \sigma_{1}^{2}$, where $\sigma_{2}^{2}$ is the larger of the two variances. The dependence on sample size is illustrated below.

You can see from the inset plot, which includes sample sizes of 2 and 3, that small sample sizes really hurt comparisons of variance, and that interesting differences in variance can’t be detected until we have more than 10 samples.

### Proportions

Proportions, such as fraction of defective parts between a new and old design, can be compared by looking at the difference between the two proportions, $\Delta = \left| p_1 - p_0 \right|$.

You can see from this that proportions data provides much less information than variable data; we need much larger sample sizes to achieve usefully small $\Delta$.

### Summary and look forward

When designing experiments, the goal is to detect some difference between two populations. The uncertainty in our measurements and the variation in the parts has a big impact on how many parts we need to test, or greatly limits what we can learn from an experiment.

Next time, I’ll show how these calculations of sample size and uncertainty impact the busines.

# Sample Size Matters: Uncertainty in Measurement

In my previous post, I gave a brief introduction to populations and samples, and stated that sample size impacts our ability to know what a population really looks like. In this post, I want to show this relationship in more detail. In future posts, I will look at how sample size considerations impact our engineering process and what impacts this has on the business.

### Mean and sample size

The error in our estimate of the mean, $E$, is proportional to the standard deviation of the sample, $S$, and the sample size, $n$.

We can visualize this easily enough by plotting the 95% confidence interval. When we sample and calculate the sample mean ($\overline{X}$), the true population mean, $\mu$, (what we really want to know) is likely to be anywhere in the shaded region of the graph below.

This graph shows the 95% confidence region for the true population mean, $\mu$; there’s a 95% chance that the true population mean is within this band. The “0” line on the y axis is our estimate of the mean, $\overline{X}$. We can’t know what the true population mean is, but it’s clear that if we use more samples, we can be sure that our estimate is closer to the true mean.

### Standard deviation and sample size

Likewise, when we calculate the sample standard deviation, $S$, the true standard deviation, $\sigma$ has a 95% chance of being within the confidence band below. For small sample sizes (roughly less than 10), the measured standard deviation can be off from the true standard deviation by several times. Even for ten samples, the potential error is nearly $\pm 1$ standard deviation.

### Proportion and sample size

For proportions, the situation is similar: there is a 95% chance that the true sample proportion, $p$, is within the shaded band based on the measured sample proportion $\hat{p}$. Since this confidence interval depends on $\hat{p}$ and cannot be standardized the way $\mu$ and $\sigma$ can be, confidence intervals for two different proportions are plotted.

For small $n$, proportions data tells us very little.

### Process capability and production costs

The cost of poor quality in product or process design can be characterized by the Cpk:

Where USL is the upper specification limit (also called the upper tolerance) and LSL is the lower specification limit (or lower tolerance).

We can estimate the defect rate (defects per opportunity, or DPO) from the Cpk:

$DPO = 1 - \Pr\left(X < 3 \times Cpk - 1.5\right)$

That probability function is calculated in R with pnorm(3 * Cpk - 1.5) and in Excel with NORMSDIST(3 * Cpk - 1.5). The 1.5 is a typical value used to account for uncorrected or undetected process drift.

Since we don’t know $\mu$ and $\sigma$, we have to substitute $\overline{X}$ and $S$. The uncertainty in these estimates of the population $\mu$ and $\sigma$ mean that we have uncertainty in what the true process Cpk (or defect rates) will be once we’re in production. When our sample testing tells us that the Cpk should be 1.67 (the blue line), the true process Cpk will actually turn out to be somewhere in the shaded band:

Below the blue line, our product or process is failing to meet customer expectations, and will result in lost customers or higher warranty costs. Above the blue line, we’ve added more cost to the production of the product than we need to, reducing our gross profit margin. Since that gray band doesn’t completely disappear, even at 100 samples, we can never eliminate these risks; we have to find a way to manage them effectively.

The impact of this may be more evident when we convert from Cpk to defect rates (ppm):

### Summary and a look forward

With a fair sampling process, samples will look similar to—and statistically indistinguishable from—the population that they were drawn from. How much they look like the population depends critically on how many samples are tested. The uncertainties, or errors in our estimates, resulting from sample size decisions have impacts all through our design analysis and production planning.

In the next post, I will explore in more detail how these uncertainties impact our experiment designs.