data

When analyzing data, I have often found it useful to think of the data as being one of four main types, according to the typology proposed by Stevens.^[1] Different types of data have certain characteristics; understanding what type of data you have helps with selecting the analysis to perform perform while preventing basic mistakes.

The types, or “scales of measurement,” are:

Nominal: Data identifying unique classifications or objects where the order of values is not meaningful. Examples include zip codes, gender, nationality, sports teams and multiple choice answers on a test.
Ordinal: Data where the order is important but the difference or distance between items is not important or not measured. Examples include team rankings in sport (team A is better than team B, but how much better is open to debate), scales such as health (e.g. “healthy” to “sick”), ranges of opinion (e.g. “strongly agree” to “strongly disagree” or “on a scale of 1 to 10”) and Intelligence Quotient.
Interval: Numeric data identified by values where the degree of difference between items is significant and meaningful, but their ratio is not. Common examples are dates—we can say 2000 CE is 1000CE + 1000 years, but 1000 CE is not half of 2000 CE in any meaningful way—and temperatures on the Celsius and Fahrenheit scales, where a difference of 10° is meaningful, but 10° is not twice as hot as 5°.
Ratio: Numeric data where the ratio between numbers is meaningful. Usually, such scales have a meaningful “0.” Examples include length, mass, velocity, acceleration, voltage, power, duration, energy and Kelvin-scale temperature.

The generally-appropriate statistics and mathematical operations for each type are summarized in table 1.

Table 1: Scales of measurement and allowed statistical and mathematical operations.
Scale Type	Statistics	Operations
Nominal	mode, frequency, chi-squared, cluster analysis	=, ≠
Ordinal	above, plus: median, non-parametric tests, Kruskal-Wallis, rank-correlation	=, ≠, >, <
Interval	plus: arithmetic mean, some parametric tests, correlation, regression, ANOVA (sometimes), factor analysis	=, ≠, >, <, +, –
Ratio	plus: geometric and harmonic mean, ANOVA, regression, correlation coefficient	=, ≠, >, <, +, -, ×, ÷

While this is a useful typology for most use, and certainly for initial consideration, there are valid criticisms of Stevens’ typology. For example, percentages and count data have some characteristics of ratio-scale data, but with additional constraints. e.g. the average of the counts $\overline{(2, 2, 1)} = 1.66\ldots$ may not be meaningful. This typology is a useful thinking tool, but it is essential to understand the statistical methods being applied and their sensitivity to departures from underlying assumptions.

Types of data in R

R^[2] recognizes at least fifteen different types of data. Several of these are related to identifying functions and other objects—most users don’t need to worry about most of them. The main types that industrial engineers and scientists will need to use are:

numeric

Real numbers. Also known as double, real and single (note that R stores all real numbers in double-precision). May be used for all scales of measurement, but is particularly suited to ratio scale measurements.

complex

Imaginary real numbers can be manipulated directly as a data type using

x <- 1 + i2

x <- complex(real=1, imaginary=2)

Like type numeric, may be used for all scales of measurement.

integer

Stores integers only, without any decimal point. Can be used mainly for ordinal or interval data, but may be used as ratio data—such as counts—with some caution.

logical

Stores Boolean values of TRUE or FALSE, typically used as nominal data.

character

Stores text strings and can be used as nominal or ordinal data.

Types of variables in R

The above types of data can be stored in several types, or structures, of variables. The equivalent to a variable in Excel would be rows, columns or tables of data. The main ones that we will use are:

vector

Contains one or many elements, and behaves like a column or row of data. Vectors can contain any of the above types of data but each vector is stored, or encoded, as a single type. The vector

c(1, 2, 1, 3, 4)
## [1] 1 2 1 3 4

is, by default, a numeric vector of type double, but

c(1, 2, 1, 3, 4, "name")
## [1] "1" "2" "1" "3" "4" "name"

will be a character vector, or a vector where all data is stored as type character, and the numbers will be stored as characters rather than numbers. It will not be possible to perform mathematical operations on these numbers-stored-as-characters without first converting them to type numeric.

factor

A special type of character vector, where the text strings signify factor levels and are encoded internally as integer counts of the occurrence of each factor. Factors can be treated as nominal data when the order does not matter, or as ordinal data when the order does matter.

factor(c("a", "b", "c", "a"), levels=c("a","b","c","d"))
## [1] a b c a  
## Levels: a b c d

array

A generalization of vectors from one dimension to two or more dimensions. Array dimensions must be pre-defined and can have any number of dimensions. Like vectors, all elements of an array must be of the same data type. (Note that the letters object used in the example below is a variable supplied by R that contains the letters a through z.)

# letters a - c in 2x4 array 
array(data=letters[1:3], dim=c(2,4))
##      [,1] [,2] [,3] [,4]  
## [1,] "a"  "c"  "b"  "a"  
## [2,] "b"  "a"  "c"  "b"

# numbers 1 - 3 in 2x4 array 
array(data=1:3, dim=c(2,4))
##      [,1] [,2] [,3] [,4]  
## [1,]    1    3    2    1  
## [2,]    2    1    3    2

matrix

A special type of array with the properties of a mathematical matrix. It may only be two-dimensional, having rows and columns, where all columns must have the same type of data and every column must have the same number of rows. R provides several functions specific to manipulating matrices, such as taking the transpose, performing matrix multiplication and calculation eigenvectors and eigenvalues.

matrix(data = rep(1:3, times=2), nrow=2, ncol=3)
##      [,1] [,2] [,3]  
## [1,]    1    3    2  
## [2,]    2    1    3

list

Vectors whose elements are other R objects, where each object of the list can be of a different data type, and each object can be of different length and dimension than the other objects. Lists can therefore store all other data types, including other lists.

list("text", "more", 2, c(1,2,3,2))
## [[1]]  
## [1] "text"  
##  
## [[2]]  
## [1] "more"  
##  
## [[3]]  
## [1] 2  
##  
## [[4]]  
## [1] 1 2 3 2

data.frame

For most industrial and data scientists, data frames are the most widely useful type of variable. A data.frame is the list analog to the matrix: it is an $m \times n$ list where all columns must be vectors of the same number of rows (determined with NROW()). However, unlike matrices, different columns can contain different types of data and each row and column must have a name. If not named explicitly, R names rows by their row number and columns according to the data assigned assigned to the column. Data frames are typically used to store the sort of data that industrial engineers and scientists most often work with, and is the closest analog in R to an Excel spreadsheet. Usually data frames are made up of one or more columns of factors and one or more columns of numeric data.

data.frame(rnorm(5), rnorm(5), rnorm(5))
##     rnorm.5.  rnorm.5..1  rnorm.5..2  
## 1  0.2939566  1.28985202 -0.01669957  
## 2  0.3672161 -0.01663912 -1.02064116  
## 3  1.0871615  1.13855476  0.78573775  
## 4 -0.8501263 -0.17928722  1.03848796  
## 5 -1.6409403 -0.34025455 -0.62113545

More generally, in R all variables are objects, and R distinguishes between objects by their internal storage type and by their class declaration, which are accessible via the typeof() and class() functions. Functions in R are also objects, and the users can define new objects to control the output from functions like summary() and print(). For more on objects, types and classes, see section 2 of the R Language Definition.

Table 2 summarizes the internal storage and R classes of the main data and variable types.

Table 2: Table of R data and variable types.
Variable type	Storage type	Class	Measurement Scale
vector of decimals	double	numeric	ratio
vector of integers	integer	integer	ratio or interval
vector of complex	complex	complex	ratio
vector of characters	character	character	nominal
factor vector	integer	factor	nominal or ordinal
matrix of decimals	double	matrix	ratio
data frame	list	data.frame	mixed
list	list	list	mixed

References

Stevens, S. S. “On the Theory of Scales of Measurement.” Science. 103.2684 (1946): 677-680. Print.
R Core Team (2017). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.

Those of us working in industry with Excel are familiar with scatter plots, line graphs, bar charts, pie charts and maybe a couple of other graph types. Some of us have occasionally used the Analysis Pack to create histograms that don’t update when our data changes (though there is a way to make dynamic histograms in Excel; perhaps I’ll cover this in another blog post).

One of the most important steps in data analysis is to just look at the data. What does the data look like? When we have time-dependent data, we can lay it out as a time-series or, better still, as a control chart (a.k.a. “natural process behavior chart”). Sometimes we just want to see how the data looks as a group. Maybe we want to look at the product weight or the cycle time across production shifts.

Unless you have Minitab, R or another good data analysis tool at your disposal, you have probably never used—maybe never heard of—boxplots. That’s unfortunate, because boxplots should be one of the “go-to” tools in your data analysis tool belt. It’s a real oversight that Excel doesn’t provide a good way to create them.

For the purpose of demonstration, let’s start with creating some randomly generated data:

head(df)

##   variable   value
## 1   group1 -1.5609
## 2   group1 -0.3708
## 3   group1  1.4242
## 4   group1  1.3375
## 5   group1  0.3007
## 6   group1  1.9717

tail(df)

##     variable   value
## 395   group1  1.4591
## 396   group1 -1.5895
## 397   group1 -0.4692
## 398   group1  0.1450
## 399   group1 -0.3332
## 400   group1 -2.3644

If we don’t have much data, we can just plot the points:

library(ggplot2)

ggplot(data = df[1:10,]) +
  geom_point(aes(x = variable, y = value)) +
  coord_flip() +
  theme_bw()

But if we have lots of data, it becomes hard to see the distribution due to overplotting:

ggplot(data = df) +
  geom_point(aes(x = variable, y = value)) +
  coord_flip() +
  theme_bw()

We can try to fix this by changing some parameters, like adding semi-transparency (alpha blending) and using an open plot symbol, but for the most part this just makes the data points harder to see; the distribution is largely lost:

ggplot(data = df) +
  geom_point(aes(x = variable, y = value), alpha = 0.3, shape = 1) +
  coord_flip() +
  theme_bw()

The natural solution is to use histograms, another “go-to” data analysis tool that Excel doesn’t provide in a convenient way:

ggplot(data = df) +
  geom_histogram(aes(x = value), binwidth = 1) +
  theme_bw()

But histograms don’t scale well when you want to compare multiple groups; the histograms get too short (or too narrow) to really provide useful information. Here I’ve broken the data into eight groups:

head(df)

##   variable   value
## 1   group1 -1.5609
## 2   group1 -0.3708
## 3   group1  1.4242
## 4   group1  1.3375
## 5   group1  0.3007
## 6   group1  1.9717

tail(df)

##     variable   value
## 395   group8 -0.6384
## 396   group8 -3.0245
## 397   group8  1.5866
## 398   group8  1.9747
## 399   group8  0.2377
## 400   group8 -0.3468

ggplot(data = df) +
  geom_histogram(aes(x = value), binwidth = 1) +
  facet_grid(variable ~ .) +
  theme_bw()

Either the histograms need to be taller, making the stack too tall to fit on a page, or we need a better solution.

The solution is the box plot:

ggplot() +
  geom_boxplot(data = df, aes(y = value, x = variable)) +
  coord_flip() +
  theme_bw()

The boxplot provides a nice, compact representation of the distribution of a set of data, and makes it easy to compare across a large number of groups.

There’s a lot of information packed into that graph, so let’s unpack it:

Median

A measure of the central tendency of the data that is a little more robust than the mean (or arithmetic average). Half (50%) of the data falls below this mark. The other half falls above it.

First quartile (25th percentile) hinge

Twenty-five percent (25%) of the data falls below this mark.

Third quartile (75th percentile) hinge

Seventy-five percent (75%) of the data falls below this mark.

Inter-Quartile Range (IQR)

The middle half (50%) of the data falls within this band, drawn between the 25th percentile and 75th percentile hinges.

Lower whisker

The lower whisker connects the first quartile hinge to the lowest data point within 1.5 * IQR of the hinge.

Upper whisker

The upper whisker connects the third quartile hinge to the highest data point within 1.5 * IQR of the hinge.

Outliers

Any data points below 1.5 * IQR of the first quartile hinge, or above 1.5 * IQR of the third quartile hinge, are marked individually as outliers.

We can add additional values to these plots. For instance, it’s sometimes useful to add the mean (average) when the distributions are heavily skewed:

ggplot(data = df, aes(y = value, x = variable)) +
  geom_boxplot() +
  stat_summary(fun.y = mean, geom="point", shape = 10, size = 3, colour = "blue") +
  coord_flip() +
  theme_bw()

Graphs created in the R programming language using the ggplot2 and gridExtra packages.

References

R Core Team (2014). R: A language and environment for statistical computing. R Foundation for Statistical
Computing, Vienna, Austria. URL http://www.R-project.org/.
H. Wickham. ggplot2: elegant graphics for data analysis. Springer New York, 2009.
Baptiste Auguie (2012). gridExtra: functions in Grid graphics. R package version 0.9.1.
http://CRAN.R-project.org/package=gridExtra

Tom Hopper

Competitive organizations through high-performance learning

Understanding Data

Types of data in R

Types of variables in R

References

The Most Useful Data Plot You’ve Never Used

References