He divided the universe in forty categories or classes, these being further subdivided into differences, which was then subdivided into species. He assigned to each class a monosyllable of two letters; to each difference, a consonant; to each species, a vowel. For example: de, which means an element; deb, the first of the elements, fire; deba, a part of the element fire, a flame. … The words of the analytical language created by John Wilkins are not mere arbitrary symbols; each letter in them has a meaning, like those from the Holy Writ had for the Cabbalists.
      –Jorge Luis Borges, The Analytical Language of John Wilkins

99.1 Variables revisited

In the last chapter I talked a lot about variables, how they’re assigned and some of the things you can do with them, but there’s a lot of additional complexities. That’s not a surprise of course. However, some of those issues are worth drawing your attention to now. So that’s the goal of this section; to cover a few extra topics. As a consequence, this section is basically a bunch of things that I want to briefly mention, but don’t really fit in anywhere else. In short, I’ll talk about several different issues in this section, which are only loosely connected to one another.

99.1.1 Special values

The first thing I want to mention are some of the “special” values that you might see R produce. Most likely you’ll see them in situations where you were expecting a number, but there are quite a few other ways you can encounter them. These values are Inf, NaN, NAand NULL. These values can crop up in various different places, and so it’s important to understand what they mean.

Infinity (Inf). The easiest of the special values to explain is Inf, since it corresponds to a value that is infinitely large. You can also have -Inf. The easiest way to get Infis to divide a positive number by 0:

1/0
## [1] Inf

In most real world data analysis situations, if you’re ending up with infinite numbers in your data, then something has gone awry. Hopefully you’ll never have to see them.

Not a Number (NaN). The special value of NaNis short for “not a number”, and it’s basically a reserved keyword that means “there isn’t a mathematically defined number for this”. If you can remember your high school maths, remember that it is conventional to say that 0/0doesn’t have a proper answer: mathematicians would say that 0/0is undefined. R says that it’s not a number:

0/0
## [1] NaN

Nevertheless, it’s still treated as a “numeric” value. To oversimplify, NaNcorresponds to cases where you asked a proper numerical question that genuinely has no meaningful answer.

Not available (NA). NAindicates that the value that is “supposed” to be stored here is missing. To understand what this means, it helps to recognise that the NAvalue is something that you’re most likely to see when analysing data from real world experiments. Sometimes you get equipment failures, or you lose some of the data, or whatever. The point is that some of the information that you were “expecting” to get from your study is just plain missing. Note the difference between NAand NaN. For NaN, we really do know what’s supposed to be stored; it’s just that it happens to correspond to something like 0/0that doesn’t make any sense at all. In contrast, NAindicates that we actually don’t know what was supposed to be there. The information is missing.

No value (NULL). The NULLvalue takes this “absence” concept even further. It asserts that the variable genuinely has no value whatsoever, or does not even exist. This is quite different to both NaNand NA. For NaNwe actually know what the value is, because it’s something insane like 0/0. For NA, we believe that there is supposed to be a value “out there” in some sense, but a dog ate our homework and so we don’t quite know what it is. But for NULLwe strongly believe that there is no value at all.

99.1.2 Variable classes

As we’ve seen, R allows you to store different kinds of data. In particular, the variables we’ve defined so far have either been character data (text), numeric data, or logical data. It’s important that we remember what kind of information each variable stores (and even more important that R remembers) since different kinds of variables allow you to do different things to them. For instance, if your variables have numerical information in them, then it’s okay to multiply them together. But if they contain character data, multiplication makes no sense whatsoever, and R will complain if you try to do it:

x <- 5 # x is numeric
y <- 4 # y is numeric
x * y
## [1] 20
x <- "apples" # x is character
y <- "oranges" # y is character
x * y
## Error in x * y: non-numeric argument to binary operator

Even R is smart enough to know you can’t multiply "apples"by "oranges". It knows this because the quote marks are indicators that the variable is supposed to be treated as text, not as a number.

This is quite useful, but notice that it means that R makes a big distinction between 5and "5". Without quote marks, R treats 5as the number five, and will allow you to do calculations with it. With the quote marks, R treats "5"as the textual character five, and doesn’t recognise it as a number any more than it recognises "p"or "five"as numbers. As a consequence, there’s a big difference between typing x <- 5and typing x <- "5". In the former, we’re storing the number 5; in the latter, we’re storing the character "5". Thus, if we try to do multiplication with the character versions, R gets stroppy

x <- "5" # x is character
y <- "4" # y is character
x * y
## Error in x * y: non-numeric argument to binary operator

Okay, let’s suppose that I’ve forgotten what kind of data I stored in the variable x(which happens depressingly often). R provides a function that will let us find out. Actually, it provides several different functions that are used for different purposes. For now I only want to discuss the classfunction. The class of a variable is a “high level” classification, and it captures psychologically (or statistically) meaningful distinctions. For instance "2011-09-12"and "my birthday"are both text strings, but there’s an important difference between the two: one of them is a date. So it would be nice if we could get R to recognise that "2011-09-12"is a date, and allow us to do things like add or subtract from it. The class of a variable is what R uses to keep track of things like that. Because the class of a variable is critical for determining what R can or can’t do with it, the classfunction is very handy.

Later on, I’ll talk a bit about how you can convince R to coerce a variable to change from one class to another. That’s a useful skill for real world data analysis, but it’s not something that we need right now. In the meantime, the following examples illustrate the use of the classfunction:

x <- "hello world"
class(x)
## [1] "character"
x <- TRUE
class(x)
## [1] "logical"
x <- 100
class(x)
## [1] "numeric"

99.1.3 Coercion

Sometimes you want to change the variable class. This can happen for all sorts of reasons. Sometimes when you import data from files, it can come to you in the wrong format: numbers sometimes get imported as text, dates usually get imported as text, and many other possibilities besides. Regardless of how you’ve ended up in this situation, there’s a very good chance that sometimes you’ll want to convert a variable from one class into another one. Or, to use the correct term, you want to coerce the variable from one class into another. Coercion is a little tricky, and so I’ll only discuss the very basics here, using a few simple examples. However, you’ll see more examples of coercion as you go through this tutorial.

Firstly, let’s suppose we have a variable xthat is supposed to be representing a number, but the data file that you’ve been given has encoded it as text (this happens quite a lot). Let’s imagine that the variable is something like this:

x <- c("15","19")  # the variable
class(x)           # what class is it?
## [1] "character"

Obviously, if I want to do mathematical calculations using xin its current state, R is going to get very annoyed at me. It thinks that xis text, so it’s not going to allow me to try to do mathematics using it! Obviously, we need to coerce xfrom “character” to “numeric”. We can do that in a straightforward way by using the as.numericfunction:

x <- as.numeric(x)  # coerce the variable
class(x)            # what class is it?
## [1] "numeric"
x + 1               # hey, addition works!
## [1] 16 20

Not surprisingly, we can also convert it back again if we need to. The function that we use to do this is the as.characterfunction:

x <- as.character(x)   # coerce back to text
class(x)               # check the class
## [1] "character"

However, there’s some fairly obvious limitations: you can’t coerce the string "hello world"into a number because, well, there’s isn’t a number that corresponds to it. If you try, R metaphorically shrugs its shoulders and declares it to be missing:

as.numeric( "hello world" )  # this isn’t going to work.
## Warning: NAs introduced by coercion
## [1] NA

That gives you a feel for how to change between numeric and character data. What about logical data? To cover this briefly, coercing text to logical data is pretty intuitive: you use the as.logicalfunction, and the character strings "T", "TRUE", "True"and "true"all convert to the logical value of TRUE. Similarly "F", "FALSE", "False", and "false"all become FALSE. All other strings convert to NA. When you go back the other way using as.character, TRUEconverts to "TRUE"and FALSEconverts to "FALSE". Converting numbers to logicals – again using as.logical– is straightforward. Following the standard convention in the study of Boolean logic, the number 0converts to FALSE. Everything else is TRUE. Going back using as.numeric, FALSEconverts to 0and TRUEconverts to 1.

99.2 Vectors re-re-visited

99.2.1 Naming vector elements

One thing that is sometimes a little unsatisfying about the way that R prints out a vector is that the elements come out unlabelled. Here’s what I mean. Suppose I’ve got data reporting the quarterly profits for some company. If I just create a no-frills vector, I have to rely on memory to know which element corresponds to which event. That is:

profit <- c( 3.1, 0.1, -1.4, 1.1 )
profit
## [1]  3.1  0.1 -1.4  1.1

You can probably guess that the first element corresponds to the first quarter, the second element to the second quarter, and so on, but that’s only because I’ve told you the back story and because this happens to be a very simple example. In general, it can be quite difficult. This is where it can be helpful to assign namesto each of the elements. Here’s one way to do it:

names(profit) <- c("Q1","Q2","Q3","Q4")
profit
##   Q1   Q2   Q3   Q4
##  3.1  0.1 -1.4  1.1

This is a slightly odd looking command, admittedly, but it’s not too difficult to follow. All we’re doing is assigning a vector of labels (character strings) to names(profit). You can always delete the names again by using this command,

names(profit) <- NULL
profit
## [1]  3.1  0.1 -1.4  1.1

It’s also worth noting that you don’t have to do this as a two stage process. You can get the same result with this command:

profit <- c( "Q1" = 3.1, "Q2" = 0.1, "Q3" = -1.4, "Q4" = 1.1 )
profit
##   Q1   Q2   Q3   Q4
##  3.1  0.1 -1.4  1.1

The important things to notice are that (a) this does make things much easier to read, but (b) the names at the top aren’t the “real” data. The value of profit[1]is still 3.1; all I’ve done is added a name to profit[1]as well. Nevertheless, names aren’t purely cosmetic, since R allows you to pull out particular elements of the vector by referring to their names:

profit["Q1"]
##  Q1
## 3.1

And if I ever need to pull out the names themselves, then I just type

names(profit)
## [1] "Q1" "Q2" "Q3" "Q4"

99.2.2 Refresher: Logical indexing

Let’s start with a simple example. When my children were little I naturally spent a lot of time watching TV shows like In the Night Garden. In the nightgarden.Rdatafile, I’ve transcribed a short section of the dialogue from the show. The file contains two vectors, speakerand utterance, and when we take a look at the data,

load("./data/nightgarden.Rdata")
print(speaker)
##  [1] "upsy-daisy"  "upsy-daisy"  "upsy-daisy"  "upsy-daisy"  "tombliboo"
##  [6] "tombliboo"   "makka-pakka" "makka-pakka" "makka-pakka" "makka-pakka"
print(utterance)
##  [1] "pip" "pip" "onk" "onk" "ee"  "oo"  "pip" "pip" "onk" "onk"

it becomes very clear what happened to my sanity. Suppose that what I want to do is pull out only those utterances that were made by Makka-Pakka. To that end, I could first use the equality operator ==to have R tell me which cases correspond to Makka-Pakka speaking:

speaker == "makka-pakka"
##  [1] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE

So this allows a very neat way to subset the utterancevector. If I want to select only those utterances for which Makka-Pakka was the speaker, I can do so as follows:

utterance[ speaker == "makka-pakka" ]
## [1] "pip" "pip" "onk" "onk"

99.2.3 Matching cases with %in%

A second useful trick for extracting a subset of a vector is to use the %in%operator. It’s actually very similar to the ==operator, except that you can supply a collection of acceptable values. For instance, suppose I wanted to find those cases when the utteranceis either “pip”or “oo”. We can do that like this:

utterance %in% c("pip","oo")
##  [1]  TRUE  TRUE FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE FALSE

This in turn allows us to find the speakerfor all such cases,

speaker[ utterance %in% c("pip","oo") ]
## [1] "upsy-daisy"  "upsy-daisy"  "tombliboo"   "makka-pakka" "makka-pakka"

99.2.4 Removing cases with negative indices

Before moving onto data frames, there’s a couple of other tricks worth mentioning. The first of these is to use negative values as indices. Recall from earlier that we can use a vector of numbers to extract a set of elements that we would like to keep. For instance, suppose I want to keep only elements 2 and 3 from utterance. I could do so like this

utterance[2:3]
## [1] "pip" "onk"

But suppose, on the other hand, that I have discovered that observations 2 and 3 are untrustworthy, and I want to keep everything except those two elements. To that end, R lets you use “negative indices”" to remove specific values, like so:

utterance [ -(2:3) ]
## [1] "pip" "onk" "ee"  "oo"  "pip" "pip" "onk" "onk"

99.3 Factors

Okay, it’s time to start introducing some of the data types that are somewhat more specific to statistics. As every single research methodology class is at pains to ppoint out, when we assign numbers to possible outcomes, these numbers can mean quite different things depending on what kind of quantity in the world we are attempting to measure. In particular, psychologists commonly make the distinction between nominal, ordinal, interval and ratio scale data. How do we capture this distinction in R? Currently, we only seem to have a single numeric data type. That’s probably not going to be enough, is it?

A little thought suggests that the numeric variable class in R is perfectly suited for capturing ratio scale data. For instance, if I were to measure response time (RT) for five different events, I could store the data in R like this:

RT <- c(1342, 1401, 1590, 1391, 1554, 1422, 1612, 1230, 998)

where the data here are measured in milliseconds, as is conventional in the psychological literature. It’s perfectly sensible to talk about “twice the response time”, 2 * RT, or the “response time plus 1 second”, RT + 1000, and so both of the following are perfectly reasonable things for R to do:

2 * RT
## [1] 2684 2802 3180 2782 3108 2844 3224 2460 1996
RT + 1000
## [1] 2342 2401 2590 2391 2554 2422 2612 2230 1998

And to a lesser extent, the “numeric” class is okay for interval scale data, as long as we remember that multiplication and division aren’t terribly interesting for these sorts of variables. That is, if my IQ score is 110 and yours is 120, it’s perfectly okay to say that you’re 10 IQ points smarter than me,1 but it’s not okay to say that I’m only 92% as smart as you are, because intelligence doesn’t have a “natural” zero.2 Similarly, we might be willing to tolerate the use of numeric variables to represent ordinal scale variables, such as those that you typically get when you ask people to rank order items (e.g., like we do in Australian elections), though as we will see R actually has a built in tool for representing ordinal data (ordered factors: discussed later) However, when it comes to nominal scale data, it becomes completely unacceptable, because almost all of the “usual” rules for what you’re allowed to do with numbers don’t apply to nominal scale data. It is for this reason that R has factors.

99.3.1 Introducing factors

Suppose, I was doing a study in which people could belong to one of three different treatment conditions. Each group of people were asked to complete the same task, but each group received different instructions. Not surprisingly, I might want to have a variable that keeps track of what group people were in. So I could type in something like this

group <- c(1,1,1,2,2,2,1,2,1)

so that group[i]contains the group membership of the i-th person in my study. Clearly, this is numeric data, but equally obviously this is a nominal scale variable. There’s no sense in which “group 1” plus “group 2” equals “group 3”, but nevertheless if I try to do that, R won’t stop me because it doesn’t know any better:

group + 2
## [1] 3 3 3 4 4 4 3 4 3

This doesn’t make a lot of sense, but R is too stupid to know any better: it thinks that 1is an ordinary number in this context, so it sees no problem in calculating 1 + 2. But since we’re not that stupid, we’d like to stop R from doing this. We can do so by instructing R to treat group as a factor. This is easy to do using the as.factorfunction:3

group <- as.factor(group)
group
## [1] 1 1 1 2 2 2 1 2 1
## Levels: 1 2

It looks more or less the same as before (though it’s not immediately obvious what all that Levelsrubbish is about), but if we ask R to tell us what the classof the group variable is now, it’s clear that it has done what we asked:

class(group)
## [1] "factor"

Neat. Better yet, now that I’ve converted groupto a factor, look what happens when I try to add 2to it:

group + 2
## Warning in Ops.factor(group, 2): '+' not meaningful for factors
## [1] NA NA NA NA NA NA NA NA NA

This time even R is smart enough to know that I’m being an idiot, so it tells me off and then produces a vector of missing values.

99.3.2 Labelling the factor levels

I have a confession to make. My memory is not infinite in capacity; and it seems to be getting worse as I get older. So it kind of annoys me when I get data sets where there’s a nominal scale variable called genderwhich is supposed to have two levels corresponding to malesand females. Even setting aside the question of whether gender should be considered a binary variable in the first place (seems a little unlikely no matter how you define the concept of “gender”), it drives me crazy when I load the file and to print out the variable I get something like this:

load("./data/badgender.Rdata")
gender
## [1] 1 1 1 1 1 2 2 2 2
## Levels: 1 2

Okaaaay. That’s not helpful at all, and it makes me very sad. Which number corresponds to the males and which one corresponds to the females? Wouldn’t it be nice if R could actually keep track of this? It’s way too hard to remember which number corresponds to which gender. And besides, the problem that this causes is much more serious than a single sad nerd… because R has no way of knowing that the 1s in the groupfactor are a very different kind of thing to the 1s in the genderfactor So if I try to ask which elements of the groupvariable are equal to the corresponding elements in gender, R thinks this is totally kosher, and gives me this:

group == gender
## [1]  TRUE  TRUE  TRUE FALSE FALSE  TRUE FALSE  TRUE FALSE

Well,that’s…especially stupid.4 The problem here is that R is very literal minded. Even though you’ve declared both groupand genderto be factors, it still assumes that a 1is a 1no matter which variable it appears in.

To fix both of these problems (my memory problem, and R’s infuriating literal interpretations), what we need to do is assign meaningful labels to the different levels of each factor. We can do that like this:

levels(group) <- c("control", "treatment")
levels(gender) <- c("male", "female")

Having done this R produces output that seems a lot more readable…

group
## [1] control   control   control   treatment treatment treatment control
## [8] treatment control
## Levels: control treatment
gender
## [1] male   male   male   male   male   female female female female
## Levels: male female

And when I try to make insane comparisons it tells me to feel bad about myself…

group==gender
## Error in Ops.factor(group, gender): level sets of factors are different

Factors are very useful things, and we’ll use them a lot in this book: they’re the main way to represent a nominal scale variable. And there are lots of nominal scale variables out there. I’ll talk more about factors later, but for now you know enough to be able to get started.

99.4 Loading a CSV

99.4.1 Importing CSV data (commands)

One quite commonly used data format is the humble “comma separated value” file, also called a CSV file, and usually bearing the file extension .csv. CSV files are just plain old-fashioned text files, and what they store is basically just a table of data. This is illustrated in the screenshot below, which shows a file called booksales.csvthat I’ve created. As you can see, each row corresponds to a variable, and each row represents the book sales data for one month. The first row doesn’t contain actual data though: it has the names of the variables.

If Rstudio were not available to you, the easiest way to open this file would be to use the read.csvfunction. This function is pretty flexible, and I’ll talk a lot more about it’s capabilities in a later chapter, but for now there’s only two arguments to the function that I’ll mention:

  • file. This should be a character string that specifies a path to the file that needs to be loaded. You can use an absolute path or a relative path to do so.
  • header. This is a logical value indicating whether or not the first row of the file contains variable names. The default value is TRUE.

So, if the current working directory is the Rbookfolder, and the booksales.csvfile is in the Rbook\datafolder, the command I would need to use to load it is:

books <- read.csv(file = "./data/booksales.csv")

There are two important points to notice here. Firstly, notice that I didn’t try to use the loadfunction, because that function is only meant to be used for .Rdatafiles. If you try to use loadon other types of data, you get an error. Secondly, notice that when I imported the CSV file I assigned the result to a variable, which I imaginatively called books.5 Let’s have a look at what we’ve got:

print(books)
##        Month Days Sales Stock.Levels
## 1    January   31     0         high
## 2   February   28   100         high
## 3      March   31   200          low
## 4      April   30    50          out
## 5        May   31     0          out
## 6       June   30     0         high
## 7       July   31     0         high
## 8     August   31     0         high
## 9  September   30     0         high
## 10   October   31     0         high
## 11  November   30     0         high
## 12  December   31     0         high

Clearly, it’s worked, but the format of this output is a bit unfamiliar. We haven’t seen anything like this before. What you’re looking at is a data frame, which is a very important kind of variable in R, and one I’ll discuss later in this chapter. For now, let’s just be happy that we imported the data and that it looks about right.

99.4.2 Importing CSV data (Rstudio)

Yet again, it’s easier in Rstudio. In the environment panel in Rstudio you should see a button called “Import Dataset”. Click on that, and it will give you a couple of options:

Select the “From Text File (base)…” option, and it will open up a very familiar dialog box asking you to select a file: if you’re on a Mac, it’ll look like the usual Finder window that you use to choose a file; on Windows it looks like an Explorer window. I’m assuming that you’re familiar with your own computer, so you should have no problem finding the CSV file that you want to import! Find the one you want, then open it. When you do this, you’ll see a window that looks like the one below:

The import data set window is relatively straightforward to understand. In the top left corner, you need to type the name of the variable you R to create. By default, that will be the same as the file name: our file is called booksales.csv, so Rstudio suggests the name booksales. If you’re happy with that, leave it alone. If not, type something else. Immediately below this are a few things that you can tweak to make sure that the data gets imported correctly:

  • Heading. Does the first row of the file contain raw data, or does it contain headings for each variable? The booksales.csvfile has a header at the top, so I selected “yes”.
  • Separator. What character is used to separate different entries? In most CSV files this will be a comma (it is “comma separated” after all). But you can change this if your file is different.
  • Decimal. What character is used to specify the decimal point? In English speaking countries, this is almost always a period (i.e., .). That’s not universally true: many European countries use a comma. So you can change that if you need to.
  • Quote. What character is used to denote a block of text? That’s usually going to be a double quote mark. It is for the booksales.csvfile, so that’s what I selected.

The nice thing about the Rstudio window is that it shows you the raw data file at the top of the window, and it shows you a preview of the data at the bottom. If the data at the bottom doesn’t look right, try changing some of the settings on the left hand side. Once you’re happy, click “Import”. When you do, you’ll see the read.csvcommand appear in the console, along with a View(booksales)command that will open up a visual display of the booksalesdata set.

99.5 Data frames

It’s now time to go back and deal with the somewhat confusing thing that happened earlier when we tried to open up a CSV file. Apparently we succeeded in loading the data, but it came to us in a very odd looking format. At the time, I told you that this was a data frame. Now I’d better explain what that means.

99.5.1 Introducing data frames

In order to understand why R has created this funny thing called a data frame, it helps to try to see what problem it solves. So let’s go back to the little scenario that I used when introducing factors in the last. In that section I recorded the groupand genderfor all 9 participants in my study, as well as their average RTon some task.

So there are these three variables in the workspace, gender, groupand RT. And it just so happens that all three are the same size (i.e., they’re all vectors with 9 elements). Aaaand it just so happens that gender[1]corresponds to the gender of the first person, and RT[1]is the response time of that very same person, etc. In other words, you and I both know that all these variables correspond to the same data set, and they are organised in exactly the same way.

However, R doesn’t know this! As far as it’s concerned, there’s no reason why the RTvariable has to be the same length as the gendervariable; and there’s no particular reason to think that RT[1]has any special relationship to gender[1]any more than it has a special relationship to gender[4]. In other words, when we store everything in separate variables like this, R doesn’t know anything about the relationships between things. It doesn’t even really know that these variables actually refer to a proper data set. The data frame fixes this: if we store our variables inside a data frame, we’re telling R to treat these variables as a single, fairly coherent data set.

To see how they do this, let’s create one. So how do we create a data frame? One way we’ve already seen: if we import our data from a CSV file, R will store it as a data frame. A second way is to create it directly from some existing variables using the data.framefunction. All you have to do is type a list of variables that you want to include in the data frame. The output of a data.framecommand is, well, a data frame. So, if I want to store the variables from my experiment in a data frame called expI can do so like this:

exp <- data.frame(group, gender, RT)
exp
##       group gender   RT
## 1   control   male 1342
## 2   control   male 1401
## 3   control   male 1590
## 4 treatment   male 1391
## 5 treatment   male 1554
## 6 treatment female 1422
## 7   control female 1612
## 8 treatment female 1230
## 9   control female  998

Note that expis a completely self-contained variable. Once you’ve created it, it no longer depends on the original variables from which it was constructed. That is, if we make changes to the original RTvariable, it will not lead to any changes to the RTdata stored in exp. So, for the sake of my sanity, let’s remove all variables from the workspace except for the expdata frame:

rm(list = objects()[objects() != "exp"])  # ... this is ugly!
who() # but what matters is that it worked...
##    -- Name --   -- Class --   -- Size --
##    exp          data.frame    9 x 3

We can verify that the original RTvariable is gone, because if we try to print it out R gives an error message:

RT
## Error in eval(expr, envir, enclos): object 'RT' not found

99.5.2 Indexing data frames with $

At this point, our workspace contains only the one variable, a data frame called expt. But as we can see when we told R to print the variable out, this data frame contains three variables, each of which has nine observations. So how do we get this information out again? After all, there’s no point in storing information if you don’t use it, and there’s no way to use information if you can’t access it. So let’s talk a bit about how to pull information out of a data frame. As is always the case with R there are several ways to do this. The simplest is to use the $operator to point to the variable you’re interested in, like this:

exp$RT
## [1] 1342 1401 1590 1391 1554 1422 1612 1230  998

There’s a lot more that can be said about data frames: they’re fairly complicated beasts, and the longer you use R the more important it is to make sure you really understand them. We’ll talk a lot more about them later.

99.6 Lists

The next kind of data I want to mention are lists. Lists are an extremely fundamental data structure in R, and as you start making the transition from a novice to a savvy R user you will use lists all the time. Most of the advanced data structures in R are built from lists (e.g., data frames are actually a specific type of list), so it’s useful to have a basic understanding of them.

Okay, so what is a list, exactly? Like data frames, lists are just “collections of variables.” However, unlike data frames – which are basically supposed to look like a nice “rectangular” table of data – there are no constraints on what kinds of variables we include, and no requirement that the variables have any particular relationship to one another. In order to understand what this actually means, the best thing to do is create a list, which we can do using the listfunction. If I type this as my command:

starks <- list(
  parents = c("Eddard", "Catelyn"),
  children = c("Robb", "Jon", "Sansa", "Arya", "Brandon", "Rickon"),
  alive = 8
)

I create a list starksthat contains a list of the various characters that belong to House Stark in George R. R. Martin’s A Song of Ice and Fire novels. Because Martin does seem to enjoy killing off characters, the list starts out by indicating that all eight are currently alive (at the start of the books obviously!) and we can update it if need be. When a character dies, I might do this:

starks$alive <- starks$alive - 1
starks
## $parents
## [1] "Eddard"  "Catelyn"
##
## $children
## [1] "Robb"    "Jon"     "Sansa"   "Arya"    "Brandon" "Rickon"
##
## $alive
## [1] 7

I can delete whole variables from the list if I want. For instance, I might just give up on the parents entirely:

starks$parents <- NULL
starks
## $children
## [1] "Robb"    "Jon"     "Sansa"   "Arya"    "Brandon" "Rickon"
##
## $alive
## [1] 7

You get the idea, I hope.

99.7 Data frames revisited

Working with data frames can sometimes be complicated, so I’m going to revisit them again now that we’ve talked about lists. At it’s core, a data frame genuinely is a list, one that just happens to have a special “rectangular” structure. However, this rectangular structure means that in addition to working with it using the $operator (as one would do for a list), there are some other possibilities too, based on the fact that a data frame has rows and columns. Let’s return to the In the Night Garden data set. First, let’s construct a data frame, itng

load("./data/nightgarden.Rdata")
itng <- data.frame( speaker, utterance )
itng
##        speaker utterance
## 1   upsy-daisy       pip
## 2   upsy-daisy       pip
## 3   upsy-daisy       onk
## 4   upsy-daisy       onk
## 5    tombliboo        ee
## 6    tombliboo        oo
## 7  makka-pakka       pip
## 8  makka-pakka       pip
## 9  makka-pakka       onk
## 10 makka-pakka       onk

… and we’ll assume the goal is to be able to select different subsets of this data set.

99.7.1 The subsetfunction

There are several different ways to subset a data frame in R, some easier than others. I’ll start by discussing the subsetfunction, which is probably the conceptually simplest way do it. For our purposes there are three different arguments that you’ll be most interested in:

  • x. The data frame that you want to subset.
  • subset. A vector of logical values indicating which cases (rows) of the data frame you want to keep. By default, all cases will be retained.
  • select. This argument indicates which variables (columns) in the data frame you want to keep. This can either be a list of variable names, or a logical vector indicating which ones to keep, or even just a numeric vector containing the relevant column numbers. By default, all variables will be retained.

Let’s start with an example in which I use all three of these arguments. Suppose that I want to subset the itngdata frame, keeping only the utterances made by Makka-Pakka. What that means is that I need to use the selectargument to pick out the utterancevariable, and I also need to use the subsetvariable, to pick out the cases when Makka-Pakka is speaking (i.e., speaker == "makka-pakka"). Therefore, the command I need to use is this:

df <- subset(
  x = itng,
  subset = speaker == "makka-pakka",
  select = utterance
  )
print( df )
##    utterance
## 7        pip
## 8        pip
## 9        onk
## 10       onk

The variable dfhere is still a data frame, but it only contains one variable (called utterance) and four cases. Notice that the row numbers are the same ones from the original data frame. It’s worth taking a moment to briefly explain this. The reason that this happens is that these “row numbers” are actually row names. When you create a new data frame from scratch R will assign each row a fairly boring row name, identical to the row number. However, when you subset the data frame, each row keeps its original row name. This can be quite useful, since – as in the current example – it provides you a visual reminder of what each row in the new data frame corresponds to in the original data frame. However, if it annoys you, you can change the row names using the rownamesfunction, or remove them entirely with the rownames(df) <- NULLcommand.

In any case, let’s return to the subsetfunction, and look at what happens when we don’t use all three of the arguments. Firstly, suppose that I didn’t bother to specify the selectargument. Let’s see what happens:

subset(
  x = itng,
  subset = speaker == "makka-pakka"
)
##        speaker utterance
## 7  makka-pakka       pip
## 8  makka-pakka       pip
## 9  makka-pakka       onk
## 10 makka-pakka       onk

Not surprisingly, R has kept the same cases from the original data set (i.e., rows 7 through 10), but this time it has retained all of the variables from the data frame. Equally unsurprisingly, if I don’t specify the subsetargument, what we find is that R keeps all of the cases:

subset(
  x = itng,
  select = utterance
)
##    utterance
## 1        pip
## 2        pip
## 3        onk
## 4        onk
## 5         ee
## 6         oo
## 7        pip
## 8        pip
## 9        onk
## 10       onk

Again, it’s important to note that this output is still a data frame: it’s just a data frame with only a single variable.

99.7.2 Brackets I: Rows & columns

Throughout the book so far, whenever I’ve been subsetting a vector I’ve tended use the square brackets [ ]to do so. But in the previous section when I started talking about subsetting a data frame I used the subsetfunction. As a consequence, you might be wondering whether it is possible to use the square brackets to subset a data frame. The answer, of course, is yes. Not only can you use square brackets for this purpose, as you become more familiar with R you’ll find that this is actually much more convenient than using subset. Unfortunately, the use of square brackets for this purpose is somewhat complicated, and there are a few cases that cause some confusion. So be warned: this section is more complicated than it feels like it “should” be. With that warning in place, I’ll try to walk you through it slowly. For this section, I’ll use a slightly different In the Night Garden data set, namely the gardendata frame that is stored in the nightgarden2.Rdatafile:

load( "./data/nightgarden2.Rdata" )
garden
##            speaker utterance line
## case.1  upsy-daisy       pip    1
## case.2  upsy-daisy       pip    2
## case.3   tombliboo        ee    5
## case.4 makka-pakka       pip    7
## case.5 makka-pakka       onk    9

As you can see, the gardendata frame contains three variables and five cases, and this time around I’ve used the rownamesfunction to attach slightly verbose labels to each of the cases. Moreover, let’s assume that what we want to do is to pick out rows 4 and 5 (the two cases when Makka-Pakka is speaking), and columns 1 and 2 (variables speakerand utterance).

How shall we do this? As usual, there’s more than one way. The first way is based on the observation that, since a data frame is rectangular, every element in the data frame has a row number and a column number. So, if we want to pick out a single element, we have to specify the row number and a column number within the square brackets. By convention, the row number comes first. So, for the data frame above, which has five rows and three columns, the numerical indexing scheme looks like this:

col 1col 2col 3
row 1[1,1][1,2][1,3]
row 2[2,1][2,2][2,3]
row 3[3,1][3,2][3,3]
row 4[4,1][4,2][4,3]
row 5[5,1][5,2][5,3]

If I want the 3rd case of the 2nd variable, what I would type is garden[3,2], and R would print out some output showing that this element corresponds to the utterance "ee". However, let’s hold off from actually doing that for a moment, because there’s something slightly counterintuitive about the specifics of what R does under those circumstances. Instead, let’s aim to solve our original problem, which is to pull out two rows (4 and 5) and two columns (1 and 2). This is fairly simple to do, since R allows us to specify multiple rows and multiple columns. So let’s try that:

garden[ 4:5, 1:2 ]
##            speaker utterance
## case.4 makka-pakka       pip
## case.5 makka-pakka       onk

Clearly, that’s exactly what we asked for: the output here is a data frame containing two variables and two cases. Note that I could have gotten the same answer if I’d used the cfunction to produce my vectors rather than the :operator. That is, the following command is equivalent to the last one:

garden[ c(4,5), c(1,2) ]
##            speaker utterance
## case.4 makka-pakka       pip
## case.5 makka-pakka       onk

It’s just not as pretty. However, if the columns and rows that you want to keep don’t happen to be next to each other in the original data frame, then you might find that you have to resort to using commands like garden[ c(2,4,5), c(1,3) ]to extract them.

A second way to do the same thing is to use the names of the rows and columns. That is, instead of using the row numbers and column numbers, you use the character strings that are used as the labels for the rows and columns. To apply this idea to our gardendata frame, we would use a command like this:

garden[ c("case.4", "case.5"), c("speaker", "utterance") ]
##            speaker utterance
## case.4 makka-pakka       pip
## case.5 makka-pakka       onk

Once again, this produces exactly the same output, so I haven’t bothered to show it. Note that, although this version is more annoying to type than the previous version, it’s a bit easier to read, because it’s often more meaningful to refer to the elements by their names rather than their numbers. Also note that you don’t have to use the same convention for the rows and columns. For instance, I often find that the variable names are meaningful and so I sometimes refer to them by name, whereas the row names are pretty arbitrary so it’s easier to refer to them by number. In fact, that’s more or less exactly what’s happening with the garden data frame, so it probably makes more sense to use this as the command:

garden[ 4:5, c("speaker", "utterance") ]
##            speaker utterance
## case.4 makka-pakka       pip
## case.5 makka-pakka       onk

Again, the output is identical.

Finally, both the rows and columns can be indexed using logicals vectors as well. For example, although I claimed earlier that my goal was to extract cases 4 and 5, it’s pretty obvious that what I really wanted to do was select the cases where Makka-Pakka is speaking. So what I could have done is create a logical vector that indicates which cases correspond to Makka-Pakka speaking:

garden$speaker == "makka-pakka"
## [1] FALSE FALSE FALSE  TRUE  TRUE

As you can see, the 4th and 5th elements of this vector are TRUEwhile the others are FALSE. So I can use this vector to select the rows that I want to keep:

garden[ garden$speaker == "makka-pakka", c("speaker", "utterance") ]
##            speaker utterance
## case.4 makka-pakka       pip
## case.5 makka-pakka       onk

And of course the output is, yet again, the same.

99.7.3 Brackets II: Some elaborations

There are two fairly useful elaborations on this “rows and columns” approach that I should point out. Firstly, what if you want to keep all of the rows, or all of the columns? To do this, all we have to do is leave the corresponding entry blank, but it is crucial to remember to keep the comma! For instance, suppose I want to keep all the rows in the gardendata, but I only want to retain the first two columns. The easiest way do this is to use a command like this:

garden[ , 1:2 ]
##            speaker utterance
## case.1  upsy-daisy       pip
## case.2  upsy-daisy       pip
## case.3   tombliboo        ee
## case.4 makka-pakka       pip
## case.5 makka-pakka       onk

Alternatively, if I want to keep all the columns but only want the last two rows, I use the same trick, but this time I leave the second index blank. So my command becomes:

garden[ 4:5, ]
##            speaker utterance line
## case.4 makka-pakka       pip    7
## case.5 makka-pakka       onk    9

The second elaboration I should note is that it’s still okay to use negative indexes as a way of telling R to delete certain rows or columns. For instance, if I want to delete the 3rd column, then I use this command:

garden[ , -3 ]
##            speaker utterance
## case.1  upsy-daisy       pip
## case.2  upsy-daisy       pip
## case.3   tombliboo        ee
## case.4 makka-pakka       pip
## case.5 makka-pakka       onk

whereas if I want to delete the 3rd row, then I’d use this one:

garden[ -3,  ]
##            speaker utterance line
## case.1  upsy-daisy       pip    1
## case.2  upsy-daisy       pip    2
## case.4 makka-pakka       pip    7
## case.5 makka-pakka       onk    9

So that’s nice.

99.7.4 Brackets III: “dropping”

At this point some of you might be wondering why I’ve been so terribly careful to choose my examples in such a way as to ensure that the output always has are multiple rows and multiple columns. The reason for this is that I’ve been trying to hide the somewhat curious “dropping” behaviour that R produces when the output only has a single column. I’ll start by showing you what happens, and then I’ll try to explain it. Firstly, let’s have a look at what happens when the output contains only a single row:

garden[ 5, ]
##            speaker utterance line
## case.5 makka-pakka       onk    9

This is exactly what you’d expect to see: a data frame containing three variables, and only one case per variable. Okay, no problems so far. What happens when you ask for a single column? Suppose, for instance, I try this as a command:

garden[ , 3 ]

Based on everything that I’ve shown you so far, you would be well within your rights to expect to see R produce a data frame containing a single variable and five cases. After all, that is what the subsetcommand does in this situation, and it’s pretty consistent with everything else that I’ve shown you so far about how square brackets work. In other words, you should expect to see this:

##        line
## case.1    1
## case.2    2
## case.3    5
## case.4    7
## case.5    9

However, that is emphatically not what happens. What you actually get is this:

garden[, 3]
## [1] 1 2 5 7 9

That output is not a data frame at all! That’s just an ordinary numeric vector containing 5 elements. What’s going on here is that R has “noticed” that the output that we’ve asked for doesn’t really “need” to be wrapped up in a data frame at all, because it only corresponds to a single variable. So what it does is “drop” the output from a data frame containing a single variable, “down” to a simpler output that corresponds to that variable. This behaviour is convenient for day to day usage once you’ve become familiar with it – and I suppose that’s the real reason why R does this – but there’s no escaping the fact that it is deeply confusing to novices. It’s especially confusing because the behaviour appears only for a very specific case: (a) it only works for columns and not for rows, because the columns correspond to variables and the rows do not, and (b) it only applies to the “rows and columns” version of the square brackets, and not to the subsetfunction, or to the “just columns” use of the square brackets (next section). As I say, it’s very confusing when you’re just starting out. For what it’s worth, you can suppress this behaviour if you want, by setting drop = FALSEwhen you construct your bracketed expression. That is, you could do something like this:

garden[, 3, drop=FALSE]
##        line
## case.1    1
## case.2    2
## case.3    5
## case.4    7
## case.5    9

I suppose that helps a little bit, in that it gives you some control over the dropping behaviour, but I’m not sure it helps to make things any easier to understand. Anyway, that’s the “dropping” special case. Fun, isn’t it?

99.7.5 Brackets IV: Columns only

As if the weird “dropping” behaviour wasn’t annoying enough, R actually provides a completely different way of using square brackets to index a data frame. Specifically, if you only give a single index, R will assume you want the corresponding columns, not the rows. Do not be fooled by the fact that this second method also uses square brackets: it behaves differently to the “rows and columns” method that I’ve discussed in the last few sections. Let’s start with the following command:

garden[ 1:2 ]
##            speaker utterance
## case.1  upsy-daisy       pip
## case.2  upsy-daisy       pip
## case.3   tombliboo        ee
## case.4 makka-pakka       pip
## case.5 makka-pakka       onk

As you can see, the output gives me the first two columns, much as if I’d typed garden[,1:2]. It doesn’t give me the first two rows, which is what I’d have gotten if I’d used a command like garden[1:2,]. Not only that, if I ask for a single column, R does not drop the output:

garden[3]
##        line
## case.1    1
## case.2    2
## case.3    5
## case.4    7
## case.5    9

As I said earlier, the only case where dropping occurs by default is when you use the “row and columns” version of the square brackets, and the output happens to correspond to a single column. However, if you really want to force R to drop the output, you can do so using the “double brackets” notation:

garden[[3]]
## [1] 1 2 5 7 9

Note that R will only allow you to ask for one column at a time using the double brackets. If you try to ask for multiple columns in this way, you get completely different behaviour, which may or may not produce an error, but definitely won’t give you the output you’re expecting. The only reason I’m mentioning it at all is that you might run into double brackets when doing further reading, and a lot of books don’t explicitly point out the difference between [ ]and [[ ]].

99.7.6 Whyyyyy?

Okay, for those few readers that have persevered with this section long enough to get here without having set fire to somethin, I should explain why R has these two different systems for subsetting a data frame (i.e., “row and column” versus “just columns”), and why they behave so differently to each other. I’m not 100% sure about the motivation since I never did manage to read through very much of the references that describe the early development of R, but I think the answer relates to the fact that data frames are actually a very strange hybrid of two different kinds of thing. At a low level, a data frame is a list. I can demonstrate this to you by overriding the normal printfunction and forcing R to print out the garden data frame using the default print method (see later!) rather than the special one that is defined only for data frames. Here’s what we get:

print.default( garden )
## $speaker
## [1] upsy-daisy  upsy-daisy  tombliboo   makka-pakka makka-pakka
## Levels: makka-pakka tombliboo upsy-daisy
##
## $utterance
## [1] pip pip ee  pip onk
## Levels: ee onk oo pip
##
## $line
## [1] 1 2 5 7 9
##
## attr(,"class")
## [1] "data.frame"

Apart from the weird part of the output right at the bottom, this is identical to the print out that you get when you print out a list. In other words, a data frame is a list. View from this “list based” perspective, it’s clear what garden[1]is: it’s the first variable stored in the list, namely speaker. In other words, when you use the “just columns” way of indexing a data frame, using only a single index, R assumes that you’re thinking about the data frame as if it were a list of variables. In fact, when you use the $operator you’re taking advantage of the fact that the data frame is secretly a list.

However, a data frame is more than just a list. It’s a very special kind of list where all the variables are of the same length, and the first element in each variable happens to correspond to the first “case” in the data set. That’s why no-one ever wants to see a data frame printed out in the default “list-like” way that I’ve shown in the extract above. In terms of the deeper meaning behind what a data frame is used for, a data frame really does have this rectangular shape to it:

print( garden )
##            speaker utterance line
## case.1  upsy-daisy       pip    1
## case.2  upsy-daisy       pip    2
## case.3   tombliboo        ee    5
## case.4 makka-pakka       pip    7
## case.5 makka-pakka       onk    9

Because of the fact that a data frame is basically a table of data, R provides a second “row and column” method for interacting with the data frame. This method makes much more sense in terms of the “table of data” interpretation of what a data frame is, and so for the most part it’s this method that people tend to prefer. Throughout this tutorial I’ll aim to stick to the “row and column” approach (though I will use $a lot), and avoid referring to the “just columns” approach. However, it does get used a lot in practice, so I think it’s important that this book explain what’s going on.

And now let us never speak of this again.

99.8 Matrices

Up to this point we have encountered several different kinds of variables. At the simplest level, we’ve seen numeric data, logical data and character data. However, we’ve also encountered some more complicated kinds of variables, namely factors, formulas, data frames and lists. There are many more specialised data structures out there, but there’s a few more generic ones that I want to talk about in passing.

The first of these is a matrix. Much like a data frame, a matrix is basically a big rectangular table of data, and in fact there are quite a few similarities between the two. However, there are also some key differences, so it’s important to talk about matrices in a little detail. To start with, lets create a matrix using the “row bind” function, rbind, which you can use to combine multiple vectors together in a row-wise fashion (I’m sure you can guess what the column bind function cbinddoes)…

row.1 <- c( 2,3,1 )         # create data for row 1
row.2 <- c( 5,6,7 )         # create data for row 2
M <- rbind( row.1, row.2 )  # row bind them into a matrix
print(M)                    # and print it out...
##       [,1] [,2] [,3]
## row.1    2    3    1
## row.2    5    6    7

The variable Mis a matrix, which we can confirm by using the classfunction. Notice that, when we bound the two vectors together, R retained the names of the original variables as row names. We could delete these if we wanted by typing rownames(M)<-NULL, but I generally prefer having meaningful names attached to my variables, so I’ll keep them. In fact, let’s also add some highly unimaginative column names as well:

colnames(M) <- c( "col.1", "col.2", "col.3" )
print(M)
##       col.1 col.2 col.3
## row.1     2     3     1
## row.2     5     6     7

99.8.1 Matrix indexing

You can use square brackets to subset a matrix in much the same way that you can for data frames, again specifying a row index and then a column index. For instance, M[2,3]pulls out the entry in the 2nd row and 3rd column of the matrix (i.e., 7), whereas M[2,]pulls out the entire 2nd row, and M[,3]pulls out the entire 3rd column. However, it’s worth noting that when you pull out a column, R will print the results horizontally, not vertically.6 There is also a way of referring to the elements of a matrix using a single index:

M <- matrix(
  data = 1:12, # the values to include in the matrix
  nrow = 3,    # number of rows
  ncol = 4     # number of columns
)
print(M)
##      [,1] [,2] [,3] [,4]
## [1,]    1    4    7   10
## [2,]    2    5    8   11
## [3,]    3    6    9   12

This is a \(3\times 4\) matrix Mthat contains the numbers 1 to 12 in order, listed columnwise. You can index the elements of Mby referring to them in that same columnwise order. To see this, notice that these two commands both extract the same element of the matrix:

print(M[2,2]) 
## [1] 5
print(M[5])
## [1] 5

99.8.2 Matrices vs data frames

The critical difference between a data frame and a matrix is that, in a data frame, we have this notion that each of the columns corresponds to a different variable: as a consequence, the columns in a data frame can be of different data types. The first column could be numeric, and the second column could contain character strings, and the third column could be logical data. In that sense, there is a fundamental asymmetry build into a data frame, because of the fact that columns represent variables (which can be qualitatively different to each other) and rows represent cases (which cannot). Matrices are different. At a fundamental level, a matrix really is just one variable: it just happens that this one variable is formatted into rows and columns. If you want a matrix of numeric data, every single element in the matrix must be a number. If you want a matrix of character strings, every single element in the matrix must be a character string. If you try to mix data of different types together, then R will either complain or try to coerce the matrix into something unexpected. To give you a sense of this, let’s do something silly and convert one element of Mfrom the number 5to the character string "five"

M[2,2] <- "five"
print(M)
##      [,1] [,2]   [,3] [,4]
## [1,] "1"  "4"    "7"  "10"
## [2,] "2"  "five" "8"  "11"
## [3,] "3"  "6"    "9"  "12"

As you can see, the entire matrix Mhas been coerced into text.

99.9 Arrays

When doing data analysis, we often have reasons to want to use higher dimensional tables (e.g., sometimes you need to cross-tabulate three variables against each other). You can’t do this with matrices, but you can do it with arrays. An array is just like a matrix, except it can have more than two dimensions if you need it to. In fact, as far as R is concerned a matrix is just a special kind of array, in much the same way that a data frame is a special kind of list. I don’t want to talk about arrays too much, but I will very briefly show you an example of what a three dimensional array looks like.

A <- array(
  data = 1:24,
  dim = c(3,4,2)
  )
print(A)
## , , 1
##
##      [,1] [,2] [,3] [,4]
## [1,]    1    4    7   10
## [2,]    2    5    8   11
## [3,]    3    6    9   12
##
## , , 2
##
##      [,1] [,2] [,3] [,4]
## [1,]   13   16   19   22
## [2,]   14   17   20   23
## [3,]   15   18   21   24

99.9.1 Array indexing

Not surprisingly, you can refer the elements of an array using the same kind of logic that we used for matrices. In this case, since Ais a three-dimensional \(3 \times 4 \times 2\) array, we need three indices to specify an element,

A[2,3,1]
## [1] 8

but you probably won’t be surprised to note that this also works…

A[8]
## [1] 8

99.9.2 Array names

As with other data structures, arrays can have names for specific elements. In fact, we can assign names to each of the dimensions too. For instance, suppose that the array Ahas the shape that it does because it represents a (3 genders) \(\times\) (4 seasons) \(\times\) (2 times) structure. We could specify the dimension names for Alike this:

A <- array(
  data = 1:24,
  dim = c(3,4,2),
  dimnames = list(
    "genders" = c("male", "female", "nonbinary"),
    "seasons" = c("summer", "autumn", "winter", "spring"),
    "times" = c("day", "night")
    )
  )
print(A)
## , , times = day
##
##            seasons
## genders     summer autumn winter spring
##   male           1      4      7     10
##   female         2      5      8     11
##   nonbinary      3      6      9     12
##
## , , times = night
##
##            seasons
## genders     summer autumn winter spring
##   male          13     16     19     22
##   female        14     17     20     23
##   nonbinary     15     18     21     24

I find this a lot easier to read - it’s usually a good idea to label your arrays! Plus, it makes it a little easier to index them too, since (as usual) you can refer to elements by names. So if I just wanted to take a slice through the array corresponding to the "nonbinary"values, I could do this:

A["nonbinary",,]
##         times
## seasons  day night
##   summer   3    15
##   autumn   6    18
##   winter   9    21
##   spring  12    24

99.10 Ordered factors

One topic that I neglected to mention when discussing factors earlier in this chapter is that there are actually two different types of factor in R, unordered factors and ordered factors. An unordered factor corresponds to a nominal scale variable, and all of the factors we’ve discussed so far in this book have been unordered. However, it’s sometimes useful to explicitly tell R that your variable is ordinal scale, and if so you need to declare it to be an ordered factor. For instance, suppose we have a variable consisting of Likert scale data:

likert <- c(1, 7, 3, 4, 4, 4, 2, 6, 5, 5)

We can declare this to be an ordered factor in by using the factorfunction, and setting ordered = TRUE. To illustrate how this works, let’s create an ordered factor called likert.ordinaland have a look at it:

likert.ordinal <- factor(
  x = likert,                 # the raw data
  levels = c(7,6,5,4,3,2,1),  # strongest agreement is 1, weakest is 7
  ordered = TRUE              # and it’s ordered
)
print(likert.ordinal)
##  [1] 1 7 3 4 4 4 2 6 5 5
## Levels: 7 < 6 < 5 < 4 < 3 < 2 < 1

Notice that when we print out the ordered factor, R explicitly tells us what order the levels come in. Because I wanted to order my levels in terms of increasing strength of endorsement, and because a response of 1 corresponded to the strongest agreement and 7 to the strongest disagreement, it was important that I tell R to encode 7 as the lowest value and 1 as the largest. Always check this when creating an ordered factor: it’s very easy to accidentally encode your data “upside down” if you’re not paying attention. In any case, note that we can (and should) attach meaningful names to these factor levels by using the levelsfunction, like this:

levels( likert.ordinal ) <- c(
  "strong.disagree", "disagree", "weak.disagree",
  "neutral", "weak.agree", "agree", "strong.agree"
)
print( likert.ordinal )
##  [1] strong.agree    strong.disagree weak.agree      neutral
##  [5] neutral         neutral         agree           disagree
##  [9] weak.disagree   weak.disagree
## 7 Levels: strong.disagree < disagree < weak.disagree < ... < strong.agree

One nice thing about using ordered factors is that there are analyses for which R automatically treats ordered factors differently from unordered factors, and generally in a way that is more appropriate for ordinal data. However, I won’t go into details here. Like so many things in this chapter, my main goal here is to make you aware that R has this capability built into it; so if you ever need to start thinking about ordinal scale variables in more detail, you have at least some idea where to start looking!

99.11 Dates and times

Times and dates are very annoying types of data. To a first approximation we can say that there are 365 days in a year, 24 hours in a day, 60 minutes in an hour and 60 seconds in a minute, but that’s not quite correct. The length of the solar day is not exactly 24 hours, and the length of solar year is not exactly 365 days, so we have a complicated system of corrections that have to be made to keep the time and date system working. On top of that, the measurement of time is usually taken relative to a local time zone, and most (but not all) time zones have both a standard time and a daylight savings time, though the date at which the switch occurs is not at all standardised. So, as a form of data, times and dates are just awful to work with. Unfortunately, they’re also important. Sometimes it’s possible to avoid having to use any complicated system for dealing with times and dates. Often you just want to know what year something happened in, so you can just use numeric data: in quite a lot of situations something as simple as delcaring that this.yearis 2018, and it works just fine. If you can get away with that for your application, this is probably the best thing to do. However, sometimes you really do need to know the actual date. Or, even worse, the actual time. In this section, I’ll very briefly introduce you to the basics of how R deals with date and time data. As with a lot of things in this chapter, I won’t go into details: the goal here is to show you the basics of what you need to do if you ever encounter this kind of data in real life. And then we’ll all agree never to speak of it again.

To start with, let’s talk about the date. As it happens, modern operating systems are very good at keeping track of the time and date, and can even handle all those annoying timezone issues and daylight savings pretty well. So R takes the quite sensible view that it can just ask the operating system what the date is. We can pull the date using the Sys.Datefunction:

today <- Sys.Date()  # ask the operating system for the date
print(today)         # display the date
## [1] "2018-07-24"

Okay, that seems straightforward. But, it does rather look like today is just a character string, doesn’t it? That would be a problem, because dates really do have a quasi-numeric character to them, and it would be nice to be able to do basic addition and subtraction with them. Well, fear not. If you type in class(today), R will tell you that the todayvariable is a "Date"object. What this means is that, hidden underneath this text string, R has a numeric representation.7 What that means is that you can in fact add and subtract days. For instance, if we add 1to today, R will print out the date for tomorrow:

today + 1
## [1] "2018-07-25"

Let’s see what happens when we add 365 days:

today + 365
## [1] "2019-07-24"

R provides a number of functions for working with dates, but I don’t want to talk about them in any detail. I will, however, make passing mention of the weekdaysfunction which will tell you what day of the week a particular date corresponded to, which is extremely convenient in some situations:

weekdays(today)
## [1] "Tuesday"

I’ll also point out that you can use the as.Dateto coerce various different kinds of data into dates. If the data happen to be strings formatted exactly according to the international standard notation (i.e., yyyy-mm-dd) then the conversion is straightforward, because that’s the format that R expects to see by default. You can convert dates from other formats too, but it’s slightly trickier, and something I won’t go into here.

What about times? Well, times are even more annoying than dates, so much so that I don’t intend to talk about them at all right now, other than to point you in the direction of some vaguely useful things. R itself does provide you with tools for handling time data, and in fact there are two separate classes of data that are used to represent times, known by the odd names POSIXctand POSIXlt. You can use these to work with times if you want to, but for most applications you would probably be better off downloading the chronpackage, which provides some much more user friendly tools for working with times and dates.

99.12 Formulas

The last kind of variable that I want to introduce before finally being able to start talking about something a little more practical is the formula. Formulas were originally introduced into R as a convenient way to specify a particular type of statistical model (linear regression) but they’re such handy things that they’ve spread. Formulas are now used in a lot of different contexts, so it makes sense to introduce them early.

Stated simply, a formula object is a variable, but it’s a special type of variable that specifies a relationship between other variables. A formula is specified using the “tilde operator” ~. A very simple example of a formula is shown below8

formula1 <- out ~ pred
formula1
## out ~ pred

The precise meaning of this formula depends on exactly what you want to do with it, but in broad terms it means “the out(outcome) variable, analysed in terms of the pred(predictor) variable”. That said, although the simplest and most common form of a formula uses the “one variable on the left, one variable on the right” format, there are others. For instance, the following examples are all reasonably common

formula2 <-  out ~ pred1 + pred2  # more than one variable on the right
formula3 <-  out ~ pred1 * pred2  # different relationship between predictors
formula4 <-  ~ var1 + var2        # a ’one-sided’ formula

and there are many more variants besides. Formulas are pretty flexible things, and so different functions will make use of different formats, depending on what the function is intended to do. At this point you don’t need to know much about formulas - I only mention them now so you don’t get surprised by them later!

99.13 Generic functions

There’s one other important thing that I omitted when I discussed functions earlier on, and that’s the concept of a generic function. The two most notable examples that you’ll see in the next few chapters are summaryand plot, although you’ve already seen an example of one working behind the scenes, and that’s the printfunction. The thing that makes generics different from the other functions is that their behaviour changes, often quite dramatically, depending on the classof the input you give it. The easiest way to explain the concept is with an example. With that in mind, lets take a closer look at what the printfunction actually does. I’ll do this by creating a formula, and printing it out in a few different ways. First, let’s stick with what we know:

my.formula <- blah ~ blah.blah  # create a variable of class "formula"
print( my.formula )             # print it the normal way
## blah ~ blah.blah

So far, there’s nothing very surprising here. But there’s actually a lot going on behind the scenes here. When I type print(my.formula), what actually happens is the printfunction checks the class of the my.formulavariable. When the function discovers that the variable it’s been given is a formula, it goes looking for a function called print.formula, and then delegates the whole business of printing out the variable to the print.formulafunction.9 For what it’s worth, the name for a “dedicated” function like print.formulathat exists only to be a special case of a generic function like printis a method, and the name for the process in which the generic function passes off all the hard work onto a method is called method dispatch. You won’t need to understand the details at all for this book, but you do need to know the gist of it; if only because a lot of the functions we’ll use are actually generics.

Just to give you a sense of this, let’s do something silly and try to bypass the normal workings of the printfunction:

print.default( my.formula ) # do something silly by using the wrong method
## blah ~ blah.blah
## attr(,"class")
## [1] "formula"
## attr(,".Environment")
## <environment: R_GlobalEnv>

Hm. You can kind of see that it is trying to print out the same formula, but there’s a bunch of ugly low-level details that have also turned up on screen. This is because the print.defaultmethod doesn’t know anything about formulas, and doesn’t know that it’s supposed to be hiding the obnoxious internal gibberish that R produces sometimes.

At this stage, this is about as much as we need to know about generic functions and their methods. In fact, you can get through the entire book without learning any more about them than this, so it’s probably a good idea to end this discussion here.

99.14 Summary

summary, summary, summary


  1. Taking the usual caveats about IQ measurement as given, of course.

  2. Or, more precisely, we don’t know how to measure it. Arguably, a rock has zero intelligence. But it doesn’t make sense to say that the IQ of a rock is 0 in the same way that we can say that the average human has an IQ of 100. And without knowing what the IQ value is that corresponds to a literal absence of any capacity to think, reason or learn, then we really can’t multiply or divide IQ scores and expect a meaningful answer.

  3. This is an example of coercing a variable from one class to another. I’ll talk about coercion in more detail later

  4. Some users might wonder why R even allows the ==operator for factors. The reason is that sometimes you really do have different factors that have the same levels. For instance, if I was analysing data associated with football games, I might have a factor called home.team, and another factor called winning.team. In that situation I really should be able to ask if home.team == winning.team.

  5. Note that I didn’t to this in my earlier example when loading the .Rdatafile. There’s a reason for this. The idea behind an .Rdatafile is that it stores a whole workspace. So, if you had the ability to look inside the file yourself you’d see that the data file keeps track of all the variables and their names. So when you loadthe file, R restores all those original names. CSV files are treated differently: as far as R is concerned, the CSV only stores one variable, but that variable is big table. So when you import that table into the workspace, R expects you to give it a name.

  6. The reason for this relates to how matrices are implemented. The original matrix Mis treated as a two-dimensional object, containing two rows and three columns. However, whenever you pull out a single row or a single column, the result is considered to be a vector, which has a length but doesn’t have dimensions. Unless you explictly coerce the vector into a matrix, R doesn’t really distinguish between row vectors and column vectors. This has implications for how matrix algebra is implemented in R (which I’ll admit I initially found odd). When multiplying a matrix by a vector using the %*%operator, R will attempt to interpret the vector as either a row vector or column vector, depending on whichever one makes the multiplication work. That is, suppose \(\mathbf{M}\) is \(2\times 3\) matrix, and \(v\) is a \(1\times 3\) row vector. Mathematically the matrix multiplication \(\mathbf{M}v\) doesn’t make sense since the dimensions don’t conform, but you can multiply by the corresponding column vector, \(\mathbf{M}v^T\). So, if I set v <- M[2,]and then try to calculate M %*% v, which you’d think would fail, it actually works because R treats the one dimensional array as if it were a column vector for the purposes of matrix multiplication. Note that if both objects are one dimensional arrays/vectors, this leads to ambiguity since \(vv^T\) (inner product) and \(v^Tv\) (outer product) yield different answers. In this situation, the %*%operator returns the inner product not the outer product. To understand all the details, check out the help documentation.

  7. Date objects are coded internally as the number of days that have passed since January 1, 1970.

  8. Note that, when I write out the formula, R doesn’t check to see if the outand predvariables actually exist: it’s only later on when you try to use the formula for something that this happens.

  9. For readers with a programming background: R has three separate systems for object oriented programming. The earliest system was S3, and it was very informal: generic functions as described here are part of the S3 system. Later on S4 was introduced as a more formal way of doing things. I confess I never learned S4 because it looked tedious. More recently R introduced Reference Classes, which look kind of neat and I should probably learn about them. Discussed here if you’re interested.