He divided the universe in forty categories or classes, these being further subdivided into differences, which was then subdivided into species. He assigned to each class a monosyllable of two letters; to each difference, a consonant; to each species, a vowel. For example:
de
, which means an element;deb
, the first of the elements, fire;deba
, a part of the element fire, a flame. … The words of the analytical language created by John Wilkins are not mere arbitrary symbols; each letter in them has a meaning, like those from the Holy Writ had for the Cabbalists.
–Jorge Luis Borges, The Analytical Language of John Wilkins
In the last chapter I talked a lot about variables, how they’re assigned and some of the things you can do with them, but there’s a lot of additional complexities. That’s not a surprise of course. However, some of those issues are worth drawing your attention to now. So that’s the goal of this section; to cover a few extra topics. As a consequence, this section is basically a bunch of things that I want to briefly mention, but don’t really fit in anywhere else. In short, I’ll talk about several different issues in this section, which are only loosely connected to one another.
The first thing I want to mention are some of the “special” values that you might see R produce. Most likely you’ll see them in situations where you were expecting a number, but there are quite a few other ways you can encounter them. These values are Inf
, NaN
, NA
and NULL
. These values can crop up in various different places, and so it’s important to understand what they mean.
Infinity (Inf
). The easiest of the special values to explain is Inf
, since it corresponds to a value that is infinitely large. You can also have -Inf
. The easiest way to get Inf
is to divide a positive number by 0:
1/0
## [1] Inf
In most real world data analysis situations, if you’re ending up with infinite numbers in your data, then something has gone awry. Hopefully you’ll never have to see them.
Not a Number (NaN
). The special value of NaN
is short for “not a number”, and it’s basically a reserved keyword that means “there isn’t a mathematically defined number for this”. If you can remember your high school maths, remember that it is conventional to say that 0/0
doesn’t have a proper answer: mathematicians would say that 0/0
is undefined. R says that it’s not a number:
0/0
## [1] NaN
Nevertheless, it’s still treated as a “numeric” value. To oversimplify, NaN
corresponds to cases where you asked a proper numerical question that genuinely has no meaningful answer.
Not available (NA
). NA
indicates that the value that is “supposed” to be stored here is missing. To understand what this means, it helps to recognise that the NA
value is something that you’re most likely to see when analysing data from real world experiments. Sometimes you get equipment failures, or you lose some of the data, or whatever. The point is that some of the information that you were “expecting” to get from your study is just plain missing. Note the difference between NA
and NaN
. For NaN
, we really do know what’s supposed to be stored; it’s just that it happens to correspond to something like 0/0
that doesn’t make any sense at all. In contrast, NA
indicates that we actually don’t know what was supposed to be there. The information is missing.
No value (NULL
). The NULL
value takes this “absence” concept even further. It asserts that the variable genuinely has no value whatsoever, or does not even exist. This is quite different to both NaN
and NA
. For NaN
we actually know what the value is, because it’s something insane like 0/0
. For NA
, we believe that there is supposed to be a value “out there” in some sense, but a dog ate our homework and so we don’t quite know what it is. But for NULL
we strongly believe that there is no value at all.
As we’ve seen, R allows you to store different kinds of data. In particular, the variables we’ve defined so far have either been character data (text), numeric data, or logical data. It’s important that we remember what kind of information each variable stores (and even more important that R remembers) since different kinds of variables allow you to do different things to them. For instance, if your variables have numerical information in them, then it’s okay to multiply them together. But if they contain character data, multiplication makes no sense whatsoever, and R will complain if you try to do it:
x <- 5 # x is numeric
y <- 4 # y is numeric
x * y
## [1] 20
x <- "apples" # x is character
y <- "oranges" # y is character
x * y
## Error in x * y: non-numeric argument to binary operator
Even R is smart enough to know you can’t multiply "apples"
by "oranges"
. It knows this because the quote marks are indicators that the variable is supposed to be treated as text, not as a number.
This is quite useful, but notice that it means that R makes a big distinction between 5
and "5"
. Without quote marks, R treats 5
as the number five, and will allow you to do calculations with it. With the quote marks, R treats "5"
as the textual character five, and doesn’t recognise it as a number any more than it recognises "p"
or "five"
as numbers. As a consequence, there’s a big difference between typing x <- 5
and typing x <- "5"
. In the former, we’re storing the number 5
; in the latter, we’re storing the character "5"
. Thus, if we try to do multiplication with the character versions, R gets stroppy
x <- "5" # x is character
y <- "4" # y is character
x * y
## Error in x * y: non-numeric argument to binary operator
Okay, let’s suppose that I’ve forgotten what kind of data I stored in the variable x
(which happens depressingly often). R provides a function that will let us find out. Actually, it provides several different functions that are used for different purposes. For now I only want to discuss the class
function. The class of a variable is a “high level” classification, and it captures psychologically (or statistically) meaningful distinctions. For instance "2011-09-12"
and "my birthday"
are both text strings, but there’s an important difference between the two: one of them is a date. So it would be nice if we could get R to recognise that "2011-09-12"
is a date, and allow us to do things like add or subtract from it. The class of a variable is what R uses to keep track of things like that. Because the class of a variable is critical for determining what R can or can’t do with it, the class
function is very handy.
Later on, I’ll talk a bit about how you can convince R to coerce a variable to change from one class to another. That’s a useful skill for real world data analysis, but it’s not something that we need right now. In the meantime, the following examples illustrate the use of the class
function:
x <- "hello world"
class(x)
## [1] "character"
x <- TRUE
class(x)
## [1] "logical"
x <- 100
class(x)
## [1] "numeric"
Sometimes you want to change the variable class. This can happen for all sorts of reasons. Sometimes when you import data from files, it can come to you in the wrong format: numbers sometimes get imported as text, dates usually get imported as text, and many other possibilities besides. Regardless of how you’ve ended up in this situation, there’s a very good chance that sometimes you’ll want to convert a variable from one class into another one. Or, to use the correct term, you want to coerce the variable from one class into another. Coercion is a little tricky, and so I’ll only discuss the very basics here, using a few simple examples. However, you’ll see more examples of coercion as you go through this tutorial.
Firstly, let’s suppose we have a variable x
that is supposed to be representing a number, but the data file that you’ve been given has encoded it as text (this happens quite a lot). Let’s imagine that the variable is something like this:
x <- c("15","19") # the variable
class(x) # what class is it?
## [1] "character"
Obviously, if I want to do mathematical calculations using x
in its current state, R is going to get very annoyed at me. It thinks that x
is text, so it’s not going to allow me to try to do mathematics using it! Obviously, we need to coerce x
from “character” to “numeric”. We can do that in a straightforward way by using the as.numeric
function:
x <- as.numeric(x) # coerce the variable
class(x) # what class is it?
## [1] "numeric"
x + 1 # hey, addition works!
## [1] 16 20
Not surprisingly, we can also convert it back again if we need to. The function that we use to do this is the as.character
function:
x <- as.character(x) # coerce back to text
class(x) # check the class
## [1] "character"
However, there’s some fairly obvious limitations: you can’t coerce the string "hello world"
into a number because, well, there’s isn’t a number that corresponds to it. If you try, R metaphorically shrugs its shoulders and declares it to be missing:
as.numeric( "hello world" ) # this isn’t going to work.
## Warning: NAs introduced by coercion
## [1] NA
That gives you a feel for how to change between numeric and character data. What about logical data? To cover this briefly, coercing text to logical data is pretty intuitive: you use the as.logical
function, and the character strings "T"
, "TRUE"
, "True"
and "true"
all convert to the logical value of TRUE
. Similarly "F"
, "FALSE"
, "False"
, and "false"
all become FALSE
. All other strings convert to NA
. When you go back the other way using as.character
, TRUE
converts to "TRUE"
and FALSE
converts to "FALSE"
. Converting numbers to logicals – again using as.logical
– is straightforward. Following the standard convention in the study of Boolean logic, the number 0
converts to FALSE
. Everything else is TRUE
. Going back using as.numeric
, FALSE
converts to 0
and TRUE
converts to 1
.
One thing that is sometimes a little unsatisfying about the way that R prints out a vector is that the elements come out unlabelled. Here’s what I mean. Suppose I’ve got data reporting the quarterly profits for some company. If I just create a no-frills vector, I have to rely on memory to know which element corresponds to which event. That is:
profit <- c( 3.1, 0.1, -1.4, 1.1 )
profit
## [1] 3.1 0.1 -1.4 1.1
You can probably guess that the first element corresponds to the first quarter, the second element to the second quarter, and so on, but that’s only because I’ve told you the back story and because this happens to be a very simple example. In general, it can be quite difficult. This is where it can be helpful to assign names
to each of the elements. Here’s one way to do it:
names(profit) <- c("Q1","Q2","Q3","Q4")
profit
## Q1 Q2 Q3 Q4
## 3.1 0.1 -1.4 1.1
This is a slightly odd looking command, admittedly, but it’s not too difficult to follow. All we’re doing is assigning a vector of labels (character strings) to names(profit)
. You can always delete the names again by using this command,
names(profit) <- NULL
profit
## [1] 3.1 0.1 -1.4 1.1
It’s also worth noting that you don’t have to do this as a two stage process. You can get the same result with this command:
profit <- c( "Q1" = 3.1, "Q2" = 0.1, "Q3" = -1.4, "Q4" = 1.1 )
profit
## Q1 Q2 Q3 Q4
## 3.1 0.1 -1.4 1.1
The important things to notice are that (a) this does make things much easier to read, but (b) the names at the top aren’t the “real” data. The value of profit[1]
is still 3.1; all I’ve done is added a name to profit[1]
as well. Nevertheless, names aren’t purely cosmetic, since R allows you to pull out particular elements of the vector by referring to their names:
profit["Q1"]
## Q1
## 3.1
And if I ever need to pull out the names themselves, then I just type
names(profit)
## [1] "Q1" "Q2" "Q3" "Q4"
Let’s start with a simple example. When my children were little I naturally spent a lot of time watching TV shows like In the Night Garden. In the nightgarden.Rdata
file, I’ve transcribed a short section of the dialogue from the show. The file contains two vectors, speaker
and utterance
, and when we take a look at the data,
load("./data/nightgarden.Rdata")
print(speaker)
## [1] "upsy-daisy" "upsy-daisy" "upsy-daisy" "upsy-daisy" "tombliboo"
## [6] "tombliboo" "makka-pakka" "makka-pakka" "makka-pakka" "makka-pakka"
print(utterance)
## [1] "pip" "pip" "onk" "onk" "ee" "oo" "pip" "pip" "onk" "onk"
it becomes very clear what happened to my sanity. Suppose that what I want to do is pull out only those utterances that were made by Makka-Pakka. To that end, I could first use the equality operator ==
to have R tell me which cases correspond to Makka-Pakka speaking:
speaker == "makka-pakka"
## [1] FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE
So this allows a very neat way to subset the utterance
vector. If I want to select only those utterances for which Makka-Pakka was the speaker, I can do so as follows:
utterance[ speaker == "makka-pakka" ]
## [1] "pip" "pip" "onk" "onk"
%in%
A second useful trick for extracting a subset of a vector is to use the %in%
operator. It’s actually very similar to the ==
operator, except that you can supply a collection of acceptable values. For instance, suppose I wanted to find those cases when the utterance
is either “pip”
or “oo”
. We can do that like this:
utterance %in% c("pip","oo")
## [1] TRUE TRUE FALSE FALSE FALSE TRUE TRUE TRUE FALSE FALSE
This in turn allows us to find the speaker
for all such cases,
speaker[ utterance %in% c("pip","oo") ]
## [1] "upsy-daisy" "upsy-daisy" "tombliboo" "makka-pakka" "makka-pakka"
Before moving onto data frames, there’s a couple of other tricks worth mentioning. The first of these is to use negative values as indices. Recall from earlier that we can use a vector of numbers to extract a set of elements that we would like to keep. For instance, suppose I want to keep only elements 2 and 3 from utterance
. I could do so like this
utterance[2:3]
## [1] "pip" "onk"
But suppose, on the other hand, that I have discovered that observations 2 and 3 are untrustworthy, and I want to keep everything except those two elements. To that end, R lets you use “negative indices”" to remove specific values, like so:
utterance [ -(2:3) ]
## [1] "pip" "onk" "ee" "oo" "pip" "pip" "onk" "onk"
Okay, it’s time to start introducing some of the data types that are somewhat more specific to statistics. As every single research methodology class is at pains to ppoint out, when we assign numbers to possible outcomes, these numbers can mean quite different things depending on what kind of quantity in the world we are attempting to measure. In particular, psychologists commonly make the distinction between nominal, ordinal, interval and ratio scale data. How do we capture this distinction in R? Currently, we only seem to have a single numeric data type. That’s probably not going to be enough, is it?
A little thought suggests that the numeric variable class in R is perfectly suited for capturing ratio scale data. For instance, if I were to measure response time (RT) for five different events, I could store the data in R like this:
RT <- c(1342, 1401, 1590, 1391, 1554, 1422, 1612, 1230, 998)
where the data here are measured in milliseconds, as is conventional in the psychological literature. It’s perfectly sensible to talk about “twice the response time”, 2 * RT
, or the “response time plus 1 second”, RT + 1000
, and so both of the following are perfectly reasonable things for R to do:
2 * RT
## [1] 2684 2802 3180 2782 3108 2844 3224 2460 1996
RT + 1000
## [1] 2342 2401 2590 2391 2554 2422 2612 2230 1998
And to a lesser extent, the “numeric” class is okay for interval scale data, as long as we remember that multiplication and division aren’t terribly interesting for these sorts of variables. That is, if my IQ score is 110 and yours is 120, it’s perfectly okay to say that you’re 10 IQ points smarter than me,1 but it’s not okay to say that I’m only 92% as smart as you are, because intelligence doesn’t have a “natural” zero.2 Similarly, we might be willing to tolerate the use of numeric variables to represent ordinal scale variables, such as those that you typically get when you ask people to rank order items (e.g., like we do in Australian elections), though as we will see R actually has a built in tool for representing ordinal data (ordered factors: discussed later) However, when it comes to nominal scale data, it becomes completely unacceptable, because almost all of the “usual” rules for what you’re allowed to do with numbers don’t apply to nominal scale data. It is for this reason that R has factors.
Suppose, I was doing a study in which people could belong to one of three different treatment conditions. Each group of people were asked to complete the same task, but each group received different instructions. Not surprisingly, I might want to have a variable that keeps track of what group people were in. So I could type in something like this
group <- c(1,1,1,2,2,2,1,2,1)
so that group[i]
contains the group membership of the i-th person in my study. Clearly, this is numeric data, but equally obviously this is a nominal scale variable. There’s no sense in which “group 1” plus “group 2” equals “group 3”, but nevertheless if I try to do that, R won’t stop me because it doesn’t know any better:
group + 2
## [1] 3 3 3 4 4 4 3 4 3
This doesn’t make a lot of sense, but R is too stupid to know any better: it thinks that 1
is an ordinary number in this context, so it sees no problem in calculating 1 + 2
. But since we’re not that stupid, we’d like to stop R from doing this. We can do so by instructing R to treat group as a factor. This is easy to do using the as.factor
function:3
group <- as.factor(group)
group
## [1] 1 1 1 2 2 2 1 2 1
## Levels: 1 2
It looks more or less the same as before (though it’s not immediately obvious what all that Levels
rubbish is about), but if we ask R to tell us what the class
of the group variable is now, it’s clear that it has done what we asked:
class(group)
## [1] "factor"
Neat. Better yet, now that I’ve converted group
to a factor, look what happens when I try to add 2
to it:
group + 2
## Warning in Ops.factor(group, 2): '+' not meaningful for factors
## [1] NA NA NA NA NA NA NA NA NA
This time even R is smart enough to know that I’m being an idiot, so it tells me off and then produces a vector of missing values.
I have a confession to make. My memory is not infinite in capacity; and it seems to be getting worse as I get older. So it kind of annoys me when I get data sets where there’s a nominal scale variable called gender
which is supposed to have two levels corresponding to males
and females
. Even setting aside the question of whether gender should be considered a binary variable in the first place (seems a little unlikely no matter how you define the concept of “gender”), it drives me crazy when I load the file and to print out the variable I get something like this:
load("./data/badgender.Rdata")
gender
## [1] 1 1 1 1 1 2 2 2 2
## Levels: 1 2
Okaaaay. That’s not helpful at all, and it makes me very sad. Which number corresponds to the males and which one corresponds to the females? Wouldn’t it be nice if R could actually keep track of this? It’s way too hard to remember which number corresponds to which gender. And besides, the problem that this causes is much more serious than a single sad nerd… because R has no way of knowing that the 1s in the group
factor are a very different kind of thing to the 1s in the gender
factor So if I try to ask which elements of the group
variable are equal to the corresponding elements in gender
, R thinks this is totally kosher, and gives me this:
group == gender
## [1] TRUE TRUE TRUE FALSE FALSE TRUE FALSE TRUE FALSE
Well,that’s…especially stupid.4 The problem here is that R is very literal minded. Even though you’ve declared both group
and gender
to be factors, it still assumes that a 1
is a 1
no matter which variable it appears in.
To fix both of these problems (my memory problem, and R’s infuriating literal interpretations), what we need to do is assign meaningful labels to the different levels of each factor. We can do that like this:
levels(group) <- c("control", "treatment")
levels(gender) <- c("male", "female")
Having done this R produces output that seems a lot more readable…
group
## [1] control control control treatment treatment treatment control
## [8] treatment control
## Levels: control treatment
gender
## [1] male male male male male female female female female
## Levels: male female
And when I try to make insane comparisons it tells me to feel bad about myself…
group==gender
## Error in Ops.factor(group, gender): level sets of factors are different
Factors are very useful things, and we’ll use them a lot in this book: they’re the main way to represent a nominal scale variable. And there are lots of nominal scale variables out there. I’ll talk more about factors later, but for now you know enough to be able to get started.
One quite commonly used data format is the humble “comma separated value” file, also called a CSV file, and usually bearing the file extension .csv
. CSV files are just plain old-fashioned text files, and what they store is basically just a table of data. This is illustrated in the screenshot below, which shows a file called booksales.csv
that I’ve created. As you can see, each row corresponds to a variable, and each row represents the book sales data for one month. The first row doesn’t contain actual data though: it has the names of the variables.
If Rstudio were not available to you, the easiest way to open this file would be to use the read.csv
function. This function is pretty flexible, and I’ll talk a lot more about it’s capabilities in a later chapter, but for now there’s only two arguments to the function that I’ll mention:
file
. This should be a character string that specifies a path to the file that needs to be loaded. You can use an absolute path or a relative path to do so.header
. This is a logical value indicating whether or not the first row of the file contains variable names. The default value is TRUE
.So, if the current working directory is the Rbook
folder, and the booksales.csv
file is in the Rbook\data
folder, the command I would need to use to load it is:
books <- read.csv(file = "./data/booksales.csv")
There are two important points to notice here. Firstly, notice that I didn’t try to use the load
function, because that function is only meant to be used for .Rdata
files. If you try to use load
on other types of data, you get an error. Secondly, notice that when I imported the CSV file I assigned the result to a variable, which I imaginatively called books
.5 Let’s have a look at what we’ve got:
print(books)
## Month Days Sales Stock.Levels
## 1 January 31 0 high
## 2 February 28 100 high
## 3 March 31 200 low
## 4 April 30 50 out
## 5 May 31 0 out
## 6 June 30 0 high
## 7 July 31 0 high
## 8 August 31 0 high
## 9 September 30 0 high
## 10 October 31 0 high
## 11 November 30 0 high
## 12 December 31 0 high
Clearly, it’s worked, but the format of this output is a bit unfamiliar. We haven’t seen anything like this before. What you’re looking at is a data frame, which is a very important kind of variable in R, and one I’ll discuss later in this chapter. For now, let’s just be happy that we imported the data and that it looks about right.
Yet again, it’s easier in Rstudio. In the environment panel in Rstudio you should see a button called “Import Dataset”. Click on that, and it will give you a couple of options:
Select the “From Text File (base)…” option, and it will open up a very familiar dialog box asking you to select a file: if you’re on a Mac, it’ll look like the usual Finder window that you use to choose a file; on Windows it looks like an Explorer window. I’m assuming that you’re familiar with your own computer, so you should have no problem finding the CSV file that you want to import! Find the one you want, then open it. When you do this, you’ll see a window that looks like the one below:
The import data set window is relatively straightforward to understand. In the top left corner, you need to type the name of the variable you R to create. By default, that will be the same as the file name: our file is called booksales.csv, so Rstudio suggests the name booksales. If you’re happy with that, leave it alone. If not, type something else. Immediately below this are a few things that you can tweak to make sure that the data gets imported correctly:
Heading
. Does the first row of the file contain raw data, or does it contain headings for each variable? The booksales.csv
file has a header at the top, so I selected “yes”.Separator
. What character is used to separate different entries? In most CSV files this will be a comma (it is “comma separated” after all). But you can change this if your file is different.Decimal
. What character is used to specify the decimal point? In English speaking countries, this is almost always a period (i.e., .
). That’s not universally true: many European countries use a comma. So you can change that if you need to.Quote
. What character is used to denote a block of text? That’s usually going to be a double quote mark. It is for the booksales.csv
file, so that’s what I selected.The nice thing about the Rstudio window is that it shows you the raw data file at the top of the window, and it shows you a preview of the data at the bottom. If the data at the bottom doesn’t look right, try changing some of the settings on the left hand side. Once you’re happy, click “Import”. When you do, you’ll see the read.csv
command appear in the console, along with a View(booksales)
command that will open up a visual display of the booksales
data set.
It’s now time to go back and deal with the somewhat confusing thing that happened earlier when we tried to open up a CSV file. Apparently we succeeded in loading the data, but it came to us in a very odd looking format. At the time, I told you that this was a data frame. Now I’d better explain what that means.
In order to understand why R has created this funny thing called a data frame, it helps to try to see what problem it solves. So let’s go back to the little scenario that I used when introducing factors in the last. In that section I recorded the group
and gender
for all 9 participants in my study, as well as their average RT
on some task.
So there are these three variables in the workspace, gender
, group
and RT
. And it just so happens that all three are the same size (i.e., they’re all vectors with 9 elements). Aaaand it just so happens that gender[1]
corresponds to the gender of the first person, and RT[1]
is the response time of that very same person, etc. In other words, you and I both know that all these variables correspond to the same data set, and they are organised in exactly the same way.
However, R doesn’t know this! As far as it’s concerned, there’s no reason why the RT
variable has to be the same length as the gender
variable; and there’s no particular reason to think that RT[1]
has any special relationship to gender[1]
any more than it has a special relationship to gender[4]
. In other words, when we store everything in separate variables like this, R doesn’t know anything about the relationships between things. It doesn’t even really know that these variables actually refer to a proper data set. The data frame fixes this: if we store our variables inside a data frame, we’re telling R to treat these variables as a single, fairly coherent data set.
To see how they do this, let’s create one. So how do we create a data frame? One way we’ve already seen: if we import our data from a CSV file, R will store it as a data frame. A second way is to create it directly from some existing variables using the data.frame
function. All you have to do is type a list of variables that you want to include in the data frame. The output of a data.frame
command is, well, a data frame. So, if I want to store the variables from my experiment in a data frame called exp
I can do so like this:
exp <- data.frame(group, gender, RT)
exp
## group gender RT
## 1 control male 1342
## 2 control male 1401
## 3 control male 1590
## 4 treatment male 1391
## 5 treatment male 1554
## 6 treatment female 1422
## 7 control female 1612
## 8 treatment female 1230
## 9 control female 998
Note that exp
is a completely self-contained variable. Once you’ve created it, it no longer depends on the original variables from which it was constructed. That is, if we make changes to the original RT
variable, it will not lead to any changes to the RT
data stored in exp
. So, for the sake of my sanity, let’s remove all variables from the workspace except for the exp
data frame:
rm(list = objects()[objects() != "exp"]) # ... this is ugly!
who() # but what matters is that it worked...
## -- Name -- -- Class -- -- Size --
## exp data.frame 9 x 3
We can verify that the original RT
variable is gone, because if we try to print it out R gives an error message:
RT
## Error in eval(expr, envir, enclos): object 'RT' not found
$
At this point, our workspace contains only the one variable, a data frame called expt. But as we can see when we told R to print the variable out, this data frame contains three variables, each of which has nine observations. So how do we get this information out again? After all, there’s no point in storing information if you don’t use it, and there’s no way to use information if you can’t access it. So let’s talk a bit about how to pull information out of a data frame. As is always the case with R there are several ways to do this. The simplest is to use the $
operator to point to the variable you’re interested in, like this:
exp$RT
## [1] 1342 1401 1590 1391 1554 1422 1612 1230 998
There’s a lot more that can be said about data frames: they’re fairly complicated beasts, and the longer you use R the more important it is to make sure you really understand them. We’ll talk a lot more about them later.
The next kind of data I want to mention are lists. Lists are an extremely fundamental data structure in R, and as you start making the transition from a novice to a savvy R user you will use lists all the time. Most of the advanced data structures in R are built from lists (e.g., data frames are actually a specific type of list), so it’s useful to have a basic understanding of them.
Okay, so what is a list, exactly? Like data frames, lists are just “collections of variables.” However, unlike data frames – which are basically supposed to look like a nice “rectangular” table of data – there are no constraints on what kinds of variables we include, and no requirement that the variables have any particular relationship to one another. In order to understand what this actually means, the best thing to do is create a list, which we can do using the list
function. If I type this as my command:
starks <- list(
parents = c("Eddard", "Catelyn"),
children = c("Robb", "Jon", "Sansa", "Arya", "Brandon", "Rickon"),
alive = 8
)
I create a list starks
that contains a list of the various characters that belong to House Stark in George R. R. Martin’s A Song of Ice and Fire novels. Because Martin does seem to enjoy killing off characters, the list starts out by indicating that all eight are currently alive (at the start of the books obviously!) and we can update it if need be. When a character dies, I might do this:
starks$alive <- starks$alive - 1
starks
## $parents
## [1] "Eddard" "Catelyn"
##
## $children
## [1] "Robb" "Jon" "Sansa" "Arya" "Brandon" "Rickon"
##
## $alive
## [1] 7
I can delete whole variables from the list if I want. For instance, I might just give up on the parents entirely:
starks$parents <- NULL
starks
## $children
## [1] "Robb" "Jon" "Sansa" "Arya" "Brandon" "Rickon"
##
## $alive
## [1] 7
You get the idea, I hope.
Working with data frames can sometimes be complicated, so I’m going to revisit them again now that we’ve talked about lists. At it’s core, a data frame genuinely is a list, one that just happens to have a special “rectangular” structure. However, this rectangular structure means that in addition to working with it using the $
operator (as one would do for a list), there are some other possibilities too, based on the fact that a data frame has rows and columns. Let’s return to the In the Night Garden data set. First, let’s construct a data frame, itng
…
load("./data/nightgarden.Rdata")
itng <- data.frame( speaker, utterance )
itng
## speaker utterance
## 1 upsy-daisy pip
## 2 upsy-daisy pip
## 3 upsy-daisy onk
## 4 upsy-daisy onk
## 5 tombliboo ee
## 6 tombliboo oo
## 7 makka-pakka pip
## 8 makka-pakka pip
## 9 makka-pakka onk
## 10 makka-pakka onk
… and we’ll assume the goal is to be able to select different subsets of this data set.
subset
functionThere are several different ways to subset a data frame in R, some easier than others. I’ll start by discussing the subset
function, which is probably the conceptually simplest way do it. For our purposes there are three different arguments that you’ll be most interested in:
x
. The data frame that you want to subset.subset
. A vector of logical values indicating which cases (rows) of the data frame you want to keep. By default, all cases will be retained.select
. This argument indicates which variables (columns) in the data frame you want to keep. This can either be a list of variable names, or a logical vector indicating which ones to keep, or even just a numeric vector containing the relevant column numbers. By default, all variables will be retained.Let’s start with an example in which I use all three of these arguments. Suppose that I want to subset the itng
data frame, keeping only the utterances made by Makka-Pakka. What that means is that I need to use the select
argument to pick out the utterance
variable, and I also need to use the subset
variable, to pick out the cases when Makka-Pakka is speaking (i.e., speaker == "makka-pakka"
). Therefore, the command I need to use is this:
df <- subset(
x = itng,
subset = speaker == "makka-pakka",
select = utterance
)
print( df )
## utterance
## 7 pip
## 8 pip
## 9 onk
## 10 onk
The variable df
here is still a data frame, but it only contains one variable (called utterance
) and four cases. Notice that the row numbers are the same ones from the original data frame. It’s worth taking a moment to briefly explain this. The reason that this happens is that these “row numbers” are actually row names. When you create a new data frame from scratch R will assign each row a fairly boring row name, identical to the row number. However, when you subset the data frame, each row keeps its original row name. This can be quite useful, since – as in the current example – it provides you a visual reminder of what each row in the new data frame corresponds to in the original data frame. However, if it annoys you, you can change the row names using the rownames
function, or remove them entirely with the rownames(df) <- NULL
command.
In any case, let’s return to the subset
function, and look at what happens when we don’t use all three of the arguments. Firstly, suppose that I didn’t bother to specify the select
argument. Let’s see what happens:
subset(
x = itng,
subset = speaker == "makka-pakka"
)
## speaker utterance
## 7 makka-pakka pip
## 8 makka-pakka pip
## 9 makka-pakka onk
## 10 makka-pakka onk
Not surprisingly, R has kept the same cases from the original data set (i.e., rows 7 through 10), but this time it has retained all of the variables from the data frame. Equally unsurprisingly, if I don’t specify the subset
argument, what we find is that R keeps all of the cases:
subset(
x = itng,
select = utterance
)
## utterance
## 1 pip
## 2 pip
## 3 onk
## 4 onk
## 5 ee
## 6 oo
## 7 pip
## 8 pip
## 9 onk
## 10 onk
Again, it’s important to note that this output is still a data frame: it’s just a data frame with only a single variable.
Throughout the book so far, whenever I’ve been subsetting a vector I’ve tended use the square brackets [ ]
to do so. But in the previous section when I started talking about subsetting a data frame I used the subset
function. As a consequence, you might be wondering whether it is possible to use the square brackets to subset a data frame. The answer, of course, is yes. Not only can you use square brackets for this purpose, as you become more familiar with R you’ll find that this is actually much more convenient than using subset
. Unfortunately, the use of square brackets for this purpose is somewhat complicated, and there are a few cases that cause some confusion. So be warned: this section is more complicated than it feels like it “should” be. With that warning in place, I’ll try to walk you through it slowly. For this section, I’ll use a slightly different In the Night Garden data set, namely the garden
data frame that is stored in the nightgarden2.Rdata
file:
load( "./data/nightgarden2.Rdata" )
garden
## speaker utterance line
## case.1 upsy-daisy pip 1
## case.2 upsy-daisy pip 2
## case.3 tombliboo ee 5
## case.4 makka-pakka pip 7
## case.5 makka-pakka onk 9
As you can see, the garden
data frame contains three variables and five cases, and this time around I’ve used the rownames
function to attach slightly verbose labels to each of the cases. Moreover, let’s assume that what we want to do is to pick out rows 4 and 5 (the two cases when Makka-Pakka is speaking), and columns 1 and 2 (variables speaker
and utterance
).
How shall we do this? As usual, there’s more than one way. The first way is based on the observation that, since a data frame is rectangular, every element in the data frame has a row number and a column number. So, if we want to pick out a single element, we have to specify the row number and a column number within the square brackets. By convention, the row number comes first. So, for the data frame above, which has five rows and three columns, the numerical indexing scheme looks like this:
col 1 | col 2 | col 3 | |
---|---|---|---|
row 1 | [1,1] | [1,2] | [1,3] |
row 2 | [2,1] | [2,2] | [2,3] |
row 3 | [3,1] | [3,2] | [3,3] |
row 4 | [4,1] | [4,2] | [4,3] |
row 5 | [5,1] | [5,2] | [5,3] |
If I want the 3rd case of the 2nd variable, what I would type is garden[3,2]
, and R would print out some output showing that this element corresponds to the utterance "ee"
. However, let’s hold off from actually doing that for a moment, because there’s something slightly counterintuitive about the specifics of what R does under those circumstances. Instead, let’s aim to solve our original problem, which is to pull out two rows (4 and 5) and two columns (1 and 2). This is fairly simple to do, since R allows us to specify multiple rows and multiple columns. So let’s try that:
garden[ 4:5, 1:2 ]
## speaker utterance
## case.4 makka-pakka pip
## case.5 makka-pakka onk
Clearly, that’s exactly what we asked for: the output here is a data frame containing two variables and two cases. Note that I could have gotten the same answer if I’d used the c
function to produce my vectors rather than the :
operator. That is, the following command is equivalent to the last one:
garden[ c(4,5), c(1,2) ]
## speaker utterance
## case.4 makka-pakka pip
## case.5 makka-pakka onk
It’s just not as pretty. However, if the columns and rows that you want to keep don’t happen to be next to each other in the original data frame, then you might find that you have to resort to using commands like garden[ c(2,4,5), c(1,3) ]
to extract them.
A second way to do the same thing is to use the names of the rows and columns. That is, instead of using the row numbers and column numbers, you use the character strings that are used as the labels for the rows and columns. To apply this idea to our garden
data frame, we would use a command like this:
garden[ c("case.4", "case.5"), c("speaker", "utterance") ]
## speaker utterance
## case.4 makka-pakka pip
## case.5 makka-pakka onk
Once again, this produces exactly the same output, so I haven’t bothered to show it. Note that, although this version is more annoying to type than the previous version, it’s a bit easier to read, because it’s often more meaningful to refer to the elements by their names rather than their numbers. Also note that you don’t have to use the same convention for the rows and columns. For instance, I often find that the variable names are meaningful and so I sometimes refer to them by name, whereas the row names are pretty arbitrary so it’s easier to refer to them by number. In fact, that’s more or less exactly what’s happening with the garden data frame, so it probably makes more sense to use this as the command:
garden[ 4:5, c("speaker", "utterance") ]
## speaker utterance
## case.4 makka-pakka pip
## case.5 makka-pakka onk
Again, the output is identical.
Finally, both the rows and columns can be indexed using logicals vectors as well. For example, although I claimed earlier that my goal was to extract cases 4 and 5, it’s pretty obvious that what I really wanted to do was select the cases where Makka-Pakka is speaking. So what I could have done is create a logical vector that indicates which cases correspond to Makka-Pakka speaking:
garden$speaker == "makka-pakka"
## [1] FALSE FALSE FALSE TRUE TRUE
As you can see, the 4th and 5th elements of this vector are TRUE
while the others are FALSE
. So I can use this vector to select the rows that I want to keep:
garden[ garden$speaker == "makka-pakka", c("speaker", "utterance") ]
## speaker utterance
## case.4 makka-pakka pip
## case.5 makka-pakka onk
And of course the output is, yet again, the same.
There are two fairly useful elaborations on this “rows and columns” approach that I should point out. Firstly, what if you want to keep all of the rows, or all of the columns? To do this, all we have to do is leave the corresponding entry blank, but it is crucial to remember to keep the comma! For instance, suppose I want to keep all the rows in the garden
data, but I only want to retain the first two columns. The easiest way do this is to use a command like this:
garden[ , 1:2 ]
## speaker utterance
## case.1 upsy-daisy pip
## case.2 upsy-daisy pip
## case.3 tombliboo ee
## case.4 makka-pakka pip
## case.5 makka-pakka onk
Alternatively, if I want to keep all the columns but only want the last two rows, I use the same trick, but this time I leave the second index blank. So my command becomes:
garden[ 4:5, ]
## speaker utterance line
## case.4 makka-pakka pip 7
## case.5 makka-pakka onk 9
The second elaboration I should note is that it’s still okay to use negative indexes as a way of telling R to delete certain rows or columns. For instance, if I want to delete the 3rd column, then I use this command:
garden[ , -3 ]
## speaker utterance
## case.1 upsy-daisy pip
## case.2 upsy-daisy pip
## case.3 tombliboo ee
## case.4 makka-pakka pip
## case.5 makka-pakka onk
whereas if I want to delete the 3rd row, then I’d use this one:
garden[ -3, ]
## speaker utterance line
## case.1 upsy-daisy pip 1
## case.2 upsy-daisy pip 2
## case.4 makka-pakka pip 7
## case.5 makka-pakka onk 9
So that’s nice.
At this point some of you might be wondering why I’ve been so terribly careful to choose my examples in such a way as to ensure that the output always has are multiple rows and multiple columns. The reason for this is that I’ve been trying to hide the somewhat curious “dropping” behaviour that R produces when the output only has a single column. I’ll start by showing you what happens, and then I’ll try to explain it. Firstly, let’s have a look at what happens when the output contains only a single row:
garden[ 5, ]
## speaker utterance line
## case.5 makka-pakka onk 9
This is exactly what you’d expect to see: a data frame containing three variables, and only one case per variable. Okay, no problems so far. What happens when you ask for a single column? Suppose, for instance, I try this as a command:
garden[ , 3 ]
Based on everything that I’ve shown you so far, you would be well within your rights to expect to see R produce a data frame containing a single variable and five cases. After all, that is what the subset
command does in this situation, and it’s pretty consistent with everything else that I’ve shown you so far about how square brackets work. In other words, you should expect to see this:
## line
## case.1 1
## case.2 2
## case.3 5
## case.4 7
## case.5 9
However, that is emphatically not what happens. What you actually get is this:
garden[, 3]
## [1] 1 2 5 7 9
That output is not a data frame at all! That’s just an ordinary numeric vector containing 5 elements. What’s going on here is that R has “noticed” that the output that we’ve asked for doesn’t really “need” to be wrapped up in a data frame at all, because it only corresponds to a single variable. So what it does is “drop” the output from a data frame containing a single variable, “down” to a simpler output that corresponds to that variable. This behaviour is convenient for day to day usage once you’ve become familiar with it – and I suppose that’s the real reason why R does this – but there’s no escaping the fact that it is deeply confusing to novices. It’s especially confusing because the behaviour appears only for a very specific case: (a) it only works for columns and not for rows, because the columns correspond to variables and the rows do not, and (b) it only applies to the “rows and columns” version of the square brackets, and not to the subset
function, or to the “just columns” use of the square brackets (next section). As I say, it’s very confusing when you’re just starting out. For what it’s worth, you can suppress this behaviour if you want, by setting drop = FALSE
when you construct your bracketed expression. That is, you could do something like this:
garden[, 3, drop=FALSE]
## line
## case.1 1
## case.2 2
## case.3 5
## case.4 7
## case.5 9
I suppose that helps a little bit, in that it gives you some control over the dropping behaviour, but I’m not sure it helps to make things any easier to understand. Anyway, that’s the “dropping” special case. Fun, isn’t it?
As if the weird “dropping” behaviour wasn’t annoying enough, R actually provides a completely different way of using square brackets to index a data frame. Specifically, if you only give a single index, R will assume you want the corresponding columns, not the rows. Do not be fooled by the fact that this second method also uses square brackets: it behaves differently to the “rows and columns” method that I’ve discussed in the last few sections. Let’s start with the following command:
garden[ 1:2 ]
## speaker utterance
## case.1 upsy-daisy pip
## case.2 upsy-daisy pip
## case.3 tombliboo ee
## case.4 makka-pakka pip
## case.5 makka-pakka onk
As you can see, the output gives me the first two columns, much as if I’d typed garden[,1:2]
. It doesn’t give me the first two rows, which is what I’d have gotten if I’d used a command like garden[1:2,]
. Not only that, if I ask for a single column, R does not drop the output:
garden[3]
## line
## case.1 1
## case.2 2
## case.3 5
## case.4 7
## case.5 9
As I said earlier, the only case where dropping occurs by default is when you use the “row and columns” version of the square brackets, and the output happens to correspond to a single column. However, if you really want to force R to drop the output, you can do so using the “double brackets” notation:
garden[[3]]
## [1] 1 2 5 7 9
Note that R will only allow you to ask for one column at a time using the double brackets. If you try to ask for multiple columns in this way, you get completely different behaviour, which may or may not produce an error, but definitely won’t give you the output you’re expecting. The only reason I’m mentioning it at all is that you might run into double brackets when doing further reading, and a lot of books don’t explicitly point out the difference between [ ]
and [[ ]]
.
Okay, for those few readers that have persevered with this section long enough to get here without having set fire to somethin, I should explain why R has these two different systems for subsetting a data frame (i.e., “row and column” versus “just columns”), and why they behave so differently to each other. I’m not 100% sure about the motivation since I never did manage to read through very much of the references that describe the early development of R, but I think the answer relates to the fact that data frames are actually a very strange hybrid of two different kinds of thing. At a low level, a data frame is a list. I can demonstrate this to you by overriding the normal print
function and forcing R to print out the garden data frame using the default print method (see later!) rather than the special one that is defined only for data frames. Here’s what we get:
print.default( garden )
## $speaker
## [1] upsy-daisy upsy-daisy tombliboo makka-pakka makka-pakka
## Levels: makka-pakka tombliboo upsy-daisy
##
## $utterance
## [1] pip pip ee pip onk
## Levels: ee onk oo pip
##
## $line
## [1] 1 2 5 7 9
##
## attr(,"class")
## [1] "data.frame"
Apart from the weird part of the output right at the bottom, this is identical to the print out that you get when you print out a list. In other words, a data frame is a list. View from this “list based” perspective, it’s clear what garden[1]
is: it’s the first variable stored in the list, namely speaker
. In other words, when you use the “just columns” way of indexing a data frame, using only a single index, R assumes that you’re thinking about the data frame as if it were a list of variables. In fact, when you use the $
operator you’re taking advantage of the fact that the data frame is secretly a list.
However, a data frame is more than just a list. It’s a very special kind of list where all the variables are of the same length, and the first element in each variable happens to correspond to the first “case” in the data set. That’s why no-one ever wants to see a data frame printed out in the default “list-like” way that I’ve shown in the extract above. In terms of the deeper meaning behind what a data frame is used for, a data frame really does have this rectangular shape to it:
print( garden )
## speaker utterance line
## case.1 upsy-daisy pip 1
## case.2 upsy-daisy pip 2
## case.3 tombliboo ee 5
## case.4 makka-pakka pip 7
## case.5 makka-pakka onk 9
Because of the fact that a data frame is basically a table of data, R provides a second “row and column” method for interacting with the data frame. This method makes much more sense in terms of the “table of data” interpretation of what a data frame is, and so for the most part it’s this method that people tend to prefer. Throughout this tutorial I’ll aim to stick to the “row and column” approach (though I will use $
a lot), and avoid referring to the “just columns” approach. However, it does get used a lot in practice, so I think it’s important that this book explain what’s going on.
And now let us never speak of this again.
Up to this point we have encountered several different kinds of variables. At the simplest level, we’ve seen numeric data, logical data and character data. However, we’ve also encountered some more complicated kinds of variables, namely factors, formulas, data frames and lists. There are many more specialised data structures out there, but there’s a few more generic ones that I want to talk about in passing.
The first of these is a matrix. Much like a data frame, a matrix is basically a big rectangular table of data, and in fact there are quite a few similarities between the two. However, there are also some key differences, so it’s important to talk about matrices in a little detail. To start with, lets create a matrix using the “row bind” function, rbind
, which you can use to combine multiple vectors together in a row-wise fashion (I’m sure you can guess what the column bind function cbind
does)…
row.1 <- c( 2,3,1 ) # create data for row 1
row.2 <- c( 5,6,7 ) # create data for row 2
M <- rbind( row.1, row.2 ) # row bind them into a matrix
print(M) # and print it out...
## [,1] [,2] [,3]
## row.1 2 3 1
## row.2 5 6 7
The variable M
is a matrix, which we can confirm by using the class
function. Notice that, when we bound the two vectors together, R retained the names of the original variables as row names. We could delete these if we wanted by typing rownames(M)<-NULL
, but I generally prefer having meaningful names attached to my variables, so I’ll keep them. In fact, let’s also add some highly unimaginative column names as well:
colnames(M) <- c( "col.1", "col.2", "col.3" )
print(M)
## col.1 col.2 col.3
## row.1 2 3 1
## row.2 5 6 7
You can use square brackets to subset a matrix in much the same way that you can for data frames, again specifying a row index and then a column index. For instance, M[2,3]
pulls out the entry in the 2nd row and 3rd column of the matrix (i.e., 7
), whereas M[2,]
pulls out the entire 2nd row, and M[,3]
pulls out the entire 3rd column. However, it’s worth noting that when you pull out a column, R will print the results horizontally, not vertically.6 There is also a way of referring to the elements of a matrix using a single index:
M <- matrix(
data = 1:12, # the values to include in the matrix
nrow = 3, # number of rows
ncol = 4 # number of columns
)
print(M)
## [,1] [,2] [,3] [,4]
## [1,] 1 4 7 10
## [2,] 2 5 8 11
## [3,] 3 6 9 12
This is a \(3\times 4\) matrix M
that contains the numbers 1 to 12 in order, listed columnwise. You can index the elements of M
by referring to them in that same columnwise order. To see this, notice that these two commands both extract the same element of the matrix:
print(M[2,2])
## [1] 5
print(M[5])
## [1] 5
The critical difference between a data frame and a matrix is that, in a data frame, we have this notion that each of the columns corresponds to a different variable: as a consequence, the columns in a data frame can be of different data types. The first column could be numeric, and the second column could contain character strings, and the third column could be logical data. In that sense, there is a fundamental asymmetry build into a data frame, because of the fact that columns represent variables (which can be qualitatively different to each other) and rows represent cases (which cannot). Matrices are different. At a fundamental level, a matrix really is just one variable: it just happens that this one variable is formatted into rows and columns. If you want a matrix of numeric data, every single element in the matrix must be a number. If you want a matrix of character strings, every single element in the matrix must be a character string. If you try to mix data of different types together, then R will either complain or try to coerce the matrix into something unexpected. To give you a sense of this, let’s do something silly and convert one element of M
from the number 5
to the character string "five"
…
M[2,2] <- "five"
print(M)
## [,1] [,2] [,3] [,4]
## [1,] "1" "4" "7" "10"
## [2,] "2" "five" "8" "11"
## [3,] "3" "6" "9" "12"
As you can see, the entire matrix M
has been coerced into text.
When doing data analysis, we often have reasons to want to use higher dimensional tables (e.g., sometimes you need to cross-tabulate three variables against each other). You can’t do this with matrices, but you can do it with arrays. An array is just like a matrix, except it can have more than two dimensions if you need it to. In fact, as far as R is concerned a matrix is just a special kind of array, in much the same way that a data frame is a special kind of list. I don’t want to talk about arrays too much, but I will very briefly show you an example of what a three dimensional array looks like.
A <- array(
data = 1:24,
dim = c(3,4,2)
)
print(A)
## , , 1
##
## [,1] [,2] [,3] [,4]
## [1,] 1 4 7 10
## [2,] 2 5 8 11
## [3,] 3 6 9 12
##
## , , 2
##
## [,1] [,2] [,3] [,4]
## [1,] 13 16 19 22
## [2,] 14 17 20 23
## [3,] 15 18 21 24
Not surprisingly, you can refer the elements of an array using the same kind of logic that we used for matrices. In this case, since A
is a three-dimensional \(3 \times 4 \times 2\) array, we need three indices to specify an element,
A[2,3,1]
## [1] 8
but you probably won’t be surprised to note that this also works…
A[8]
## [1] 8
As with other data structures, arrays can have names for specific elements. In fact, we can assign names to each of the dimensions too. For instance, suppose that the array A
has the shape that it does because it represents a (3 genders) \(\times\) (4 seasons) \(\times\) (2 times) structure. We could specify the dimension names for A
like this:
A <- array(
data = 1:24,
dim = c(3,4,2),
dimnames = list(
"genders" = c("male", "female", "nonbinary"),
"seasons" = c("summer", "autumn", "winter", "spring"),
"times" = c("day", "night")
)
)
print(A)
## , , times = day
##
## seasons
## genders summer autumn winter spring
## male 1 4 7 10
## female 2 5 8 11
## nonbinary 3 6 9 12
##
## , , times = night
##
## seasons
## genders summer autumn winter spring
## male 13 16 19 22
## female 14 17 20 23
## nonbinary 15 18 21 24
I find this a lot easier to read - it’s usually a good idea to label your arrays! Plus, it makes it a little easier to index them too, since (as usual) you can refer to elements by names. So if I just wanted to take a slice through the array corresponding to the "nonbinary"
values, I could do this:
A["nonbinary",,]
## times
## seasons day night
## summer 3 15
## autumn 6 18
## winter 9 21
## spring 12 24
One topic that I neglected to mention when discussing factors earlier in this chapter is that there are actually two different types of factor in R, unordered factors and ordered factors. An unordered factor corresponds to a nominal scale variable, and all of the factors we’ve discussed so far in this book have been unordered. However, it’s sometimes useful to explicitly tell R that your variable is ordinal scale, and if so you need to declare it to be an ordered factor. For instance, suppose we have a variable consisting of Likert scale data:
likert <- c(1, 7, 3, 4, 4, 4, 2, 6, 5, 5)
We can declare this to be an ordered factor in by using the factor
function, and setting ordered = TRUE
. To illustrate how this works, let’s create an ordered factor called likert.ordinal
and have a look at it:
likert.ordinal <- factor(
x = likert, # the raw data
levels = c(7,6,5,4,3,2,1), # strongest agreement is 1, weakest is 7
ordered = TRUE # and it’s ordered
)
print(likert.ordinal)
## [1] 1 7 3 4 4 4 2 6 5 5
## Levels: 7 < 6 < 5 < 4 < 3 < 2 < 1
Notice that when we print out the ordered factor, R explicitly tells us what order the levels come in. Because I wanted to order my levels in terms of increasing strength of endorsement, and because a response of 1 corresponded to the strongest agreement and 7 to the strongest disagreement, it was important that I tell R to encode 7 as the lowest value and 1 as the largest. Always check this when creating an ordered factor: it’s very easy to accidentally encode your data “upside down” if you’re not paying attention. In any case, note that we can (and should) attach meaningful names to these factor levels by using the levels
function, like this:
levels( likert.ordinal ) <- c(
"strong.disagree", "disagree", "weak.disagree",
"neutral", "weak.agree", "agree", "strong.agree"
)
print( likert.ordinal )
## [1] strong.agree strong.disagree weak.agree neutral
## [5] neutral neutral agree disagree
## [9] weak.disagree weak.disagree
## 7 Levels: strong.disagree < disagree < weak.disagree < ... < strong.agree
One nice thing about using ordered factors is that there are analyses for which R automatically treats ordered factors differently from unordered factors, and generally in a way that is more appropriate for ordinal data. However, I won’t go into details here. Like so many things in this chapter, my main goal here is to make you aware that R has this capability built into it; so if you ever need to start thinking about ordinal scale variables in more detail, you have at least some idea where to start looking!
Times and dates are very annoying types of data. To a first approximation we can say that there are 365 days in a year, 24 hours in a day, 60 minutes in an hour and 60 seconds in a minute, but that’s not quite correct. The length of the solar day is not exactly 24 hours, and the length of solar year is not exactly 365 days, so we have a complicated system of corrections that have to be made to keep the time and date system working. On top of that, the measurement of time is usually taken relative to a local time zone, and most (but not all) time zones have both a standard time and a daylight savings time, though the date at which the switch occurs is not at all standardised. So, as a form of data, times and dates are just awful to work with. Unfortunately, they’re also important. Sometimes it’s possible to avoid having to use any complicated system for dealing with times and dates. Often you just want to know what year something happened in, so you can just use numeric data: in quite a lot of situations something as simple as delcaring that this.year
is 2018, and it works just fine. If you can get away with that for your application, this is probably the best thing to do. However, sometimes you really do need to know the actual date. Or, even worse, the actual time. In this section, I’ll very briefly introduce you to the basics of how R deals with date and time data. As with a lot of things in this chapter, I won’t go into details: the goal here is to show you the basics of what you need to do if you ever encounter this kind of data in real life. And then we’ll all agree never to speak of it again.
To start with, let’s talk about the date. As it happens, modern operating systems are very good at keeping track of the time and date, and can even handle all those annoying timezone issues and daylight savings pretty well. So R takes the quite sensible view that it can just ask the operating system what the date is. We can pull the date using the Sys.Date
function:
today <- Sys.Date() # ask the operating system for the date
print(today) # display the date
## [1] "2018-07-24"
Okay, that seems straightforward. But, it does rather look like today is just a character string, doesn’t it? That would be a problem, because dates really do have a quasi-numeric character to them, and it would be nice to be able to do basic addition and subtraction with them. Well, fear not. If you type in class(today)
, R will tell you that the today
variable is a "Date"
object. What this means is that, hidden underneath this text string, R has a numeric representation.7 What that means is that you can in fact add and subtract days. For instance, if we add 1
to today
, R will print out the date for tomorrow:
today + 1
## [1] "2018-07-25"
Let’s see what happens when we add 365 days:
today + 365
## [1] "2019-07-24"
R provides a number of functions for working with dates, but I don’t want to talk about them in any detail. I will, however, make passing mention of the weekdays
function which will tell you what day of the week a particular date corresponded to, which is extremely convenient in some situations:
weekdays(today)
## [1] "Tuesday"
I’ll also point out that you can use the as.Date
to coerce various different kinds of data into dates. If the data happen to be strings formatted exactly according to the international standard notation (i.e., yyyy-mm-dd
) then the conversion is straightforward, because that’s the format that R expects to see by default. You can convert dates from other formats too, but it’s slightly trickier, and something I won’t go into here.
What about times? Well, times are even more annoying than dates, so much so that I don’t intend to talk about them at all right now, other than to point you in the direction of some vaguely useful things. R itself does provide you with tools for handling time data, and in fact there are two separate classes of data that are used to represent times, known by the odd names POSIXct
and POSIXlt
. You can use these to work with times if you want to, but for most applications you would probably be better off downloading the chron
package, which provides some much more user friendly tools for working with times and dates.
The last kind of variable that I want to introduce before finally being able to start talking about something a little more practical is the formula. Formulas were originally introduced into R as a convenient way to specify a particular type of statistical model (linear regression) but they’re such handy things that they’ve spread. Formulas are now used in a lot of different contexts, so it makes sense to introduce them early.
Stated simply, a formula object is a variable, but it’s a special type of variable that specifies a relationship between other variables. A formula is specified using the “tilde operator” ~
. A very simple example of a formula is shown below8
formula1 <- out ~ pred
formula1
## out ~ pred
The precise meaning of this formula depends on exactly what you want to do with it, but in broad terms it means “the out
(outcome) variable, analysed in terms of the pred
(predictor) variable”. That said, although the simplest and most common form of a formula uses the “one variable on the left, one variable on the right” format, there are others. For instance, the following examples are all reasonably common
formula2 <- out ~ pred1 + pred2 # more than one variable on the right
formula3 <- out ~ pred1 * pred2 # different relationship between predictors
formula4 <- ~ var1 + var2 # a ’one-sided’ formula
and there are many more variants besides. Formulas are pretty flexible things, and so different functions will make use of different formats, depending on what the function is intended to do. At this point you don’t need to know much about formulas - I only mention them now so you don’t get surprised by them later!
There’s one other important thing that I omitted when I discussed functions earlier on, and that’s the concept of a generic function. The two most notable examples that you’ll see in the next few chapters are summary
and plot
, although you’ve already seen an example of one working behind the scenes, and that’s the print
function. The thing that makes generics different from the other functions is that their behaviour changes, often quite dramatically, depending on the class
of the input you give it. The easiest way to explain the concept is with an example. With that in mind, lets take a closer look at what the print
function actually does. I’ll do this by creating a formula, and printing it out in a few different ways. First, let’s stick with what we know:
my.formula <- blah ~ blah.blah # create a variable of class "formula"
print( my.formula ) # print it the normal way
## blah ~ blah.blah
So far, there’s nothing very surprising here. But there’s actually a lot going on behind the scenes here. When I type print(my.formula)
, what actually happens is the print
function checks the class of the my.formula
variable. When the function discovers that the variable it’s been given is a formula, it goes looking for a function called print.formula
, and then delegates the whole business of printing out the variable to the print.formula
function.9 For what it’s worth, the name for a “dedicated” function like print.formula
that exists only to be a special case of a generic function like print
is a method, and the name for the process in which the generic function passes off all the hard work onto a method is called method dispatch. You won’t need to understand the details at all for this book, but you do need to know the gist of it; if only because a lot of the functions we’ll use are actually generics.
Just to give you a sense of this, let’s do something silly and try to bypass the normal workings of the print
function:
print.default( my.formula ) # do something silly by using the wrong method
## blah ~ blah.blah
## attr(,"class")
## [1] "formula"
## attr(,".Environment")
## <environment: R_GlobalEnv>
Hm. You can kind of see that it is trying to print out the same formula, but there’s a bunch of ugly low-level details that have also turned up on screen. This is because the print.default
method doesn’t know anything about formulas, and doesn’t know that it’s supposed to be hiding the obnoxious internal gibberish that R produces sometimes.
At this stage, this is about as much as we need to know about generic functions and their methods. In fact, you can get through the entire book without learning any more about them than this, so it’s probably a good idea to end this discussion here.
summary, summary, summary
Taking the usual caveats about IQ measurement as given, of course.↩
Or, more precisely, we don’t know how to measure it. Arguably, a rock has zero intelligence. But it doesn’t make sense to say that the IQ of a rock is 0 in the same way that we can say that the average human has an IQ of 100. And without knowing what the IQ value is that corresponds to a literal absence of any capacity to think, reason or learn, then we really can’t multiply or divide IQ scores and expect a meaningful answer.↩
This is an example of coercing a variable from one class to another. I’ll talk about coercion in more detail later↩
Some users might wonder why R even allows the ==
operator for factors. The reason is that sometimes you really do have different factors that have the same levels. For instance, if I was analysing data associated with football games, I might have a factor called home.team
, and another factor called winning.team
. In that situation I really should be able to ask if home.team == winning.team
.↩
Note that I didn’t to this in my earlier example when loading the .Rdata
file. There’s a reason for this. The idea behind an .Rdata
file is that it stores a whole workspace. So, if you had the ability to look inside the file yourself you’d see that the data file keeps track of all the variables and their names. So when you load
the file, R restores all those original names. CSV files are treated differently: as far as R is concerned, the CSV only stores one variable, but that variable is big table. So when you import that table into the workspace, R expects you to give it a name.↩
The reason for this relates to how matrices are implemented. The original matrix M
is treated as a two-dimensional object, containing two rows and three columns. However, whenever you pull out a single row or a single column, the result is considered to be a vector, which has a length but doesn’t have dimensions. Unless you explictly coerce the vector into a matrix, R doesn’t really distinguish between row vectors and column vectors. This has implications for how matrix algebra is implemented in R (which I’ll admit I initially found odd). When multiplying a matrix by a vector using the %*%
operator, R will attempt to interpret the vector as either a row vector or column vector, depending on whichever one makes the multiplication work. That is, suppose \(\mathbf{M}\) is \(2\times 3\) matrix, and \(v\) is a \(1\times 3\) row vector. Mathematically the matrix multiplication \(\mathbf{M}v\) doesn’t make sense since the dimensions don’t conform, but you can multiply by the corresponding column vector, \(\mathbf{M}v^T\). So, if I set v <- M[2,]
and then try to calculate M %*% v
, which you’d think would fail, it actually works because R treats the one dimensional array as if it were a column vector for the purposes of matrix multiplication. Note that if both objects are one dimensional arrays/vectors, this leads to ambiguity since \(vv^T\) (inner product) and \(v^Tv\) (outer product) yield different answers. In this situation, the %*%
operator returns the inner product not the outer product. To understand all the details, check out the help documentation.↩
Date objects are coded internally as the number of days that have passed since January 1, 1970.↩
Note that, when I write out the formula, R doesn’t check to see if the out
and pred
variables actually exist: it’s only later on when you try to use the formula for something that this happens.↩
For readers with a programming background: R has three separate systems for object oriented programming. The earliest system was S3, and it was very informal: generic functions as described here are part of the S3 system. Later on S4 was introduced as a more formal way of doing things. I confess I never learned S4 because it looked tedious. More recently R introduced Reference Classes, which look kind of neat and I should probably learn about them. Discussed here if you’re interested.↩