Exercises for 2.1: Descriptive statistics

Preliminaries

  • load the driving.Rdata file.
  • type head( driving ) to look at the first few observations
  • load the following packages: lsr, psych
load( "~/Work/Research/Rbook/workshop_dsto/datasets/driving.Rdata")
head( driving )
##    id gender age distractor peak.hour errors_time1 errors_time2 rt_time1
## 1 s.1   male  19      radio       yes            7            7      346
## 2 s.2 female  42    toddler        no           15           16      424
## 3 s.3   male  27       none        no           10            7      415
## 4 s.4 female  22      radio       yes            5            1      266
## 5 s.5 female  33       none        no            4            9      302
## 6 s.6 female  35    toddler       yes           15           12      423
##   rt_time2
## 1      636
## 2      787
## 3      580
## 4      459
## 5      513
## 6      767
library(lsr)
library(psych)

Exercise 2.1.1: Central tendency

  • compute the mean age of the participants
  • compute the median age of the participants
  • compute the modal age of the participants (note what happens when there are multiple modal frequencies)

Solution 2.1.1:

mean( driving$age )
## [1] 29.41176
median( driving$age )
## [1] 31
modeOf( driving$age )
## [1] 22 35 31

Exercise 2.1.2: Spread

  • compute the standard deviation age of the participants
  • compute the age range of the participants
  • compute the interquartile range on the age of the participants
  • compute the 10th, 25th, 75th and 90th percentiles for age

Solution 2.1.2:

sd( driving$age )
## [1] 7.106853
range( driving$age )
## [1] 15 42
IQR( driving$age )
## [1] 11
quantile( driving$age, c(.1,.25,.75,.9))
##  10%  25%  75%  90% 
## 20.8 24.0 35.0 36.4

Exercise 2.1.3: Tabulation part 1

  • use table() to construct a tabulation of the distractor varible
  • now use table() to cross-tabulate distractor by peak.hour

Solution 2.1.3:

table( driving$distractor )
## 
##    none   radio toddler 
##       5       6       6
table( driving$distractor, driving$peak.hour )
##          
##           yes no
##   none      2  3
##   radio     3  3
##   toddler   3  3

Exercise 2.1.4: Tabulation part 2

  • use xtabs() to cross-tabulate distractor by peak.hour

Solution 2.1.4:

xtabs( ~distractor + peak.hour, driving )
##           peak.hour
## distractor yes no
##    none      2  3
##    radio     3  3
##    toddler   3  3

Exercise 2.1.5: Getting descriptives en masse

  • use describe to get descriptives for all variables in driving
  • use summary to get descriptives for all variables in driving

Solution 2.1.5:

Here’s the summary() of the driving data:

summary( driving )
##        id        gender        age          distractor peak.hour
##  s.1    : 1   female:12   Min.   :15.00   none   :5    yes:8    
##  s.10   : 1   male  : 5   1st Qu.:24.00   radio  :6    no :9    
##  s.11   : 1               Median :31.00   toddler:6             
##  s.12   : 1               Mean   :29.41                         
##  s.13   : 1               3rd Qu.:35.00                         
##  s.14   : 1               Max.   :42.00                         
##  (Other):11                                                     
##   errors_time1    errors_time2       rt_time1        rt_time2    
##  Min.   : 3.00   Min.   : 0.000   Min.   :241.0   Min.   :281.0  
##  1st Qu.: 6.00   1st Qu.: 4.000   1st Qu.:322.0   1st Qu.:482.0  
##  Median :10.00   Median : 7.000   Median :374.0   Median :580.0  
##  Mean   :10.29   Mean   : 7.765   Mean   :375.2   Mean   :569.6  
##  3rd Qu.:14.00   3rd Qu.:12.000   3rd Qu.:424.0   3rd Qu.:651.0  
##  Max.   :19.00   Max.   :19.000   Max.   :463.0   Max.   :787.0  
## 

And here’s the result of using the describe() function:

describe( driving )
## Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning
## Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning
## Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning
## Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning
## Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning
## -Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning
## -Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning
## -Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning
## -Inf
##              vars  n   mean     sd median trimmed    mad min  max range
## id*             1 17    NaN     NA     NA     NaN     NA Inf -Inf  -Inf
## gender*         2 17    NaN     NA     NA     NaN     NA Inf -Inf  -Inf
## age             3 17  29.41   7.11     31   29.53   5.93  15   42    27
## distractor*     4 17    NaN     NA     NA     NaN     NA Inf -Inf  -Inf
## peak.hour*      5 17    NaN     NA     NA     NaN     NA Inf -Inf  -Inf
## errors_time1    6 17  10.29   5.12     10   10.20   5.93   3   19    16
## errors_time2    7 17   7.76   5.86      7    7.53   7.41   0   19    19
## rt_time1        8 17 375.24  70.69    374  378.33  77.10 241  463   222
## rt_time2        9 17 569.59 137.94    580  574.33 145.29 281  787   506
##               skew kurtosis    se
## id*             NA       NA    NA
## gender*         NA       NA    NA
## age          -0.32    -0.83  1.72
## distractor*     NA       NA    NA
## peak.hour*      NA       NA    NA
## errors_time1  0.12    -1.46  1.24
## errors_time2  0.19    -1.17  1.42
## rt_time1     -0.32    -1.25 17.14
## rt_time2     -0.23    -0.86 33.46

Something worth noting here is that the describe() function throws a whole bunch of warnings. That’s new – it wasn’t doing that when I first wrote these exercises! I’m sure it’s something that future versions of the package will probably have to fix. It’s very useful to notice how they are happening though. Notice that they’re all happening with respect to the nominal scale variables (id, gender etc?). That’s a consequence of the fact that those variables are factors, and so most of the descriptive statistics in question aren’t well-defined for factors.

Bonus 2.1.5

Okay, because I didn’t know that the psych package had introduced this little annoyance, I should make it up to you by showing you an advance trick to make the problem go away.

There’s a function in R called is.numeric() that tests to see if a variable is numeric! And there’s another one called sapply() (short for “simple apply”) that applies another function to every element in a list. Because data frames are secretly lists, you can do this to get R to tell you which variables are actually numeric:

sapply( driving, is.numeric )
##           id       gender          age   distractor    peak.hour 
##        FALSE        FALSE         TRUE        FALSE        FALSE 
## errors_time1 errors_time2     rt_time1     rt_time2 
##         TRUE         TRUE         TRUE         TRUE

So if you want to apply the describe() function to a truncated data frame that only contains the numeric variables, you’d do this to create the truncated data frame…

new_driving <- driving[,sapply( driving, is.numeric )]
describe(new_driving)
##              vars  n   mean     sd median trimmed    mad min max range
## age             1 17  29.41   7.11     31   29.53   5.93  15  42    27
## errors_time1    2 17  10.29   5.12     10   10.20   5.93   3  19    16
## errors_time2    3 17   7.76   5.86      7    7.53   7.41   0  19    19
## rt_time1        4 17 375.24  70.69    374  378.33  77.10 241 463   222
## rt_time2        5 17 569.59 137.94    580  574.33 145.29 281 787   506
##               skew kurtosis    se
## age          -0.32    -0.83  1.72
## errors_time1  0.12    -1.46  1.24
## errors_time2  0.19    -1.17  1.42
## rt_time1     -0.32    -1.25 17.14
## rt_time2     -0.23    -0.86 33.46

Finally, if you really can’t be bothered typing that in over and over again, you can define your own function that does it automatically for you. Let’s call it tidyDescribe(), shall we? Here’s how to define it:

tidyDescribe <- function(x) {
  describe(x[,sapply( x, is.numeric )])
}

Now, all we have to do is this:

tidyDescribe( driving )
##              vars  n   mean     sd median trimmed    mad min max range
## age             1 17  29.41   7.11     31   29.53   5.93  15  42    27
## errors_time1    2 17  10.29   5.12     10   10.20   5.93   3  19    16
## errors_time2    3 17   7.76   5.86      7    7.53   7.41   0  19    19
## rt_time1        4 17 375.24  70.69    374  378.33  77.10 241 463   222
## rt_time2        5 17 569.59 137.94    580  574.33 145.29 281 787   506
##               skew kurtosis    se
## age          -0.32    -0.83  1.72
## errors_time1  0.12    -1.46  1.24
## errors_time2  0.19    -1.17  1.42
## rt_time1     -0.32    -1.25 17.14
## rt_time2     -0.23    -0.86 33.46

Much nicer. As much as I hate having to include this lengthy digression, it does highlight a really important aspect of working with R. As you start getting more proficient in R, you’ll find yourself defining your own functions all the time. It’s one of the biggest differences between using a statistical *package*'' (like SPSS) anddoing statistical programming’’ (like in R). The programming approach takes more time and effort to learn but it’s way more powerful and eventually you find that you can do all sorts of really interesting things that were never possible before.

Exercise 2.1.6: Descriptives by group

  • use describeBy to get descriptives separately depending on whether the tests were conducted in peak hour or not
  • use aggregate to calculate the mean number of errors in the first test (i.e. errors_time1) broken down by peak.hour and distractor

Solution 2.1.6:

describeBy( driving, driving$peak.hour )
## Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning
## Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning
## Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning
## Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning
## Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning
## -Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning
## -Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning
## -Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning
## -Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning
## Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning
## Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning
## Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning
## Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning
## -Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning
## -Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning
## -Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning
## -Inf
## group: yes
##              vars n   mean     sd median trimmed    mad min  max range
## id*             1 8    NaN     NA     NA     NaN     NA Inf -Inf  -Inf
## gender*         2 8    NaN     NA     NA     NaN     NA Inf -Inf  -Inf
## age             3 8  29.62   6.23   30.5   29.62   6.67  19   36    17
## distractor*     4 8    NaN     NA     NA     NaN     NA Inf -Inf  -Inf
## peak.hour*      5 8    NaN     NA     NA     NaN     NA Inf -Inf  -Inf
## errors_time1    6 8   9.00   5.76    6.5    9.00   4.45   3   18    15
## errors_time2    7 8   5.00   5.68    3.5    5.00   5.19   0   14    14
## rt_time1        8 8 364.25  83.54  375.0  364.25  97.11 241  459   218
## rt_time2        9 8 542.50 165.55  539.0  542.50 174.21 281  767   486
##               skew kurtosis    se
## id*             NA       NA    NA
## gender*         NA       NA    NA
## age          -0.55    -1.39  2.20
## distractor*     NA       NA    NA
## peak.hour*      NA       NA    NA
## errors_time1  0.41    -1.78  2.04
## errors_time2  0.44    -1.67  2.01
## rt_time1     -0.23    -1.76 29.53
## rt_time2     -0.11    -1.55 58.53
## -------------------------------------------------------- 
## group: no
##              vars n   mean     sd median trimmed    mad min  max range
## id*             1 9    NaN     NA     NA     NaN     NA Inf -Inf  -Inf
## gender*         2 9    NaN     NA     NA     NaN     NA Inf -Inf  -Inf
## age             3 9  29.22   8.18     31   29.22   8.90  15   42    27
## distractor*     4 9    NaN     NA     NA     NaN     NA Inf -Inf  -Inf
## peak.hour*      5 9    NaN     NA     NA     NaN     NA Inf -Inf  -Inf
## errors_time1    6 9  11.44   4.50     12   11.44   4.45   4   19    15
## errors_time2    7 9  10.22   5.09     10   10.22   4.45   4   19    15
## rt_time1        8 9 385.00  60.49    374  385.00  74.13 302  463   161
## rt_time2        9 9 593.67 112.66    580  593.67 105.26 414  787   373
##               skew kurtosis    se
## id*             NA       NA    NA
## gender*         NA       NA    NA
## age          -0.16    -1.14  2.73
## distractor*     NA       NA    NA
## peak.hour*      NA       NA    NA
## errors_time1 -0.03    -1.12  1.50
## errors_time2  0.31    -1.31  1.70
## rt_time1     -0.03    -1.65 20.16
## rt_time2      0.13    -1.15 37.55
aggregate( errors_time1 ~ peak.hour + distractor, driving, mean )
##   peak.hour distractor errors_time1
## 1       yes       none      5.00000
## 2        no       none      7.00000
## 3       yes      radio      5.00000
## 4        no      radio     11.33333
## 5       yes    toddler     15.66667
## 6        no    toddler     16.00000

Exercise 2.1.7: Correlations

  • calculate the correlation of the number of errors at time 1 with the number of errors at time 2 using cor()
  • calculate the Spearman correlation for the same variables, again using cor()
  • use correlate() to compute all pairwise correlations.

Solution 2.1.7:

cor( driving$errors_time1, driving$errors_time2 )
## [1] 0.6962821
cor( driving$errors_time1, driving$errors_time2, method="spearman" )
## [1] 0.671599
correlate( driving )
## 
## CORRELATIONS
## ============
## - correlation type:  pearson 
## - correlations shown only when both variables are numeric
## 
##              id gender   age distractor peak.hour errors_time1
## id            .      .     .          .         .            .
## gender        .      .     .          .         .            .
## age           .      .     .          .         .        0.407
## distractor    .      .     .          .         .            .
## peak.hour     .      .     .          .         .            .
## errors_time1  .      . 0.407          .         .            .
## errors_time2  .      . 0.282          .         .        0.696
## rt_time1      .      . 0.521          .         .        0.708
## rt_time2      .      . 0.428          .         .        0.605
##              errors_time2 rt_time1 rt_time2
## id                      .        .        .
## gender                  .        .        .
## age                 0.282    0.521    0.428
## distractor              .        .        .
## peak.hour               .        .        .
## errors_time1        0.696    0.708    0.605
## errors_time2            .    0.404    0.466
## rt_time1            0.404        .    0.789
## rt_time2            0.466    0.789        .