As many of you know, how we analyze our data depends a lot on the format of our data (numbers, factors, etc),the design of our experiment, and whether or not our data is normally distributed or not.

We’ll use some simulated data and some data that is available in mock datasets to analyze this today.

So what if we have data that is normally distributed? we can simulate normally distributed data with rnorm, with its standard deviation and mean.

We will set the seed so that we all get the same results.

set.seed(1)
y <- rnorm(100, mean=10, sd=2)

if we tell it to list the data, what do we see?

y
##   [1]  8.747092 10.367287  8.328743 13.190562 10.659016  8.359063 10.974858
##   [8] 11.476649 11.151563  9.389223 13.023562 10.779686  8.757519  5.570600
##  [15] 12.249862  9.910133  9.967619 11.887672 11.642442 11.187803 11.837955
##  [22] 11.564273 10.149130  6.021297 11.239651  9.887743  9.688409  7.058495
##  [29]  9.043700 10.835883 12.717359  9.794425 10.775343  9.892390  7.245881
##  [36]  9.170011  9.211420  9.881373 12.200051 11.526351  9.670953  9.493277
##  [43] 11.393927 11.113326  8.622489  8.585010 10.729164 11.537066  9.775308
##  [50] 11.762215 10.796212  8.775947 10.682239  7.741274 12.866047 13.960800
##  [57]  9.265557  7.911731 11.139439  9.729891 14.803236  9.921520 11.379479
##  [64] 10.056004  8.513454 10.377585  6.390083 12.931110 10.306507 14.345223
##  [71] 10.951019  8.580107 11.221453  8.131805  7.492733 10.582892  9.113416
##  [78] 10.002211 10.148683  8.820958  8.862663  9.729643 12.356174  6.952866
##  [85] 11.187892 10.665901 12.126200  9.391632 10.740038 10.534198  8.914960
##  [92] 12.415736 12.320805 11.400427 13.173667 11.116973  7.446816  8.853469
##  [99]  7.550775  9.053199

does it look normal to you?, It should, given we pulled it from a normal distribution.

hist(y)

how can we test for it? Maybe a shapiro.wilk test?

shapiro.test(y)
## 
##  Shapiro-Wilk normality test
## 
## data:  y
## W = 0.9956, p-value = 0.9876

The p-value is not significant, so the data is normal. So, if we have two different distributions that are normal and numeric, how can we test if they’re different?

set.seed(5)
y1 <- rnorm(100, mean=11, sd=2)
y1
##   [1]  9.318289 13.768719  8.489016 11.140286 14.422882  9.794184 10.055667
##   [8]  9.729257 10.428453 11.276216 13.455261  9.396441  8.839215 10.684931
##  [15]  8.856480 10.722028  9.805374  6.632066 11.481635 10.481289 12.801024
##  [22] 12.883739 13.935924 12.413522 12.638018 10.413036 13.837178 13.997548
##  [29]  9.685836  9.294409 11.631830 13.219388 15.430921 13.434207 13.958444
##  [36] 12.903148  8.980935  6.999055  7.475628 10.714784 14.100121  9.395154
##  [43] 10.850842 14.791336 10.086862 12.124447  9.225983 10.079511  9.551343
##  [50] 10.861578 13.926497 11.375452 13.044046  9.816330 10.775599  9.150094
##  [57] 12.506610 10.774782 10.871818 11.466551  8.726834 12.709661  9.843259
##  [64] 11.992723  9.479884 10.317227  6.795342 10.396595  8.455233 10.440668
##  [71] 10.591805 10.548772 11.694057 11.064736 11.827063 10.689303 12.946971
##  [78] 11.242180 11.378347  9.874230 11.996832  7.515395 12.951058 10.951834
##  [85] 12.351369  9.579381 15.774465 10.053136 10.848455  9.956320 12.852094
##  [92]  8.875178 12.114068 12.801461 12.979891 11.767216 10.306832  9.919621
##  [99] 10.634889 10.881401

How do we visualize them together?

hist(y1, col="blue", breaks=20)
hist(y, col="red", breaks=20, add=T)

What statistical test can we apply here?

t.test(y, y1)
## 
##  Welch Two Sample t-test
## 
## data:  y and y1
## t = -3.242, df = 197.49, p-value = 0.001393
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.3597919 -0.3311987
## sample estimates:
## mean of x mean of y 
##  10.21777  11.06327

We see a very basic output saying that these are significantly different

What test would we do if they weren’t normally distributed though? Maybe a wilcoxon-test.

wilcox.test(y, y1)
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  y and y1
## W = 3814, p-value = 0.003772
## alternative hypothesis: true location shift is not equal to 0

We can see that it is still significant, but the p-value isn’t as small Something that can be common when comparing the power of parametric vs non-parametric tests.

What if they are actually paired though?

t.test(y, y1, paired=TRUE)
## 
##  Paired t-test
## 
## data:  y and y1
## t = -3.2742, df = 99, p-value = 0.00146
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.3578724 -0.3331182
## sample estimates:
## mean of the differences 
##              -0.8454953

We can see that we get different p-values. What else is different? Our degrees of freedom. Instead of 200 things, we have 2 measurements of 100 things

What about the non-parametic alternative?

wilcox.test(y, y1, paired=TRUE)
## 
##  Wilcoxon signed rank test with continuity correction
## 
## data:  y and y1
## V = 1612, p-value = 0.001704
## alternative hypothesis: true location shift is not equal to 0

How do we simulate character/group data? If we use this function, we can call groups by letters. We see if we use letters from the range 1:2, they come up “a” and “b”.

rep(letters[1:2])
## [1] "a" "b"

And if we use a different range, and different letters.

rep(letters[6:16])
##  [1] "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p"

So we can set these and assign them. Length out is setting how many total should be “a”,"b’.

group <- rep(letters[1:2], length.out=200)

Let’s set our data for two different peaks at a mean of 10 and 12.

set.seed(6)
response <- rnorm(n=100, mean=c(10, 11), sd=2)

Let’s put this into a dataframe called test.

test <- data.frame(group,response)

We can see we have a’s with a lower values.

test
##     group  response
## 1       a 10.539212
## 2       b  9.740029
## 3       a 11.737320
## 4       b 14.454391
## 5       a 10.048375
## 6       b 11.736050
## 7       a  7.381591
## 8       b 12.477244
## 9       a 10.089746
## 10      b  8.903206
## 11      a 13.455702
## 12      b  8.642801
## 13      a 11.306413
## 14      b 10.262867
## 15      a  8.800891
## 16      b 11.109210
## 17      a 13.415355
## 18      b  8.811254
## 19      a  9.421436
## 20      b 15.414826
## 21      a 11.037498
## 22      b  8.190164
## 23      a 14.029729
## 24      b  8.623683
## 25      a 10.380762
## 26      b  8.660528
## 27      a  9.923837
## 28      b 15.708409
## 29      a 12.786853
## 30      b  9.879335
## 31      a  8.657081
## 32      b 11.984877
## 33      a  7.641219
## 34      b  8.882565
## 35      a 12.275805
## 36      b 10.679469
## 37      a 11.260986
## 38      b 14.233919
## 39      a  9.613000
## 40      b  7.784416
## 41      a  8.229672
## 42      b 10.135331
## 43      a  9.156752
## 44      b 10.659012
## 45      a 10.491622
## 46      b  9.508504
## 47      a  9.452112
## 48      b 14.649158
## 49      a 10.028467
## 50      b 11.376085
## 51      a  9.891733
## 52      b 11.923223
## 53      a  8.806459
## 54      b 13.526516
## 55      a  7.709342
## 56      b 13.169247
## 57      a  6.942009
## 58      b  7.852516
## 59      a  9.771815
## 60      b 11.222418
## 61      a 10.428275
## 62      b 12.107552
## 63      a  7.880961
## 64      b  7.779901
## 65      a  9.323698
## 66      b 11.409876
## 67      a  9.551310
## 68      b  9.180781
## 69      a  8.383780
## 70      b 12.106167
## 71      a  9.221644
## 72      b 10.105509
## 73      a  9.957722
## 74      b  9.801165
## 75      a  9.378268
## 76      b  9.636735
## 77      a  9.595889
## 78      b 13.233601
## 79      a 11.651998
## 80      b 13.501838
## 81      a 15.216196
## 82      b 10.897920
## 83      a 14.454388
## 84      b 10.972256
## 85      a  6.905201
## 86      b  8.240222
## 87      a 12.959741
## 88      b 10.490156
## 89      a  9.347439
## 90      b  9.546723
## 91      a  6.095303
## 92      b 11.845880
## 93      a 12.363370
## 94      b 12.835919
## 95      a 10.190935
## 96      b  7.631136
## 97      a 11.980659
## 98      b  9.584336
## 99      a  8.811942
## 100     b  8.868206
## 101     a 10.539212
## 102     b  9.740029
## 103     a 11.737320
## 104     b 14.454391
## 105     a 10.048375
## 106     b 11.736050
## 107     a  7.381591
## 108     b 12.477244
## 109     a 10.089746
## 110     b  8.903206
## 111     a 13.455702
## 112     b  8.642801
## 113     a 11.306413
## 114     b 10.262867
## 115     a  8.800891
## 116     b 11.109210
## 117     a 13.415355
## 118     b  8.811254
## 119     a  9.421436
## 120     b 15.414826
## 121     a 11.037498
## 122     b  8.190164
## 123     a 14.029729
## 124     b  8.623683
## 125     a 10.380762
## 126     b  8.660528
## 127     a  9.923837
## 128     b 15.708409
## 129     a 12.786853
## 130     b  9.879335
## 131     a  8.657081
## 132     b 11.984877
## 133     a  7.641219
## 134     b  8.882565
## 135     a 12.275805
## 136     b 10.679469
## 137     a 11.260986
## 138     b 14.233919
## 139     a  9.613000
## 140     b  7.784416
## 141     a  8.229672
## 142     b 10.135331
## 143     a  9.156752
## 144     b 10.659012
## 145     a 10.491622
## 146     b  9.508504
## 147     a  9.452112
## 148     b 14.649158
## 149     a 10.028467
## 150     b 11.376085
## 151     a  9.891733
## 152     b 11.923223
## 153     a  8.806459
## 154     b 13.526516
## 155     a  7.709342
## 156     b 13.169247
## 157     a  6.942009
## 158     b  7.852516
## 159     a  9.771815
## 160     b 11.222418
## 161     a 10.428275
## 162     b 12.107552
## 163     a  7.880961
## 164     b  7.779901
## 165     a  9.323698
## 166     b 11.409876
## 167     a  9.551310
## 168     b  9.180781
## 169     a  8.383780
## 170     b 12.106167
## 171     a  9.221644
## 172     b 10.105509
## 173     a  9.957722
## 174     b  9.801165
## 175     a  9.378268
## 176     b  9.636735
## 177     a  9.595889
## 178     b 13.233601
## 179     a 11.651998
## 180     b 13.501838
## 181     a 15.216196
## 182     b 10.897920
## 183     a 14.454388
## 184     b 10.972256
## 185     a  6.905201
## 186     b  8.240222
## 187     a 12.959741
## 188     b 10.490156
## 189     a  9.347439
## 190     b  9.546723
## 191     a  6.095303
## 192     b 11.845880
## 193     a 12.363370
## 194     b 12.835919
## 195     a 10.190935
## 196     b  7.631136
## 197     a 11.980659
## 198     b  9.584336
## 199     a  8.811942
## 200     b  8.868206

Did this data come out the way we wanted? We can see how R classified each group by using the str() function.

str(test)
## 'data.frame':    200 obs. of  2 variables:
##  $ group   : Factor w/ 2 levels "a","b": 1 2 1 2 1 2 1 2 1 2 ...
##  $ response: num  10.54 9.74 11.74 14.45 10.05 ...

What if we do a basic t-test by group? We use the basic formula of reponse, or dependent, first as a function “~” of the dependent. This is pretty standard throughout.

t.test(response~group, data=test)
## 
##  Welch Two Sample t-test
## 
## data:  response by group
## t = -2.1976, df = 197.55, p-value = 0.02914
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.21415080 -0.06567404
## sample estimates:
## mean in group a mean in group b 
##        10.15963        10.79954

What about visualizing this? simply the same format as for the t-test.

boxplot(response~group, data=test)

We see a very significant difference. Any guesses what the non parametic version would be?

wilcox.test(response~group, data=test)
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  response by group
## W = 4176, p-value = 0.0442
## alternative hypothesis: true location shift is not equal to 0

What if we had more than 2 groups? What about an ANOVA? Simulate groups and responses again, but with three groups

group.aov <- rep(letters[1:3], length.out=300)
response.aov <- rnorm(n=300, mean=c(10, 20, 30), sd=2)

Visualize it.

hist(response.aov)

Hard to see, so let’s break it up some more.

hist(response.aov, breaks=20)

Put it into a data frame.

test.aov <- data.frame(group.aov, response.aov)

What are the groups?

str(test.aov)
## 'data.frame':    300 obs. of  2 variables:
##  $ group.aov   : Factor w/ 3 levels "a","b","c": 1 2 3 1 2 3 1 2 3 1 ...
##  $ response.aov: num  8.12 20.19 30.13 8.02 19.86 ...
test.aov
##     group.aov response.aov
## 1           a     8.118466
## 2           b    20.187821
## 3           c    30.132507
## 4           a     8.018688
## 5           b    19.855770
## 6           c    29.553486
## 7           a     9.714447
## 8           b    21.070813
## 9           c    29.250886
## 10          a     8.163675
## 11          b    18.717555
## 12          c    28.294174
## 13          a     6.769873
## 14          b    22.212981
## 15          c    29.134208
## 16          a    11.206208
## 17          b    20.083927
## 18          c    31.172260
## 19          a     7.929797
## 20          b    21.656384
## 21          c    29.285933
## 22          a    10.374932
## 23          b    19.576971
## 24          c    29.163406
## 25          a     8.502122
## 26          b    19.667918
## 27          c    25.205109
## 28          a    11.766518
## 29          b    19.491791
## 30          c    26.701269
## 31          a    10.332678
## 32          b    21.848179
## 33          c    28.079871
## 34          a    10.960212
## 35          b    22.255098
## 36          c    31.336251
## 37          a    12.713954
## 38          b    22.045840
## 39          c    29.956591
## 40          a     7.012731
## 41          b    18.010948
## 42          c    31.175779
## 43          a    11.578186
## 44          b    22.647095
## 45          c    31.434442
## 46          a     7.238572
## 47          b    20.205753
## 48          c    26.401913
## 49          a    11.946031
## 50          b    20.432972
## 51          c    31.050230
## 52          a     8.913349
## 53          b    16.565962
## 54          c    32.060233
## 55          a     7.705762
## 56          b    19.467741
## 57          c    27.665517
## 58          a     9.575835
## 59          b    16.556371
## 60          c    27.123451
## 61          a    11.669120
## 62          b    20.945517
## 63          c    30.862615
## 64          a    10.942922
## 65          b    23.473781
## 66          c    31.809653
## 67          a    10.580663
## 68          b    18.252207
## 69          c    28.565027
## 70          a     9.644661
## 71          b    22.068034
## 72          c    26.184120
## 73          a    11.594654
## 74          b    19.810298
## 75          c    30.411230
## 76          a    11.767786
## 77          b    16.593736
## 78          c    30.271560
## 79          a    12.358004
## 80          b    20.768582
## 81          c    30.071476
## 82          a    11.868671
## 83          b    19.761274
## 84          c    27.744633
## 85          a    10.835056
## 86          b    20.102938
## 87          c    29.824379
## 88          a    10.209204
## 89          b    17.980738
## 90          c    29.432667
## 91          a    12.259920
## 92          b    19.918231
## 93          c    32.673664
## 94          a    10.803370
## 95          b    18.604896
## 96          c    25.475005
## 97          a    11.575135
## 98          b    19.097051
## 99          c    28.877900
## 100         a    11.246222
## 101         b    22.844228
## 102         c    28.483109
## 103         a     9.889314
## 104         b    19.039083
## 105         c    28.158967
## 106         a     8.704743
## 107         b    21.035866
## 108         c    32.956722
## 109         a     5.086632
## 110         b    17.101096
## 111         c    29.200391
## 112         a    11.033446
## 113         b    19.414927
## 114         c    32.688073
## 115         a    10.506259
## 116         b    18.092459
## 117         c    27.992211
## 118         a    11.101995
## 119         b    20.664737
## 120         c    30.291406
## 121         a     9.156598
## 122         b    20.008371
## 123         c    28.593217
## 124         a    12.897580
## 125         b    17.353879
## 126         c    29.078724
## 127         a     9.014654
## 128         b    24.055756
## 129         c    27.819007
## 130         a    11.451340
## 131         b    20.103265
## 132         c    29.089373
## 133         a    10.973001
## 134         b    20.212591
## 135         c    30.071565
## 136         a     8.816613
## 137         b    21.797356
## 138         c    30.078801
## 139         a     9.784824
## 140         b    22.289945
## 141         c    23.937590
## 142         a    10.368557
## 143         b    19.567477
## 144         c    32.329824
## 145         a     9.098170
## 146         b    19.492188
## 147         c    29.261902
## 148         a    10.757738
## 149         b    17.042097
## 150         c    28.657405
## 151         a     8.872500
## 152         b    16.628596
## 153         c    31.024545
## 154         a     9.564735
## 155         b    23.267770
## 156         c    29.141044
## 157         a    13.446941
## 158         b    24.136333
## 159         c    25.070148
## 160         a     9.675994
## 161         b    19.521651
## 162         c    32.419014
## 163         a    10.365747
## 164         b    21.931635
## 165         c    32.233545
## 166         a     6.749229
## 167         b    18.452126
## 168         c    29.544907
## 169         a    10.433502
## 170         b    20.128300
## 171         c    26.891588
## 172         a    12.923991
## 173         b    19.992731
## 174         c    27.899365
## 175         a    12.836904
## 176         b    18.729112
## 177         c    33.389173
## 178         a    11.526005
## 179         b    19.546884
## 180         c    26.927401
## 181         a    11.496323
## 182         b    17.574170
## 183         c    32.195674
## 184         a    10.983540
## 185         b    22.345474
## 186         c    27.814428
## 187         a     8.630065
## 188         b    19.084652
## 189         c    28.822231
## 190         a    11.429790
## 191         b    19.683849
## 192         c    27.981929
## 193         a     9.121487
## 194         b    18.193941
## 195         c    29.865519
## 196         a     8.833208
## 197         b    19.966695
## 198         c    31.017309
## 199         a     9.175271
## 200         b    18.462114
## 201         c    30.832344
## 202         a     9.919914
## 203         b    17.433766
## 204         c    31.441823
## 205         a     8.317028
## 206         b    18.859323
## 207         c    29.054782
## 208         a     7.877000
## 209         b    20.199965
## 210         c    28.430550
## 211         a     6.420961
## 212         b    17.191754
## 213         c    31.600222
## 214         a    10.633671
## 215         b    17.677639
## 216         c    30.842895
## 217         a     8.836619
## 218         b    21.530380
## 219         c    28.975268
## 220         a     9.463324
## 221         b    18.827826
## 222         c    32.379487
## 223         a     8.670159
## 224         b    20.033358
## 225         c    29.785597
## 226         a     8.721637
## 227         b    23.072799
## 228         c    32.629565
## 229         a    10.687742
## 230         b    21.633836
## 231         c    32.190657
## 232         a    12.237506
## 233         b    16.933606
## 234         c    30.954731
## 235         a    12.092983
## 236         b    22.062440
## 237         c    32.713357
## 238         a    12.155726
## 239         b    18.712618
## 240         c    32.116235
## 241         a    12.635230
## 242         b    19.972564
## 243         c    28.932394
## 244         a     8.768916
## 245         b    21.403168
## 246         c    28.618618
## 247         a     7.951723
## 248         b    20.074731
## 249         c    30.245030
## 250         a     8.496898
## 251         b    19.795980
## 252         c    29.108413
## 253         a    10.762664
## 254         b    20.455936
## 255         c    31.459465
## 256         a    14.121016
## 257         b    17.170203
## 258         c    28.633009
## 259         a    11.127867
## 260         b    19.596413
## 261         c    26.682396
## 262         a     9.561859
## 263         b    21.875482
## 264         c    31.175675
## 265         a     9.820167
## 266         b    13.920519
## 267         c    28.883521
## 268         a     7.341616
## 269         b    21.872522
## 270         c    29.879298
## 271         a    14.206914
## 272         b    18.598988
## 273         c    28.070195
## 274         a    10.472865
## 275         b    22.115792
## 276         c    28.528353
## 277         a    11.724736
## 278         b    19.036243
## 279         c    32.358042
## 280         a     9.579015
## 281         b    19.918456
## 282         c    26.925382
## 283         a     7.433000
## 284         b    20.127299
## 285         c    28.221981
## 286         a     9.965087
## 287         b    23.805990
## 288         c    25.586475
## 289         a    10.603511
## 290         b    20.377359
## 291         c    26.874220
## 292         a    12.677286
## 293         b    18.629817
## 294         c    29.607756
## 295         a     7.025535
## 296         b    16.211755
## 297         c    32.782658
## 298         a    13.216722
## 299         b    22.124407
## 300         c    30.293919

Set up the anova test.

test.aov.fit <- aov(response.aov~group.aov, data=test.aov)

If we use the plot function, we can see the standard normality, etc plots

plot(test.aov.fit)

See the output. But we see no p-value or other information?

test.aov.fit
## Call:
##    aov(formula = response.aov ~ group.aov, data = test.aov)
## 
## Terms:
##                 group.aov Residuals
## Sum of Squares   18817.98   1114.08
## Deg. of Freedom         2       297
## 
## Residual standard error: 1.936779
## Estimated effects may be unbalanced

Let’s use summary() to get a summary of it. We see it’s really significant. But that shouldn’t be surprising given the values we told it to use.

summary(test.aov.fit)
##              Df Sum Sq Mean Sq F value Pr(>F)    
## group.aov     2  18818    9409    2508 <2e-16 ***
## Residuals   297   1114       4                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Since we see a significant difference in the ANOVA, but it isn’t necessarily between all groups, we can do a post-hoc test, such as a Tukey HSD

TukeyHSD(test.aov.fit)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = response.aov ~ group.aov, data = test.aov)
## 
## $group.aov
##          diff       lwr      upr p adj
## b-a  9.748184  9.103002 10.39337     0
## c-a 19.399908 18.754726 20.04509     0
## c-b  9.651724  9.006541 10.29691     0

The output will show us all of the comparisons and we see that all of them are different from one another. What about if we use different numbers?

Simulate groups and responses again, but with three groups, two are obviously very close, and they have different standard deviations

group.aov2 <- rep(letters[1:3], length.out=300)
response.aov2 <- rnorm(n=300, mean=c(10, 10.5, 30), sd=c(2,6,1.5))
hist(response.aov2, breaks=20)

test.aov2 <- data.frame(group.aov2, response.aov2)

Anova time

test.aov.fit2 <- aov(response.aov2~group.aov2, data=test.aov2)
summary(test.aov.fit2)
##              Df Sum Sq Mean Sq F value Pr(>F)    
## group.aov2    2  25880   12940   989.6 <2e-16 ***
## Residuals   297   3883      13                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Still very significant, but what about posthoc?

TukeyHSD(test.aov.fit2)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = response.aov2 ~ group.aov2, data = test.aov2)
## 
## $group.aov2
##          diff        lwr       upr     p adj
## b-a  0.664413 -0.5401587  1.868985 0.3967147
## c-a 20.026745 18.8221736 21.231317 0.0000000
## c-b 19.362332 18.1577606 20.566904 0.0000000

We can see there is not a significant difference between a and b, which is not surprising, given the means we gave them.

How about using a nonparametric test?

nonpara.aov <- kruskal.test(response.aov2~group.aov2, data=test.aov2)
nonpara.aov
## 
##  Kruskal-Wallis rank sum test
## 
## data:  response.aov2 by group.aov2
## Kruskal-Wallis chi-squared = 199.36, df = 2, p-value < 2.2e-16

There are options for post-hoc, but we will not explore right now (may involve other packages)

#What if we are interested in more than one response variable (dependent variable).

Let’s pull in the iris flower dataset

data("iris")

Check out the data

str(iris)
## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

We can see that there are a few different numerical traits (width, length, etc) and also species.

So we can ask if these dependent traits are impacted by one independent trait.

We can ask whether sepal length and petal length are significantly different between species?

We can use a MANOVA to answer this question. In order to do this, we need to bind together the two dependent variables (sepal length and petal length).

results.manova <- manova(cbind(iris$Sepal.Length, iris$Petal.Length)~iris$Species)
summary(results.manova)
##               Df Pillai approx F num Df den Df    Pr(>F)    
## iris$Species   2 0.9885   71.829      4    294 < 2.2e-16 ***
## Residuals    147                                            
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

But what if we want to look at each trait separately?

summary.aov(results.manova)
##  Response 1 :
##               Df Sum Sq Mean Sq F value    Pr(>F)    
## iris$Species   2 63.212  31.606  119.26 < 2.2e-16 ***
## Residuals    147 38.956   0.265                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response 2 :
##               Df Sum Sq Mean Sq F value    Pr(>F)    
## iris$Species   2 437.10 218.551  1180.2 < 2.2e-16 ***
## Residuals    147  27.22   0.185                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We can see that both traits are significantly different by species.

#Now that we’ve covered basic reponse by group, what about continuous by continous? Correlation data?

Let’s use other datasets that are already in R since they can be useful Let’s call in the motortrend car data.

data("mtcars")

Let’s check it out

str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

There’s a lot to look through here, but let’s keep it simple.

What is directly impacting the fuel efficiency (mpg) of these cars? So we’re going to set up a linear model (lm()) and use the same formula we’ve used before (dependent~independent).

Let’s just start with one variable, Weight.

carfit <- lm(mpg~wt, data=mtcars)

You can check out the plots of residuals if you like.

plot(carfit)

What does the output look like?

summary(carfit)
## 
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5432 -2.3647 -0.1252  1.4096  6.8727 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  37.2851     1.8776  19.858  < 2e-16 ***
## wt           -5.3445     0.5591  -9.559 1.29e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.046 on 30 degrees of freedom
## Multiple R-squared:  0.7528, Adjusted R-squared:  0.7446 
## F-statistic: 91.38 on 1 and 30 DF,  p-value: 1.294e-10

Highly significant, with a strong r2 value (% of variation in mpg explained by weight!)

What does it look like?

plot(mpg~wt, data=mtcars)

What about if we include another variable? maybe the cylinders?

carfit2 <- lm(mpg~wt+cyl, data=mtcars)

We see that it is significant and has a higher R2 value, but does that matter?

summary(carfit2)
## 
## Call:
## lm(formula = mpg ~ wt + cyl, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.2893 -1.5512 -0.4684  1.5743  6.1004 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  39.6863     1.7150  23.141  < 2e-16 ***
## wt           -3.1910     0.7569  -4.216 0.000222 ***
## cyl          -1.5078     0.4147  -3.636 0.001064 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.568 on 29 degrees of freedom
## Multiple R-squared:  0.8302, Adjusted R-squared:  0.8185 
## F-statistic: 70.91 on 2 and 29 DF,  p-value: 6.809e-12

Which model is better?

anova(carfit, carfit2)
## Analysis of Variance Table
## 
## Model 1: mpg ~ wt
## Model 2: mpg ~ wt + cyl
##   Res.Df    RSS Df Sum of Sq     F   Pr(>F)   
## 1     30 278.32                               
## 2     29 191.17  1     87.15 13.22 0.001064 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This output shows that there is a significantly better fit with the more complex model (p=0.001064).

But what about an interaction effect of something like how many cylinders it has?

If the car is heavier because of the engine and not something else, maybe there’s an effect?

carfit3 <- lm(mpg~wt+cyl+wt*cyl, data=mtcars)
summary(carfit3)
## 
## Call:
## lm(formula = mpg ~ wt + cyl + wt * cyl, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.2288 -1.3495 -0.5042  1.4647  5.2344 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  54.3068     6.1275   8.863 1.29e-09 ***
## wt           -8.6556     2.3201  -3.731 0.000861 ***
## cyl          -3.8032     1.0050  -3.784 0.000747 ***
## wt:cyl        0.8084     0.3273   2.470 0.019882 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.368 on 28 degrees of freedom
## Multiple R-squared:  0.8606, Adjusted R-squared:  0.8457 
## F-statistic: 57.62 on 3 and 28 DF,  p-value: 4.231e-12

We see that weight is significant to the model, as is cylinder, BUT we see a significant interaction between cylinder and weight.

What’s going on here? We know that as weight goes up, the mpg goes down

plot(mpg~cyl, data=mtcars)

But we know as the cylinders go up, the mpg goes down too.

plot(mpg~cyl, data=mtcars)

What happens to wt as cylinders go up?

plot(wt~cyl, data=mtcars)

If we look at our summary again, we can see that the estimate for the interaction is positive, whereas the estimates for the wt and cyl independently are negative. This means, while wt and cyl are negatively impacting mpg, as cylinder increases, the effect of wt on mpg decreases.
There is an interaction of these variables and if we compare the simpler wt+cyl model to our more complex model, the more complex model is significantly better.

anova(carfit2, carfit3)
## Analysis of Variance Table
## 
## Model 1: mpg ~ wt + cyl
## Model 2: mpg ~ wt + cyl + wt * cyl
##   Res.Df    RSS Df Sum of Sq      F  Pr(>F)  
## 1     29 191.17                              
## 2     28 156.98  1    34.196 6.0995 0.01988 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

You can plot these with cool graphs, but that is something you can do easily outside of base R (ggplot2, etc)

So what if we have categorical/character data? Maybe something from a survey or phenotypes from genetic crosses?

What if we want to know if there is a more common combination of hair and eye color? Let’s load the available dataset in R called HairEyeColor.

data("HairEyeColor")

Let’s look at it

str(HairEyeColor)
##  'table' num [1:4, 1:4, 1:2] 32 53 10 3 11 50 10 30 10 25 ...
##  - attr(*, "dimnames")=List of 3
##   ..$ Hair: chr [1:4] "Black" "Brown" "Red" "Blond"
##   ..$ Eye : chr [1:4] "Brown" "Blue" "Hazel" "Green"
##   ..$ Sex : chr [1:2] "Male" "Female"
HairEyeColor
## , , Sex = Male
## 
##        Eye
## Hair    Brown Blue Hazel Green
##   Black    32   11    10     3
##   Brown    53   50    25    15
##   Red      10   10     7     7
##   Blond     3   30     5     8
## 
## , , Sex = Female
## 
##        Eye
## Hair    Brown Blue Hazel Green
##   Black    36    9     5     2
##   Brown    66   34    29    14
##   Red      16    7     7     7
##   Blond     4   64     5     8

It’s in kind of a weird format, and for our question, we are not worried about sex. We can use the function margin.table to give us all of our output for a marginal frequency table

HairEyeNew <- margin.table(HairEyeColor, margin=c(1,2))

Let’s see what it looks like.

HairEyeNew
##        Eye
## Hair    Brown Blue Hazel Green
##   Black    68   20    15     5
##   Brown   119   84    54    29
##   Red      26   17    14    14
##   Blond     7   94    10    16

We can see it is all in a usable table now, but raw numbers are not what we are interested in.

What kind of test would we use to see if something is more or less than expected? chi-square is common

We can use this function with what we saved as a marginal frequency table.

chisq.test(HairEyeNew)
## 
##  Pearson's Chi-squared test
## 
## data:  HairEyeNew
## X-squared = 138.29, df = 9, p-value < 2.2e-16

We see something is significant, but we want to see what frequencies we have. In order to calculate frequency, we need to divide what we have by the totals.

HairEyeNew/sum(HairEyeNew)
##        Eye
## Hair          Brown        Blue       Hazel       Green
##   Black 0.114864865 0.033783784 0.025337838 0.008445946
##   Brown 0.201013514 0.141891892 0.091216216 0.048986486
##   Red   0.043918919 0.028716216 0.023648649 0.023648649
##   Blond 0.011824324 0.158783784 0.016891892 0.027027027