As many of you know, how we analyze our data depends a lot on the format of our data (numbers, factors, etc),the design of our experiment, and whether or not our data is normally distributed or not.
We’ll use some simulated data and some data that is available in mock datasets to analyze this today.
So what if we have data that is normally distributed? we can simulate normally distributed data with rnorm, with its standard deviation and mean.
We will set the seed so that we all get the same results.
set.seed(1)
y <- rnorm(100, mean=10, sd=2)
if we tell it to list the data, what do we see?
y
## [1] 8.747092 10.367287 8.328743 13.190562 10.659016 8.359063 10.974858
## [8] 11.476649 11.151563 9.389223 13.023562 10.779686 8.757519 5.570600
## [15] 12.249862 9.910133 9.967619 11.887672 11.642442 11.187803 11.837955
## [22] 11.564273 10.149130 6.021297 11.239651 9.887743 9.688409 7.058495
## [29] 9.043700 10.835883 12.717359 9.794425 10.775343 9.892390 7.245881
## [36] 9.170011 9.211420 9.881373 12.200051 11.526351 9.670953 9.493277
## [43] 11.393927 11.113326 8.622489 8.585010 10.729164 11.537066 9.775308
## [50] 11.762215 10.796212 8.775947 10.682239 7.741274 12.866047 13.960800
## [57] 9.265557 7.911731 11.139439 9.729891 14.803236 9.921520 11.379479
## [64] 10.056004 8.513454 10.377585 6.390083 12.931110 10.306507 14.345223
## [71] 10.951019 8.580107 11.221453 8.131805 7.492733 10.582892 9.113416
## [78] 10.002211 10.148683 8.820958 8.862663 9.729643 12.356174 6.952866
## [85] 11.187892 10.665901 12.126200 9.391632 10.740038 10.534198 8.914960
## [92] 12.415736 12.320805 11.400427 13.173667 11.116973 7.446816 8.853469
## [99] 7.550775 9.053199
does it look normal to you?, It should, given we pulled it from a normal distribution.
hist(y)
how can we test for it? Maybe a shapiro.wilk test?
shapiro.test(y)
##
## Shapiro-Wilk normality test
##
## data: y
## W = 0.9956, p-value = 0.9876
The p-value is not significant, so the data is normal. So, if we have two different distributions that are normal and numeric, how can we test if they’re different?
set.seed(5)
y1 <- rnorm(100, mean=11, sd=2)
y1
## [1] 9.318289 13.768719 8.489016 11.140286 14.422882 9.794184 10.055667
## [8] 9.729257 10.428453 11.276216 13.455261 9.396441 8.839215 10.684931
## [15] 8.856480 10.722028 9.805374 6.632066 11.481635 10.481289 12.801024
## [22] 12.883739 13.935924 12.413522 12.638018 10.413036 13.837178 13.997548
## [29] 9.685836 9.294409 11.631830 13.219388 15.430921 13.434207 13.958444
## [36] 12.903148 8.980935 6.999055 7.475628 10.714784 14.100121 9.395154
## [43] 10.850842 14.791336 10.086862 12.124447 9.225983 10.079511 9.551343
## [50] 10.861578 13.926497 11.375452 13.044046 9.816330 10.775599 9.150094
## [57] 12.506610 10.774782 10.871818 11.466551 8.726834 12.709661 9.843259
## [64] 11.992723 9.479884 10.317227 6.795342 10.396595 8.455233 10.440668
## [71] 10.591805 10.548772 11.694057 11.064736 11.827063 10.689303 12.946971
## [78] 11.242180 11.378347 9.874230 11.996832 7.515395 12.951058 10.951834
## [85] 12.351369 9.579381 15.774465 10.053136 10.848455 9.956320 12.852094
## [92] 8.875178 12.114068 12.801461 12.979891 11.767216 10.306832 9.919621
## [99] 10.634889 10.881401
How do we visualize them together?
hist(y1, col="blue", breaks=20)
hist(y, col="red", breaks=20, add=T)
What statistical test can we apply here?
t.test(y, y1)
##
## Welch Two Sample t-test
##
## data: y and y1
## t = -3.242, df = 197.49, p-value = 0.001393
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.3597919 -0.3311987
## sample estimates:
## mean of x mean of y
## 10.21777 11.06327
We see a very basic output saying that these are significantly different
What test would we do if they weren’t normally distributed though? Maybe a wilcoxon-test.
wilcox.test(y, y1)
##
## Wilcoxon rank sum test with continuity correction
##
## data: y and y1
## W = 3814, p-value = 0.003772
## alternative hypothesis: true location shift is not equal to 0
We can see that it is still significant, but the p-value isn’t as small Something that can be common when comparing the power of parametric vs non-parametric tests.
What if they are actually paired though?
t.test(y, y1, paired=TRUE)
##
## Paired t-test
##
## data: y and y1
## t = -3.2742, df = 99, p-value = 0.00146
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.3578724 -0.3331182
## sample estimates:
## mean of the differences
## -0.8454953
We can see that we get different p-values. What else is different? Our degrees of freedom. Instead of 200 things, we have 2 measurements of 100 things
What about the non-parametic alternative?
wilcox.test(y, y1, paired=TRUE)
##
## Wilcoxon signed rank test with continuity correction
##
## data: y and y1
## V = 1612, p-value = 0.001704
## alternative hypothesis: true location shift is not equal to 0
How do we simulate character/group data? If we use this function, we can call groups by letters. We see if we use letters from the range 1:2, they come up “a” and “b”.
rep(letters[1:2])
## [1] "a" "b"
And if we use a different range, and different letters.
rep(letters[6:16])
## [1] "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p"
So we can set these and assign them. Length out is setting how many total should be “a”,"b’.
group <- rep(letters[1:2], length.out=200)
Let’s set our data for two different peaks at a mean of 10 and 12.
set.seed(6)
response <- rnorm(n=100, mean=c(10, 11), sd=2)
Let’s put this into a dataframe called test.
test <- data.frame(group,response)
We can see we have a’s with a lower values.
test
## group response
## 1 a 10.539212
## 2 b 9.740029
## 3 a 11.737320
## 4 b 14.454391
## 5 a 10.048375
## 6 b 11.736050
## 7 a 7.381591
## 8 b 12.477244
## 9 a 10.089746
## 10 b 8.903206
## 11 a 13.455702
## 12 b 8.642801
## 13 a 11.306413
## 14 b 10.262867
## 15 a 8.800891
## 16 b 11.109210
## 17 a 13.415355
## 18 b 8.811254
## 19 a 9.421436
## 20 b 15.414826
## 21 a 11.037498
## 22 b 8.190164
## 23 a 14.029729
## 24 b 8.623683
## 25 a 10.380762
## 26 b 8.660528
## 27 a 9.923837
## 28 b 15.708409
## 29 a 12.786853
## 30 b 9.879335
## 31 a 8.657081
## 32 b 11.984877
## 33 a 7.641219
## 34 b 8.882565
## 35 a 12.275805
## 36 b 10.679469
## 37 a 11.260986
## 38 b 14.233919
## 39 a 9.613000
## 40 b 7.784416
## 41 a 8.229672
## 42 b 10.135331
## 43 a 9.156752
## 44 b 10.659012
## 45 a 10.491622
## 46 b 9.508504
## 47 a 9.452112
## 48 b 14.649158
## 49 a 10.028467
## 50 b 11.376085
## 51 a 9.891733
## 52 b 11.923223
## 53 a 8.806459
## 54 b 13.526516
## 55 a 7.709342
## 56 b 13.169247
## 57 a 6.942009
## 58 b 7.852516
## 59 a 9.771815
## 60 b 11.222418
## 61 a 10.428275
## 62 b 12.107552
## 63 a 7.880961
## 64 b 7.779901
## 65 a 9.323698
## 66 b 11.409876
## 67 a 9.551310
## 68 b 9.180781
## 69 a 8.383780
## 70 b 12.106167
## 71 a 9.221644
## 72 b 10.105509
## 73 a 9.957722
## 74 b 9.801165
## 75 a 9.378268
## 76 b 9.636735
## 77 a 9.595889
## 78 b 13.233601
## 79 a 11.651998
## 80 b 13.501838
## 81 a 15.216196
## 82 b 10.897920
## 83 a 14.454388
## 84 b 10.972256
## 85 a 6.905201
## 86 b 8.240222
## 87 a 12.959741
## 88 b 10.490156
## 89 a 9.347439
## 90 b 9.546723
## 91 a 6.095303
## 92 b 11.845880
## 93 a 12.363370
## 94 b 12.835919
## 95 a 10.190935
## 96 b 7.631136
## 97 a 11.980659
## 98 b 9.584336
## 99 a 8.811942
## 100 b 8.868206
## 101 a 10.539212
## 102 b 9.740029
## 103 a 11.737320
## 104 b 14.454391
## 105 a 10.048375
## 106 b 11.736050
## 107 a 7.381591
## 108 b 12.477244
## 109 a 10.089746
## 110 b 8.903206
## 111 a 13.455702
## 112 b 8.642801
## 113 a 11.306413
## 114 b 10.262867
## 115 a 8.800891
## 116 b 11.109210
## 117 a 13.415355
## 118 b 8.811254
## 119 a 9.421436
## 120 b 15.414826
## 121 a 11.037498
## 122 b 8.190164
## 123 a 14.029729
## 124 b 8.623683
## 125 a 10.380762
## 126 b 8.660528
## 127 a 9.923837
## 128 b 15.708409
## 129 a 12.786853
## 130 b 9.879335
## 131 a 8.657081
## 132 b 11.984877
## 133 a 7.641219
## 134 b 8.882565
## 135 a 12.275805
## 136 b 10.679469
## 137 a 11.260986
## 138 b 14.233919
## 139 a 9.613000
## 140 b 7.784416
## 141 a 8.229672
## 142 b 10.135331
## 143 a 9.156752
## 144 b 10.659012
## 145 a 10.491622
## 146 b 9.508504
## 147 a 9.452112
## 148 b 14.649158
## 149 a 10.028467
## 150 b 11.376085
## 151 a 9.891733
## 152 b 11.923223
## 153 a 8.806459
## 154 b 13.526516
## 155 a 7.709342
## 156 b 13.169247
## 157 a 6.942009
## 158 b 7.852516
## 159 a 9.771815
## 160 b 11.222418
## 161 a 10.428275
## 162 b 12.107552
## 163 a 7.880961
## 164 b 7.779901
## 165 a 9.323698
## 166 b 11.409876
## 167 a 9.551310
## 168 b 9.180781
## 169 a 8.383780
## 170 b 12.106167
## 171 a 9.221644
## 172 b 10.105509
## 173 a 9.957722
## 174 b 9.801165
## 175 a 9.378268
## 176 b 9.636735
## 177 a 9.595889
## 178 b 13.233601
## 179 a 11.651998
## 180 b 13.501838
## 181 a 15.216196
## 182 b 10.897920
## 183 a 14.454388
## 184 b 10.972256
## 185 a 6.905201
## 186 b 8.240222
## 187 a 12.959741
## 188 b 10.490156
## 189 a 9.347439
## 190 b 9.546723
## 191 a 6.095303
## 192 b 11.845880
## 193 a 12.363370
## 194 b 12.835919
## 195 a 10.190935
## 196 b 7.631136
## 197 a 11.980659
## 198 b 9.584336
## 199 a 8.811942
## 200 b 8.868206
Did this data come out the way we wanted? We can see how R classified each group by using the str() function.
str(test)
## 'data.frame': 200 obs. of 2 variables:
## $ group : Factor w/ 2 levels "a","b": 1 2 1 2 1 2 1 2 1 2 ...
## $ response: num 10.54 9.74 11.74 14.45 10.05 ...
What if we do a basic t-test by group? We use the basic formula of reponse, or dependent, first as a function “~” of the dependent. This is pretty standard throughout.
t.test(response~group, data=test)
##
## Welch Two Sample t-test
##
## data: response by group
## t = -2.1976, df = 197.55, p-value = 0.02914
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.21415080 -0.06567404
## sample estimates:
## mean in group a mean in group b
## 10.15963 10.79954
What about visualizing this? simply the same format as for the t-test.
boxplot(response~group, data=test)
We see a very significant difference. Any guesses what the non parametic version would be?
wilcox.test(response~group, data=test)
##
## Wilcoxon rank sum test with continuity correction
##
## data: response by group
## W = 4176, p-value = 0.0442
## alternative hypothesis: true location shift is not equal to 0
What if we had more than 2 groups? What about an ANOVA? Simulate groups and responses again, but with three groups
group.aov <- rep(letters[1:3], length.out=300)
response.aov <- rnorm(n=300, mean=c(10, 20, 30), sd=2)
Visualize it.
hist(response.aov)
Hard to see, so let’s break it up some more.
hist(response.aov, breaks=20)
Put it into a data frame.
test.aov <- data.frame(group.aov, response.aov)
What are the groups?
str(test.aov)
## 'data.frame': 300 obs. of 2 variables:
## $ group.aov : Factor w/ 3 levels "a","b","c": 1 2 3 1 2 3 1 2 3 1 ...
## $ response.aov: num 8.12 20.19 30.13 8.02 19.86 ...
test.aov
## group.aov response.aov
## 1 a 8.118466
## 2 b 20.187821
## 3 c 30.132507
## 4 a 8.018688
## 5 b 19.855770
## 6 c 29.553486
## 7 a 9.714447
## 8 b 21.070813
## 9 c 29.250886
## 10 a 8.163675
## 11 b 18.717555
## 12 c 28.294174
## 13 a 6.769873
## 14 b 22.212981
## 15 c 29.134208
## 16 a 11.206208
## 17 b 20.083927
## 18 c 31.172260
## 19 a 7.929797
## 20 b 21.656384
## 21 c 29.285933
## 22 a 10.374932
## 23 b 19.576971
## 24 c 29.163406
## 25 a 8.502122
## 26 b 19.667918
## 27 c 25.205109
## 28 a 11.766518
## 29 b 19.491791
## 30 c 26.701269
## 31 a 10.332678
## 32 b 21.848179
## 33 c 28.079871
## 34 a 10.960212
## 35 b 22.255098
## 36 c 31.336251
## 37 a 12.713954
## 38 b 22.045840
## 39 c 29.956591
## 40 a 7.012731
## 41 b 18.010948
## 42 c 31.175779
## 43 a 11.578186
## 44 b 22.647095
## 45 c 31.434442
## 46 a 7.238572
## 47 b 20.205753
## 48 c 26.401913
## 49 a 11.946031
## 50 b 20.432972
## 51 c 31.050230
## 52 a 8.913349
## 53 b 16.565962
## 54 c 32.060233
## 55 a 7.705762
## 56 b 19.467741
## 57 c 27.665517
## 58 a 9.575835
## 59 b 16.556371
## 60 c 27.123451
## 61 a 11.669120
## 62 b 20.945517
## 63 c 30.862615
## 64 a 10.942922
## 65 b 23.473781
## 66 c 31.809653
## 67 a 10.580663
## 68 b 18.252207
## 69 c 28.565027
## 70 a 9.644661
## 71 b 22.068034
## 72 c 26.184120
## 73 a 11.594654
## 74 b 19.810298
## 75 c 30.411230
## 76 a 11.767786
## 77 b 16.593736
## 78 c 30.271560
## 79 a 12.358004
## 80 b 20.768582
## 81 c 30.071476
## 82 a 11.868671
## 83 b 19.761274
## 84 c 27.744633
## 85 a 10.835056
## 86 b 20.102938
## 87 c 29.824379
## 88 a 10.209204
## 89 b 17.980738
## 90 c 29.432667
## 91 a 12.259920
## 92 b 19.918231
## 93 c 32.673664
## 94 a 10.803370
## 95 b 18.604896
## 96 c 25.475005
## 97 a 11.575135
## 98 b 19.097051
## 99 c 28.877900
## 100 a 11.246222
## 101 b 22.844228
## 102 c 28.483109
## 103 a 9.889314
## 104 b 19.039083
## 105 c 28.158967
## 106 a 8.704743
## 107 b 21.035866
## 108 c 32.956722
## 109 a 5.086632
## 110 b 17.101096
## 111 c 29.200391
## 112 a 11.033446
## 113 b 19.414927
## 114 c 32.688073
## 115 a 10.506259
## 116 b 18.092459
## 117 c 27.992211
## 118 a 11.101995
## 119 b 20.664737
## 120 c 30.291406
## 121 a 9.156598
## 122 b 20.008371
## 123 c 28.593217
## 124 a 12.897580
## 125 b 17.353879
## 126 c 29.078724
## 127 a 9.014654
## 128 b 24.055756
## 129 c 27.819007
## 130 a 11.451340
## 131 b 20.103265
## 132 c 29.089373
## 133 a 10.973001
## 134 b 20.212591
## 135 c 30.071565
## 136 a 8.816613
## 137 b 21.797356
## 138 c 30.078801
## 139 a 9.784824
## 140 b 22.289945
## 141 c 23.937590
## 142 a 10.368557
## 143 b 19.567477
## 144 c 32.329824
## 145 a 9.098170
## 146 b 19.492188
## 147 c 29.261902
## 148 a 10.757738
## 149 b 17.042097
## 150 c 28.657405
## 151 a 8.872500
## 152 b 16.628596
## 153 c 31.024545
## 154 a 9.564735
## 155 b 23.267770
## 156 c 29.141044
## 157 a 13.446941
## 158 b 24.136333
## 159 c 25.070148
## 160 a 9.675994
## 161 b 19.521651
## 162 c 32.419014
## 163 a 10.365747
## 164 b 21.931635
## 165 c 32.233545
## 166 a 6.749229
## 167 b 18.452126
## 168 c 29.544907
## 169 a 10.433502
## 170 b 20.128300
## 171 c 26.891588
## 172 a 12.923991
## 173 b 19.992731
## 174 c 27.899365
## 175 a 12.836904
## 176 b 18.729112
## 177 c 33.389173
## 178 a 11.526005
## 179 b 19.546884
## 180 c 26.927401
## 181 a 11.496323
## 182 b 17.574170
## 183 c 32.195674
## 184 a 10.983540
## 185 b 22.345474
## 186 c 27.814428
## 187 a 8.630065
## 188 b 19.084652
## 189 c 28.822231
## 190 a 11.429790
## 191 b 19.683849
## 192 c 27.981929
## 193 a 9.121487
## 194 b 18.193941
## 195 c 29.865519
## 196 a 8.833208
## 197 b 19.966695
## 198 c 31.017309
## 199 a 9.175271
## 200 b 18.462114
## 201 c 30.832344
## 202 a 9.919914
## 203 b 17.433766
## 204 c 31.441823
## 205 a 8.317028
## 206 b 18.859323
## 207 c 29.054782
## 208 a 7.877000
## 209 b 20.199965
## 210 c 28.430550
## 211 a 6.420961
## 212 b 17.191754
## 213 c 31.600222
## 214 a 10.633671
## 215 b 17.677639
## 216 c 30.842895
## 217 a 8.836619
## 218 b 21.530380
## 219 c 28.975268
## 220 a 9.463324
## 221 b 18.827826
## 222 c 32.379487
## 223 a 8.670159
## 224 b 20.033358
## 225 c 29.785597
## 226 a 8.721637
## 227 b 23.072799
## 228 c 32.629565
## 229 a 10.687742
## 230 b 21.633836
## 231 c 32.190657
## 232 a 12.237506
## 233 b 16.933606
## 234 c 30.954731
## 235 a 12.092983
## 236 b 22.062440
## 237 c 32.713357
## 238 a 12.155726
## 239 b 18.712618
## 240 c 32.116235
## 241 a 12.635230
## 242 b 19.972564
## 243 c 28.932394
## 244 a 8.768916
## 245 b 21.403168
## 246 c 28.618618
## 247 a 7.951723
## 248 b 20.074731
## 249 c 30.245030
## 250 a 8.496898
## 251 b 19.795980
## 252 c 29.108413
## 253 a 10.762664
## 254 b 20.455936
## 255 c 31.459465
## 256 a 14.121016
## 257 b 17.170203
## 258 c 28.633009
## 259 a 11.127867
## 260 b 19.596413
## 261 c 26.682396
## 262 a 9.561859
## 263 b 21.875482
## 264 c 31.175675
## 265 a 9.820167
## 266 b 13.920519
## 267 c 28.883521
## 268 a 7.341616
## 269 b 21.872522
## 270 c 29.879298
## 271 a 14.206914
## 272 b 18.598988
## 273 c 28.070195
## 274 a 10.472865
## 275 b 22.115792
## 276 c 28.528353
## 277 a 11.724736
## 278 b 19.036243
## 279 c 32.358042
## 280 a 9.579015
## 281 b 19.918456
## 282 c 26.925382
## 283 a 7.433000
## 284 b 20.127299
## 285 c 28.221981
## 286 a 9.965087
## 287 b 23.805990
## 288 c 25.586475
## 289 a 10.603511
## 290 b 20.377359
## 291 c 26.874220
## 292 a 12.677286
## 293 b 18.629817
## 294 c 29.607756
## 295 a 7.025535
## 296 b 16.211755
## 297 c 32.782658
## 298 a 13.216722
## 299 b 22.124407
## 300 c 30.293919
Set up the anova test.
test.aov.fit <- aov(response.aov~group.aov, data=test.aov)
If we use the plot function, we can see the standard normality, etc plots
plot(test.aov.fit)
See the output. But we see no p-value or other information?
test.aov.fit
## Call:
## aov(formula = response.aov ~ group.aov, data = test.aov)
##
## Terms:
## group.aov Residuals
## Sum of Squares 18817.98 1114.08
## Deg. of Freedom 2 297
##
## Residual standard error: 1.936779
## Estimated effects may be unbalanced
Let’s use summary() to get a summary of it. We see it’s really significant. But that shouldn’t be surprising given the values we told it to use.
summary(test.aov.fit)
## Df Sum Sq Mean Sq F value Pr(>F)
## group.aov 2 18818 9409 2508 <2e-16 ***
## Residuals 297 1114 4
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Since we see a significant difference in the ANOVA, but it isn’t necessarily between all groups, we can do a post-hoc test, such as a Tukey HSD
TukeyHSD(test.aov.fit)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = response.aov ~ group.aov, data = test.aov)
##
## $group.aov
## diff lwr upr p adj
## b-a 9.748184 9.103002 10.39337 0
## c-a 19.399908 18.754726 20.04509 0
## c-b 9.651724 9.006541 10.29691 0
The output will show us all of the comparisons and we see that all of them are different from one another. What about if we use different numbers?
Simulate groups and responses again, but with three groups, two are obviously very close, and they have different standard deviations
group.aov2 <- rep(letters[1:3], length.out=300)
response.aov2 <- rnorm(n=300, mean=c(10, 10.5, 30), sd=c(2,6,1.5))
hist(response.aov2, breaks=20)
test.aov2 <- data.frame(group.aov2, response.aov2)
Anova time
test.aov.fit2 <- aov(response.aov2~group.aov2, data=test.aov2)
summary(test.aov.fit2)
## Df Sum Sq Mean Sq F value Pr(>F)
## group.aov2 2 25880 12940 989.6 <2e-16 ***
## Residuals 297 3883 13
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Still very significant, but what about posthoc?
TukeyHSD(test.aov.fit2)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = response.aov2 ~ group.aov2, data = test.aov2)
##
## $group.aov2
## diff lwr upr p adj
## b-a 0.664413 -0.5401587 1.868985 0.3967147
## c-a 20.026745 18.8221736 21.231317 0.0000000
## c-b 19.362332 18.1577606 20.566904 0.0000000
We can see there is not a significant difference between a and b, which is not surprising, given the means we gave them.
How about using a nonparametric test?
nonpara.aov <- kruskal.test(response.aov2~group.aov2, data=test.aov2)
nonpara.aov
##
## Kruskal-Wallis rank sum test
##
## data: response.aov2 by group.aov2
## Kruskal-Wallis chi-squared = 199.36, df = 2, p-value < 2.2e-16
There are options for post-hoc, but we will not explore right now (may involve other packages)
#What if we are interested in more than one response variable (dependent variable).
Let’s pull in the iris flower dataset
data("iris")
Check out the data
str(iris)
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
We can see that there are a few different numerical traits (width, length, etc) and also species.
So we can ask if these dependent traits are impacted by one independent trait.
We can ask whether sepal length and petal length are significantly different between species?
We can use a MANOVA to answer this question. In order to do this, we need to bind together the two dependent variables (sepal length and petal length).
results.manova <- manova(cbind(iris$Sepal.Length, iris$Petal.Length)~iris$Species)
summary(results.manova)
## Df Pillai approx F num Df den Df Pr(>F)
## iris$Species 2 0.9885 71.829 4 294 < 2.2e-16 ***
## Residuals 147
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
But what if we want to look at each trait separately?
summary.aov(results.manova)
## Response 1 :
## Df Sum Sq Mean Sq F value Pr(>F)
## iris$Species 2 63.212 31.606 119.26 < 2.2e-16 ***
## Residuals 147 38.956 0.265
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response 2 :
## Df Sum Sq Mean Sq F value Pr(>F)
## iris$Species 2 437.10 218.551 1180.2 < 2.2e-16 ***
## Residuals 147 27.22 0.185
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We can see that both traits are significantly different by species.
#Now that we’ve covered basic reponse by group, what about continuous by continous? Correlation data?
Let’s use other datasets that are already in R since they can be useful Let’s call in the motortrend car data.
data("mtcars")
Let’s check it out
str(mtcars)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
There’s a lot to look through here, but let’s keep it simple.
What is directly impacting the fuel efficiency (mpg) of these cars? So we’re going to set up a linear model (lm()) and use the same formula we’ve used before (dependent~independent).
Let’s just start with one variable, Weight.
carfit <- lm(mpg~wt, data=mtcars)
You can check out the plots of residuals if you like.
plot(carfit)
What does the output look like?
summary(carfit)
##
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.5432 -2.3647 -0.1252 1.4096 6.8727
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.2851 1.8776 19.858 < 2e-16 ***
## wt -5.3445 0.5591 -9.559 1.29e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.046 on 30 degrees of freedom
## Multiple R-squared: 0.7528, Adjusted R-squared: 0.7446
## F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10
Highly significant, with a strong r2 value (% of variation in mpg explained by weight!)
What does it look like?
plot(mpg~wt, data=mtcars)
What about if we include another variable? maybe the cylinders?
carfit2 <- lm(mpg~wt+cyl, data=mtcars)
We see that it is significant and has a higher R2 value, but does that matter?
summary(carfit2)
##
## Call:
## lm(formula = mpg ~ wt + cyl, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.2893 -1.5512 -0.4684 1.5743 6.1004
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.6863 1.7150 23.141 < 2e-16 ***
## wt -3.1910 0.7569 -4.216 0.000222 ***
## cyl -1.5078 0.4147 -3.636 0.001064 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.568 on 29 degrees of freedom
## Multiple R-squared: 0.8302, Adjusted R-squared: 0.8185
## F-statistic: 70.91 on 2 and 29 DF, p-value: 6.809e-12
Which model is better?
anova(carfit, carfit2)
## Analysis of Variance Table
##
## Model 1: mpg ~ wt
## Model 2: mpg ~ wt + cyl
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 30 278.32
## 2 29 191.17 1 87.15 13.22 0.001064 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
This output shows that there is a significantly better fit with the more complex model (p=0.001064).
But what about an interaction effect of something like how many cylinders it has?
If the car is heavier because of the engine and not something else, maybe there’s an effect?
carfit3 <- lm(mpg~wt+cyl+wt*cyl, data=mtcars)
summary(carfit3)
##
## Call:
## lm(formula = mpg ~ wt + cyl + wt * cyl, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.2288 -1.3495 -0.5042 1.4647 5.2344
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 54.3068 6.1275 8.863 1.29e-09 ***
## wt -8.6556 2.3201 -3.731 0.000861 ***
## cyl -3.8032 1.0050 -3.784 0.000747 ***
## wt:cyl 0.8084 0.3273 2.470 0.019882 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.368 on 28 degrees of freedom
## Multiple R-squared: 0.8606, Adjusted R-squared: 0.8457
## F-statistic: 57.62 on 3 and 28 DF, p-value: 4.231e-12
We see that weight is significant to the model, as is cylinder, BUT we see a significant interaction between cylinder and weight.
What’s going on here? We know that as weight goes up, the mpg goes down
plot(mpg~cyl, data=mtcars)
But we know as the cylinders go up, the mpg goes down too.
plot(mpg~cyl, data=mtcars)
What happens to wt as cylinders go up?
plot(wt~cyl, data=mtcars)
If we look at our summary again, we can see that the estimate for the interaction is positive, whereas the estimates for the wt and cyl independently are negative. This means, while wt and cyl are negatively impacting mpg, as cylinder increases, the effect of wt on mpg decreases.
There is an interaction of these variables and if we compare the simpler wt+cyl model to our more complex model, the more complex model is significantly better.
anova(carfit2, carfit3)
## Analysis of Variance Table
##
## Model 1: mpg ~ wt + cyl
## Model 2: mpg ~ wt + cyl + wt * cyl
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 29 191.17
## 2 28 156.98 1 34.196 6.0995 0.01988 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
You can plot these with cool graphs, but that is something you can do easily outside of base R (ggplot2, etc)
So what if we have categorical/character data? Maybe something from a survey or phenotypes from genetic crosses?
What if we want to know if there is a more common combination of hair and eye color? Let’s load the available dataset in R called HairEyeColor.
data("HairEyeColor")
Let’s look at it
str(HairEyeColor)
## 'table' num [1:4, 1:4, 1:2] 32 53 10 3 11 50 10 30 10 25 ...
## - attr(*, "dimnames")=List of 3
## ..$ Hair: chr [1:4] "Black" "Brown" "Red" "Blond"
## ..$ Eye : chr [1:4] "Brown" "Blue" "Hazel" "Green"
## ..$ Sex : chr [1:2] "Male" "Female"
HairEyeColor
## , , Sex = Male
##
## Eye
## Hair Brown Blue Hazel Green
## Black 32 11 10 3
## Brown 53 50 25 15
## Red 10 10 7 7
## Blond 3 30 5 8
##
## , , Sex = Female
##
## Eye
## Hair Brown Blue Hazel Green
## Black 36 9 5 2
## Brown 66 34 29 14
## Red 16 7 7 7
## Blond 4 64 5 8
It’s in kind of a weird format, and for our question, we are not worried about sex. We can use the function margin.table to give us all of our output for a marginal frequency table
HairEyeNew <- margin.table(HairEyeColor, margin=c(1,2))
Let’s see what it looks like.
HairEyeNew
## Eye
## Hair Brown Blue Hazel Green
## Black 68 20 15 5
## Brown 119 84 54 29
## Red 26 17 14 14
## Blond 7 94 10 16
We can see it is all in a usable table now, but raw numbers are not what we are interested in.
What kind of test would we use to see if something is more or less than expected? chi-square is common
We can use this function with what we saved as a marginal frequency table.
chisq.test(HairEyeNew)
##
## Pearson's Chi-squared test
##
## data: HairEyeNew
## X-squared = 138.29, df = 9, p-value < 2.2e-16
We see something is significant, but we want to see what frequencies we have. In order to calculate frequency, we need to divide what we have by the totals.
HairEyeNew/sum(HairEyeNew)
## Eye
## Hair Brown Blue Hazel Green
## Black 0.114864865 0.033783784 0.025337838 0.008445946
## Brown 0.201013514 0.141891892 0.091216216 0.048986486
## Red 0.043918919 0.028716216 0.023648649 0.023648649
## Blond 0.011824324 0.158783784 0.016891892 0.027027027