Mamba Mamba - 3 months ago 5
R Question

How to iterate over groups and combinations of factors to t-test the differences in means?

I have the following data struture,

date <- as.Date(as.character( c("2015-02-13",
"2015-02-13",
"2015-02-13",
"2015-02-13",
"2015-02-13",
"2015-02-13",
"2015-02-13",
"2015-02-13",
"2015-02-13",
"2015-02-14",
"2015-02-14",
"2015-02-14",
"2015-02-14",
"2015-02-14",
"2015-02-14",
"2015-02-14",
"2015-02-14",
"2015-02-14",
"2015-02-15",
"2015-02-15",
"2015-02-15",
"2015-02-15",
"2015-02-15",
"2015-02-15",
"2015-02-15",
"2015-02-15",
"2015-02-15")))

name <- c("John","Michael","Thomas",
"John","Michael","Thomas",
"John","Michael","Thomas",
"John","Michael","Thomas",
"John","Michael","Thomas",
"John","Michael","Thomas",
"John","Michael","Thomas",
"John","Michael","Thomas",
"John","Michael","Thomas")

drinks <-c("Beer","Coffee","Tee",
"Tee","Beer", "Coffee",
"Coffee","Tee","Beer",
"Beer","Coffee","Tee",
"Tee","Beer", "Coffee",
"Coffee","Tee","Beer",
"Beer","Coffee","Tee",
"Tee","Beer", "Coffee",
"Coffee","Tee","Beer")



consumed <- c(3,2,5,3,6,2,9,4,5,
1,3,5,8,0,1,2,3,5,
1,24,4,5,7,9,9,1,2)

version_1 <- data.frame(date,name,drinks,consumed)


My second dataframe is almost idetical except for consumtion:

consumed <- c(10,9,1,20,30,1,50,40,20,
10,2,10,2,1,1,2,3,5,
20,24,1,40,2,8,4,0,0)

version_2 <- data.frame(date,name,drinks,consumed)


version_1$version <- rep("one", nrow(version_1))
version_2$version <- rep("two", nrow(version_2))
all <- rbind(version_1, version_2)

all$version <- as.factor(all$version)

date name drinks consumed version
1 2015-02-13 John Beer 3 one
2 2015-02-13 Michael Coffee 2 one
3 2015-02-13 Thomas Tee 5 one
4 2015-02-13 John Tee 3 one
5 2015-02-13 Michael Beer 6 one
6 2015-02-13 Thomas Coffee 2 one
7 2015-02-13 John Coffee 9 one
8 2015-02-13 Michael Tee 4 one
9 2015-02-13 Thomas Beer 5 one
10 2015-02-14 John Beer 1 one
11 2015-02-14 Michael Coffee 3 one
12 2015-02-14 Thomas Tee 5 one
13 2015-02-14 John Tee 8 one
14 2015-02-14 Michael Beer 0 one
15 2015-02-14 Thomas Coffee 1 one
16 2015-02-14 John Coffee 2 one
17 2015-02-14 Michael Tee 3 one
18 2015-02-14 Thomas Beer 5 one
19 2015-02-15 John Beer 1 one
20 2015-02-15 Michael Coffee 24 one
21 2015-02-15 Thomas Tee 4 one
22 2015-02-15 John Tee 5 one
23 2015-02-15 Michael Beer 7 one
24 2015-02-15 Thomas Coffee 9 one
25 2015-02-15 John Coffee 9 one
26 2015-02-15 Michael Tee 1 one
27 2015-02-15 Thomas Beer 2 one
28 2015-02-13 John Beer 10 two
29 2015-02-13 Michael Coffee 9 two
30 2015-02-13 Thomas Tee 1 two
31 2015-02-13 John Tee 20 two
32 2015-02-13 Michael Beer 30 two
33 2015-02-13 Thomas Coffee 1 two
34 2015-02-13 John Coffee 50 two
35 2015-02-13 Michael Tee 40 two
36 2015-02-13 Thomas Beer 20 two
37 2015-02-14 John Beer 10 two
38 2015-02-14 Michael Coffee 2 two
39 2015-02-14 Thomas Tee 10 two
40 2015-02-14 John Tee 2 two
41 2015-02-14 Michael Beer 1 two
42 2015-02-14 Thomas Coffee 1 two
43 2015-02-14 John Coffee 2 two
44 2015-02-14 Michael Tee 3 two
45 2015-02-14 Thomas Beer 5 two
46 2015-02-15 John Beer 20 two
47 2015-02-15 Michael Coffee 24 two
48 2015-02-15 Thomas Tee 1 two
49 2015-02-15 John Tee 40 two
50 2015-02-15 Michael Beer 2 two
51 2015-02-15 Thomas Coffee 8 two
52 2015-02-15 John Coffee 4 two
53 2015-02-15 Michael Tee 0 two
54 2015-02-15 Thomas Beer 0 two


I would like to loop over the dataframe and t-test the group differences(one vs. two) differences. Each day has always one unique combination of names and drinks consumed. Thus I would like to test:

2015-02-13 John Beer 3 one
2015-02-14 John Beer 1 one
2015-02-15 John Beer 1 one

versus

2015-02-13 John Beer 10 two
2015-02-14 John Beer 10 two
2015-02-15 John Beer 20 two

and so on for each date, name and drink group pair.

I just cant figure out how to achieve that:

for (i in 1:length(date)){
temp <- all[all$date==date[i],]

}

Answer

Using data.table:

library(data.table)
setDT(all)

all[, t.test(consumed[version == "one"], consumed[version == "two"]), by = .(name,drinks)]
      name drinks  statistic parameter    p.value   conf.int  estimate null.value alternative                  method                                                 data.name
 1:    John   Beer -3.4320324  2.159744 0.06761534 -25.303554  1.666667          0   two.sided Welch Two Sample t-test consumed[version == "one"] and consumed[version == "two"]
 2:    John   Beer -3.4320324  2.159744 0.06761534   1.970221 13.333333          0   two.sided Welch Two Sample t-test consumed[version == "one"] and consumed[version == "two"]
 3: Michael Coffee -0.2067737  3.960582 0.84638132 -28.960658  9.666667          0   two.sided Welch Two Sample t-test consumed[version == "one"] and consumed[version == "two"]
 4: Michael Coffee -0.2067737  3.960582 0.84638132  24.960658 11.666667          0   two.sided Welch Two Sample t-test consumed[version == "one"] and consumed[version == "two"]
 5:  Thomas    Tee  0.2208631  2.049375 0.84525800 -12.025434  4.666667          0   two.sided Welch Two Sample t-test consumed[version == "one"] and consumed[version == "two"]
 6:  Thomas    Tee  0.2208631  2.049375 0.84525800  13.358768  4.000000          0   two.sided Welch Two Sample t-test consumed[version == "one"] and consumed[version == "two"]
 7:    John    Tee -1.3850647  2.070089 0.29640280 -61.453187  5.333333          0   two.sided Welch Two Sample t-test consumed[version == "one"] and consumed[version == "two"]
 8:    John    Tee -1.3850647  2.070089 0.29640280  30.786521 20.666667          0   two.sided Welch Two Sample t-test consumed[version == "one"] and consumed[version == "two"]
 9: Michael   Beer -0.6835859  2.210972 0.55885626 -45.015433  4.333333          0   two.sided Welch Two Sample t-test consumed[version == "one"] and consumed[version == "two"]
10: Michael   Beer -0.6835859  2.210972 0.55885626  31.682100 11.000000          0   two.sided Welch Two Sample t-test consumed[version == "one"] and consumed[version == "two"]
11:  Thomas Coffee  0.1942572  3.977345 0.85549254  -8.883193  4.000000          0   two.sided Welch Two Sample t-test consumed[version == "one"] and consumed[version == "two"]
12:  Thomas Coffee  0.1942572  3.977345 0.85549254  10.216527  3.333333          0   two.sided Welch Two Sample t-test consumed[version == "one"] and consumed[version == "two"]
13:    John Coffee -0.7570982  2.088564 0.52510317 -77.499374  6.666667          0   two.sided Welch Two Sample t-test consumed[version == "one"] and consumed[version == "two"]
14:    John Coffee -0.7570982  2.088564 0.52510317  53.499374 18.666667          0   two.sided Welch Two Sample t-test consumed[version == "one"] and consumed[version == "two"]
15: Michael    Tee -0.9049035  2.018804 0.46026242 -66.647341  2.666667          0   two.sided Welch Two Sample t-test consumed[version == "one"] and consumed[version == "two"]
16: Michael    Tee -0.9049035  2.018804 0.46026242  43.314008 14.333333          0   two.sided Welch Two Sample t-test consumed[version == "one"] and consumed[version == "two"]
17:  Thomas   Beer -0.7113284  2.110684 0.54726281 -29.270500  4.000000          0   two.sided Welch Two Sample t-test consumed[version == "one"] and consumed[version == "two"]
18:  Thomas   Beer -0.7113284  2.110684 0.54726281  20.603833  8.333333          0   two.sided Welch Two Sample t-test consumed[version == "one"] and consumed[version == "two"]

This does a t.test on two groups (consumed[version == "one"], consumed[version == "two"]), by group (by = .(name,drinks))

The reason the result has two rows is because your confidence interval + estimate returns two values. All other columns are repeated.

We can avoid this by storing the result in our data.table as a list, by wrapping in list(...):

result <- all[, .(ttest = list(t.test(consumed[version == "one"], consumed[version == "two"]))), by = .(name,drinks)]
result
      name drinks   ttest
1:    John   Beer <htest>
2: Michael Coffee <htest>
3:  Thomas    Tee <htest>
4:    John    Tee <htest>
5: Michael   Beer <htest>
6:  Thomas Coffee <htest>
7:    John Coffee <htest>
8: Michael    Tee <htest>
9:  Thomas   Beer <htest>

We can then call a result with:

result[name == "John" & drinks == "Beer", ttest]
[[1]]

    Welch Two Sample t-test

data:  consumed[version == "one"] and consumed[version == "two"]
t = -3.432, df = 2.1597, p-value = 0.06762
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -25.303554   1.970221
sample estimates:
mean of x mean of y 
 1.666667 13.333333 
Comments