Shaxi Liver Shaxi Liver - 3 months ago 9
R Question

T-test between vectors subseted from the data frame

I would like to perform the

t.test
to get the
pvalue
between specified vectors. Let's use the data below as an example:

structure(list(mpg = c(21, 21, 22.8, 21.4, 18.7, 18.1, 14.3,
24.4, 22.8, 19.2, 17.8, 16.4, 17.3, 15.2, 10.4, 10.4, 14.7, 32.4,
30.4, 33.9, 21.5, 15.5, 15.2, 13.3, 19.2, 27.3, 26, 30.4, 15.8,
19.7, 15, 21.4), cyl = c(6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8,
8, 8, 8, 8, 8, 4, 4, 4, 4, 8, 8, 8, 8, 4, 4, 4, 8, 6, 8, 4),
disp = c(160, 160, 108, 258, 360, 225, 360, 146.7, 140.8,
167.6, 167.6, 275.8, 275.8, 275.8, 472, 460, 440, 78.7, 75.7,
71.1, 120.1, 318, 304, 350, 400, 79, 120.3, 95.1, 351, 145,
301, 121), hp = c(110, 110, 93, 110, 175, 105, 245, 62, 95,
123, 123, 180, 180, 180, 205, 215, 230, 66, 52, 65, 97, 150,
150, 245, 175, 66, 91, 113, 264, 175, 335, 109), drat = c(3.9,
3.9, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92, 3.92,
3.07, 3.07, 3.07, 2.93, 3, 3.23, 4.08, 4.93, 4.22, 3.7, 2.76,
3.15, 3.73, 3.08, 4.08, 4.43, 3.77, 4.22, 3.62, 3.54, 4.11
), wt = c(2.62, 2.875, 2.32, 3.215, 3.44, 3.46, 3.57, 3.19,
3.15, 3.44, 3.44, 4.07, 3.73, 3.78, 5.25, 5.424, 5.345, 2.2,
1.615, 1.835, 2.465, 3.52, 3.435, 3.84, 3.845, 1.935, 2.14,
1.513, 3.17, 2.77, 3.57, 2.78), qsec = c(16.46, 17.02, 18.61,
19.44, 17.02, 20.22, 15.84, 20, 22.9, 18.3, 18.9, 17.4, 17.6,
18, 17.98, 17.82, 17.42, 19.47, 18.52, 19.9, 20.01, 16.87,
17.3, 15.41, 17.05, 18.9, 16.7, 16.9, 14.5, 15.5, 14.6, 18.6
), vs = c(0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0,
0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1), am = c(1,
1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1,
0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1), gear = c(4, 4, 4, 3,
3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3,
3, 3, 4, 5, 5, 5, 5, 5, 4), carb = c("M_PP", "O_PP", "C_PP", "K_MM",
"T_MM", "C_MM", "R_PP", "E_PP", "W_PP", "Q_PP", "R_MM", "T_MM",
"V_MM", "Q_MM", "F_PP", "D_PP", "S_PP", "Z_PP", "K_PP", "G_PP", "F_MM",
"D_MM", "S_MM", "Z_MM", "K_MM", "F_MM", "A_PP", "D_PP", "T_PP",
"R_MM", "D_MM", "T_MM"), Name = c("Mark", "Mark", "Mark", "Mark",
"Mark", "Mark", "Tom", "Tom", "Tom", "Tom", "Tom", "Tom",
"Tom", "Tom", "Tim", "Tim", "Tim", "Tim", "Tim", "Tim", "Tim",
"Tim", "Tim", "Tim", "Tim", "Tim", "Greg", "Greg", "Greg",
"Greg", "Greg", "Greg")), .Names = c("mpg", "cyl", "disp",
"hp", "drat", "wt", "qsec", "vs", "am", "gear", "carb", "Name"
), row.names = c(NA, -32L), class = "data.frame")


Below you can just see one group which can be distinguished from this data frame:

mpg cyl disp hp drat wt qsec vs am gear carb Name
1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 M_PP Mark
2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 O_PP Mark
3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 C_PP Mark
4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 K_MM Mark
5 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 T_MM Mark
6 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 C_MM Mark


So, I would like to perform the
t.test
between the
PP
and
MM
- subgroups of Mark (
carb
column). The column which is interesting for me is
gear
. I would like to know that difference in gears number is statistically important withing those subgroups.

Such analysis should be performed for all the groups like
Mark
from this data.

The results (pvalues) can be stored in the same data frame in additional column. It means that pvalues will be repeated in all the rows belonging to the same group.

Answer

It is quite straight forward when using dplyr,

library(dplyr)
df %>% 
  group_by(Name) %>% 
  mutate(carb1 = gsub('.*_', '', carb), p_values = t.test(cyl[carb1 == 'PP'], cyl[carb1 == 'MM'])$p.value) %>% 
  select(-carb1)

#Source: local data frame [32 x 13]
#Groups: Name [4]

#     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb  Name  p_values
#   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>     <dbl>
#1   21.0     6 160.0   110  3.90 2.620 16.46     0     1     4  M_PP  Mark 0.2301996
#2   21.0     6 160.0   110  3.90 2.875 17.02     0     1     4  O_PP  Mark 0.2301996
#3   22.8     4 108.0    93  3.85 2.320 18.61     1     1     4  C_PP  Mark 0.2301996
#4   21.4     6 258.0   110  3.08 3.215 19.44     1     0     3  K_MM  Mark 0.2301996
#5   18.7     8 360.0   175  3.15 3.440 17.02     0     0     3  T_MM  Mark 0.2301996
#6   18.1     6 225.0   105  2.76 3.460 20.22     1     0     3  C_MM  Mark 0.2301996
#7   14.3     8 360.0   245  3.21 3.570 15.84     0     0     3  R_PP   Tom 0.1294094
#8   24.4     4 146.7    62  3.69 3.190 20.00     1     0     4  E_PP   Tom 0.1294094
#9   22.8     4 140.8    95  3.92 3.150 22.90     1     0     4  W_PP   Tom 0.1294094
#10  19.2     6 167.6   123  3.92 3.440 18.30     1     0     4  Q_PP   Tom 0.1294094

NOTE: I used cyl as gear throws the error

Error: data are essentially constant