user2117897 - 6 months ago 21

R Question

so here is what i am trying to do. i have a data set that has all outcomes listed in one column, but the step they were observed at and the method used to observe them are in separate columns. there are multiple sites which i am treating as a unique identifier. not every site has the same number of steps or methods and not every method is done at every step. for example, site a1 may have steps s1-s5 and method m1-m25 at each step, while site a9 may have steps s1-s15, but only methods m3-m9. basically, there can be missing data for a given site/step/method. not every site has every step, and not every step has every method. the raw data set looks a little like this:

`site step method outcome`

a1 S 1 m1 5

a1 S 1 m2 1

a1 S 2 m6 4

a2 S 1 m6 1a

a2 S 1 m4 3

a2 S 3 m7 2

a2 S 4 m2 7

a3 S 1 m1 2a

a3 S 1 m2 c11

a4 S 1 m4 2

a4 S 2 m2 5

a5 S 3 m3 6

a6 S 2 m1 7

a6 S 3 m4 8

outcome has some numeric and character values, depending on method

step is the only real "time" variable, but i feel that i need to have r treat method as one as well. the way the data is now there are lots of rows and just these few columns, and the way the data is set up now i am having trouble running any analysis on it.

i used reshape as follows( i have tried other reshape statements but this is one for example), wont let me use 2 time variables unfortuneately

`mydata<-reshape(rawdata,idvar="site",timevar="step",direction="wide")`

site method.S 1 outcome.S 1 method.S 2 outcome.S 2 method.S 3 outcome.S 3

1 a1 m1 5 m6 4 <NA> NA

4 a2 m6 1 <NA> NA m7 2

8 a3 m1 2 <NA> NA <NA> NA

10 a4 m4 2 m2 5 <NA> NA

12 a5 <NA> NA <NA> NA m3 6

13 a6 <NA> NA m1 7 m4 8

method.S 4 outcome.S 4

1 <NA> NA

4 m2 7

8 <NA> NA

10 <NA> NA

12 <NA> NA

13 <NA> NA

this is the output from r

it is correct that i only want to end up with 1 row per site, and many columns (even if a site had nothing done at a particular step). i am trying to get 1 row per site and then the outcome column will go away with all its values beneath an appropriate column like so

`site S1.m1.outcome S1.m2.outcome S1.m3.outcome ................ S9.m10.outcome`

a1 1 c4.5 NA 3.6

so basically one column per step and method combination, and i know that is alot of columns, but it will make it much easier to compare between steps which is one of my goals

my main point of doing this is to be able to, for a given method, test changes in outcomes between steps using t-tests and such for differences in means.

i imagine there is an easier way to go about doing the tests but i am still new to r and haven't found one yet.

thanks for any advice

cheers

Answer

Here are two options, if I understand your desired output correctly. In the examples below, for the last steps, I've sorted the columns so they would match from using each output, and I've only shown the first few and the last few columns from the resulting `data.frame`

s so you can see that they are giving you the same results. In other words, you should be able to stop at "T2" and "T3" for your actual needs--the rest is just for demonstration.

`reshape`

twice```
T1 <- reshape(rawdata, idvar = c("site", "method"),
timevar = "step", direction = "wide")
T2 <- reshape(T1, direction = "wide", idvar = "site", timevar = "method")
dim(T2)
# [1] 6 25
names(T2)
# [1] "site" "outcome.S1.m1" "outcome.S2.m1" "outcome.S3.m1" "outcome.S4.m1"
# [6] "outcome.S1.m2" "outcome.S2.m2" "outcome.S3.m2" "outcome.S4.m2" "outcome.S1.m6"
# [11] "outcome.S2.m6" "outcome.S3.m6" "outcome.S4.m6" "outcome.S1.m4" "outcome.S2.m4"
# [16] "outcome.S3.m4" "outcome.S4.m4" "outcome.S1.m7" "outcome.S2.m7" "outcome.S3.m7"
# [21] "outcome.S4.m7" "outcome.S1.m3" "outcome.S2.m3" "outcome.S3.m3" "outcome.S4.m3"
T2a <- T2[, order(names(T2))]
T2a[, c(1:3, 23:25)]
# outcome.S1.m1 outcome.S1.m2 outcome.S1.m3 outcome.S4.m6 outcome.S4.m7 site
# 1 5 1 <NA> <NA> <NA> a1
# 4 <NA> <NA> <NA> <NA> <NA> a2
# 8 2a c11 <NA> <NA> <NA> a3
# 10 <NA> <NA> <NA> <NA> <NA> a4
# 12 <NA> <NA> <NA> <NA> <NA> a5
# 13 <NA> <NA> <NA> <NA> <NA> a6
```

`dcast`

from "reshape2"```
library(reshape2)
T3 <- dcast(rawdata, site ~ step + method, value.var = "outcome", drop = FALSE)
dim(T3)
# [1] 6 25
names(T3)
# [1] "site" "S1_m1" "S1_m2" "S1_m3" "S1_m4" "S1_m6" "S1_m7" "S2_m1" "S2_m2" "S2_m3"
# [11] "S2_m4" "S2_m6" "S2_m7" "S3_m1" "S3_m2" "S3_m3" "S3_m4" "S3_m6" "S3_m7" "S4_m1"
# [21] "S4_m2" "S4_m3" "S4_m4" "S4_m6" "S4_m7"
T3a <- T3[, order(names(T3))]
T3a[, c(1:3, 23:25)]
# S1_m1 S1_m2 S1_m3 S4_m6 S4_m7 site
# 1 5 1 <NA> <NA> <NA> a1
# 2 <NA> <NA> <NA> <NA> <NA> a2
# 3 2a c11 <NA> <NA> <NA> a3
# 4 <NA> <NA> <NA> <NA> <NA> a4
# 5 <NA> <NA> <NA> <NA> <NA> a5
# 6 <NA> <NA> <NA> <NA> <NA> a6
```

Both use the following as the input for "rawdata"

```
rawdata <- structure(list(site = c("a1", "a1", "a1", "a2", "a2", "a2", "a2",
"a3", "a3", "a4", "a4", "a5", "a6", "a6"), step = c("S1", "S1",
"S2", "S1", "S1", "S3", "S4", "S1", "S1", "S1", "S2", "S3", "S2",
"S3"), method = c("m1", "m2", "m6", "m6", "m4", "m7", "m2", "m1",
"m2", "m4", "m2", "m3", "m1", "m4"), outcome = c("5", "1", "4",
"1a", "3", "2", "7", "2a", "c11", "2", "5", "6", "7", "8")), .Names = c("site",
"step", "method", "outcome"), row.names = c(NA, -14L), class = "data.frame")
```