Ben Ben - 14 days ago 5
R Question

How to merge several columns of the same dataframe?

I have one big data frame containing different measurements performed by several probes.

The timing of the measurements are not exactly the same. As I want to compare both measurements at a given time and plot them in an animation, I need my data to be "synchronized".

Here is an example of the dataframe I get (in real life I have way more columns that I read directly from a text file):

time1.in.s <- seq(0.010, 100, length.out = 100)
time2.in.s <- seq(0.022, 100, length.out = 100)
data1 <- seq(-10, 100, length.out = 100)
data2 <- seq(-25, 80, length.out = 100)

my.df <- data.frame(time1.in.s, data1, time2.in.s, data2)


Which gives:

time1.in.s data1 time2.in.s data2
1 0.01 -10.000000 0.022000 -25.0000000
2 1.02 -8.888889 1.031879 -23.9393939
3 2.03 -7.777778 2.041758 -22.8787879
4 3.04 -6.666667 3.051636 -21.8181818
5 4.05 -5.555556 4.061515 -20.7575758
6 5.06 -4.444444 5.071394 -19.6969697


What I want to do is merge the two timeX.in.s columns in a single "time" column. Where data is not available, I would have NAs that I could fill in with something like
na.approx(my.df$data1, x = my.df$time)
.

This code is given so that you can reproduce the problem, but in real life, time1.in.s, time2.in.s, data1 and data2 are not available separately. What I actually do is
my.df <- read.table(my.file, header = TRUE)
and I get the same result. I thus don't have the possibility to build the separate data frames directly, I need to split the one big data frame in several manually:

df.list <- list()
for (i in seq(1, ncol(my.df), 2)) {
df.list[[ceiling(i/2)]] <- data.frame(time = my.df[, i], data = my.df[, i+1])
}


Then merge the dataframes one by one:

merged.df <- data.frame(time = as.numeric(NA), data = as.numeric(NA))
for (i in 1:length(df.list)) {
merged.df <- merge(merged.df, df.list[[i]], by = "time", all = TRUE)
}


And finally fill in the gaps:

merged.df$data.y <- na.approx(merged.df$data.y, x = merged.df$time, na.rm = FALSE)


That definitely works (except the names of the columns are a big mess). But it is cumbersome and doesn't look very R to me. Is there a simpler way to do this?

Here is the result obtained with the above commands:

> head(merged.df)
time data.x data.y data
1 0.010000 NA -10.000000 NA
2 0.022000 NA -9.986799 -25.00000
3 1.020000 NA -8.888889 NA
4 1.031879 NA -8.875821 -23.93939
5 2.030000 NA -7.777778 NA
6 2.041758 NA -7.764843 -22.87879


Column data.x comes from the initial empty merged.df. It can be dumped.
Column data.y is the my.df$data1 column.
In the above dataframe, I did not use the na.approx command on column data (which corresponds to my.df$data2 column)

Additional note on OmaymaS' proposed solution:

To make this work in the general case (i.e. with any number of columns), what I have done is the following. First, I defined a 6 columns data frame:

time1.in.s <- seq(0.010, 100, length.out = 100)
time2.in.s <- seq(0.022, 100, length.out = 100)
time3.in.s <- seq(0.017, 99.8, length.out = 100)
data1 <- seq(-10, 100, length.out = 100)
data2 <- seq(-25, 80, length.out = 100)
data3 <- seq(-15, 70, length.out = 100)

my.df <- data.frame(time1.in.s, data1, time2.in.s, data2, time3.in.s, data3)


This leads to:

head(my.df)
time1.in.s data1 time2.in.s data2 time3.in.s data3
1 0.01 -10.000000 0.022000 -25.00000 0.017000 -15.00000
2 1.02 -8.888889 1.031879 -23.93939 1.024909 -14.14141
3 2.03 -7.777778 2.041758 -22.87879 2.032818 -13.28283
4 3.04 -6.666667 3.051636 -21.81818 3.040727 -12.42424
5 4.05 -5.555556 4.061515 -20.75758 4.048636 -11.56566
6 5.06 -4.444444 5.071394 -19.69697 5.056545 -10.70707


I changed the name of all columns containing the time to the same name (this way I don't have to tell the
merge
function which column to merge
by
):

colnames(my.df)[seq(1, ncol(my.df), 2)] <- "Time"


Then I loop on a slightly modified Reduce function:

df.merged <- my.df[, 1:2]

for (i in seq(3, ncol(my.df), 2)) {
df.merged <- Reduce(function(x,y) merge(x,y,
all = TRUE),
list(df.merged,
my.df[, i:(i+1)])
)
}


This gives:

> head(df.merged)
Time data1 data2 data3
1 0.010000 -10.000000 NA NA
2 0.017000 NA NA -15.00000
3 0.022000 NA -25.00000 NA
4 1.020000 -8.888889 NA NA
5 1.024909 NA NA -14.14141
6 1.031879 NA -23.93939 NA


Finally, I apply the
na.approx
function:

df.interp <- df.merged
df.interp[, 2:ncol(df.interp)] <- na.approx(df.interp[, 2:ncol(df.interp)],
x = df.interp$Time,
na.rm = FALSE)


Here is the final result:

> head(df.interp)
Time data1 data2 data3
1 0.010000 -10.000000 NA NA
2 0.017000 -9.992299 NA -15.00000
3 0.022000 -9.986799 -25.00000 -14.99574
4 1.020000 -8.888889 -23.95187 -14.14560
5 1.024909 -8.883488 -23.94671 -14.14141
6 1.031879 -8.875821 -23.93939 -14.13548


I still have NAs at the beginning of some data columns, but I can get rid of them with the
na.omit
function.

Answer

Try merge, it should help you accomplish what you need:

First: create two datframes with data and corresponding time:

df1 <- data.frame(time1.in.s, data1)
df2 <- data.frame(time2.in.s, data2)

Second: merge the two dataframes, specifying the column to use using by.x and by.y, and include all values:

df.merged <- merge(df1,df2,
      by.x = "time1.in.s",
      by.y = "time2.in.s",
      all.x = TRUE,
      all.y = TRUE)

Note: to clarify as per Sotos recommendation:

all.x = TRUE,
all.y = TRUE

is similar to

all = TRUE

So if you want to exclude values from either dataframes that do not exist in the other, you can set all.x or all.y to FALSE.

Now you will have time in once column, and you can rename the columns as you like.

> head(df.merged)
  time1.in.s      data1     data2
1   0.010000 -10.000000        NA
2   0.022000         NA -25.00000
3   1.020000  -8.888889        NA
4   1.031879         NA -23.93939
5   2.030000  -7.777778        NA
6   2.041758         NA -22.87879

EDIT: If you want to apply this on multiple columns, where you have multiple timen.in.s- datan, you can try reduce as follows, where you can add multiple selections in the list, and all will be merged according to the time column, assuming that it will be always the first in select.

df.merged <- Reduce(function(x,y) merge(x,y,
                   by.x = names(x)[1],
                   by.y = names(y)[1],
                   all = TRUE),
   list(select(my.df,time1.in.s, data1),
        select(my.df,time2.in.s, data2))
   )

> head(df.merged)
  time1.in.s      data1     data2
1   0.010000 -10.000000        NA
2   0.022000         NA -25.00000
3   1.020000  -8.888889        NA
4   1.031879         NA -23.93939
5   2.030000  -7.777778        NA
6   2.041758         NA -22.87879

Additional NOTE:

If you want to use columns' indecies, you can use:

df.merged <- Reduce(function(x,y) merge(x,y,
                                        by.x = names(x)[1],
                                        by.y = names(y)[1],
                                        all = TRUE),
                    list(select(my.df,1,2),
                         select(my.df,3,4))
)

Also If your columns' names are consistent, and you want to build the list automatically, you can create a function which takes an integer and return the columns' names you want to select:

getDF <- function(x)
{
        c1 <- paste0("time",x,".in.s")
        c2 <- paste0("data",x)
        return(c(c1,c2))
}

For example:

> getDF(1)
[1] "time1.in.s" "data1"

Then you can use this in reduce:

df.merged <- Reduce(function(x,y) merge(x,y,
                                        by.x = names(x)[1],
                                        by.y = names(y)[1],
                                        all = TRUE),
                    list(my.df[,getDF(1)],
                         my.df[,getDF(2)])
)