Kevin Kevin - 3 years ago 182
R Question

Transposing rows into columns, then split them

I have a data set containing data sorted in rows like this:

*VarName1* - *VarValue1*
*VarName2* - *VarValue2*
*Etc.*


I want it to be that the VarNames become individual columns. I have achieved this by using the following code:

DFP1 <- as.data.frame(t(DFP)) #DFP contains the data


Now, this is a very big data set. It contains multiple years (millions of rows) of data. Above code creates a dataframe which has > 1E6 columns. I need to split these columns by each entry. I saw that in the first piece of data, a new entry recurs at every 86th column. So, I tried this:

tmp <- data.frame(
X = DFP$noFloat,
ind = rep(1:86, nrow(DFP)/86)
)

y <- rbind(DFP$nmlVar[1:86], unstack(tmp, X~ind))


This works for a few rows. The problem is that the number of variables increased over the years and that I cannot simply assume that the number of variables per entry are the same. This results in variable values mismatching it's names. I am looking for a way to match variables and values based on their variable names.

I am new to advanced data-analysis, so please let me know if you need anything more.

EDIT: I created some sample data of how DFP looks like, to hopefully make you better understand my question:

DFP <- data.frame(
nmlVar = c("Batch", "Mass", "Length", "Product","Batch", "Mass",
"Length", "Product", "Batch", "Mass", "Length", "Width", "Product"),

noFloat = c(254578, 20, 24, 24547, 254579, 23, 24, 24547, 254580, 20,
24, 19, 24547)
)


Important to note here is the apperance of new variable width in the third recurrence. This is typical for my dataset, introduction of new variables. The key indicator here is batch and it should be split at each time the variable batch appears.

dput output of sample data:

structure(list(nmlVar = structure(c(1L, 3L, 2L, 4L, 1L, 3L, 2L,
4L, 1L, 3L, 2L, 5L, 4L), .Label = c("Batch", "Length", "Mass",
"Product", "Width"), class = "factor"), noFloat = c(254578, 20,
24, 24547, 254579, 23, 24, 24547, 254580, 20, 24, 19, 24547)), .Names = c("nmlVar",
"noFloat"), row.names = c(NA, -13L), class = "data.frame")

Answer Source

Is this what you are after?:

library(dplyr)
library(tidyr)
DFP %>% 
  mutate(sample = cumsum(nmlVar == 'Batch')) %>% 
  spread(nmlVar, noFloat)

Gives:

  sample  Batch Length Mass Product Width
1      1 254578     24   20   24547    NA
2      2 254579     24   23   24547    NA
3      3 254580     24   20   24547    19
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download