homer3018 homer3018 - 12 days ago 6
R Question

Split a single column into a multiple columns dataframe

So I've solved my initial problem using a suggestion from G. Grothendieck, thanks again, exactly the clean way of doing that I was after. Initial post is here.
Now reality is that my file is just a little more subtle .

It actually looks like this (see also the data section at end of post for reproducible format):

A1
100
200
txt
A2
STRING
300
400
txt txt
txt
txt txt txt
A3
STRING
STRING
150
250
A2
.
.
.


Preliminary data wrangling looks like this :

type <- cumsum(raw_data[[1]] %in% c("A1","A2","A3"))
v <- tapply(raw_data[[1]], type, c, simplify = FALSE)
m <- t(do.call(cbind, lapply(v, ts)))

raw_data<- as.data.frame(m, stringsAsFactors = FALSE)
raw_data[] <- lapply(raw_data, type.convert, as.is = TRUE)

raw_data$Occurences <- 0


giving :

V1 V2 V3 V4 V5 V6 V7
1 A1 100 200 txt <NA> <NA> <NA>
2 A2 String 300 400 txt txt txt txt txt txt
3 A3 String String 150 250 <NA> <NA>
4 A2 <NA> <NA> <NA> <NA> <NA> <NA>


The issue is that df[3,4] should be in df[3,2], and I should state "2" in a new column. Same goes on line 2, where df[2,3] should be in df[2,2] and state "1" in that same addiotional column.
In other words, I'm chasing this :

V1 V2 V3 V4 V5 V6 Occurences
1 A1 100 200 txt <NA> <NA> 0
2 A2 300 400 txt txt txt txt txt txt 1
3 A3 150 250 <NA> <NA> <NA> 2
4 A2 NA NA <NA> <NA> <NA> 0


The STRING is right after A something, sometimes it does not occur and sometimes just one time or several occurences. This is what I did to handle the problem :

#Count "STRING" occurences and readjust values in expected columns
formatString <- function(df) {
z <<- which(df[,2] %in% "STRING")
if (length(z) > 0){
for (i in z){
df$Occurences = df$Occurences + 1
for (j in 2:ncol(df)-1){
if (is.na(df[i,j]) | is.na(df[i,j+1])){
df[i,j] = NA
} else {
df[i,j] = df[i,j+1]
}
}
}
}
z <<- which(df[,2] %in% "STRING")
if(length(z) > 0){formatString(df)}
}


This function is supposed to process exclusively the lines where STRING is found in column 2, it would increment the last column (Occurences) and then shift all the values one column to the left so that they all go back to where they are expected to be. The IS.NA thing is just to try and stop the loop when we're starting to see NAs.
Once we've processed these lines, we look again if STRING is in the column 2, and if so call again the function.

Now my problem is that the function looks like it is processing (takes like 20 sec on almost 19k observations and 261 columns, not really sure it's the best in terms of processing time), expect my dataframe was not updated at the end of the loop. However z got updated so it seems to be working the way it should.

What am I missing ?

The data



Data in reproductible form :

DF <- structure(list(V1 = c("A1", "100", "200", "txt ", "A2", "String",
"300","400", "txt txt", "txt", "txt txt txt", "A3",
"String", "String", "150", "250", "A2")), .Names = "V1",
row.names = c(NA, -14L), class = "data.frame")

Answer

So, because my example was not as accurate as I thought, the data.table approach does not suit my need. Actually the STRING is not necessarily between integer.

I came up with this function :

#Count "STRING" occurences and readjust values in expected columns
formatSearching <- function(df) {
  z <<- which(df[,2] %in% "STRING")
  if (length(z) > 0){
    for (i in z){
      df[i,"String_occurences"] = df[i,"String_occurences"] + 1
      for (j in 2:(ncol(raw_data)-1)){
        if (is.na(df[i,j]) | is.na(df[i,j+1])){
          df[i,j] = NA
        } else {
          df[i,j] = df[i,j+1]
        }
      }
    }
  }
  z <<- which(df[,2] %in% "STRING")
  #if(length(z) > 0){formatSearching(df)} This somehow does work, but does not update df...
  return(df)
}

Because of the comment at the end, I'm calling it like this :

raw_data <- formatSearching(raw_data)
while(length(z) > 0){raw_data <- formatSearching(raw_data)}

There are several issues with all this. First my intent was not to put a while in the middle of the processing of my data, but have a fully functional recursive function. I am probably missing an assignment somewhere so that my RAW_DATA dataframe gets updated.

Secondly, this process is taking time, especially the while step. It is possible that on some lines, I'll have up to 10 occurrences, but on others just 1. I am convinced that we could do better and by better i mean faster, more efficient.

Now this is doing the job the way I want to, I'm just trying to gain some processing speed.

Thank you all.

Comments