treeof treeof - 3 years ago 147
R Question

Break up each dataframe row text into five even chunks of text

I was hoping for some assistance with this thorny string problem.

Current dataframe

ID Text
1 This is a very long piece of string. This contains many lines.


I would like to transform it to:

ID Text1 Text2 Text3 Text4 Text5
1 This is a very long piece of string. This contains many lines.


The string split should occur on evenly spliced amount of words. In the example above I have attempted to demonstrate the line split evenly 5 times, so each column should contain 20% of the words.

The objective behind this is to frame these words into such a manner that they can be looked at as time series data as a conversation has just been split up.

Any answers are appreciated.

Answer Source

There is probably a better option to do it but this works with no additional package:

First thing, we create a reproducible example:

df <- data.frame(ID=1:2,
                 Text=c("Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.",
                        "Lorem ipsum dolor sit amet, consectetur adipiscing elit"),
                 stringsAsFactors = FALSE)

Then, chunkize is a wrapper around split+cut that is the tricky part. It takes a character, split it on spaces and into n chunks, then returns a data.frame with n many columns. (We remove names so that the rbind downwards is fine).

chunkize <- function(chr, n=5){
  x <- strsplit(chr, " ")[[1]]
  df <- as.data.frame(
    lapply(
      split(x, 
            cut(seq_along(x), 
                breaks=n)), 
      paste, collapse=" "), 
    stringsAsFactors = FALSE, col.names=NULL)
  names(df) <- NULL
  df
}

Then we simply apply it for every row. We also add the the ID column:

df_chunked <- do.call("rbind", 
                      apply(df, 1, 
                         function(x) cbind(x[1], chunkize(x[-1], 5))))

Finally, we rename columns:

colnames(df_chunked) <- c("ID", paste0("Text", 1:5))

Same thing into an handy function:

chunkize_this <- function(df, n=5){
  chunkize <- function(chr, n){
    x <- strsplit(chr, " ")[[1]]
    df <- as.data.frame(
      lapply(
        split(x, 
              cut(seq_along(x), 
                  breaks=n)), 
        paste, collapse=" "), 
      stringsAsFactors = FALSE, col.names=NULL)
    names(df) <- NULL
    df
  }

  df_chunked <- do.call("rbind", 
                        apply(df, 1, function(x) cbind(x[1], chunkize(x[-1], n))))
  colnames(df_chunked) <- c(colnames(df)[1], paste0("Text", 1:n))
  rownames(df_chunked) <- NULL
  df_chunked
}

You can try it with:

View(chunkize_this(df, 3))
View(chunkize_this(df, 5))

Another example:

df <- read.table(h=T, text=
  'ID   Text
  1    "This is a very long piece of string. This contains many lines."
  2    "This is a very long piece of string. It contains one or two more word."
  3    "Short"'
)

> chunkize_this(df, 5)
ID     Text1           Text2         Text3           Text4                Text5
1  1 This is a       very long      piece of    string. This contains many lines.
2  2 This is a very long piece of string. It contains one or       two more word.
3  3                                   Short                                     
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download