ruser9575ba6f ruser9575ba6f - 3 months ago 7
R Question

Duplicate data.table rows from formatted string column

Question:



What is the best way in R to transform a
data.table
which looks like this:

> input
id value node
1: 1 foo node3
2: 2 bar node[2,4]
3: 3 qux node[2-4]
4: 4 foo node[1-2,4]


into something like this:

> output
id value node
1: 1 foo node3
2: 2 bar node2
3: 2 bar node4
4: 3 qux node2
5: 3 qux node3
6: 3 qux node4
7: 4 foo node1
8: 4 foo node2
9: 4 foo node4


Sample input and output:

input <- data.table(id = c(1,2,3,4), value = c("foo", "bar", "qux", "foo"), node = c("node3","node[2,4]","node[2-4]","node[1-2,4]"))


output <- data.table(id = c(1,2,2,3,3,3,4,4,4), value = c("foo","bar","bar","qux","qux","qux","foo","foo","foo"), node = c("node3", "node2", "node4", "node2", "node3", "node4", "node1", "node2", "node4"))


Background:



I am extracting job logs from a cluster of machines and the logs are similar to the input above. The id corresponds to a job id, the value to a particular executable, and the node to the machines in the cluster that actually executed the job. The logs use a compressed formatting for the node column to represent which machines the job ran on.

Using
library(stringr)
, I wrote some ugly code which will partially parse the node column. Perhaps this can be a useful starting point:

expand_node <- function(nodes)
{
tokens <- str_match(nodes, "\\[([0-9,\\-]+)\\]")[ ,2]
tokens <- str_replace_all(tokens, "\\-", ":")
tokens <- paste0("c(",tokens,")")
result <- lapply(tokens, function(expr) eval(parse(text = expr)))
return(result)
}

Answer

Here is a data.table option you can try, and one step fewer with the regular expression:

input[, .(node = unlist(lapply(sub("node\\[?([0-9,:]+)\\]?", "c(\\1)", gsub("-", ":", node)), 
          function(expr) paste("node", eval(parse(text = expr)), sep = "")))), .(id, value)]

#   id value  node
#1:  1   foo node3
#2:  2   bar node2
#3:  2   bar node4
#4:  3   qux node2
#5:  3   qux node3
#6:  3   qux node4
#7:  4   foo node1
#8:  4   foo node2
#9:  4   foo node4
Comments