Francis Loko Francis Loko - 1 year ago 57
R Question

Aggregate duplicates in multiple columns of a data table in such a way that every value is only listed once

I have a table of alphanumeric IDs and associated values. There are instances of one ID associated with multiple values. On the other hand, there are also instances of a given value corresponding to several IDs. I am trying to collapse the table in such a way that each ID and each value is only listed once. I found a few questions that are close, but not quite the same, e.g. R finding duplicates in one column and collapsing in a second column, or aggregate values from several fields into one

This is an example of what my data look like:

ID = c("c067","c155","l413","l413","k456","z025","z026","z025","z026")
value = c(47, 47, 57, 58, "vwd", 85, 85, "ca", "ca")
x.raw = data.table(ID, value)

I can get a partial solution by doing this:

x.tidy1 = x.raw[,list(value = list(value)), by = ID]

But, notice that the values "47" and "c("85", "ca")" each occur twice. What I really would like is to get something similar to this at the end:

ID = list(c("c067","c155"),"l413","k456",c("z025","z026"))
value = list("47",c("57","58"),"vwd",c("85","ca"))
x.tidy2 = data.table(ID, value)

# ID value
# 1: c067,c155 47
# 2: l413 57,58
# 3: k456 vwd
# 4: z025,z026 85,ca

I have tried this, without success:

x.tidy2 = x.tidy1[,list(ID = list(ID)), by = value]

How could I do this?

Answer Source

To find connected components and label edges with them...

g = graph_from_edgelist(as.matrix(x.raw))

mem = components(g)$membership
x.raw[, grp := mem[ match(ID, names(mem))] ]

Then you can summarize as the OP had wanted:

x.raw[, lapply(.SD, function(x) .(unique(x))), by=grp]

#    grp        ID value
# 1:   1 c067,c155    47
# 2:   2      l413 57,58
# 3:   3      k456   vwd
# 4:   4 z025,z026 85,ca

List columns like this are a very difficult format for further analysis. If you just want to visualize the components...

V(g)$type = V(g)$name %like% "^[a-z][0-9]{3}$"
plot(g, layout = layout.bipartite(g))

enter image description here

Or, you know...

com = make_clusters(g, clusters(g)$mem)
plot(com, g)

enter image description here