mpalanco - 1 year ago 121
R Question

# Set cover approximation in R

I'm trying to solve or implement an approximation to the set cover problem in R. Given a data frame like this.

``````  sets n
1   s1 1
2   s1 2
3   s1 3
4   s2 2
5   s2 4
6   s3 3
7   s3 4
8   s4 4
9   s4 5
``````

The unique number of elements in column
`n`
are:

``````unique(d\$n)
[1] 1 2 3 4 5
``````

I'd like to calculate the smaller number of sets (column
`sets`
) that cover all the unique elements in n (universe). In this example two sets: s1 {1, 2, 3} and s4 {4, 5}. I've read about it in wikipedia and on the Internet, and I know a greedy algorithm could be applied to find an approximation. I've checked too this link in which they mention two packages to solve such problems,
`LPsolve`
and
`Rsymphony`
, but I don't know even how to start. In my real life example, I have more than 40,000 sets and each set between 1,000 and 10,000 elements and a unviverse or unique elements of 80,000.
Any help or guide about how to start or proceed will be very much appreciated.

data

``````d <- structure(list(sets = structure(c(1L, 1L, 1L, 2L, 2L, 3L, 3L,
4L, 4L), .Label = c("s1", "s2", "s3", "s4"), class = "factor"),
n = c(1, 2, 3, 2, 4, 3, 4, 4, 5)), .Names = c("sets", "n"
), row.names = c(NA, -9L), class = "data.frame")
``````

The `lpSolve` package is available on CRAN for linear programing problems. Using your link which had a response from the very reputable Hans Borchers, as well as a slightly more complex example (starting on pg 4/5) in http://math.mit.edu/~goemans/18434S06/setcover-tamara.pdf as templates to understand teh proper structure of the setup, and then following along with modifications to the first example in `?lp`:

``````library( lpSolve)
?lp
# In Details: "Note that every variable is assumed to be >= 0!"
# go from your long-form rep of the sets to a wide form for a matrix representation
( items.mat<- t(table(d\$sets,d\$n))  )  # could have reversed order of args to skip t()
#---------
> dimnames(items.mat) = list( items=1:5, sets=paste0("s", 1:4) )
> items.mat
sets
items s1 s2 s3 s4
1  1  0  0  0
2  1  1  0  0
3  1  0  1  0
4  0  1  1  1
5  0  0  0  1
#---------
f.obj <-  rep(1,4)  # starting values of objective parameters by column (to be solved)
f.dir <- rep(">=",5) # the constraint "directions" by row
f.rhs <- rep(1,5)    # the inequality values by row (require all items to be present)

lp ("min", f.obj, items.mat, f.dir, f.rhs)\$solution
#[1] 1 0 0 1
``````

So sets `s1` and `s4` are a minimal cover. The "column coefficients" determine choice of "sets".

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download