BigData BigData - 2 months ago 17x
R Question

Association rule in R - removing redundant rule (arules)

Assume we have 3 rules:

[1] {A,B,D} -> {C}

[2] {A,B} -> {C}

[3] Whatever it is

is a subset of rule
(because rule
contains all the items in rule
), so rule
should be eliminated (because rule
is too specific and its information is included in rule

I searched through the internet and everyone is using these code to remove redundant rules:

subset.matrix <- is.subset(rules.sorted, rules.sorted)
subset.matrix[lower.tri(subset.matrix, diag=T)] <- NA
redundant <- colSums(subset.matrix, na.rm=T) >= 1
rules.pruned <- rules.sorted[!redundant]

I dont understand how the code work.

After line 2 of the code, the subset.matrix will become:

[,1] [,2] [,3]
[1,] NA 1 0
[2,] NA NA 0
[3,] NA NA NA

The cells in the lower triangle are set to be NA and since rule
is a subset of rule
, the corresponding cell is set to 1. So I have 2 questions:

  1. Why do we have to set the lower triangle as NA? If we do so then how can we check whether rule
    is subset of rule
    or not? (the cell has been set as NA)

  2. In our case, rule
    should be the one to be eliminated, but these code eliminate rule
    instead of rule
    . (Because the first cell in column 2 is 1, and according to line 3 of the code, the column sums of column 2 >= 1, therefore will be treated as redundant)

Any help would be appreciated !!


For your code to work you need an interest measure (confidence or lift) and rules.sorted needs to be sorted by either confidence or lift. Anyway, the code is horribly inefficient since is.subset() creates a matrix of size n^2, where n is the number of rules. Also, is.subset for rules merges rhs and lhs of the rule which is not correct. So don't worry too much about the implementation details.

A more efficient way to do this is now implemented as function is.redundant() in package arules (available in version 1.4-2). This explanation comes from the manual page:

A rule is redundant if a more general rules with the same or a higher confidence exists. That is, a more specific rule is redundant if it is only equally or even less predictive than a more general rule. A rule is more general if it has the same RHS but one or more items removed from the LHS. Formally, a rule X -> Y is redundant if

for some X' subset X, conf(X' -> Y) >= conf(X -> Y).

This is equivalent to a negative or zero improvement as defined by Bayardo et al. (2000). In this implementation other measures than confidence, e.g. improvement of lift, can be used as well.

Check out the examples in ? is.redundant.