BigData - 9 months ago 79

R Question

Assume we have 3 rules:

`[1] {A,B,D} -> {C}`

[2] {A,B} -> {C}

[3] Whatever it is

Rule

`[2]`

`[1]`

`[1]`

`[2]`

`[1]`

`[1]`

`[2]`

I searched through the internet and everyone is using these code to remove redundant rules:

`subset.matrix <- is.subset(rules.sorted, rules.sorted)`

subset.matrix[lower.tri(subset.matrix, diag=T)] <- NA

redundant <- colSums(subset.matrix, na.rm=T) >= 1

which(redundant)

rules.pruned <- rules.sorted[!redundant]

I dont understand how the code work.

After line 2 of the code, the subset.matrix will become:

`[,1] [,2] [,3]`

[1,] NA 1 0

[2,] NA NA 0

[3,] NA NA NA

The cells in the lower triangle are set to be NA and since rule

`[2]`

`[1]`

- Why do we have to set the lower triangle as NA? If we do so then how can we check whether rule is subset of rule
`[2]`

or not? (the cell has been set as NA)`[3]`

- In our case, rule should be the one to be eliminated, but these code eliminate rule
`[1]`

instead of rule`[2]`

. (Because the first cell in column 2 is 1, and according to line 3 of the code, the column sums of column 2 >= 1, therefore will be treated as redundant)`[1]`

Any help would be appreciated !!

Answer

For your code to work you need an interest measure (confidence or lift) and `rules.sorted`

needs to be sorted by either confidence or lift. Anyway, the code is horribly inefficient since `is.subset()`

creates a matrix of size n^2, where n is the number of rules. Also, `is.subset`

for rules merges rhs and lhs of the rule which is not correct. So don't worry too much about the implementation details.

A more efficient way to do this is now implemented as function `is.redundant()`

in package arules (available in version 1.4-2).
This explanation comes from the manual page:

A rule is redundant if a more general rules with the same or a higher confidence exists. That is, a more specific rule is redundant if it is only equally or even less predictive than a more general rule. A rule is more general if it has the same RHS but one or more items removed from the LHS. Formally, a rule X -> Y is redundant if

for some X' subset X, conf(X' -> Y) >= conf(X -> Y).

This is equivalent to a negative or zero improvement as defined by Bayardo et al. (2000). In this implementation other measures than confidence, e.g. improvement of lift, can be used as well.

Check out the examples in `? is.redundant`

.