Abhishek Kulshrestha Abhishek Kulshrestha - 1 month ago 6
R Question

Remove columns on the condition of sum in R

I am trying to make a spam predictor. I have got the words for it.

I got the matrix where columns are the words occuring in the mails and the result that it is spam/non-spam such as

Example:-

mail thing money dollar spam
0 1 1 1 1
1 0 0 1 0
0 0 1 1 1


and so on.
The column "spam" here denotes that the mail is spam or not.

How should I remove the columns whose sum of the column(only when spam is 0) is less than a specific value(let's say x)?

This way I can remove the terms which are not necessary at all to detect a spam.

Thanks for the help!

Answer
set.seed(1)

mydata <- data.frame(A = sample(0:1, 10, T),
                     B = sample(0:1, 10, T),
                     C = sample(0:1, 10, T),
                     D = sample(0:1, 10, T),
                     spam = sample(0:1, 10, T))

##    A B C D spam
## 1  0 0 1 0    1
## 2  0 0 0 1    1
## 3  1 1 1 0    1
## 4  1 0 0 0    1
## 5  0 1 0 1    1
## 6  1 0 0 1    1
## 7  1 1 0 1    0
## 8  1 1 0 0    0
## 9  1 0 1 1    1
## 10 0 1 0 0    1


mydata_ones <- mydata [ mydata[, ncol(mydata)] == 0, ]

colSums(mydata_ones)

## A    B    C    D spam 
## 2    2    0    1    0 

cbind(mydata[, -ncol(mydata)] [, colSums(mydata_ones) >= 2 ], 
      spam = mydata[, ncol(mydata)])

##    A B spam
## 1  0 0    1
## 2  0 0    1
## 3  1 1    1
## 4  1 0    1
## 5  0 1    1
## 6  1 0    1
## 7  1 1    0
## 8  1 1    0
## 9  1 0    1
## 10 0 1    1