Kanishka - 2 years ago 63

R Question

Below is the training dataset that I am using for Naive Bayes implementation in R(using e1071 package) where: X,Y,Z are the different classes and V1,V2,V3,V4,V5 are the attributes:-

`Class V1 V2 V3 V4 V5`

X Yes Yes No Yes Yes

X Yes Yes No No Yes

X Yes Yes No No Yes

X Yes Yes No No Yes

X No Yes No No Yes

X No Yes No No Yes

X No Yes No No Yes

X No No No No No

X No No No No No

X No No No No No

X No No No No No

X No No No No No

X No No No No No

X No No No No No

X No No No No No

X No No No No No

Y Yes Yes Yes No Yes

Y No No No No Yes

Y No No No No Yes

Y No No No No No

Y No No No No No

Y No No No No No

Y No No No No No

Z No Yes Yes No Yes

Z No No No No Yes

Z No No No No Yes

Z No No No No No

Z No No No No No

Z No No No No No

Z No No No No No

The prior probabilities for the above dataset are X->0.5333333 Y->0.2333333 Z->0.2333333

and the conditional probabilities are :-

`V1`

Y No Yes

X 0.7500000 0.2500000

Y 0.8571429 0.1428571

Z 1.0000000 0.0000000

V2

Y No Yes

X 0.5625000 0.4375000

Y 0.8571429 0.1428571

Z 0.8571429 0.1428571

V3

Y No Yes

X 1.0000000 0.0000000

Y 0.8571429 0.1428571

Z 0.8571429 0.1428571

V4

Y No Yes

X 0.9375 0.0625

Y 1.0000 0.0000

Z 1.0000 0.0000

V5

Y No Yes

X 0.5625000 0.4375000

Y 0.5714286 0.4285714

Z 0.5714286 0.4285714

I want to find out in which class does V3 belong to, given value Yes. So I have my test data as :-

`V3`

Yes

So, I have to find out probability of each class ie, Probability(X| V3=Yes), Probability(Y| V3=Yes),Probability(Z| V3=Yes) and take the maximum out of the three. Now,

Probability(X| V3=Yes)= Probability(X) * Probability(V3=Yes|X)/ P(V3)

From the conditional probability mentioned above, we know that Probability(V3=Yes|X)=0

So, Probability(X| V3=Yes) should be 0 and Probability(Y| V3=Yes),Probability(Z| V3=Yes) should be 0.5 each.

But in R output is different. From the package e1071 I have used naiveBayes function. Below is the code and its corresponding output:-

`#model_nb<-naiveBayes(Class~.,data = train,laplace=0)`

#results<-predict(model_nb,test,type = "raw")

#print(results)

# X Y Z

#[1,] 0.5714286 0.2142857 0.2142857

Can someone please explain as to why such is the output in R?

Same scenario as Case1 w.r.t. Test Data, only difference being laplace used is 1. So, again I have to find out probability of each class ie, Probability(X| V3=Yes), Probability(Y| V3=Yes),Probability(Z| V3=Yes) and take the maximum out of the three.

Below are the conditional probabilities after laplace smoothing(k=1)

`V1`

Y No Yes

X 0.7222222 0.2777778

Y 0.7777778 0.2222222

Z 0.8888889 0.1111111

V2

Y No Yes

X 0.5555556 0.4444444

Y 0.7777778 0.2222222

Z 0.7777778 0.2222222

V3

Y No Yes

X 0.94444444 0.05555556

Y 0.77777778 0.22222222

Z 0.77777778 0.22222222

V4

Y No Yes

X 0.8888889 0.1111111

Y 0.8888889 0.1111111

Z 0.8888889 0.1111111

V5

Y No Yes

X 0.5555556 0.4444444

Y 0.5555556 0.4444444

Z 0.5555556 0.4444444

From naive bayes definition,

Probability(X| V3=Yes)= Probability(X) * Probability(V3=Yes|X)/ P(V3)

Probability(Y| V3=Yes)= Probability(Y) * Probability(V3=Yes|X)/ P(V3)

Probability(Z| V3=Yes)= Probability(Z) * Probability(V3=Yes|X)/ P(V3)

After Calculation I have,

Probability(X| V3=Yes)= 0.53 * 0.05555556 / P(V3)=0.029/P(V3)

Probability(Y| V3=Yes)= 0.23 * 0.22222222 / P(V3)=0.051/P(V3)

Probability(Z| V3=Yes)= 0.23 * 0.22222222 / P(V3)=0.051/P(V3)

From the above calculation, there should be a tie between class Y and Z. But in R output is different. Class X is being shown as output class. Below is the code and its corresponding output:-

`#model_nb<-naiveBayes(Class~.,data = train,laplace=1)`

#results<-predict(model_nb,test,type = "raw")

#print(results)

# X Y Z

#[1,] 0.5811966 0.2094017 0.2094017

Again, can someone please explain why is such the output in R? Am I going wrong anywhere with my calculation?

Also, need some explanation on how P(V3) would be calculated when laplace smoothing is done.

Thanks in advance!

Recommended for you: Get network issues from **WhatsUp Gold**. **Not end users.**

Answer Source

The problem is that you are using just one sample for the test dataset, with only one value of `V3`

. If you give a bit more test data you get sensible/expected results (focusing only on your **case 1**):

```
test <- data.frame(V3=c("Yes", "No"))
predict(model_nb, test, type="raw")
X Y Z
[1,] 0.007936508 0.4960317 0.4960317
[2,] 0.571428571 0.2142857 0.2142857
```

Note you don't get exactly 0, 0.5, 0.5 for V3="Yes", since the function is using a threshold -which you can adjust, do `?predict.naiveBayes`

for more info.

The problem is actually due to the internal implementation of `predict.naiveBayes`

(the source code is at CRAN repository). I'm not going to go into all the details, but basically I've debugged the function, and in a certain step there is this line,

```
newdata <- data.matrix(newdata)
```

which will later decide which column of the conditional probabilities to use. With your original data the data.matrix looks like this:

```
data.matrix(data.frame(V3="Yes"))
V3
[1,] 1
```

thus it later assumes that the conditional probabilities were to be taken from column 1, i.e values 1.0000000, 0.8571429 and 0.8571429 for V3="No", and that's why you were getting results as if V3 was actually "No".

However,

```
data.matrix(data.frame(V3=c("Yes", "No")))
V3
[1,] 2
[2,] 1
```

gives column 2 of the conditional probabilities when V3 is "Yes", and thus you get the right result.

I'm pretty sure your **case 2** is just analogous.

Hope it helps.

**EDIT after comments:** I guess the easier way to solve it would be to put all the data in one data.frame, and select the indexes you use for training/testing your model. Many functions accept `subset`

to select the data you use for training, and `naiveBayes`

is no exception. However, for `predict.naiveBayes`

you have to select the index. Something like this.

```
all_data <- rbind(train, c(NA, NA, NA, "Yes", NA, NA))
trainIndex <- 1:30
model_nb <- naiveBayes(Class~., data=all_data, laplace=0, subset=trainIndex)
predict(model_nb, all_data[-trainIndex,], type="raw")
```

gives the expected result.

```
X Y Z
[1,] 0.007936508 0.4960317 0.4960317
```

Note that this works because in this case when you do the `data.matrix`

operation you get the right result.

```
data.matrix(all_data[-trainIndex,])
Class V1 V2 V3 V4 V5
31 NA NA NA 2 NA NA
```

**EDIT2 after comments:** Some more details on why this is happening.

When you define your `test`

dataframe including only one value equal to "No", the conversion performed by `data.matrix`

has actually no way to know that your variable `V3`

has 2 possible values, "Yes" and "No". `test$V3`

is actually a factor:

```
test <- data.frame(V3="Yes")
class(test$V3)
[1] "factor"
```

and as said it has only one level (no way for the data.frame to know there are actually 2)

```
levels(test$V3)
[1] "Yes"
```

The implementation of `data.matrix`

, as you can see in the docs, uses the levels of the factor:

Factors and ordered factors are replaced by their internal codes.

Thus when converting test to `data.matrix`

it interprets there's only one possible value of the factor and decodes it,

```
data.matrix(test)
V3
[1,] 1
```

However, when you do the trick of putting training and test into the same dataframe, the factor levels are properly defined.

```
levels(all_data$V3)
[1] "No" "Yes"
```

The result would be the same if you did this:

```
test <- data.frame(V3=factor("Yes", levels=levels(all_data$V3)))
test
V3
1 Yes
levels(test$V3)
[1] "No" "Yes"
data.matrix(test)
V3
[1,] 2
```

Recommended from our users: **Dynamic Network Monitoring from WhatsUp Gold from IPSwitch**. ** Free Download**