Bas - 5 months ago 51

C# Question

I'm developing an item-based collaborative filter using an adjusted cosine similarity between restaurants to generate recommendations. I got everything set up and it works well, but when I try to simulate possible test scenarios, I got some interesting results.

I'll start with my test data. I have 2 restaurants where I want to calculate a similarity between, and 3 users who all have rated the 2 restaurants the same. I'll explain it using the following matrix:

`User 1 | User 2 | User 3`

Restaurant 1 | 1 | 2 | 1

Restaurant 2 | 1 | 2 | 1

I'm trying to calculate the similarity using the following function:

Restaurants are called

`Subject`

`public double ComputeSimilarity(Guid subject1, Guid subject2, IEnumerable<Review> allReviews)`

{

//This will create an IEnumerable of reviews from the same user on the 2 restaurants.

var matches = (from R1 in allReviews.Where(x => x.SubjectId == subject1)

from R2 in allReviews.Where(x => x.SubjectId == subject2)

where R1.UserId == R2.UserId

select new { R1, R2 });

double num = 0.0f;

double dem1 = 0.0f;

double dem2 = 0.0f;

//For the similarity between subjects, we use an adjusted cosine similarity.

//More information on this can be found here: http://www10.org/cdrom/papers/519/node14.html

foreach (var item in matches)

{

//First get the average of all reviews the user has given. This is used in the adjusted cosine similarity, read the article from the link for further explanation

double avg = allReviews.Where(x => x.UserId == item.R1.UserId)

.Average(x => x.rating);

num += ((item.R1.rating - avg) * (item.R2.rating - avg));

dem1 += Math.Pow((item.R1.rating - avg), 2);

dem2 += Math.Pow((item.R2.rating - avg), 2);

}

return (num / (Math.Sqrt(dem1) * Math.Sqrt(dem2)));

}

My review looks like this:

`public class Review`

{

public Guid Id { get; set; }

public int rating { get; set; } //This can be an integer between 1-5

public Guid SubjectId { get; set; } //This is the guid of the subject the review has been left on

public Guid UserId { get; set; } //This is the guid of the user who left the review

}

In all other scenarios will the function calculate a correct similarity between subjects. But when I use the test data above (Where I expected a perfect similarity) it results in an NaN.

Is this an error in my code or is this an error in the adjusted cosine similarity? And if it results in NaN, is it good to catch this and insert a

`1`

Edit: I have tried with other matrices too, and I got even more interesting results.

`User 1 | User 2 | User 3 | User 4 | User 5`

Restaurant 1 | 1 | 2 | 1 | 1 | 2

Restaurant 2 | 1 | 2 | 1 | 1 | 2

This still results in NaN.

`User 1 | User 2 | User 3 | User 4 | User 5`

Restaurant 1 | 2 | 2 | 1 | 1 | 2

Restaurant 2 | 1 | 2 | 1 | 1 | 2

This results in

`-1`

Answer

It seems your algorithm is implemented correctly. Thing is this formula can indeed be undefined at some points for perfectly reasonable sets. You can treat this case as "this measure (adjusted cosine similarity) has nothing to say about provided sets", so it is not correct to assign any arbitrary value (0, 1, -1). Instead, use different measure in this case. For example, simple (non-adjusted) cosine similarity will give "1" as a result, which is what you might expect.