F. Junkert - 1 year ago 53

Python Question

The heading might be confusing - i will try to explain what I want to do:

I'm studying computer-science and trying to implement a small movie-recommender as a project for my lecture "Data-Warehousing & Data Mining". Right now I'm trying to compute the similarity of 2 users according to their movie-ratings.

`class Rating(Model):`

def __init__(self, userID, movieID, rating):

...

I overrode

`__eq__, __ne__ and __hash__`

`def similarity(userA, userB):`

ratingsA = userA.ratings

ratingsB = userB.ratings

common_ratings = set((ratingsA, ratingsB))

What i want now is something like the following:

2 Lists sorted in the same order to make it possible to calculate the cosin-distance of the users/their ratings.

`[Rating(userID=1, movieID=4, rating=4.7), Rating(user=1, movie=7, rating=9.8)]`

[Rating(userID=2, movieID=4, rating=2.0), Rating(user=2, movie=7, rating=6.6)]

I really don't like my approach, but I couldn't find a better one the last couple hours.

Another, less efficient way (I think?) would be like this:

`lA = []`

lB = []

for rA in ratingsA:

for rB in ratingsB:

if rA.movieID == rB.movieID:

lA.append(rA)

lB.append(rB)

sim = cos_dist(lA, lB)

This approach would probably work but i guess the runtime would be horrible, since there are around 40000 movies and the probability that 2 users have rated same movies is pretty low...

Maybe someone has an efficient approach?

Thank you in advance!

Answer Source

Your approach is O(N^2) worst case. You can reduce the complexity to O(N log N) sorting the ratings lists:

```
sorted_ratingsA = sorted(ratingsA, lambda x: x.movieID)
sorted_ratingsB = sorted(ratingsB, lambda x: x.movieID)
```

And now we can pop the items from these lists from the last one (for efficiency reasons) and use the order over the `movieID`

to check whether a certain id was rated by the user. Something along the lines of:

```
lA = []
lB = []
maxA = sorted_ratingsA.pop()
maxB = sorted_ratingsB.pop()
while sorted_ratingsA and sorted_ratingsB:
if maxA.movieID == maxB.movieID:
lA.append(maxA)
lb.append(maxB)
# instead of the following two pop calls you could simply
# change the elif into a new if statement.
maxA = sorted_ratingsA.pop()
maxB = sorted_ratingsB.pop()
elif maxA < maxB:
maxB = sorted_ratingsB.pop()
else:
maxA = sorted_ratingsA.pop()
```

As you can see the list that contain the maximum value is popped until either an equal id is found or until the id goes below, in that case you start popping from the other list. The fact the the lists are in ascending order means you find all matches in O(N log N).

It is essential to use `pop()`

because popping the end of a `list`

takes *amortized* O(1) time, while using something like `pop(0)`

would cost O(N) for each pop on average and would reintroduce an O(N^2) factor.

An alternative approach is to simply use hashing, and this should get you on average time O(N). You first create two maps from `movieID`

to ratings, and then intersect the maps:

```
mapA = {x.movieId: x for x in ratingsA}
mapB = {x.movieId: x for x in ratingsB}
common_keys = mapA.keys() & mapB.keys()
lA = [mapA[k] for k in common_keys]
lB = [mapB[k] for k in common_keys]
```

If you are using python<3.x replace `keys()`

with `viewkeys()`

.

Note: even if this solution uses hashing, the order of `lA`

and `lB`

matches because iteration order over a `set`

changes only if the set is modified, so the two iterations above retrieve the corresponding ratings. However the order of the ratings itself is not defined (so you don't know the order in which the `movieID`

will appear, but you know that it will match between `lA`

and `lB`

).

You didn't mention SQL in your question, in any case if these objects are in an SQL database it's better to just let the database do the search for you.
You probably have a `rankings`

table with various fields and you want to do:

```
SELECT * FROM rankings
JOIN rankings AS rankings2
ON rankings.movieID = rankings2.movieID
```