Recsys scaled distance calculation hard to understand

Question

Recsys scaled distance calculation hard to understand

vivecalindahl opened this issue 3 years ago · comments

I joined the Anyscale tutorial yesterday, thanks a lot for a very interesting walkthrough @ceteri!

I have issues with understanding some the details of the Recsys notebook. Specifically, in calculating the observation vector:

scaled_diff = abs(c[item] - rating) / 2.0

Here c is a list of length number of users while item is an integer < 100.

The same issue has been raised in the see Anyscale Academy repo, see #35.

If it's not incorrect, it's hard to understand. Some extra comments would be worth a lot!

paco xander nathan · Answer 1 · Thu May 27 2021 03:43:22 GMT+0800 (China Standard Time)

Ah! Good point :) Thank you @vivecalindahl for the catch (and also I hadn't seen a notification about the issue in anyscale/academy)
Will rework that -

paco xander nathan · Answer 2 · Thu May 27 2021 04:19:11 GMT+0800 (China Standard Time)

Thanks for pointing out that code @vivecalindahl, it needs more comments describing what happens at that point.
Had to dig through to find an answer for that one :)

In the JokeRec.load_data() method where the data gets loaded, these ratings are scaled in advance. The raw data ranges [-10, 10] but the scaled data ranges [-1.0, 1.0]

Then these scaled rating values get used as the sample data for the clustering.

The c[item] value is the cluster center for ratings of a particular item (an individual joke), not a number of users who've rated an item. This is scaled the same way as the rating values.

Does that help?

Viveca Lindahl · Answer 3 · Thu May 27 2021 15:53:41 GMT+0800 (China Standard Time)

Thanks for the quick reply!

That helps for the scaling part, but my main confusion (and I think that of @YeziPeter) was the mismatch between the range of item and the dimension of c, namely item is a cluster index in [0, 99], while the mean vector c = [m_0, m_1, ..., m_24982] takes user indices in [0, 24982]. Apologies for repeating myself...

My guess was the intent is to calculate the item rating of the active user relative to that user's average rating for each cluster. Something like

scaled_diff = abs(c[active_user] - rating)/ 2.0

But having read your explanation below that seems to have been a poor guess:

The c[item] value is the cluster center for ratings of a particular item (an individual joke), not a number of users who've rated an item.

How can an individual item have a cluster center? It's represented by a vector of user ratings. One can make that into a single value either by taking a mean or projecting out the active user.

paco xander nathan · Answer 4 · Sun May 30 2021 05:25:40 GMT+0800 (China Standard Time)

Yes, definitely that was a bug: there was an incorrect dimension being used for the clustering data.
This recent patch #18 reworks the overall approach to resolve this problem:

removed the unnecessary/incorrect matrix transpose
changed impute strategy to "median"
the CLUSTERS data structure is now a partition of N/K "nearest" items per cluster

Simulated results tend to score better now, and ostensibly this bug fix should result in better lift.

Many thanks for all your help @vivecalindahl – how does this look?

Viveca Lindahl · Answer 5 · Mon May 31 2021 20:27:07 GMT+0800 (China Standard Time)

It makes more sense now and I think I understand the idea of the env implementation. Thanks for the fixes!