linkedin / photon-ml

A scalable machine learning library on Apache Spark

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Help needed on using the GAME API

nishamuktewar opened this issue · comments

Hello,

I am trying to understand how to use the GAME API, especially, how to include mixed effects both - random intercepts and random slopes and thought it might be okay to ask here. Let's say I am using the MovieLens dataset and wanted to build a mixed model by adding just a random intercept based on the userId, how can that be achieved? I tried specifying it in the following way:

spark2-submit \
  --class com.linkedin.photon.ml.cli.game.training.Driver \
  --master yarn \
  --deploy-mode client \
  --num-executors 4 \
  --driver-memory 10g \
  --executor-memory 10g \
  photon-all_2.11-1.0.0.jar \
  --train-input-dirs "hdfs:///user/nisha/Data/photon-ml/movieLens/train/" \
  --output-dir "hdfs:///user/nisha/Data/photon-ml/movieLens/output" \
  --task-type "LINEAR_REGRESSION" \
  --feature-name-and-term-set-path "hdfs:///user/nisha/Data/photon-ml/movieLens/featuresets/" \
  --feature-shard-id-to-feature-section-keys-map "globalShard:|userShard:" \
  --updating-sequence global,per-user \
  --application-name "GAME model testing" \
  --validate-input-dirs "hdfs:///user/nisha/Data/photon-ml/movieLens/test" \
  --fixed-effect-optimization-configurations "global:10,1e-5,1,1.0,TRON,L2" \
  --random-effect-optimization-configurations "per-user:10,1e-5,1,1.0,TRON,L2" \
  --fixed-effect-data-configurations "global:globalShard,1" \
  --random-effect-data-configurations "per-user:userId,userShard,1,10,5,0.5,index_map" \
  --input-column-names "response:response|uid:userId|offset:offset|weight:weight|metadataMap:metadataMap" \
  --delete-output-dir-if-exists "true" \
  --num-iterations 5 \
  --evaluator-type RMSE \
  --summarization-output-dir "hdfs:///user/nisha/Data/photon-ml/movieLens/training-smry" \
  --normalization-type NONE \
  --compute-variance false

This does produce some resultant coefficients - a fixed effect intercept and intercepts by userIds.
fixed effect intercept: probably mean of the training set response variable - rating

{u'variances': None, u'means': [{u'term': u'', u'name': u'(INTERCEPT)', u'value': 3.5454240769000385}], u'modelClass': u'com.linkedin.photon.ml.supervised.regression.LinearRegressionModel', u'lossFunction': u'', u'modelId': u'fixed-effect'}

random effects for each userId:

{u'variances': None, u'means': [{u'term': u'', u'name': u'(INTERCEPT)', u'value': 0.7932438678450324}], u'modelClass': u'com.linkedin.photon.ml.supervised.regression.LinearRegressionModel', u'lossFunction': u'', u'modelId': u'273'}
{u'variances': None, u'means': [{u'term': u'', u'name': u'(INTERCEPT)', u'value': 0.10382895222067612}], u'modelClass': u'com.linkedin.photon.ml.supervised.regression.LinearRegressionModel', u'lossFunction': u'', u'modelId': u'253'}
......

So does that mean the userId = 273's random intercept is actually 3.545 + 0.793 = 4.3386?

If I were using R's lmer package, I would use something like:

userModel <- lmer(rating ~ (1|userId), data=movieLensTrain)

and it would produce results of the form:

Fixed effects:
            Estimate Std. Error t value
(Intercept)  3.66559    0.01824     201
> coef(userModel)
$userId
    (Intercept)
1      2.706271
2      3.453963
.     ...
273    4.254384

where userId = 273's random intercept = 4.254

Understand that the numbers won't match exactly because of the different hyperparams + underlying implementation. But wanted to know if this how it is done? And also how can I add random slopes based on the userId?

Thank you for your time. Once I can figure this out I can help add some documentation on how to use this API.

Hi @nishamuktewar

Looks like you're on the right track. To add random slope, you need to assign feature bags to the random-effect shard: e.g. userShard:genreFeatures,movieLatentFactorFeatures. This will add features from the "genreFeatures" and "movieLatentFactorFeatures" bags into the feature vectors for the per-user problem, and hence learn a slope on those features.

If you haven't seen it already, we have an interactive tutorial here: https://github.com/linkedin/photon-ml/wiki/Photon-ML-Tutorial

This tutorial shows how to use our new API, which should be a lot more user friendly than the command-line interface.

Thanks @joshvfleming.

Appreciate your response. So it seems my understanding of the random intercept coefficient for a userId is correct? I will try the random slope logic like you suggested and go through the tutorial. Thanks again.