Yelp / MOE

A global, black box optimization engine for real world metric optimization.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

For Beta distribution points, sample or use the variance?

orasis opened this issue · comments

Hi,

I have an existing bayesian bandit system that I might like to discover new points to test through MOE.

For a beta distrubution, is there some way to feed it in directly using value_var to provide the variance? Or, do I have to sample the beta distribution and insert maybe 10 or so points per beta rather than providing the full information found in the beta distribution?

Also, the documentation said it likes the the points to be zero mean. A beta distribution is between 0 and 1, so what would be the proper way to correct to a zero mean?

Thank you so much, I look forward to trying out MOE.

Sorry I've been gone for so long! See: #448 for details.

@orasis, I'm not sure I'm understanding your question. What are you trying to do? Pick the best one of several beta distributed choices (that would be bandits)? Or tune something that is at every point a beta distribution (that'd be the optimal learning bit)?

Are you trying to use MOE's bandit framework? Our beta-based (BLA aka thompson sampling) bandit does not support a variance input. I'm not sure this makes sense... like are you uncertain of the number of trials & successes?

Or do you have input data that follows a beta distribution?
MOE works by modeling every point in the 'world' as a Gaussian (roughly) with some mean & variance. Nearby points influence each other through the covariance kernel. Modeling things as Gaussian makes the math work out nicer (indeed it makes the math even possible without resorting to massive monte carlo).

That assumption is not as limiting as it sounds, as MOE is guaranteed to converge (asymptotically) for any optimization problem, even if the input distribution is not gaussian.

If your data is beta distributed everywhere, then you can compute the first & second moments, mean & variance. That is what you'd feed into MOE at each location for objective value & variance. Well I think, if I'm imagining your scenario right.

Finally, the 0-mean thing just means you would want to subtract the GLOBAL mean. So if I observed function values [1, -1, 3], the mean is 1 and I would want to input [0, -2, 2]. You could update the mean after every sample (which means retuning hyperparameters every time), or you can space it out.

The issue is that MOE assumes a prior that is 0 mean and variance alpha (the hyperparameter). Fire up the demo (plotter) and look at what happens when you have no data but change the signal variance hyperparam. So if your objective values are say [10, 12, 8], and the prior has mean 0, then MOE will want to sample far away (minimizing) b/c the prior is way better. Similarly if you had [-10, -12, -8], MOE may over-favor points near your existing samples b/c the prior (with value 0) is way worse.

Without exactly understandign what you're trying to do though I can't offer precise advice on what to do with the mean.

Oh, maybe this is what you're doing... at Yelp, one of the things we used MOE to do was optimize parameters of the advertising system to increase click-through-rate. So we'd fire up a few experimental cohorts.

Multi-armed bandits would spin down an arm that sucks. Then we'd take clicks, attempts and compute mean & variance to feed into MOE. And ask MOE for a new experimental cohort to replace the one we just stopped.

Is that like what you're doing?

Yes. So I have an MAB system using Thompson sampling where I track the
trial count and success count. I want to provide these arms to the GP so
that it can suggest new arms to try. In this way a fixed number of arms is
used as points in a continuous space which should provide good performance
and scaling.
On Tue, Nov 3, 2015 at 23:32 Eric Liu notifications@github.com wrote:

Oh, maybe this is what you're doing... at Yelp, one of the things we used
MOE to do was optimize parameters of the advertising system to increase
click-through-rate. So we'd fire up a few experimental cohorts.

Multi-armed bandits would spin down an arm that sucks. Then we'd take
clicks, attempts and compute mean & variance to feed into MOE. And ask MOE
for a new experimental cohort to replace the one we just stopped.

Is that what you're doing?


Reply to this email directly or view it on GitHub
#436 (comment).

Gotcha. So that's basically the same as the Yelp example I described.

For each data point, you'd want to provide MOE with the mean & variance. For us, we were observing user clicks, so we computed mean/var as in a binomial distribution. I'm not sure what you're measuring, but you'd want either that or mean/var from beta. I believe that since you're using Thompson sampling, that would mean binomial b/c beta is your prior... but I could be confused.

As for adjusting the points to be 0 mean, you'll want to select a system-wide mean value (keep reading) and subtract that from the per cohort/arm mean you computed above.

For Yelp ads we set this to be some representative click through rate, let's say 40% (made up number). 40% made sense b/c we saw CTRs in the [30%, 50%] range. So maybe you know what the 'average' is a priori, or maybe you select the a representative average, use the average across cohorts/arms, (min+max)2, many options.

As I noted above (again say we're minimizing), a global mean value that is very low (i.e., your other observations are mostly above it = worse than it) will push MOE to be more exploratory. This is b/c the sampled points will look generally bad, and the unknown (the prior) will look comparatively good. And a mean value that is very high (other values below it = better than it) will push more to be more exploitative for the opposite reasons.

Remember that changing the 'global' mean value will require retraining hyperparameters. While the parameters tested don't change, the objective function values will.

P.S. I don't have the reference handy but there are more scientific ways of picking this value (by based on max likelihood). But implementing these isn't trivial b/c it introduces a dependency on the hyperparameters and you'd need to optimize everything together. Plus, these estimates seem to do pretty badly early on when you don't have much data. So I haven't done it here.

P.P.S. Remember also to put reasonable bounds on the hyperparameter alpha. This is the variance of the prior. Setting a huge value (relative to observed CTRs) will cause lots of exploration & a tiny value will cause mostly exploitation. B/c a tiny value would make it impossible for the prior to perform better than a point near the current best.

You may find it easier to reason about bounds on alpha if you rescale your objective function as: (value - global_mean) / global_mean, i.e., make it relative. There you can say things like "less than a 1% relative change is uninteresting" and "more than a 50% relative change is unlikely" and set your bounds accordingly, then transform back (or pass MOE transformed values).

This is excellent information. In particular I hadn't thought about data
points mostly above zero causing more exploration. Thank you!
On Fri, Nov 6, 2015 at 02:56 Eric Liu notifications@github.com wrote:

Gotcha. So that's basically the same as the Yelp example I described.

For each data point, you'd want to provide MOE with the mean & variance.
For us, we were observing user clicks, so we computed mean/var as in a
binomial distribution. I'm not sure what you're measuring, but you'd want
either that or mean/var from beta. I believe that since you're using
Thompson sampling, that would mean binomial b/c beta is your prior... but I
could be confused.

As for adjusting the points to be 0 mean, you'll want to select a
system-wide mean value (keep reading) and subtract that from the per
cohort/arm mean you computed above.

For Yelp ads we set this to be some representative click through rate,
let's say 40% (made up number). 40% made sense b/c we saw CTRs in the [30%,
50%] range. So maybe you know what the 'average' is a priori, or maybe you
select the a representative average, use the average across cohorts/arms,
(min+max)2, many options.

As I noted above (again say we're minimizing), a global mean value that is
very low (i.e., your other observations are mostly above it = worse than
it) will push MOE to be more exploratory. This is b/c the sampled points
will look generally bad, and the unknown (the prior) will look
comparatively good. And a mean value that is very high (other values below
it = better than it) will push more to be more exploitative for the
opposite reasons.

Remember that changing the 'global' mean value will require retraining
hyperparameters. While the parameters tested don't change, the objective
function values will.

P.S. I don't have the reference handy but there are more scientific ways
of picking this value (by based on max likelihood). But implementing these
isn't trivial b/c it introduces a dependency on the hyperparameters and
you'd need to optimize everything together. Plus, these estimates seem to
do pretty badly early on when you don't have much data. So I haven't done
it here.

P.P.S. Remember also to put reasonable bounds on the hyperparameter alpha.
This is the variance of the prior. Setting a huge value (relative to
observed CTRs) will cause lots of exploration & a tiny value will cause
mostly exploitation. B/c a tiny value would make it impossible for the
prior to perform better than a point near the current best.

You may find it easier to reason about bounds on alpha if you rescale your
objective function as: (value - global_mean) / global_mean, i.e., make it
relative. There you can say things like "less than a 1% relative change is
uninteresting" and "more than a 50% relative change is unlikely" and set
your bounds accordingly, then transform back (or pass MOE transformed
values).


Reply to this email directly or view it on GitHub
#436 (comment).

Sure, glad I could help.

Keep in mind that you should not apply the logic: "I want MOE to explore more, let me change the mean so that more points look bad." The global mean should be your best estimate or a representative estimate.

Motivating example: Imagine a GP with a single point. That point carries most of its influence in a hyperellipsoid with principal axes 2-3 times MOE's length scales (at 3x away, the 'effect' is 1% of the center). Outside of this, the behavior is governed by the prior. This is invariant of the actual function value--could be 1, 10, 100, doesn't matter, you're back to the prior over the same length scales.

If you imagine an extreme case where my objective varies in [10,20] and I set global_mean = 0, then MOE's search behavior can be loosely described as checking points most distant from all previous points. And you hardly need MOE if that's the search strategy you want, ha.