Exploration is done on all arms simultaneously with LinGreedy policy

Question

Exploration is done on all arms simultaneously with LinGreedy policy

bjarlestam opened this issue 4 months ago · comments

Andreas Bjärlestam commented 4 months ago

When using the LinGreedy policy exploration is done on all arms simultaneously. For example if we set epsilon to 0.1, then with 10% chance the user will get random recommendations on each arm so when recommending a list of 10 items, the user will get all items completely random. I think that the user experience would be better if the exploration would be done on each arm individually, so that at least most of the recommendations are relevant.

skadio · Answer 1 · Tue Apr 23 2024 21:17:32 GMT+0800 (China Standard Time)

Hi @bjarlestam, that's an interesting case, thanks for sharing.

It is relatively easy to create Customized Bandits in MABwiser.

See an example here https://github.com/fidelity/mabwiser/blob/master/examples/customized_mab.py

The idea is to let MABWiser run the engine under the hood, and then, to customize/overwrite one or more of its methods to fit your desired behavior.

You can create a custom mab inheriting from Epsilon Greedy and just overwrite its predict method.

An example is done here to implement a customized max_regret algorithm rather than the default max_expectations: https://github.com/fidelity/mabwiser/blob/master/examples/customized_mab.py#L42

In your case, you can get the expectation for each arm, but rather than argmax() over all
(as in the default epsilon https://github.com/fidelity/mabwiser/blob/master/mabwiser/greedy.py#L49), you can return an arg max over a restricted subset of arms (say, most relevant which remains application specific) or top-k best, or what have you.

Sth like this:

class MyCustomGreedy(_EpsilonGreedy):

    def __init__(self, rng: _BaseRNG, arms: List[Arm], n_jobs: int):
        super().__init__(rng, arms, n_jobs)

    def predict(self, contexts: np.ndarray=None):

        # TODO: change selection mechanism here
        arm_to_exp = super().predict_expectations()
        return argmax(expectations)

Then, run it as usual.

my_greedy = MyCustomGreedy(create_rng(42), arms, n_jobs=1)
my_greedy.fit(decisions, rewards)
prediction = my_greedy.predict()

Hope this helps, and if you are using MABWiser, would appreciate a GitHub star ⭐

Andreas Bjärlestam · Answer 2 · Wed Apr 24 2024 15:26:54 GMT+0800 (China Standard Time)

Thanks, might have a look at implementing a custom bandit then. Is it possible to use such a custom learning policy in mabwiser?

skadio · Answer 3 · Tue May 07 2024 21:53:09 GMT+0800 (China Standard Time)

Yes, surely you can add your new policy into the public API of Mabwiser by exposing it as a new parameter. Explained here in more detail, step-by-step guide https://fidelity.github.io/mabwiser/new_bandit.html

That said, the above custom implementation will have all the methods that one could call from Public API of MABWiser. So "functionality" wise, using the custom implementation directly VS. calling it from MAB Public API overlaps mostly.

On the other hand, exposing it via the public MAB would be quite interesting, if/when would like to contribute your policy as a new publicly available method.

Hope this helps!