reward quantification

Question

reward quantification

chijames opened this issue 5 years ago · comments

Hi,

Thanks for the code. Can you please explain a bit more about why we need to quantize the reward in agent.py? Since I did not see this in the paper. Thanks.

Hao Tan · Answer 1 · Thu Jun 27 2019 05:59:00 GMT+0800 (China Standard Time)

Hi,

The discrete reward is a standard technique in reinforcement learning with hand-crafted reward. An example here: Visual Attention for multiple object detection.
Using raw distances as rewards would roughly have similar results but are not robust. As a result, the accuracies vibrate during training and have a large variance w.r.t different random seeds.

In general, most RL methods are over-sensitive to the scale and distribution of reward. Hence it is common to see reward normalization techniques, e.g., reward * alpha (to scale the reward dist) or ln(reward) | exp(reward) | 1/reward (to modify the family of distributions) in the code.

The discrete reward is yet another technique to normalize the distribution of rewards, which is insensitive to the scale and distribution families.

However, no free lunch in the world!! The discrete reward would introduce "positive weight cycle", where the agent could accumulate positive reward by looping in the cycle. This issue is solved by our RL + IL method.

Hope these help!

chijames · Answer 2 · Thu Jun 27 2019 06:13:25 GMT+0800 (China Standard Time)

Thanks for your thorough explanation!

ZhuFengdaaa · Answer 3 · Wed Aug 21 2019 16:54:10 GMT+0800 (China Standard Time)

@airsplay

Using raw distances as rewards would roughly have similar results but are not robust.

I don't understand how this phenomenon happens. Could you provide some work about it?

Hao Tan · Answer 4 · Thu Aug 22 2019 02:08:53 GMT+0800 (China Standard Time)

@ZhuFengdaaa
Hmmm, it might not be a work mentioning it. But I could try to explain it.

Stability of RL (somehow) relies on the distribution of rewards. With discretization, the distribution would be reduced to a well-understood Bernoulli distribution. At the same time, the distribution of distance on a graph depends on the characteristic of a graph. Thus a few normalized techniques need to be used. E.g., log the reward if the distribution belongs to some specific exponential family.

However, you do not always know the characteristic of a graph hence a biased reward normalization would make the RL unstable.