avisingh599 / reward-learning-rl

[RSS 2019] End-to-End Robotic Reinforcement Learning without Reward Engineering

Home Page:https://sites.google.com/view/reward-learning-rl/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

VICE vs. SACClassifier

jgkim2020 opened this issue · comments

This is not a issue about the code implementation per se, but rather a question on the difference between two algorithms.

It seems that the VICE class implementation follows the equation from the original paper as well as the RSS paper when training the "logit" f(s) via the softmax discriminator D(s,a) with cross-entropy loss.

However, the SACClassifier class implementation does not use log_pi(a|s) and instead trains the "logit" via the sigmoid discriminator D(s) with cross-entropy loss. Since the SACClassifier utilizes negatives samples (by sampling from the replay buffer) when training the "logit" (or equivalently the event prob.) it doesn't seem to be the "Naive Classifier" case mentioned in the RSS paper.

What is the reasoning/theory behind SACClassifier? Any references (relevant paper, etc.) would be much appreciated :)

Nevermind, I realized that SACClassifier only trains the classifier on the first episode (self._epoch == 0) and is indeed the "Naive Classifier" case from the paper.

Glad you figured it out! The reason SACClassifier was implemented this non-intuitive way was because it made it extremely simple to implement VICE and VICE-RAQ on top of it.