Some questions on CQL
dbsxdbsx opened this issue · comments
1.For behavior cloning, the update formula policy_loss = (alpha*log_pi - log_probs).mean()
, I wonder why using log_probs
, but not q-value here?
2. When using Lagrange, do alpha_prime
and cql_min_q_weight
refer to the same thing, and shouldn't alpha_prime
be updated before updating Q_loss, according to formula 30 from CQL paper?
3. Is twin Q function still essential? From my opinion, since q-value could be guaranteed to be a lower bound of true Q value, the twin Q function outputs are needless. Am I right?
4. What is cql_temp
in code? The value is always 1
, and what is it used for if taking a different value?
(I know some code are referring to CQL, but since the author is no longer active, I asked here.)