young-geng / CQL

Conservative Q Learning on top of SAC

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Some questions on CQL

dbsxdbsx opened this issue · comments

1.For behavior cloning, the update formula policy_loss = (alpha*log_pi - log_probs).mean(), I wonder why using log_probs , but not q-value here?
2. When using Lagrange, do alpha_prime and cql_min_q_weight refer to the same thing, and shouldn't alpha_prime be updated before updating Q_loss, according to formula 30 from CQL paper?
3. Is twin Q function still essential? From my opinion, since q-value could be guaranteed to be a lower bound of true Q value, the twin Q function outputs are needless. Am I right?
4. What is cql_temp in code? The value is always 1, and what is it used for if taking a different value?

(I know some code are referring to CQL, but since the author is no longer active, I asked here.)