LSTM History Summarization Poor Performance

Question

LSTM History Summarization Poor Performance

GreatArcStudios opened this issue 5 months ago · comments

I've been trying to play around with LSTM history summarization and various algorithms on toy environments. I have found that in particular with fully observable environments, the LSTM history summarization module performs poorly. For example, on pendulum, it's gotten to episode 240 with DDPG and the reward is still super negative:

episode 10 return: -1397.8437638282776
episode 20 return: -1456.5716090202332
episode 30 return: -1346.7031090259552
episode 40 return: -1062.9707473516464
episode 50 return: -1553.775797367096
episode 60 return: -1183.1240811608732
episode 70 return: -1204.8374280929565
episode 80 return: -1765.7173137664795
episode 90 return: -1457.595247745514
episode 100 return: -1073.8434460163116
episode 110 return: -1259.576441526413
episode 120 return: -1023.4422654509544
episode 130 return: -1172.4052398204803
episode 140 return: -706.3417955114273
episode 150 return: -1440.3148374557495
episode 160 return: -1305.943922996521
episode 170 return: -932.2708021588624
episode 180 return: -1180.3497375249863
episode 190 return: -1183.8436343669891
episode 200 return: -1431.2645144462585
episode 210 return: -920.1698980480433
episode 220 return: -1273.0213116556406
episode 230 return: -1060.2317422628403
episode 240 return: -1481.3649444580078

For reference, without the LSTM history summarization module and DDPG, we usually get to a moving average of under -250 around episode 100. I extended the DDPG test as follows:

    def test_ddpg_lstm_summarization(self) -> None:
        """
        This test is checking if DDPG will eventually learn on Pendulum-v1
        If learns well, return will converge above -250
        Due to randomness in games, we check on moving avarage of episode returns
        """
        env = GymEnvironment("Pendulum-v1")
        agent = PearlAgent(
            policy_learner=DeepDeterministicPolicyGradient(
                state_dim=512,
                action_space=env.action_space,
                actor_hidden_dims=[400, 300],
                critic_hidden_dims=[400, 300],
                critic_learning_rate=1e-2,
                actor_learning_rate=1e-3,
                training_rounds=5,
                actor_soft_update_tau=0.05,
                critic_soft_update_tau=0.05,
                exploration_module=NormalDistributionExploration(
                    mean=0,
                    std_dev=0.2,
                ),
            ),
            history_summarization_module=LSTMHistorySummarizationModule(
                observation_dim=env.observation_space.shape[0],
                action_dim=env.action_space.shape[0],
                hidden_dim=512,
                num_layers=5,
                history_length=200
            ),
            replay_buffer=FIFOOffPolicyReplayBuffer(50000),
        )
        self.assertTrue(
            target_return_is_reached(
                agent=agent,
                env=env,
                target_return=-250,
                max_episodes=1000,
                learn=True,
                learn_after_episode=True,
                exploit=False,
                check_moving_average=True,
            )
        )

Is this expected or did I miss something obvious?

I did modify the policy learner preprocess method to detach the tensor being assigned to batch.state (refer to line 186)

Eric Zhu · Answer 1 · Thu Jan 18 2024 12:18:41 GMT+0800 (China Standard Time)

This appears to be related to #30, in that detaching the batch.state tensor will cause the history summarization module to not get gradients for backprop, but not detaching it will result in an error. Hopefully this gets fixed fast.

Zheqing (Bill) Zhu · Answer 2 · Thu Jan 18 2024 13:12:18 GMT+0800 (China Standard Time)

Hi there! A few things I noticed:

First of all, you should not detach batch.state as that will make the observation input a completely random state since the LSTM cannot summarize and realize only the last observation is useful. I believe DDPG should work if you don't detach the tensor assigned to batch.state and it should work. #30 is talking about a different issue.
#30 is about PPO + LSTM where FIFOOnPolicyReplayBuffer is used and it is not the same issue (you are using FIFOOffPolicyReplayBuffer). We will have a fix for that separately soon.
Additionally, Pendulum is a fully-observable environment, adding an LSTM with history length of 200 will significantly slow down learning as now the agent with LSTM has to deal with an environment where 99.5% of state representation entries are noise. LSTM should only be helpful when dealing with partially-observable environment. I would suggest avoiding LSTMs when dealing with fully observable environment.

Hope this helps. Let us know if you see errors when you don't detach the state tensor when using DDPG + LSTM.

Eric Zhu · Answer 3 · Thu Jan 18 2024 21:18:12 GMT+0800 (China Standard Time)

Hi! Thanks for the response!

I referenced that PR because of the reason the PR author gave for detaching the batch state tensor. That is, adding the LSTM history module to algorithms like DDPG or CSAC will result in the double backward pass error. So it’s tangentially related but there does appear to be a bug somewhere. You can actually see this if you run my test case in the issue (given that you don’t detach the tensor).

Yeah, I expected that this test would be a bit silly, but I wanted to use this as a sanity check as I am currently building my own history summarization modules.

Thanks!

Edit:

Could you also clarify why 99.5% of the state representation entries would be noise? I was under the impression that they would just be learned representations (possibly compressed or expanded based on state_dim) of the history. They would be noise given the detached tensor though due to the lack of grad computations.

Edit 2:
#29 contains a better traceback actually. I am on WSL, so my tracebacks aren't always super informative.

Zheqing (Bill) Zhu · Answer 4 · Fri Jan 19 2024 06:10:00 GMT+0800 (China Standard Time)

@GreatArcStudios Ah, we just identified another issue with LSTM with actor-critic methods with double gradient passes. Someone on the team will submit a fix soon. Thanks!

On a side note, for the 99.5% representation entries being noise. I should clarify that I meant 99.5% of history input is noise. Because LSTM with 200 history lookback will take 200 observations and try to summarize into a latent representation. However, the first 199 observations are not useful since the last observation is already the state. Hence 99.5% of history input is noise, which would very easily mislead LSTM. Now that you also detached the gradient flow, then there's no way LSTM can tell that only the last observation matters.

Stay tuned on the fix. Will be out soon.

yiwan-rl · Answer 5 · Tue Jan 23 2024 03:39:29 GMT+0800 (China Standard Time)

Hi @GreatArcStudios, thanks for spotting this double backward pass error. We have just fixed it. You should be able to run DDPG/CSAC with LSTM now.

Eric Zhu · Answer 6 · Thu Jan 25 2024 22:17:41 GMT+0800 (China Standard Time)

Thanks for the quick fix! I see that this fix has now removed the buffers for the hidden representation and cell representation. Does this mean that we should actually increase the history length as we no longer maintain references to the "cached" latent states?

yiwan-rl · Answer 7 · Fri Jan 26 2024 06:33:15 GMT+0800 (China Standard Time)

In the current design, LSTM only accepts a fixed length history. For example, if an episode has a length of 100 but the chosen history length for the LSTM is 10, then only the last 10 observation-action pairs are sent to the LSTM to compute the current subjective state. The initial hidden state of the LSTM is always a zero vector. So the chosen history length needs to be long enough to cover sufficient amount of information for making decisions.

rodrigodesalvobraz · Answer 8 · Tue Jan 30 2024 03:00:38 GMT+0800 (China Standard Time)

Closed as solved by solution to #47.