FastCarRacing-v0 Gym Environment

Click to play:

Introduction

The original CarRacing-v2 environment in Gymnasium requires a long time to train due to the way its reward model is defined, especially for off policy continuous agents such as Soft Actor-Critic or TD3. The episode only terminates when the car is far from the track, which causes the car to spin around on the grass/not moving at all when the agent is exploring. Thus not very time efficient. Thus, I propose a modified environment fixing these problems, enabling faster and easier training/better performance for SAC/DDPG agents.

Features

-Immediate termination and a penalty of -100 reward when the nose of the car is off the road.

-Action space has been changed to 2 dimensional so that the throttle and brake are mutually exclusive. This way the car will not get jammed when throttle and brake are applied at the same time when the policy is exploring.

-Braking is only available when speed exceeds 70, maximizing the speed.

-Throttle is incentivized more than brake.

-State has been changed to a 96 x 96 Torch Tensor representing the grayscale image of the car's surrounding.

Installation

git clone https://github.com/vFf0621/FastCarRacing-v0
cd FastCarRacing-v0
pip install -e .

Usage

import gymnasium
import gym_fast_car_racing

env = gym.make("FastCarRacing-v0",render_mode="human")

obs = env.reset()[0]
done = False
episode_reward = 0

while not done:
  action = agent.policy(obs)
  obs_, reward, done, truncated, info = env.step(action)
  total_reward += reward

Training

Soft Actor-Critic is being used to train the stochastic policy and the policy is evaluated deterministically(the mean, mu of the distribution is used for the evaluation). Below is the code for the actor CNN policy:

        self.conv1 = nn.Conv2d(in_channels=1, out_channels=32, kernel_size=8, stride=4)
        self.conv2 = nn.Conv2d(in_channels=32, out_channels=64, kernel_size=4, stride=2)
        self.conv3 = nn.Conv2d(in_channels=64, out_channels=64, kernel_size=3, stride=1)
        self.linear1 = nn.Linear(4096, 1024)
        self.linear2 = nn.Linear(1024, 1024)
        self.relu = nn.ReLU() # For each layer except the last.
        self.mu = nn.Linear(1024, env.action_space.shape[0])
        self.sigma = nn.Linear(1024, env.action_space.shape[0])
        self.tanh = nn.Tanh() # Applied the the output action.

For the two critic functions and the value function, the hidden layer size is the same(1024) and they all have 3 hidden layers.

Evaluation

Here is the actor policy being trained for 30 hours. Remember, this is evaluated on the mu final layer and Tanh activation needs to be applied. This policy gives a reasonable performance, can reach a maximum score of 1200 but is not perfect. After creating a CNN with the above code, use

actor.load_state_dict(torch.load("ACTOR"))

to load the trained parameters.

vFf0621 / FastCarRacing-v0