Topic of my bachelor thesis from 2020
This is an approach of a control design for a DC-motor with the help of Reinforcement Learning. The controller (RL-agent) consists of an artificial neural network, which is trained by the A2C-algorithm (Advantage Actor Critic). The environment is characterized by a simple discrete LTI-system (linear time invariant) of the motor. The aim is that the RL-agent is able to pass the right voltage to the motor to reach the desired angular velocity. Regarding the stationary as well as the dynamic behaviour a brief analysis is performed.
The Advantage-Actor-Critic-algorithm combines the advantages of Temporal Difference Learning (TD-Learning) and the Policy Gradient method. Here the actor learns the strategy using the Policy Gradient method and accordingly executes new actions during training. In contrast, the critic evaluates the current state through the value function forced by the actor's action. Similar to the actor, the critic learns the value function through TD-Learning. The goal, identical to TD-Learning, is to apply the Actor-critic method to continuous problems. For a continuouse action range it is a common idea to choose a gaussian probability density function for the actor's strategy
I decided to create a state-space representation of the DC-motor. There is the possibility to implement this kind auf physical model with SciPy. The following figure represents the simplified physical equivalent circuit diagram.
This can now be used to create the equation of the state-space representation:
The states taken into consideration and actions made by the RL-agent are valid for a specific moment. Therefore the state-space model will be time-discretized with a fixed time step of 0.05 s. In combination with the selected motor parameters you cannot see discontinuities in step responses (here not included). To implement these considerations we create a class for the DC-motor:
class Dcmotor:
def __init__(self):
self.model_dlti = self.create_dlti()
def create_dlti(self):
R, L, K_M, b, J = 1.93, 0.0062, 0.05247, 0.00000597, 0.00000676 # DC motor parameters
K_E = K_M
# state space matrices of DC motor
A = np.array([[-(R / L), -(K_E / L)], [(K_M / J), -(b / J)]])
B = np.array([[(1 / L), 0], [0, -(1 / J)]])
C = np.array([0, 1])
D = np.array([0, 0])
return signal.cont2discrete((A, B, C, D), delta_T)
def next_state(self, omega, action):
# create input vector with action
U = np.empty(shape=(sample_steps, 2), dtype=float)
for k in range(sample_steps):
U[k] = np.array([action[0], 0.0])
t_out, y_out, x_out = signal.dlsim(self.model_dlti, U, t=None, x0=np.array([0, omega]))
return y_out[1] #return next omega
By calling this class, the DC-motor gets initialised as a DLTI-system.
In the following figure you can see the overall implementation idea as well as the control structure and interfaces between the different needed parts. It is important to note that after successful training, the RL-agent only consists of the actor (inclusively pre- post-processing tasks).
The interfaces of the DC-motor are the voltage
Furthermore, to train the RL-agent using the A2C-algorithm, a reward function must be defined. The function follows specific criteria:
- maximal reward when the control difference is
$0 s^{−1}$ , - a maximum reward value of
$0$ , - negative values for non-zero control differences and
- increasingly negative rewards for higher control differences.
A linear equation with a negative factor c and the absolute value of the control difference meet these requirements. The factor c determines the penalty strength for deviation from the desired angular velocity, also assigned as a hyperparameter for training success:
Before heading to the code implementations, we will have a look on the flow chart of the RL-agent's training process. After initialising the DC-motor-model as well as the actor and critic neural networks along with hyperparameters, the training starts with episode 0. At step
For the higher-level code structur beside the class for the motor model, I specify further classes for the actor, critic and the agent. Moreover, the initialisation of different hyperparameters takes place in the beginning. Calling the function agent().train() the execution of the training process starts.
# import libraries
import tensorflow as tf
from tensorflow.keras.layers import Input, Dense, Lambda
import numpy as np
import matplotlib.pyplot as plt
from scipy import signal
#float type default
tf.keras.backend.set_floatx('float64')
#hyperparameters
gamma = 0.5 #discount factor
actor_lr = 0.00006 #actor learning rate
critic_lr = 0.0002 #critic learning rate
sample_steps = 240 #steps per episode and batch size
max_episodes = 2000 #number of episodes
delta_T = 0.05 #for DLTI
c = -0.95 #factor reward function
class Actor:...
class Critic:...
class Dcmotor:...
class Agent:...
Agent().train()
By calling the agent class, the actor critic and DC-motor get initialised. The dimensions of the state and action vectors are set. The action bound is fixed to 50 volts. For the random selection of an action the standard deviation is also bounded. The train function starts with initialising empty lists. For the start of an episode the motor start with an angular velocity of
class Agent:
def __init__(self):
self.state_dim = 2
self.action_dim = 1
self.action_bound = 50.0
self.std_bound = [1e-5, 1.0]
self.actor = Actor(self.state_dim, self.action_dim, self.action_bound, self.std_bound)
self.critic = Critic(self.state_dim)
self.dcmotor = Dcmotor()
def td_target(self, reward, next_state, k):
if k==(sample_steps-1):
return reward
v_value = self.critic.model.predict(np.reshape(next_state, [1, self.state_dim]))
return np.reshape(reward + gamma * v_value[0], [1, 1])
def advantage(self, td_target, baseline):
return td_target - baseline
def list_to_batch(self, list):
batch = list[0]
for elem in list[1:]:
batch = np.append(batch, elem, axis=0)
return batch
def train(self):
ep_batch = []
ep_reward_batch = []
actor_loss_batch = []
critic_loss_batch = []
for ep in range(max_episodes):
state_batch = []
action_batch = []
td_target_batch = []
advantage_batch = []
ep_batch.append(ep)
episode_reward = 0
next_omega = 0.0 #initial state for every new episode
for k in range(sample_steps):
#change between target values in one episode
if k == 0:
omega_target = 10.0 / 358.0
delta_omega = omega_target-next_omega/358.0
state = (delta_omega, omega_target)
elif k == 29:
omega_target = 50.0 / 358.0
delta_omega = omega_target-(next_omega[0]/358.0)
state = (delta_omega, omega_target)
elif k == 59:
...
#reversed normalisation for DC Motor
action = self.actor.get_action(state) * 50.0
print(action)
action = np.clip(action, -self.action_bound, self.action_bound)
omega = (omega_target-state[0]) * 358.0
next_omega = self.dcmotor.next_state(omega, action)
#normalisation for RL-Agent and reward calculation
delta_omega_next = omega_target - (next_omega / 358.0)
reward = c * (abs(delta_omega_next))
action = action/50.0
omega_target = np.clip(omega_target, 0.0, 1.0)
next_state = (delta_omega_next[0], omega_target)
state = np.reshape(state, [1, self.state_dim])
action = np.reshape(action, [1, self.action_dim])
next_state = np.reshape(next_state, [1, self.state_dim])
reward = np.reshape(reward, [1, 1])
td_target = self.td_target(reward, next_state, k)
advantage = self.advantage(td_target, self.critic.model.predict(state))
state_batch.append(state)
action_batch.append(action)
td_target_batch.append(td_target)
advantage_batch.append(advantage)
if len(state_batch) >= sample_steps:
states = self.list_to_batch(state_batch)
actions = self.list_to_batch(action_batch)
td_targets = self.list_to_batch(td_target_batch)
advantages = self.list_to_batch(advantage_batch)
actor_loss = self.actor.train(states, actions, advantages)
critic_loss = self.critic.train(states, td_targets)
episode_reward += reward[0][0]
state = next_state[0]
actor_loss_proto = tf.make_tensor_proto(actor_loss)
critic_loss_proto = tf.make_tensor_proto(critic_loss)
actor_loss_array = tf.make_ndarray(actor_loss_proto)
critic_loss_array = tf.make_ndarray(critic_loss_proto)
actor_loss_batch.append(actor_loss_array)
critic_loss_batch.append(critic_loss_array)
ep_reward_batch.append(episode_reward)
print('EP{} EpisodeReward={} ActorLoss={} CriticLoss={}'.format(ep, episode_reward, actor_loss_array, critic_loss_array))
# plot episode reward and losses over episodes
fig, (ax1, ax2, ax3) = plt.subplots(3, 1)
ax1.plot(ep_batch, ep_reward_batch)
ax1.set_xlabel('Episode')
ax1.set_ylabel('Episode reward')
ax2.plot(ep_batch, actor_loss_batch)
ax2.set_xlabel('Episode')
ax2.set_ylabel('Actor loss')
ax3.plot(ep_batch, critic_loss_batch)
ax3.set_xlabel('Episode')
ax3.set_ylabel('Critic loss')
fig.suptitle('TotEpisodes={}, Gamma={}, SampleSteps={}, ActorLR={}, CriticLR={}'.format(ep, gamma, sample_steps, actor_lr, critic_lr))
plt.show()
self.actor.model.save_weights('weights/actor_weights.h5')
In the beginning, the actor's neural network with two hidden layers (each with 100 neurons) is initialised. The output of the actor is a gaussian distribution with a mean
class Actor:
def __init__(self, state_dim, action_dim, action_bound, std_bound):
self.state_dim = state_dim
self.action_dim = action_dim
self.action_bound = action_bound
self.std_bound = std_bound
self.model = self.create_model()
self.opt = tf.keras.optimizers.Adam(actor_lr)
def create_model(self):
state_input = Input((self.state_dim,))
dense_1 = Dense(100, activation='relu')(state_input)
dense_2 = Dense(100, activation='relu')(dense_1)
out_mu = Dense(self.action_dim, activation='tanh')(dense_2)
std_output = Dense(self.action_dim, activation='softplus')(dense_2)
mu_output = Lambda(lambda x: x * 1.0)(out_mu)
return tf.keras.models.Model(state_input, [mu_output, std_output])
def get_action(self, state):
state = np.reshape(state, [1, self.state_dim])
mu, std = self.model.predict(state)
mu, std = mu[0], std[0]
return (np.random.normal(mu, std, size=self.action_dim))
def log_pdf(self, mu, std, actions):
std = tf.clip_by_value(std, self.std_bound[0], self.std_bound[1])
var = std ** 2
log_policy_pdf = -0.5 * (actions - mu) ** 2 / \
var - 0.5 * tf.math.log(var * 2 * np.pi)
return tf.reduce_sum(log_policy_pdf, 1, keepdims=True)
def compute_loss(self, mu, std, actions, advantages):
log_policy_pdf = self.log_pdf(mu, std, actions)
loss_policy = log_policy_pdf * advantages
return tf.reduce_mean(-loss_policy)
def train(self, states, actions, advantages):
with tf.GradientTape() as tape:
mu, std = self.model(states, training=True)
loss = self.compute_loss(mu, std, actions, advantages)
grads = tape.gradient(loss, self.model.trainable_variables)
self.opt.apply_gradients(zip(grads, self.model.trainable_variables))
return loss
In the beginning, the critic's neural network with two hidden layers (each with 100 neurons) is initialised. The output of the critic is a linear activation function because the value of the value function can be whatever negative or positive. The loss function is the mean squared error between the by the critic's neural network predicted value function and the TD-target. Here the neural networks gets updated regarding the best guess (TD-target).
class Critic:
def __init__(self, state_dim):
self.state_dim = state_dim
self.model = self.create_model()
self.opt = tf.keras.optimizers.Adam(critic_lr)
def create_model(self):
return tf.keras.Sequential([
Input((self.state_dim,)),
Dense(100, activation='relu'),
Dense(100, activation='relu'),
Dense(1, activation='linear')
])
def compute_loss(self, v_pred, td_targets):
mse = tf.keras.losses.MeanSquaredError()
return mse(td_targets, v_pred) #MSE of advantage function
def train(self, states, td_targets):
with tf.GradientTape() as tape:
v_pred = self.model(states, training=True)
assert v_pred.shape == td_targets.shape
loss = self.compute_loss(v_pred, tf.stop_gradient(td_targets))
grads = tape.gradient(loss, self.model.trainable_variables)
self.opt.apply_gradients(zip(grads, self.model.trainable_variables))
return loss
The right choice of hyperparameters and the topologies of the neural networks (for the actor and thr critic) is more or less a trial-and-error-process. Nevertheless, we should think about instead of being in the dark. Important to mention: we cannot have a look at the validation loss like in supervised training because we don't know the truth! A successful RL-agent depends a lot on the specified reward function!
The two hypothetical acitvation functions (linear and TanH) for the output neuron
Neurons and hidden layers
In general, I would highly recommend an intensive investigation of different topologies for both networks as well as diffenrent combinations of both networks to check for each other's impact. Here I have run just a couple of short trainings. The charts below show that the use of 100 neurons strives for faster and clearer convergence. The neural networks with the fewest neurons show a very slow convergence in comparison. In addition, a greater spread of values can be recognised in their progressions. The use of 50 neurons shows a somewhat clearer tendency to converge, but requires more time (especially at the beginning of training). This tendency is visible after approx. 100 to 150 episodes. Similarly, a greater spread of the progresses can be recognised at the beginning. For this reason, the architecture with 100 neurons per hidden layer is chosen. There may be a risk that the RL-agent will not be able to generalise the strategy to other angular velocities (missing exploration and / or overfitting). This needs to be checked after training. If this is the case, the number of neurons needs to be adjusted.
The learning rates leading to a successful training must be determined by iteratively adjusting them. Furthermore, the relationship between them is also important. With regard to the A2C-algorithm, an optimal strategy is only found if the value function for the Markov Decision Process is approximated sufficiently accurately by the critic. Accordingly, the actor must not learn faster than the critic, as in the worst case the strategy found does not represent an optimum. This means that the learning rate of the actor should be lower than that of the critic. However, an approximate ratio can only be found after several iterations. This also depends on the actual problem, which significantly influences the size of the gradients per training episode. In addition, a low learning rate of the actor causes the standard deviation to continue to assume a high value, especially in the initial episodes, and ensures that the action area is explored. In this way, the value function can be approximated for the largest possible range of the analysed environment.
With regard to the TanH activation function we have to think about the value range. The values +1 and -1 (or the maximum allowed voltage for the motor when scaled up) are not covered by activation fucntion. Therefore we have to expand artificially the action range. A closer look at the TanH trace shows us an approximately linear behaviour between the y value -0.9 and +0.9. If we take a voltage range of 50 V and consider the maximum possible voltage of 48.83 V for the motor, we are located in the linear region (43.83/50 = 0.88). Caution: we still have to bound the voltage to +/- 48.83 V after the output of the agent to prevent a damage of the motor. Using a step input signal at 358 1/s (maximum possible angular velocity), the control behavior is investigated with both RL-agents characterised by the two action ranges. The responses to this input signal can be found in the diagrams. The control deviation
The next step is to choose a right reducing factor
Average cumulative reward | |
---|---|
0.0 | -8.88 |
0.25 | -8.64 |
0.5 | -7.52 |
0.8 | -8.48 |
0.9 | -8.16 |
0.99 | -10.88 |
To hopefully get a well performing RL-agent, we have to think about the input
Hyperparameter | Number / Value |
---|---|
Episodes | 2,000 |
Steps per episode | 240 |
Reducing factor | 0.5 |
Learning rate actor | 0.00006 |
Learning rate critic | 0.00025 |
Factor reward function | -0.95 |
Neural network parameter | Actor | Critic |
---|---|---|
Input dimension | 2 | 2 |
Hidden layers | 2 | 2 |
Neurons per hidden layer | 100 | 100 |
Activation function hidden layer | ReLU | ReLU |
Output dimension | 2 | 1 |
Activations function output | TanH ( |
Linear |
In the following figure you can see the reward, critic and actor loss for each episode. With a change of the target profile you can also see changes in the reward curve. The curve looks stepped. At the same time there are no abrupt changes in the actor and critic loss. Depending on the
Now we take a closer look on the output of the actor for different episodes and different target profiles. The first profile was changing between higher and lower angular velocities. For 260 steps and a time discretisation of 0.05 s the x-axis goes until 12 s. You can see the
- Training with a completely randomised profile of target angular velocities every (change after a defined short period of time)
- Behaviour of the RL-agent for disturbance (here load torque
$M_I$ ) - Impact of the time discretisation size
- Implementation of the RL-agent as a controller for a test bench
- Comparison to conventional control design
- Impact of sensor quality (sampling rate and accuracy) and actuator (PWM signal)
I would love to see the use of my code in other projects! Please just cite me! :)
Copyright © 2023 Alexander Lerch
A2C_DCMotor.py is licensed under the MIT License.