Philip Hale - 50907446 - p.hale.09@aberdeen.ac.uk
Set: March 11, 2013 Due: May 3, 2013
src/dungeon/ai/hale
|── Action.java
├── CreatureBehaviour.java
├── FactionBehaviour.java
├── State.java
├── pathfind
│ ├── AStar.java
│ ├── Grid.java
│ └── Square.java
└── qlearning
├── ActionState.java
├── QLearningHelper.java
├── QValueStore.java
└── QValueStore.ser
2 directories, 11 files
Extract the included source archive into /src/dungeon/ai
. My package assumes
that the other classes in dungeon.ai
have not been modified.
The Faction Behaviour attribute in the map file should be set like so:
[...]
<Factions>
<Vector>
<Faction Behaviour="dungeon.ai.hale.FactionBehaviour" [...]/>
[...]
</Vector>
</Factions>
[...]
By default the behaviour will run using data precomputed in
src/dungeon/ai/hale/qlearning/QValueStore.ser
. In order to learn more, change
the fTrain
class variable in QLearningHelper
from false to true.
In order to reset the QLearning
data, blank the file with
: > src/dungeon/ai/hale/qlearning/QValueStore.ser
.
- Learning rate: 0.2
- Discount rate: 0.35
A low learning rate is chosen because of the large number of training runs required for improved behaviour. This in turn is necessary because the set of action-state pairs is quite large.
The discount rate is the result of trial and error, where much higher values would quickly cause Q-values to reach +/-Infinity.
The algorithm uses a HashMap for the data store. I wanted to avoid primitive types (such as a 2D array), since using object orientation would allow the data structure to be more easily changed at a later point.
Since the Q-Table associates action-state pairs with the Q-values, I
created an ActionState
wrapper class to identify a unique action and state
combination. Overriding equals()
and hashCode()
simplified Q-table updates.
The Q-Learning implementation is based on pseudocode taken from Chapter 7 of Artificial Intelligence for Games (2009) by Ian Millington and John Funge.
- Creature energy level (integer 1 to 5).
- Creature health level (integer 1 to 5).
- Whether the creature is threatened or not. (true or false).
- Distance in squares from creature to goal (1 to the number of squares on the board - typically values no larger than 1/4 this maximum).
I had initially hoped to achieve success from just measuring the first three of
the above, but this resulted in an overvaluation of the WAIT
action, where
the creature would do nothing. Tracking distance to goal allowed a reward to
be placed on movement that takes the creature closer to its destination.
Using the path computed with A* ensures a more accurate measure of distance
than a simple Euclidian calculation, and has the added benefit of already being
a small(ish) discrete value.
In order to determine whether the creature is under threat or not, the eight adjacent squares are inspected for the presence of enemy mobs. This was relatively simple to implement, since I could reuse methods from the pathfinding exercise.
- Move towards goal.
- Stand still.
Given the wide range of states and rewards, I decided to limit the available actions to moving or not moving. Adding additional complex behaviour was beyond the scope of this exercise.
The reward value is between -1.0 and +1.0, and is calculated based on the difference between the state of two game ticks. The rewards are cumulative.
Description | Reward modifier |
---|---|
No longer threatened | +0.5 |
Closer to goal | +0.1 |
In same place | -0.1 |
Newly threatened | -0.2 |
Creature dead | -0.4 |
Newly threatened and no energy | -0.3 |
'No longer threatened' implies an enemy mob has run away or been killed by the faction, and therefore represents the largest positive reward.
If the creature becomes newly threatened, a slight negative reward is given. This is based on the assumption that factions which avoid danger are likely to win more games.
A slightly larger negative reward is given if in addition to being newly threatened, the creature has no energy. This is as a result of observing the creatures behaviour, and noticing that the ogre often dies when attacking without sufficient energy.
In addition, a small reward is given for reducing the distance between the creature and the goal. Since I am not tracking which faction wins the game or the quantity of treasure, this is required to prevent waiting being overly rewarded.
Initially, the algorithm would take random actions 20% of the time, and the
rest of the time take the best action as determined by the Q-Table. However,
this cornered the learning into particular (bad) behaviours - such as not
moving at all, or never waiting. In addition, the learning would take too long
when only acting randomly 20% of the time. To combat this, a class variable in
QLearningHelper.java
toggles either saving learning data into the Q-Table and
acting randomly, or only reading the Q-Table and taking the best learnt action.
The Q-Table is trained on mapQ1.xml
against Collinson behaiour, and against
itself on pathfind3.xml The included learning data is the result of
approximately 100 tournaments. ~120 state-action pairs have been encountered
during learning.
The Q-Table HashMap is serialised and saved in a flat file, and updated at the end of each game.
Generating significantly more training data was not possible due to limitations of running tournaments. Sometimes the creatures would get stuck, requiring manual intervention to continue the training process. Further and better training would be possible if the dungeon layout and item placement could be programatically randomised.
Q-Learning was employed in this implementation to overcome limitations of the previous behaviour. The faction could optimally pathfind to treasure and enemies, but did not react to changes in health or energy. The result was blind pursuit of the enemy factions, often resulting in lost games.
From observing the behaviour of my faction, I decided to reward and 'punish' the creatures for attacking the enemy with low energy. By limiting the scope of learning to a known goal, the challenge of tuning the algorithm was made simpler than simply rewarded overall game wins.
The greatest challenge in programming this dungeon has been balancing complex behaviour with reliable and understandable code. Conceptual AI behaviour, when implemented as you might express it linguistically, quickly spirals into dozens of edge cases and large quantities of code. Reinforcement learning techniques allow for the autonomous selection of actions in a large number of different game states. The different situations which can occur in even a simple game are played out, and given predefined rewards the action taken can be measured and recorded.
The result matched my expectations. In general, the ogre will wait after the first longbow shot until its energy is replenished. In addition, when a faction creature is near the enemy they do not wait, since there isn't as much benefit. Other behavioural differences can be observed, but it's more difficult to see a pattern.
Despite the behaviour improving perceptually, the overall number of games won has not increased significantly.
In addition, because of the large number of possible distance (pathLength) values, some suboptimal actions can be observed despite reasonably large amounts of training.
The other limitation resulted from the small range of available actions. Implementing some form of evasion would have widened the possibility of actions, and with appropriate rewards led to more complex and effective behaviour.
I began with the hypothesis that merely rewarding game wins and losses would not result in better behaviour. Instead, Q-Learning could be employed to direct the creature's actions towards certain behavioural goals. In the case of my behaviour, when the energy level was below a certain thresh-hold the creature should (normally) wait. Using Q-Learning to achieve this goal would result in more nuanced and reliable behaviour than if that had been implemented procedurally.
Sometimes, it is more important for the AI players to behave according to an understandable logic than to always take the best possible action. One possible negative consequence of learning algorithms is they can uncover exploits in the game world which might not be entertaining to human players. This effect can be mitigated by placing limitations on the action sequences of creatures. It raises the interesting possibility of rewarding entertaining behaviour and punishing what players might consider 'cheating'.