Title: Evolving Multimodal Behavior With Modular Neural Networks in Ms. Pac-Man
1Evolving Multimodal Behavior With Modular Neural
Networks in Ms. Pac-Man
- By Jacob Schrum and Risto Miikkulainen
2Introduction
- Challenge Discover behavior automatically
- Simulations
- Robotics
- Video games (focus)
- Why challenging?
- Complex domains
- Multiple agents
- Multiple objectives
- Multimodal behavior required (focus)
3Multimodal Behavior
- Animals can perform many different tasks
- Imagine learning a monolithic policy as complex
as a cardinals behavior HOW? - Problem more tractable if broken into component
behaviors
Flying
Nesting
Foraging
4Multimodal Behavior in Games
- How are complex software agents designed?
- Finite state machines
- Behavior trees/lists
- Modular design leads to multimodal behavior
FSM for NPC in Quake
Behavior list for our winning BotPrize bot
5Modular Policy
- One policy consisting of several policies/modules
- Number preset, or learned
- Means of arbitration also needed
- Human specified, or learned via preference
neurons - Separate behaviors easily represented
- Sub-policies/modules can share components
Outputs
Inputs
Multitask
Preference Neurons
(Caruana 1997)
6Constructive Neuroevolution
- Genetic Algorithms Neural Networks
- Good at generating control policies
- Three basic mutations Crossover
- Other structural mutations possible
- (NEAT by Stanley 2004)
Perturb Weight
Add Connection
Add Node
7Module Mutation
- A mutation that adds a module
- Can be done in many different ways
- Can happen more than once for multiple modules
Out
In
MM(Random)
MM(Duplicate)
(cf Calabretta et al 2000)
(Schrum and Miikkulainen 2012)
8Ms. Pac-Man
- Domain needs multimodal behavior to succeed
- Predator/prey variant
- Pac-Man takes on both roles
- Goals Maximize score by
- Eating all pills in each level
- Avoiding threatening ghosts
- Eating ghosts (after power pill)
- Non-deterministic
- Very noisy evaluations
- Four mazes
- Behavior must generalize
Human Play
9Task Overlap
- Distinct behavioral modes
- Eating edible ghosts
- Clearing levels of pills
- More?
- Are ghosts currently edible?
- Possible some are and some are not
- Task division is blended
10Previous Work in Pac-Man
- Custom Simulators
- Genetic Programming Koza 1992
- Neuroevolution Gallagher Ledwich 2007, Burrow
Lucas 2009, Tan et al. 2011 - Reinforcement Learning Burrow Lucas 2009,
Subramanian et al. 2011, Bom 2013 - Alpha-Beta Tree Search Robles Lucas 2009
- Screen Capture Competition Requires Image
Processing - Evolution Fuzzy Logic Handa Isozaki 2008
- Influence Map Wirth Gallagher 2008
- Ant Colony Optimization Emilio et al. 2010
- Monte-Carlo Tree Search Ikehata Ito 2011
- Decision Trees Foderaro et al. 2012
- Pac-Man vs. Ghosts Competition Pac-Man
- Genetic Programming Alhejali Lucas 2010, 2011,
2013, Brandstetter Ahmadi 2012 - Monte-Carlo Tree Search Samothrakis et al. 2010,
Alhejali Lucas 2013 - Influence Map Svensson Johansson 2012
- Ant Colony Optimization Recio et al. 2012
- Pac-Man vs. Ghosts Competition Ghosts
- Neuroevolution Wittkamp et al. 2008
- Evolved Rule Set Gagne Congdon 2012
11Evolved Direction Evaluator
- Inspired by Brandstetter and Ahmadi (CIG 2012)
- Net with single output and direction-relative
sensors - Each time step, run net for each available
direction - Pick direction with highest net output
Right Preference
Left Preference
argmax
Left
12Module Setups
- Manually divide domain with Multitask
- Two Modules Threat/Any Edible
- Three Modules All Threat/All Edible/Mixed
- Discover new divisions with preference nodes
- Two Modules, Three Modules, MM(R), MM(D)
Out
In
Two-Module Multitask
Two Modules
MM(D)
13Average Champion Scores Across 20 Runs
14Most Used Champion Modules
Patterns of module usage correspond to different
score ranges
15Most Used Champion Modules
Low One Module
Most low-scoring networks only use one module
16Most Used Champion Modules
Low One Module
Medium-scoring networks use their primary module
80 of the time
Edible/Threat Division
17Most Used Champion Modules
Luring/Surrounded Module
Low One Module
Surprisingly, the best networks use one module
95 of the time
Edible/Threat Division
18Multimodal Behavior
- Different colors are for different modules
Learned Edible/Threat Division
Learned Luring/Surrounded Module
Three-Module Multitask
19Comparison with Other Work
Authors Method Method Method Eval Type AVG MAX
Alhejali and Lucas 2010 GP GP GP Four Maze 16,014 44,560
Alhejali and Lucas 2011 GPCamps GPCamps GPCamps Four Maze 11,413 31,850
Best Multimodal Result Two Modules/MM(D) Two Modules/MM(D) Two Modules/MM(D) Four Maze 32,959 44,520
Recio et al. 2012 Recio et al. 2012 ACO Competition Competition 36,031 43,467
Brandstetter and Ahmadi 2012 Brandstetter and Ahmadi 2012 GP Direction Competition Competition 19,198 33,420
Alhejali and Lucas 2013 Alhejali and Lucas 2013 MCTS Competition Competition 28,117 62,630
Alhejali and Lucas 2013 Alhejali and Lucas 2013 MCTSGP Competition Competition 32,641 69,010
Best Multimodal Result Best Multimodal Result MM(D) Competition Competition 65,447 100,070
Based on 100 evaluations with 3 lives. Four Maze
Visit each maze once. Competition Visit each
maze four times and advance after 3,000 time
steps.
20Discussion
- Obvious division is between edible and threat
- But these tasks are blended
- Strict Multitask divisions do not perform well
- Preference neurons can learn when best to switch
- Better division one module when surrounded
- Very asymmetrical surprising
- Highest scoring runs use one module rarely
- Module activates when Pac-Man almost surrounded
- Often leads to eating power pill luring
- Helps Pac-Man escape in other risky situations
21Future Work
- Go beyond two modules
- Issue with domain or evolution?
- Multimodal behavior of teams
- Ghost team in Pac-Man
- Physical simulation
- Unreal Tournament, robotics
22Conclusion
- Intelligent module divisions result in best
results - Modular networks make learning separate modes
easier - Results are better than previous work
- Module division unexpected
- Half of neural resources for seldom-used module
(lt 5) - Rare situations can be very important
- Some modules handle multiple modes
- Pills, threats, edible ghosts
23Questions?
- E-mail schrum2_at_cs.utexas.edu
- Movies http//nn.cs.utexas.edu/?ol-pm
- Code http//nn.cs.utexas.edu/?mm-neat
24What is Multimodal Behavior?
- From Observing Agent Behavior
- Agent performs distinct tasks
- Behavior very different in different tasks
- Single policy would have trouble generalizing
- Reinforcement Learning Perspective
- Instance of Hierarchical Reinforcement Learning
- A mode of behavior is like an option
- A temporally extended action
- A control policy that is only used in certain
states - Policy for each mode must be learned as well
- Idea From Supervised Learning
- Multitask Learning trains on multiple known tasks
25Behavioral Modes vs. Network Modules
- Different behavioral modes
- Determined via observation of behavior,
subjective - Any net can exhibit multiple behavioral modes
- Different network modules
- Determined by connectivity of network
- Groups of policy outputs
designated as modules (sub-policies) - Modules distinct even if behavior
is same/unused - Network modules should help
build behavioral modes
Module 2
Module 1
Sensors
26Preference Neuron Arbitration
- How can network decide which module to use?
- Find preference neuron (grey) with maximum output
- Corresponding policy neurons (white) define
behavior
0.7 gt 0.1, So use Module 2
Policy neuron for Module 2 has output of 0.5
0.6, 0.1, 0.5, 0.7
Output value 0.5 defines agents behavior
Outputs
Inputs
27Pareto-based Multiobjective Optimization (Pareto
1890)
High health but did not deal much damage
Tradeoff between objectives
Dealt lot of damage, but lost lots of health
(Deb et al. 2000)
28Non-dominated Sorting Genetic Algorithm II (Deb
et al. 2000)
- Population P with size N Evaluate P
- Use mutation ( crossover) to get P size N
Evaluate P - Calculate non-dominated fronts of P È P size 2N
- New population size N from highest fronts of P È
P
29Direction Evaluator Modules
- Network is evaluated in each direction
- For each direction, a module is chosen
- Human-specified (Multitask) or Preference Neurons
- Chosen module policy neuron sets direction
preference
0.5, 0.1, 0.7, 0.6 ? 0.6 gt 0.1 Left Preference
is 0.7 0.3, 0.8, 0.9, 0.1 ? 0.8 gt 0.1 Right
Preference is 0.3
0.7 gt 0.3
Ms. Pac-Man chooses to go left, based on Module 2
Left Inputs Right Inputs
30Discussion (2)
- Good divisions are harder to discover
- Some modular champions use only one module
- Particularly MM(R) new modules too random
- Are evaluations too harsh/noisy?
- Easy to lose one life
- Hard to eat all pills to progress
- Discourages exploration
- Hard to discover useful modules
- Make search more forgiving