Title: Automated Testing of Massively Multi-Player Games Lessons Learned from The Sims Online
1Automated Testing of Massively Multi-Player Games
Lessons Learned fromThe Sims Online
2Context What Is Automated Testing?
3Classes Of Testing
System Stress
Feature Regression
Load
QA
Developer
4Automation Components
5What Was Not Automated?
Startup Control
Repeatable, Synchronized Inputs
Results Analysis
Visual Effects
6Lessons Learned Automated Testing
- Design Initial Implementation
- Architecture, Scripting Tests, Test Client
- Initial Results
1/3
1/3
Fielding Analysis Adaptations
Wrap-up Questions What worked best, what
didnt Tabula Rasa MMP / SPG
1/3
Time (60 Minutes)
7Requirements
Load
Load Testing
Regression
Regression Testing
High Code Churn Rate
8Design Constraints
Load
Regression
Churn Rate
9Single, Data Driven Test Client
Regression
Load
Reusable Scripts Data
Single API
Test Client
10Data Driven Test Client
Testing feature correctness
Testing system performance
Regression
Load
Reusable Scripts Data
Single API
Test Client
Single API
Key Game States
Pass/Fail Responsiveness
Configurable Logs Metrics
11Problem Testing Accuracy
- Load Regression inputs must be
- Accurate
- Repeatable
- Churn rate logic/data in constant motion
- How to keep testing client accurate?
- Solution game client becomes test client
- Exact mimicry
- Lower maintenance costs
12Test Client Game Client
13Game Client How Much To Keep?
Game Client
View
Presentation Layer
Logic
14What Level To Test At?
Game Client
View
Mouse Clicks
Presentation Layer
Logic
Regression Too Brittle (pixel shift) Load Too
Bulky
15What Level To Test At?
Game Client
View
Internal Events
Presentation Layer
Logic
Regression Too Brittle (Churn Rate vs Logic
Data)
16Gameplay Semantic Abstractions
Basic gameplay changes less frequently than UI or
protocol implementations.
NullView Client
View
¾
Presentation Layer
Logic
¼
17Scriptable User Play Sessions
- SimScript
- Collection Presentation Layer primitives
- Synchronization wait_until, remote_command
- State probes arbitrary game state
- Avatars body skill, lamp on/off,
- Test Scripts Specific / ordered inputs
- Single user play session
- Multiple user play session
18Scriptable User Play Sessions
- Scriptable play sessions big win
- Load tunable based on actual play
- Regression constantly repeat hundreds of play
sessions, validating correctness - Gameplay semantics very stable
- UI / protocols shifted constantly
- Game play remained (about) the same
19SimScript Abstract User Actions
- include_script setup_for_test.txt
- enter_lot alpha_chimp
- wait_until game_state inlot
- chat Im an Alpha Chimp, in a Lot.
- log_message Testing object purchase.
- log_objects
- buy_object chair 10 10
- log_objects
-
20SimScript Control Sync
- Have a remote client use the chair
- remote_cmd monkey_bot
- use_object chair sit
-
- set_data avatar reading_skill 80
- set_data book unlock
- use_object book read
- wait_until avatar reading_skill 100
- set_recording on
21Client Implementation
22Composable Client
- Scripts - Cheat Console - GUI
Presentation Layer
Game Logic
23Composable Client
- Console - Lurker - GUI
- Scripts - Console - GUI
Presentation Layer
Game Logic
Any / all components may be loaded per instance
24Lesson View Logic Entangled
Game Client
View
Logic
25Few Clean Separation Points
Game Client
View
Presentation Layer
Logic
26Solution Refactored for Isolation
Game Client
View
Presentation Layer
Logic
27Lesson NullView Debugging
?
Without (legacy) view system attached, tracing
was difficult.
Presentation Layer
Logic
28Solution Embedded Diagnostics
Timeout Handlers
Diagnostics
Diagnostics
Diagnostics
Presentation Layer
Logic
29Talk Outline Automated Testing
- Design Initial Implementation
- Architecture Design
- Test Client
- Initial Results
1/3
1/3
Lessons Learned Fielding
Wrap-up Questions
1/3
Time (60 Minutes)
30Mean Time Between Failure
- Random Event, Log Execute
- Record client lifetime / RAM
- Worked just not relevant in early stages of
development - Most failures / leaks found were not
high-priority at that time, when weighed against
server crashes
31Monkey Tests
- Constant repetition of simple, isolated actions
against servers - Very useful
- Direct observation of servers while under
constant, simple input - Server processes aged all day
- Examples
- Login / Logout
- Enter House / Leave House
32QA Test Suite Regression
- High false positive rate high maintenance
- New bugs / old bugs
- Shifting game design
- Unknown failures
Not helping in day to day work.
33Talk Outline Automated Testing
- Design Initial Implementation
¼
Fielding AnalysisAdaptations Non-Determinism M
aintenance Overhead Solutions Results Monkey
/ Sniff / Load / Harness
½
¼
Wrap-up Questions
Time (60 Minutes)
34Analysis Testing Isolated Features
35Analysis Critical Path
Test Case Can an Avatar Sit in a Chair?
Failures on the Critical Path block access to
much of the game.
use_object ()
buy_object ()
enter_house ()
buy_house ()
create_avatar ()
login ()
36Solution Monkey Tests
- Primitives placed in Monkey Tests
- Isolate as much possible, repeat 400x
- Report only aggregate results
- Create Avatar 93 pass (375 of 400)
- Poor Mans Unit Test
- Feature based, not class based
- Limited isolation
- Easy failure analysis / reporting
37Talk Outline Automated Testing
- Design Initial Implementation
1/3
Lessons Learned Fielding Non-Determinism
Maintenance Costs Solution Approaches Monkey /
Sniff / Load / Harness
1/3
1/3
Wrap-up Questions
Time (60 Minutes)
38Analysis Maintenance Cost
- High defect rate in game code
- Code Coupling side effects
- Churn Rate frequent changes
- Critical Path fatal dependencies
- High debugging cost
- Non-deterministic, distributed logic
39Turnaround Time
Tests were too far removed from introduction of
defects.
40Critical Path Defects Were Very Costly
41Solution Sniff Test
42Solution Hourly Diagnostics
- SniffTest Stability Checker
- Emulates a developer
- Every hour, sync / build / test
- Critical Path monkeys ran non-stop
- Constant baseline
- Traffic Generation
- Keep the pipes full servers aging
- Keep the DB growing
43Analysis CONSTANT SHOUTING IS REALLY IRRITATING
- Bugs spawned many, many, emails
- Solution Report Managers
- Aggregates / correlates across tests
- Filters known defects
- Translates common failure reports to their root
causes - Solution Data Managers
- Information Overload Automated workflow tools
mandatory
44ToolKit Usability
- Workflow automation
- Information management
- Developer / Tester push button ease of use
- XP flavour increasingly easy to run tests
- Must be easier to run than avoid to running
- Must solve problems on the ground now
45Sample Testing Harness Views
46Load Testing Goals
- Expose issues that only occur at scale
- Establish hardware requirements
- Establish response is playable _at_ scale
- Emulate user behaviour
- Use server-side metrics to tune test scripts
against observed Beta behaviour - Run full scale load tests daily
47Load Testing Data Flow
Resource
Debugging Data
Load Testing Team
Metrics
Client
Metrics
Load Control Rig
Test
Test
Test
Test
Test
Test
Test
Test
Test
Client
Client
Client
Client
Client
Client
Client
Client
Client
Test Driver CPU
Test Driver CPU
Test Driver CPU
Game
Traffic
Internal
System
Server Cluster
Probes
Monitors
48Load Testing Lessons Learned
- Very successful
- ScaleBreak up to 4,000 clients
- Some conflicting requirements w/Regression
- Continue on fail
- Transaction tracking
- Nullview client a little chunky
49Current Work
- QA test suite automation
- Workflow tools
- Integrating testing into the new features
design/development process - Planned work
- Extend Esper Toolkit for general use
- Port to other Maxis projects
50Talk Outline Automated Testing
1/3
- Design Initial Implementation
1/3
Lessons Learned Fielding
Wrap-up Questions
1/3
Biggest Wins / Losses Reuse Tabula Rasa MMP SSP
Time (60 Minutes)
51Biggest Wins
- Presentation Layer Abstraction
- NullView client
- Scripted playsessions powerful for regression
load - Pre-Checkin Snifftest
- Load Testing
- Continual Usability Enhancements
- Team
- Upper Management Commitment
- Focused Group, Senior Developers
52Biggest Issues
- Order Of Testing
- MTBF / QA Test Suites should have come last
- Not relevant when early game too unstable
- Find / Fix Lag too distant from Development
- Changing TSOs Development Process
- Tool adoption was slow, unless mandated
- Noise
- Constant Flood Of Test Results
- Number of Game Defects, Testing Defects
- Non-Determinism / False Positives
53Tabula Rasa
How Would I Start The Next Project?
54Tabula Rasa
PreCheckin Sniff Test
Theres just no reason to let code break.
55Tabula Rasa
PreCheckin SniffTest
Keep Mainline working
Hourly Monkey Tests
Useful baseline keeps servers aging.
56Tabula Rasa
PreCheckin SniffTest
Keep Mainline working
Hourly Stability Checkers
Baseline for Developers
Dedicated Tools Group
Continual usability enhancements adapted tools To
meet on the ground conditions.
57Tabula Rasa
PreCheckin SniffTest
Keep Mainline working
Hourly Stability Checkers
Baseline for Developers
Dedicated Tools Group
Easy to Use Used
Executive Level Support
Mandates required to shift how entire teams
operated.
58Tabula Rasa
PreCheckin SniffTest
Keep Mainline working
Hourly Stability Checkers
Baseline for Developers
Easy to Use Used
Dedicated Tools Group
Executive Support
Radical Shifts in Process
Load Test Early Often
59Tabula Rasa
PreCheckin SniffTest
Keep Mainline working
Hourly Stability Checkers
Baseline for Developers
Easy to Use Used
Dedicated Tools Group
Executive Support
Radical shifts in Process
Load Test Early Often
Break it before Live
Distribute Test Development Ownership Across
Full Team
60Next Project Basic Infrastructure
Control Harness For Clients Components
Reference Client
Self Test
Living Doc
Reference Feature
Regression Engine
61Building Features NullView First
Control Harness
Reference Client
Self Test
Reference Feature
Living Doc
NullView Client
Regression Engine
62Build The Tests With The Code
Control Harness
Self Test
Reference Client
Reference Feature
Regression Engine
NullView Client
Login
Monkey Test
Nothing Gets Checked In Without A Working Monkey
Test.
63Conclusion
- Estimated Impact on MMP High
- Sniff Test kept developers working
- Load Test IDd critical failures pre-launch
- Presentation Layer scriptable play sessions
- Cost To Implement Medium
- Much Lower for SSP Games
Repeatable, coordinated inputs _at_ scale and
pre-checkin regression were very significant
schedule accelerators.
64Conclusion
Go For It
65Talk Outline Automated Testing
1/3
- Design Initial Implementation
1/3
Lessons Learned Fielding
Wrap-up Questions
1/3
Time (60 Minutes)