Title: Scaling The Software Development Process: Lessons Learned from The Sims Online
1Scaling The Software Development Process
Lessons Learned fromThe Sims Online
- Greg Kearney,
- Larry Mellon, Darrin West
- Spring 2003, GDC
2Talk Overview
- Covers Software Engineering techniques to help
when projects get big - Code structure
- Work processes (for programmers)
- Testing
- Does Not Cover
- Game Design / Content Pipeline
- Operations / Project Management
3How to Apply it.
- We didnt do all of this right away
- Improve what you can
- Dont change too much at once
- Prove that it works, and others will take up the
cause - Iterate
4Match Process to Scale
tve
Team Efficiency
Team Size
0
5What You Should Leave With
- TSO Lessons Learned
- Where we were with our software process
- What we did about it
- How it helped
- Some Rules of Thumb
- General practices that tend to smooth software
development _at_ scale - Not a blueprint for MMP development
- Useful frame of reference
6Classes of Lessons Learned Rules
- Architecture / Design Keep it Simple
- Minimizing dependencies, fatal couplings
- Minimizing complexity, brittleness
- Workspace Management Keep it Clean
- Code and directory structure
- Check in and integration strategies
- Dev. Support Structure Make it Easy, Prove it
- Testing
- Automation
- All of these had to change as we scaled up.
- They eventually exceeded the teams ability to
deal with (using existing tools processes).
7Non-Geek Analogy
- Sharpen your tools.
- Clean up your mess.
- Measure twice, cut once.
- Stay with your buddy.
Bad flashbacks found at http//www.easthamptonhig
h.org/cernak/ http//www.hancock.k12.mi.us/high/ar
t/wood/index.html
8Key Factors Affecting Efficiency
- High Churn Rate large coders times tightly
coupled code equaled frequent breaks - Our code had a deeproot system
- And we had a forest of changes to make
Big root ball found at http//www.on.ec.gc.ca/
canwarn/norwich/norsummary-e.html
9Make It Smaller
10Key Factors Affecting Efficiency
- Key Logs some issues were preventing other
issues from even being worked on
11Key Factors Affecting Efficiency
Login
- A chain of single points of failure took out the
entire team
Create an avatar
Enter a city
Buy a house
Enter a house
Buy the chair
Sit on a chair
12So, What Did We Do That Worked
- Switched to a logical architecture with less
coupling - Switched to a code structure with fewer
dependencies - Put in scaffolding to keep everyone working
- Developed sophisticated configuration management
- Instituted automated testing
- Metrics, Metrics, Metrics
13So, What Did We Do That Didnt?
- Long range milestone planning
- Network emulator(s)
- Over engineered a few things (too general)
- Some tasks failed due to
- Not replanning, reviewing long tasks
- Not breaking up long tasks
- Coding standard changed part way through
14What we were faced with
- 750K lines of legacy Windows code
- Port it to Linux
- Change from multiplayer to Client/Server
- 18 months
- Developers must remain alive after shipping
- Continuous releases starting at Beta
15Go To FinalArchitecture ASAP
16Go to final architecture ASAP
Multiplayer
Client Sim
Evolve
Here be Sync Hell
Client Sim
Client Sim
Client Sim
17Final Architecture ASAPRefactoring
- Decomposed into Multiple dlls
- Found the Simulator
- Interfaces
- Reference Counting
- Client/Server subclassing
- How it helped
- Reduced coupling. Even reduced compile times!
- Developers in different modules broke each other
less often. - We went everywhere and learned the code base.
18Final Architecture ASAPIt Had to Always Run
- But, clients would not behave predictably
- We could not even play test
- Game design was demoralized
- We needed a bridge, now!
?
?
19Final Architecture ASAPIncremental Sync
- A quick temporary solution
- Couldnt wait for final system to be finished
- High overhead, couldnt ship it
- We took partial state snapshots on the server and
restored to them on the client
- How it helped
- Could finally see the game as it would be.
- Allowed parallel game design and coding
- Bought time to lay in the right stuff.
20Final Architecture ASAPNull View
- Created Null View HouseSim on Windows
- Same interface
- Null (text output) implementation
- How it helped
- No ifdefs!
- Done under Windows, we could test this first
step. - We knew it was working during the port.
- Allowed us to port to Linux only the needed
parts.
21Final Architecture ASAPMore Bridges
- HSBs proxy on Linux, pass-through to a Windows
Sim. -
- Disabled authentication, etc.
- How it helped
- Could exercise Linux components before finishing
HouseSim port. - Allowed us to debug server scale, performance and
stability issues early. - Make best use of Windows developers.
- Allowed single platform development. Faster
compiles.
- How it helped
- Could keep working even when some of the system
wasnt available.
22Mainline Must Work!
23If Mainline Doesnt Work,Nobody Works
- The Mainline source control branch must run
- Never go dark Demo/Play Test every day
- If you hit a bug, do you sync to mainline, hoping
someone else fixed it? Or did you just add it?
- If mainline breaks for only an hour, the
project loses a man-week. - If each developer breaks the mainline only once
a month, it is broken every day.
24Mainline must workSniff Test
- Mainline was breaking for simple things.
- Features you didnt touch (and didnt test).
- Created an auto-test to exercise all core
functions. - Quick to run. Fun to watch. Checked results.
- Mandated that it pass before submitting code
changes. - Break the build feed the pig.
- How it helped
- Very simple test. Amazing difference.
- Sometimes we got lazy and trusted it too much.
Doh!
25Mainline must workStages to Sandboxing
- Got it to build reliably.
- Instituted Auto-Builds email all on failure.
- Used a Pumpkin to avoid duplicate merge-test
cycles, pulling partial submissions,... - Used a Pumpkin Queue when we really got rolling
- How it helped
- Far fewer thumbs twiddled.
- The extra process got on some peoples nerves.
26Mainline must workSandboxing
- Finally, went to per-developer branching.
- Develop on your own branch.
- Submit changes to an integration engineer.
- Full Smoke test run per submission/feature.
- If it worked, integrated to mainline in priority
order, or else it is bounced.
- How it helped
- Mainline always runs. Pull any time.
- Releases are not delayed by partial features.
- No more code freezes going to release.
27Support Structure
28Background Support Structure
- Team size placed design constraints on supporting
tools - Automation big win in big teams
- Churn rate tool accuracy / support cost
- Types of tools
- Data management collection / corrolation
- Testing controlled, synced, repeatable inputs
- Baselines my bug, your bug, or our bug?
29Overview Support Structure
- Automated testing designs to minimize impact of
churn rate - Automated data collection / corrolation
- Distributed sytem distributed data
- Dashboard / Esper / MonkeyWatcher
- Use case load testing
- Controlled (tunable) inputs, observable results
- ScaleBreak
30Problem Testing Accuracy
- Load Regression inputs must be
- Accurate
- Repeatable
- Churn rate logic/data in constant motion
- How to keep testing client accurate?
- Solution game client becomes test client
- Exact mimicry
- Lower maintenance costs
31Test Client Game Client
Test Client Game Client
Game GUI
Test Control
State
State
Commands
Presentation Layer
Client-Side Game Logic
32Game Client How Much To Keep?
Game Client
View
Presentation Layer
Logic
33What Level To Test At?
Game Client
View
Mouse Clicks
Presentation Layer
Logic
Regression Too Brittle (pixel shift) Load Too
Bulky
34What Level To Test At?
Game Client
View
Internal Events
Presentation Layer
Logic
Regression Too Brittle (Churn Rate vs Logic
Data)
35Semantic Abstractions
Basic gameplay changes less frequently than UI or
protocol implementations.
NullView Client
View
¾
Presentation Layer
Logic
¼
36Scriptable User Play Sessions
- Test Scripts Specific / ordered inputs
- Single user play session
- Multiple user play session
- SimScript
- Collection Presentation Layer primitives
- Synchronization wait_until, remote_command
- State probes arbitrary game state
- Avatars body skill, lamp on/off,
37Scriptable User Play Sessions
- Scriptable play sessions big win
- Load tunable based on actual play
- Regression walk a set of avatars thru various
play sessions, validating correctness per step - Gameplay semantics very stable
- UI / protocols shifted constantly
- Game play remained (about) the same
38Automated Test Team Baselines
- Hourly critical path stability tests
- Sync / clean / build / test
- Validate Mainline / Servers
- Snifftest weather report
- Hourly testing
- Constant reporting
39How Automated Testing Helped
- Current, accurate baseline for developers
- Scalebreak found many bugs
- Greatly increased stability
- Code base was safe
- Server health was known (and better)
40Tools Large Teams
- High tool ROI
- team_size automation_savings
- Faster triage
- Quickly narrow down problem
- across any system component
- Monitoring tools became a focal point
- Wiki central doc repository
41Monitoring / Diagnostics
When you can measure what you are speaking about
and can express it in numbers, you know something
about it. But when you cannot measure it, when
you cannot express it in numbers, your knowledge
is of a meager and unsatisfactory kind." - Lord
Kelvin
- DeMarco You cannot control what you cannot
measure. - Maxwell To measure is to know.
- Pasteur A science is as mature as its
measurement tools.
42Dashboard
- System resource health tool
- CPU / Memory / Disk /
- Central point to access
- Status
- Test Results
- Errors
- Logs
- Cores
43Test Central / Monkey Watcher
- Test Central UI
- Control rig for developers testers
- Monkey Watcher
- Collects stores (distributed) test results
- Produces summarized reports across tests
- Filters known defects
- Provides baseline of correctness
- Web frontend, unique IDs per test
44Esper
- In-game profiler for a distributed system
- Internal probes may be viewed
- Per process / machine / cluster
- Time view or summary view
- Automated data management
- Coders add one line probe
- Esper data shows up on web site
45Use Case Scale Break
- Never too early to begin scaling
- Idle keep doubling server processes
- Busy double users, dataset size
- Fix what broke, start again
- Tune input scripts using Beta data
46Load Testing Data Flow
Resource
Debugging Data
Load Testing Team
Metrics
Client
Metrics
Load Control Rig
Test
Test
Test
Test
Test
Test
Test
Test
Test
Client
Client
Client
Client
Client
Client
Client
Client
Client
Test Driver CPU
Test Driver CPU
Test Driver CPU
Game
Traffic
Internal
System
Server Cluster
Probes
Monitors
47Outline Wrapup
- Wins / Losses
- Rules Analysis Discussion
- Recommended reading
- Questions
48Process Wins / Losses
- Wins
- Module decomposition
- Logical client / server architecture
- Physical code structure
- Scaffolding for parallel development
- Tools to improve workflow
- Automated Regression / Load
49Process Wins / Losses
- Losses
- Early lack of tools
- ifdef as a cross-platform port
- Single points of failure blocked entire
development team
50Not Done YetMore Challenges
- How to ship, and ship, and ship
- How to balance infrastructure cleanup against new
feature development
51Rules of Thumb (1)
- KISS software and processes
- Incremental changes
- ltInhalegtltHold ItgtltExhalegt
- ltSaygtBaby-Steps
- Continual tool/process improvement
52Rules of Thumb (2)
- Mainline has got to work
- Get something on the ground. Quickly.
53Rules of Thumb (3)
- Key Logs break up quickly, ruthlessly
- Scaffolding keep others working
- Do important things, not urgent things
- Module separation (logically, physically)
- If you cant measure it, you dont understand it
54Final Rule Sharpen The Saw
- Efficiency impacted by
- Component coupling / team size
- Compile / load / test / analyze cycle
- Tool Justification in large teams
- Large ROI _at_ large scale
- 5 gain across 30 programmers
- Fred Brooks 31st programmer
55Recommended Reading
- Influences
- Extreme Programming
- Scott Meyers large-scale software engineering
- Gamma et al Design Patterns
- Caveat Emptor slavish following not encouraged
- Consider ground conditions for your project
56Questions Answers