Scaling The Software Development Process: Lessons Learned from The Sims Online - PowerPoint PPT Presentation

About This Presentation

Title:

Scaling The Software Development Process: Lessons Learned from The Sims Online

Description:

Scaling The Software Development Process: Lessons Learned from The Sims Online Greg Kearney, Larry Mellon, Darrin West Spring 2003, GDC – PowerPoint PPT presentation

Number of Views:199

Avg rating:3.0/5.0

Slides: 57

Provided by: Darr161

Category:

more less

Transcript and Presenter's Notes

Title: Scaling The Software Development Process: Lessons Learned from The Sims Online

1
Scaling The Software Development Process
Lessons Learned fromThe Sims Online

Greg Kearney,
Larry Mellon, Darrin West
Spring 2003, GDC

2
Talk Overview

Covers Software Engineering techniques to help
when projects get big
Code structure
Work processes (for programmers)
Testing
Does Not Cover
Game Design / Content Pipeline
Operations / Project Management

3
How to Apply it.

We didnt do all of this right away
Improve what you can
Dont change too much at once
Prove that it works, and others will take up the
cause
Iterate

4
Match Process to Scale
tve
Team Efficiency
Team Size
0
5
What You Should Leave With

TSO Lessons Learned
Where we were with our software process
What we did about it
How it helped
Some Rules of Thumb
General practices that tend to smooth software
development _at_ scale
Not a blueprint for MMP development
Useful frame of reference

6
Classes of Lessons Learned Rules

Architecture / Design Keep it Simple
Minimizing dependencies, fatal couplings
Minimizing complexity, brittleness
Workspace Management Keep it Clean
Code and directory structure
Check in and integration strategies
Dev. Support Structure Make it Easy, Prove it
Testing
Automation

All of these had to change as we scaled up.
They eventually exceeded the teams ability to
deal with (using existing tools processes).

7
Non-Geek Analogy

Sharpen your tools.
Clean up your mess.
Measure twice, cut once.
Stay with your buddy.

Bad flashbacks found at http//www.easthamptonhig
h.org/cernak/ http//www.hancock.k12.mi.us/high/ar
t/wood/index.html
8
Key Factors Affecting Efficiency

High Churn Rate large coders times tightly
coupled code equaled frequent breaks
Our code had a deeproot system
And we had a forest of changes to make

Big root ball found at http//www.on.ec.gc.ca/
canwarn/norwich/norsummary-e.html
9
Make It Smaller
10
Key Factors Affecting Efficiency

Key Logs some issues were preventing other
issues from even being worked on

11
Key Factors Affecting Efficiency
Login

A chain of single points of failure took out the
entire team

Create an avatar
Enter a city
Buy a house
Enter a house
Buy the chair
Sit on a chair
12
So, What Did We Do That Worked

Switched to a logical architecture with less
coupling
Switched to a code structure with fewer
dependencies
Put in scaffolding to keep everyone working
Developed sophisticated configuration management
Instituted automated testing
Metrics, Metrics, Metrics

13
So, What Did We Do That Didnt?

Long range milestone planning
Network emulator(s)
Over engineered a few things (too general)
Some tasks failed due to
Not replanning, reviewing long tasks
Not breaking up long tasks
Coding standard changed part way through

14
What we were faced with

750K lines of legacy Windows code
Port it to Linux
Change from multiplayer to Client/Server
18 months
Developers must remain alive after shipping
Continuous releases starting at Beta

15
Go To FinalArchitecture ASAP
16
Go to final architecture ASAP
Multiplayer
Client Sim
Evolve
Here be Sync Hell
Client Sim
Client Sim
Client Sim
17
Final Architecture ASAPRefactoring

Decomposed into Multiple dlls
Found the Simulator
Interfaces
Reference Counting
Client/Server subclassing

How it helped
Reduced coupling. Even reduced compile times!
Developers in different modules broke each other
less often.
We went everywhere and learned the code base.

18
Final Architecture ASAPIt Had to Always Run

But, clients would not behave predictably
We could not even play test
Game design was demoralized
We needed a bridge, now!

?
?
19
Final Architecture ASAPIncremental Sync

A quick temporary solution
Couldnt wait for final system to be finished
High overhead, couldnt ship it
We took partial state snapshots on the server and
restored to them on the client

How it helped
Could finally see the game as it would be.
Allowed parallel game design and coding
Bought time to lay in the right stuff.

20
Final Architecture ASAPNull View

Created Null View HouseSim on Windows
Same interface
Null (text output) implementation

How it helped
No ifdefs!
Done under Windows, we could test this first
step.
We knew it was working during the port.
Allowed us to port to Linux only the needed
parts.

21
Final Architecture ASAPMore Bridges

HSBs proxy on Linux, pass-through to a Windows
Sim.
Disabled authentication, etc.

How it helped
Could exercise Linux components before finishing
HouseSim port.
Allowed us to debug server scale, performance and
stability issues early.
Make best use of Windows developers.
Allowed single platform development. Faster
compiles.

How it helped
Could keep working even when some of the system
wasnt available.

22
Mainline Must Work!
23
If Mainline Doesnt Work,Nobody Works

The Mainline source control branch must run
Never go dark Demo/Play Test every day
If you hit a bug, do you sync to mainline, hoping
someone else fixed it? Or did you just add it?

If mainline breaks for only an hour, the
project loses a man-week.
If each developer breaks the mainline only once
a month, it is broken every day.

24
Mainline must workSniff Test

Mainline was breaking for simple things.
Features you didnt touch (and didnt test).
Created an auto-test to exercise all core
functions.
Quick to run. Fun to watch. Checked results.
Mandated that it pass before submitting code
changes.
Break the build feed the pig.

How it helped
Very simple test. Amazing difference.
Sometimes we got lazy and trusted it too much.

Doh!
25
Mainline must workStages to Sandboxing

Got it to build reliably.
Instituted Auto-Builds email all on failure.
Used a Pumpkin to avoid duplicate merge-test
cycles, pulling partial submissions,...
Used a Pumpkin Queue when we really got rolling

How it helped
Far fewer thumbs twiddled.
The extra process got on some peoples nerves.

26
Mainline must workSandboxing

Finally, went to per-developer branching.
Develop on your own branch.
Submit changes to an integration engineer.
Full Smoke test run per submission/feature.
If it worked, integrated to mainline in priority
order, or else it is bounced.

How it helped
Mainline always runs. Pull any time.
Releases are not delayed by partial features.
No more code freezes going to release.

27
Support Structure
28
Background Support Structure

Team size placed design constraints on supporting
tools
Automation big win in big teams
Churn rate tool accuracy / support cost
Types of tools
Data management collection / corrolation
Testing controlled, synced, repeatable inputs
Baselines my bug, your bug, or our bug?

29
Overview Support Structure

Automated testing designs to minimize impact of
churn rate
Automated data collection / corrolation
Distributed sytem distributed data
Dashboard / Esper / MonkeyWatcher
Use case load testing
Controlled (tunable) inputs, observable results
ScaleBreak

30
Problem Testing Accuracy

Load Regression inputs must be
Accurate
Repeatable
Churn rate logic/data in constant motion
How to keep testing client accurate?
Solution game client becomes test client
Exact mimicry
Lower maintenance costs

31
Test Client Game Client
Test Client Game Client
Game GUI
Test Control
State
State
Commands
Presentation Layer
Client-Side Game Logic
32
Game Client How Much To Keep?
Game Client
View
Presentation Layer
Logic
33
What Level To Test At?
Game Client
View
Mouse Clicks
Presentation Layer
Logic
Regression Too Brittle (pixel shift) Load Too
Bulky
34
What Level To Test At?
Game Client
View
Internal Events
Presentation Layer
Logic
Regression Too Brittle (Churn Rate vs Logic
Data)
35
Semantic Abstractions
Basic gameplay changes less frequently than UI or
protocol implementations.
NullView Client
View
¾
Presentation Layer
Logic
¼
36
Scriptable User Play Sessions

Test Scripts Specific / ordered inputs
Single user play session
Multiple user play session
SimScript
Collection Presentation Layer primitives
Synchronization wait_until, remote_command
State probes arbitrary game state
Avatars body skill, lamp on/off,

37
Scriptable User Play Sessions

Scriptable play sessions big win
Load tunable based on actual play
Regression walk a set of avatars thru various
play sessions, validating correctness per step
Gameplay semantics very stable
UI / protocols shifted constantly
Game play remained (about) the same

38
Automated Test Team Baselines

Hourly critical path stability tests
Sync / clean / build / test
Validate Mainline / Servers
Snifftest weather report
Hourly testing
Constant reporting

39
How Automated Testing Helped

Current, accurate baseline for developers
Scalebreak found many bugs
Greatly increased stability
Code base was safe
Server health was known (and better)

40
Tools Large Teams

High tool ROI
team_size automation_savings
Faster triage
Quickly narrow down problem
across any system component
Monitoring tools became a focal point
Wiki central doc repository

41
Monitoring / Diagnostics
When you can measure what you are speaking about
and can express it in numbers, you know something
about it. But when you cannot measure it, when
you cannot express it in numbers, your knowledge
is of a meager and unsatisfactory kind." - Lord
Kelvin

DeMarco You cannot control what you cannot
measure.
Maxwell To measure is to know.
Pasteur A science is as mature as its
measurement tools.

42
Dashboard

System resource health tool
CPU / Memory / Disk /
Central point to access
Status
Test Results
Errors
Logs
Cores

43
Test Central / Monkey Watcher

Test Central UI
Control rig for developers testers
Monkey Watcher
Collects stores (distributed) test results
Produces summarized reports across tests
Filters known defects
Provides baseline of correctness
Web frontend, unique IDs per test

44
Esper

In-game profiler for a distributed system
Internal probes may be viewed
Per process / machine / cluster
Time view or summary view
Automated data management
Coders add one line probe
Esper data shows up on web site

45
Use Case Scale Break

Never too early to begin scaling
Idle keep doubling server processes
Busy double users, dataset size
Fix what broke, start again
Tune input scripts using Beta data

46
Load Testing Data Flow
Resource
Debugging Data
Load Testing Team
Metrics
Client
Metrics
Load Control Rig
Test
Test
Test
Test
Test
Test
Test
Test
Test
Client
Client
Client
Client
Client
Client
Client
Client
Client
Test Driver CPU
Test Driver CPU
Test Driver CPU
Game
Traffic
Internal
System
Server Cluster
Probes
Monitors
47
Outline Wrapup