Scaling The Software Development Process: Lessons Learned from The Sims Online - PowerPoint PPT Presentation

About This Presentation
Title:

Scaling The Software Development Process: Lessons Learned from The Sims Online

Description:

Scaling The Software Development Process: Lessons Learned from The Sims Online Greg Kearney, Larry Mellon, Darrin West Spring 2003, GDC – PowerPoint PPT presentation

Number of Views:199
Avg rating:3.0/5.0
Slides: 57
Provided by: Darr161
Category:

less

Transcript and Presenter's Notes

Title: Scaling The Software Development Process: Lessons Learned from The Sims Online


1
Scaling The Software Development Process
Lessons Learned fromThe Sims Online
  • Greg Kearney,
  • Larry Mellon, Darrin West
  • Spring 2003, GDC

2
Talk Overview
  • Covers Software Engineering techniques to help
    when projects get big
  • Code structure
  • Work processes (for programmers)
  • Testing
  • Does Not Cover
  • Game Design / Content Pipeline
  • Operations / Project Management

3
How to Apply it.
  • We didnt do all of this right away
  • Improve what you can
  • Dont change too much at once
  • Prove that it works, and others will take up the
    cause
  • Iterate

4
Match Process to Scale
tve
Team Efficiency
Team Size
0
5
What You Should Leave With
  • TSO Lessons Learned
  • Where we were with our software process
  • What we did about it
  • How it helped
  • Some Rules of Thumb
  • General practices that tend to smooth software
    development _at_ scale
  • Not a blueprint for MMP development
  • Useful frame of reference

6
Classes of Lessons Learned Rules
  • Architecture / Design Keep it Simple
  • Minimizing dependencies, fatal couplings
  • Minimizing complexity, brittleness
  • Workspace Management Keep it Clean
  • Code and directory structure
  • Check in and integration strategies
  • Dev. Support Structure Make it Easy, Prove it
  • Testing
  • Automation
  • All of these had to change as we scaled up.
  • They eventually exceeded the teams ability to
    deal with (using existing tools processes).

7
Non-Geek Analogy
  • Sharpen your tools.
  • Clean up your mess.
  • Measure twice, cut once.
  • Stay with your buddy.

Bad flashbacks found at http//www.easthamptonhig
h.org/cernak/ http//www.hancock.k12.mi.us/high/ar
t/wood/index.html
8
Key Factors Affecting Efficiency
  • High Churn Rate large coders times tightly
    coupled code equaled frequent breaks
  • Our code had a deeproot system
  • And we had a forest of changes to make

Big root ball found at http//www.on.ec.gc.ca/
canwarn/norwich/norsummary-e.html
9
Make It Smaller
10
Key Factors Affecting Efficiency
  • Key Logs some issues were preventing other
    issues from even being worked on

11
Key Factors Affecting Efficiency
Login
  • A chain of single points of failure took out the
    entire team

Create an avatar
Enter a city
Buy a house
Enter a house
Buy the chair
Sit on a chair
12
So, What Did We Do That Worked
  • Switched to a logical architecture with less
    coupling
  • Switched to a code structure with fewer
    dependencies
  • Put in scaffolding to keep everyone working
  • Developed sophisticated configuration management
  • Instituted automated testing
  • Metrics, Metrics, Metrics

13
So, What Did We Do That Didnt?
  • Long range milestone planning
  • Network emulator(s)
  • Over engineered a few things (too general)
  • Some tasks failed due to
  • Not replanning, reviewing long tasks
  • Not breaking up long tasks
  • Coding standard changed part way through

14
What we were faced with
  • 750K lines of legacy Windows code
  • Port it to Linux
  • Change from multiplayer to Client/Server
  • 18 months
  • Developers must remain alive after shipping
  • Continuous releases starting at Beta

15
Go To FinalArchitecture ASAP
16
Go to final architecture ASAP
Multiplayer
Client Sim
Evolve
Here be Sync Hell
Client Sim
Client Sim
Client Sim
17
Final Architecture ASAPRefactoring
  • Decomposed into Multiple dlls
  • Found the Simulator
  • Interfaces
  • Reference Counting
  • Client/Server subclassing
  • How it helped
  • Reduced coupling. Even reduced compile times!
  • Developers in different modules broke each other
    less often.
  • We went everywhere and learned the code base.

18
Final Architecture ASAPIt Had to Always Run
  • But, clients would not behave predictably
  • We could not even play test
  • Game design was demoralized
  • We needed a bridge, now!

?
?
19
Final Architecture ASAPIncremental Sync
  • A quick temporary solution
  • Couldnt wait for final system to be finished
  • High overhead, couldnt ship it
  • We took partial state snapshots on the server and
    restored to them on the client
  • How it helped
  • Could finally see the game as it would be.
  • Allowed parallel game design and coding
  • Bought time to lay in the right stuff.

20
Final Architecture ASAPNull View
  • Created Null View HouseSim on Windows
  • Same interface
  • Null (text output) implementation
  • How it helped
  • No ifdefs!
  • Done under Windows, we could test this first
    step.
  • We knew it was working during the port.
  • Allowed us to port to Linux only the needed
    parts.

21
Final Architecture ASAPMore Bridges
  • HSBs proxy on Linux, pass-through to a Windows
    Sim.
  • Disabled authentication, etc.
  • How it helped
  • Could exercise Linux components before finishing
    HouseSim port.
  • Allowed us to debug server scale, performance and
    stability issues early.
  • Make best use of Windows developers.
  • Allowed single platform development. Faster
    compiles.
  • How it helped
  • Could keep working even when some of the system
    wasnt available.

22
Mainline Must Work!
23
If Mainline Doesnt Work,Nobody Works
  • The Mainline source control branch must run
  • Never go dark Demo/Play Test every day
  • If you hit a bug, do you sync to mainline, hoping
    someone else fixed it? Or did you just add it?
  • If mainline breaks for only an hour, the
    project loses a man-week.
  • If each developer breaks the mainline only once
    a month, it is broken every day.

24
Mainline must workSniff Test
  • Mainline was breaking for simple things.
  • Features you didnt touch (and didnt test).
  • Created an auto-test to exercise all core
    functions.
  • Quick to run. Fun to watch. Checked results.
  • Mandated that it pass before submitting code
    changes.
  • Break the build feed the pig.
  • How it helped
  • Very simple test. Amazing difference.
  • Sometimes we got lazy and trusted it too much.

Doh!
25
Mainline must workStages to Sandboxing
  1. Got it to build reliably.
  2. Instituted Auto-Builds email all on failure.
  3. Used a Pumpkin to avoid duplicate merge-test
    cycles, pulling partial submissions,...
  4. Used a Pumpkin Queue when we really got rolling
  • How it helped
  • Far fewer thumbs twiddled.
  • The extra process got on some peoples nerves.

26
Mainline must workSandboxing
  • Finally, went to per-developer branching.
  • Develop on your own branch.
  • Submit changes to an integration engineer.
  • Full Smoke test run per submission/feature.
  • If it worked, integrated to mainline in priority
    order, or else it is bounced.
  • How it helped
  • Mainline always runs. Pull any time.
  • Releases are not delayed by partial features.
  • No more code freezes going to release.

27
Support Structure
28
Background Support Structure
  • Team size placed design constraints on supporting
    tools
  • Automation big win in big teams
  • Churn rate tool accuracy / support cost
  • Types of tools
  • Data management collection / corrolation
  • Testing controlled, synced, repeatable inputs
  • Baselines my bug, your bug, or our bug?

29
Overview Support Structure
  • Automated testing designs to minimize impact of
    churn rate
  • Automated data collection / corrolation
  • Distributed sytem distributed data
  • Dashboard / Esper / MonkeyWatcher
  • Use case load testing
  • Controlled (tunable) inputs, observable results
  • ScaleBreak

30
Problem Testing Accuracy
  • Load Regression inputs must be
  • Accurate
  • Repeatable
  • Churn rate logic/data in constant motion
  • How to keep testing client accurate?
  • Solution game client becomes test client
  • Exact mimicry
  • Lower maintenance costs

31
Test Client Game Client
Test Client Game Client
Game GUI
Test Control
State
State
Commands
Presentation Layer
Client-Side Game Logic
32
Game Client How Much To Keep?
Game Client
View
Presentation Layer
Logic
33
What Level To Test At?
Game Client
View
Mouse Clicks
Presentation Layer
Logic
Regression Too Brittle (pixel shift) Load Too
Bulky
34
What Level To Test At?
Game Client
View
Internal Events
Presentation Layer
Logic
Regression Too Brittle (Churn Rate vs Logic
Data)
35
Semantic Abstractions
Basic gameplay changes less frequently than UI or
protocol implementations.
NullView Client
View
¾
Presentation Layer
Logic
¼
36
Scriptable User Play Sessions
  • Test Scripts Specific / ordered inputs
  • Single user play session
  • Multiple user play session
  • SimScript
  • Collection Presentation Layer primitives
  • Synchronization wait_until, remote_command
  • State probes arbitrary game state
  • Avatars body skill, lamp on/off,

37
Scriptable User Play Sessions
  • Scriptable play sessions big win
  • Load tunable based on actual play
  • Regression walk a set of avatars thru various
    play sessions, validating correctness per step
  • Gameplay semantics very stable
  • UI / protocols shifted constantly
  • Game play remained (about) the same

38
Automated Test Team Baselines
  • Hourly critical path stability tests
  • Sync / clean / build / test
  • Validate Mainline / Servers
  • Snifftest weather report
  • Hourly testing
  • Constant reporting

39
How Automated Testing Helped
  • Current, accurate baseline for developers
  • Scalebreak found many bugs
  • Greatly increased stability
  • Code base was safe
  • Server health was known (and better)

40
Tools Large Teams
  • High tool ROI
  • team_size automation_savings
  • Faster triage
  • Quickly narrow down problem
  • across any system component
  • Monitoring tools became a focal point
  • Wiki central doc repository

41
Monitoring / Diagnostics
When you can measure what you are speaking about
and can express it in numbers, you know something
about it. But when you cannot measure it, when
you cannot express it in numbers, your knowledge
is of a meager and unsatisfactory kind." - Lord
Kelvin
  • DeMarco You cannot control what you cannot
    measure.
  • Maxwell To measure is to know.
  • Pasteur A science is as mature as its
    measurement tools.

42
Dashboard
  • System resource health tool
  • CPU / Memory / Disk /
  • Central point to access
  • Status
  • Test Results
  • Errors
  • Logs
  • Cores

43
Test Central / Monkey Watcher
  • Test Central UI
  • Control rig for developers testers
  • Monkey Watcher
  • Collects stores (distributed) test results
  • Produces summarized reports across tests
  • Filters known defects
  • Provides baseline of correctness
  • Web frontend, unique IDs per test

44
Esper
  • In-game profiler for a distributed system
  • Internal probes may be viewed
  • Per process / machine / cluster
  • Time view or summary view
  • Automated data management
  • Coders add one line probe
  • Esper data shows up on web site

45
Use Case Scale Break
  • Never too early to begin scaling
  • Idle keep doubling server processes
  • Busy double users, dataset size
  • Fix what broke, start again
  • Tune input scripts using Beta data

46
Load Testing Data Flow
Resource
Debugging Data
Load Testing Team
Metrics
Client
Metrics
Load Control Rig
Test
Test
Test
Test
Test
Test
Test
Test
Test
Client
Client
Client
Client
Client
Client
Client
Client
Client
Test Driver CPU
Test Driver CPU
Test Driver CPU
Game
Traffic
Internal
System
Server Cluster
Probes
Monitors
47
Outline Wrapup
  • Wins / Losses
  • Rules Analysis Discussion
  • Recommended reading
  • Questions

48
Process Wins / Losses
  • Wins
  • Module decomposition
  • Logical client / server architecture
  • Physical code structure
  • Scaffolding for parallel development
  • Tools to improve workflow
  • Automated Regression / Load

49
Process Wins / Losses
  • Losses
  • Early lack of tools
  • ifdef as a cross-platform port
  • Single points of failure blocked entire
    development team

50
Not Done YetMore Challenges
  • How to ship, and ship, and ship
  • How to balance infrastructure cleanup against new
    feature development

51
Rules of Thumb (1)
  • KISS software and processes
  • Incremental changes
  • ltInhalegtltHold ItgtltExhalegt
  • ltSaygtBaby-Steps
  • Continual tool/process improvement

52
Rules of Thumb (2)
  • Mainline has got to work
  • Get something on the ground. Quickly.

53
Rules of Thumb (3)
  • Key Logs break up quickly, ruthlessly
  • Scaffolding keep others working
  • Do important things, not urgent things
  • Module separation (logically, physically)
  • If you cant measure it, you dont understand it

54
Final Rule Sharpen The Saw
  • Efficiency impacted by
  • Component coupling / team size
  • Compile / load / test / analyze cycle
  • Tool Justification in large teams
  • Large ROI _at_ large scale
  • 5 gain across 30 programmers
  • Fred Brooks 31st programmer

55
Recommended Reading
  • Influences
  • Extreme Programming
  • Scott Meyers large-scale software engineering
  • Gamma et al Design Patterns
  • Caveat Emptor slavish following not encouraged
  • Consider ground conditions for your project

56
Questions Answers
Write a Comment
User Comments (0)
About PowerShow.com