Simics Accelerator Virtualizing Large Systems - PowerPoint PPT Presentation

1 / 44
About This Presentation
Title:

Simics Accelerator Virtualizing Large Systems

Description:

Rack-based system with many boards. Intermittent error ... Simics memory images used for all data stores (flash, ram, rom, disks, etc. ... – PowerPoint PPT presentation

Number of Views:143
Avg rating:3.0/5.0
Slides: 45
Provided by: pow877
Category:

less

Transcript and Presenter's Notes

Title: Simics Accelerator Virtualizing Large Systems


1
Simics AcceleratorVirtualizing Large Systems
  • Dr. Mikael Bergqvist, Senior Application Engineer
  • 2008-05-30

2
Topic
  • Speeding up the simulation of large target
    systems
  • Bring virtualized software development to the big
    stuff
  • Outline
  • Virtualized software development
  • Apologies if you attended the morning
    presentation on multicore debug, some parts will
    be repeated. But with two tracks we cannot be
    sure that you all saw that.
  • Target and host system trends
  • Multithreading virtual hardware models
  • Leveraging redundant information with Page
    Sharing
  • Results

3
  • Virtualization for Software Developers

4
What is Virtual Hardware?
  • A piece of software
  • Running on a regular PC, server, or workstation
  • Functionally identical to a particular hardware
  • Runs the same software as the physical hardware
    system

Virtual HW
5
Virtutech Core Technology
  • Model any electronic system on a PC or
    workstation
  • Simics is a software program, no hardware
    required
  • Run the exact same software as the physical
    target (complete binary)
  • Run it fast (100s of MIPS)
  • Model any target system
  • Networks, SoCs, boards, ASICs, ... no limits
  • Here is where accelerator comes in
  • For the benefit of software developers and
    hardware providers
  • Enables process change in software development

User application code
Middleware and libraries
Target operating system (s)
Virtual target hardware
6
Why do we use Virtual Hardware?
  • Business Reasons
  • It hits the bottom line
  • Develop software before hardware becomes
    available
  • Shorten time-to-market
  • Decouple hardware and software development
  • Reduce software risk
  • Increase quality
  • Availability Flexibility
  • Engineering Reasons
  • It is cool
  • Checkpoint restore
  • Virtual time
  • Precisely synchronized
  • Stopped at any point
  • Repeatability
  • Reverse execution
  • Configurable
  • Control
  • Change anything
  • Inspection power
  • See anything
  • No debug bandwidth limit

7
Value Proposition
Optimize
Accelerate
Enhance System Debug
Replace
Early Software Development
Test and configuration
Cost of Recall and System Maintenance
Time to Market
Capital Expenditure Reduction
8
Replace
  • Availability
  • Virtual system is software
  • Trivial to copy
  • Trivial to distribute
  • Cheaper than custom HW
  • Each engineer can have a custom hardware system
    at their desk
  • Scalability
  • No physical supply limit
  • Any number of any board
  • Any type of system in infinite supply at no
    cost
  • Old systems or new
  • A virtual system can be big or small by simple
    software (re)configuration

9
Accelerate
  • Virtual hardware created from the system
    specification
  • Model available much earlier than prototype
    hardware
  • Software development starts much earlier
  • Software available when hardware starts shipping
  • Shorter sales cycles, less product risk, shorter
    time-to-market

Board design
Board prototype production
Hardware/Software Integration and Test
Hardware-dependent software development
Virtual modelproduction
Application software development
10
Optimize
  • Take advantage of the full power of virtualized
    software development and virtual hardware
  • Factor it into the project plan for a system
  • Observed effects
  • Software not blocked by hardware availability
  • Development schedules that start earlier and end
    earlier
  • Shorter development time for equivalent
    functionality
  • Shorter time to find and fix the really hard bugs
  • Fewer show-stoppers
  • More tested software
  • Improved hardware and hardware documentation
    quality
  • Very short time before software runs on first
    hardware

11
Optimized Debugging Power
  • Virtual hardware has very nice debugging and
    testing abilities

... con0.wait-for-string gt con0.input
bootm\n con0.wait-for-string login con0.input
root\n ...
break x 0x0000 0x1F00 break-io
uart0 break-exception int13
12
The Disk Corruption Example Bug
  • Distributed fault-tolerant file system got
    corrupted
  • Rack-based system with many boards
  • Intermittent error
  • Error seen as a composite state across multiple
    disks they suddenly and intermittently became
    inconsistent
  • Months spent chasing it on physical hardware
  • Simics solution
  • Reproduce corruption in Simics model of target
  • Pin-point time when it happens, by interval
    halving
  • Around the critical time, take periodic snapshots
    of disks
  • Check consistency of disk states in offline
    scripts
  • Result
  • Found the precise instruction causing the problem
  • Captured the network traffic pattern causing the
    issue
  • Communicated the complete setup and reproduction
    instructions to development, greatly facilitating
    fixing the bug

13
What Types of Systems Can Be Virtualized?
Complete Systems Networks
Examples
  • e300, e500, 440, 970, 7450, Power6, ...

Racks of Boards Backplanes
  • MPC8572E, PPC440GX, or CSSP ASIC

This is where performance becomes an issue
  • PCIe, RapidIO, I2C, Custom FPGA

Complete Boards
Devices Buses
  • MPC8572DS board, ebony board, custom

SoC Devices
  • Telecom rack, avionicsbay, blade server

Processor Memory
  • Satellite constellation, telecom network

14
  • Technology Trends and Simics Accelerator

15
Trends
  • Target systems are getting more complex
  • Multiple boards
  • Multiple processors
  • Multicore SoCs
  • More and larger memories
  • Reduces perceived simulation performance as more
    work is needed per target time unit
  • Host hardware is parallel
  • Multicore processors
  • Multiple processors
  • Clusters of PCs
  • Multicore standard for desktop
  • 600 EUR for a 2-core PC
  • 3000 EUR gets 8-core server
  • Increases processing power for software which is
    parallel
  • NB Memory size is not increasing as quickly as
    cores

16
Simics Accelerator
  • Launched with Simics 4.0 in April 2008
  • Contains a set of technologies for speeding up
    execution of large target systems in Simics
  • Tackle more complex target systems
  • Using multiple host processor cores
  • Taking advantage of redundancy in target system
  • Without impacting Simics determinism, control,
    synchronization, insight, and reverse execution

17
The Target Systems
  • Large, complex targets
  • Multiple boards
  • Multiple networks
  • 20-100 processors
  • Heterogeneous processors
  • Many gigabytes of memory
  • Almost overwhelming but not with Accelerator!
  • Brings a whole new level of systems into the
    bracket of conveniently fast
  • Typical target markets
  • Telecom network equipment (racks and clusters)
  • Military/aerospace racks
  • Datacenter blade enclosures
  • Distributed systems
  • Networked systems

18
  • Multithreading Simics

19
Not Trivial to do Right
20
Multithreading Simics Overview
Simics
Simics
Simics
Single thread
Host Workstation
Host Workstation
Host Workstation
Target simulation speed
Total simulator work
1.0
1.0
25
100
100
4.0
21
Multithreading Simics Details
  • Simics 4.0 can utilize multiple host processors
    for simulation
  • The simulation is divided up into cells
  • The cells can run concurrently in different
    threads
  • Objects in different cells can only communicate
    with each other through message passing (Simics
    links)
  • Processors that share memory or devices have to
    be in the same cell (currently)
  • Boards or machines that communicate over Ethernet
    and other networks can be in separate cells
  • Typically, one or a few boards/machines in a cell
  • Links connecting machines require some smarts
  • Orthogonal to other Simics features
  • Reuses target structure for earlier Simics
    versions

22
Hierarchical Synchronization
Synchronize shared memory machine tightly
  • Deterministic semantics
  • Regardless of host cores
  • Periodic synchronization between different cells
    and target machines
  • Puts a minimum latency on communication
    propagation
  • Synch interval determines simulation results, not
    number of execution threads in Simics
  • Latency within a cell
  • 1000-10000 cycles
  • Works well for SMP OS
  • Latency between cells
  • 10 to 1000 ms
  • Works well for latency-tolerant networks
  • Builds on current Simics experience in temporally
    decoupled simulation
  • This works well in practice

Longer latency on network between cells
link
link
Short latency between machines with tight network
coupling, inside a single cell
23
Scaling Out
  • Multithreading and distribution of the simulation
    can be combined to simulate extremely large
    systems
  • Make more cores and more host memory available
  • Takes Simics into the hundreds of nodes domain
  • Distribute at network links, just like cell
    boundaries

Simics
Simics
Simics
link
link
Switch
link
Host Workstation
Host Workstation
Host Workstation
24
  • Leveraging Target Redundancy

25
Redundancy in Target Systems
  • Large systems are not built from all-unique
    components
  • Software repeats
  • Machines use the same OS, middleware,
    applications
  • Data repeats
  • Redundant databases
  • Data packets passed around in a cluster
  • Copies within machine
  • Code and data copied from disk to memory to be
    used
  • Simulator sees the whole system, leverage
    repetition to reduce memory footprint

Packet
Dataset
DB
App A
RTOS
Packet
Dataset
Linux
DB
Dataset
RTOS
Dataset
Packet
DB
App A
RTOS
App A
RTOS
26
Data Page Sharing Implementation
  • Simics memory images used for all data stores
    (flash, ram, rom, disks, etc.)
  • Standard Simics feature
  • Identical pages in different memory images stored
    in a single copy
  • Within machines
  • Between machines
  • Regardless of type of memory in the target
  • Copy-on-write semantics for safety (obviously)
  • Reduces memory footprint, increase data locality,
    helps maintain performance

Simics
cpu
RAM
cpu
cpu
RAM
cpu

dev
dev
dev
dev
flash
flash
dev
dev
flash
dev
dev
dev
RAM
cpu
27
  • Simics Accelerator Results

28
Accelerator Scaling
  • Many times better scalability for virtual
    hardware
  • Brings virtualization to larger system setups
  • More boards and larger memories handled with same
    host
  • (No real effect on single-machine setups)
  • Better use of host hardware
  • Use all the cores in a workstation
  • Do not waste workstation memory
  • Same semantics everywhere start on a small
    machine, move to a larger one for large
    simulations if needed
  • Overall, removes target system size as an
    obstacle for using virtualized software
    development

29
Single Point of Control
Eight machines simulated by two threads, inspect
any part of any machine from single interface
30
Multithreading Performance Results
  • Performance effect of multithreading depends on
  • Target system characteristics
  • Software latency requirements
  • Target system load balance
  • Target system communications pattern
  • Synthetic experiments and lab experience
  • Single-thread performance not affected
  • Simics works just as well as before on a single
    core
  • No impact on idle loop simulation
  • Up to 10x Simics 3.2 performance
  • 8-core host, 64 target machines, no communication
  • Up to 6x scaling on 8-core host
  • Pretty respectable

31
Page Sharing Results
All results are for networks of machines booted
to prompt, but no applications loaded
Local unique data 4
Data repeated within the machine 20
Shared data across machines 96
Total data savings 65
Total data savings 20
Local unique data 1
Shared data across and within machines 98
Zero pages 90
Other shared 1
Total data savings 89
Total data savings 91
32
  • Questions?

33
Munich
34
  • Spares

35
Simulation Speed
  • Detail level determines speed
  • The more detail, the slower the simulation
  • You can run lots of software with low detail
    level
  • or not very much software with high detail level
  • But not lots of software with high detail level

36
Workload Sizes
37
Temporal Decoupling Speed Impact
  • Experimental data
  • 4 virtual PPC440 boards
  • Booting Linux
  • Which is a particularly hard workload, lots of
    device accesses
  • Execution quanta of 1, 10, 100, ... 1000_000
    cycles
  • Notable points
  • 10x performance increase from 10 to 1000 quantum
  • 30 from 1000 to 1000_000 quantum

38
Simics 4.0 Accelerator Performance
  • Running a single machine in a single thread is
    equal in performance with Simics 3.2
  • Setups with many machines are often faster than
    with 3.2
  • Multithreading makes it much easier to utilize
    multicore and multiprocessor host machines
  • Linear scaling seen for simple cases such as
    compute-intense workloads or boot with little
    communication
  • Variability of the workload limits performance
    (see next slide)
  • Performance reduced if low-latency communication
    is required
  • Page sharing is not yet optimized for performance
  • Current implementation saves memory without
    affecting performance (neither better nor worse
    than without page sharing)
  • Have potential to improve performance

39
Multithreading Performance in Practice
  • Multiple boards in a single target system
  • Virtual time progress, with time quanta

Simics
A
B
C
A1
A3
A2
B1
B3
B2
C1
C3
C2
Virtual time
40
Execution on Single-Threaded Simics
Virtual time progress, with quanta
A1
A3
A2
B1
B3
B2
C1
C3
C2
Virtual time
Serialized execution on single-threaded Simics
A1
A2
B1
B2
C1
C2
Real time
The simulation of the three target machines are
interleaved on a single processor
The real time it takes to execute each time
quantum tend to vary with target hardware and
software characteristics
41
Execution on Multi-threaded Simics
Virtual time progress, with quanta
A1
A3
A2
B1
B3
B2
C1
C3
C2
Virtual time
Each time quantum has to be finished on all
machines before progressing to next quantum
Serialized execution on single-threaded Simics
Best case, all time target time quanta take the
same time to simulate.
A1
A2
B1
B2
C1
C2
Real time
Parallel execution on multi-threaded Simics
A1
A2
A3
Stall
B1
B2
B3
Stall
C1
C2
C3
Stall
Stall
Real time
42
Execution on Multi-threaded Simics
Virtual time progress, with quanta
A1
A3
A2
B1
B3
B2
Speed-up over single-threaded Simics will vary
over time, and is limited by load balance
C1
C3
C2
Virtual time
Serialized execution on single-threaded Simics
A1
A2
B1
B2
C1
C2
Real time
Parallel execution on multi-threaded Simics
A1
A2
A3
Stall
B1
B2
B3
Stall
C1
C2
C3
Stall
Stall
Real time
43
Simics Accelerator vs Simics Central
Simics Central coordinates a set of separate
Simics processes.
Simics
Simics
Simics
Accelerator uses multiple threads inside a single
Simics instance.
Simics
Simics
Host Workstation
Host Workstation
  • Accelerator advantages
  • Easier to setup, control and coordinate the
    simulation
  • Potentially more efficient use of host machine
    resources

44
Are Cores for Free?
Write a Comment
User Comments (0)
About PowerShow.com