Other Topics in Experiment Design CS 239 Experimental Methodologies for System Software Peter Reiher - PowerPoint PPT Presentation

About This Presentation

Title:

Other Topics in Experiment Design CS 239 Experimental Methodologies for System Software Peter Reiher

Description:

Also when you run raw traces. Of sufficient length ... Maybe the best stuff in the Crawdad data archives ... Some generators create networks of specified size ... – PowerPoint PPT presentation

Number of Views:86

Avg rating:3.0/5.0

Slides: 56

Provided by: PeterR92

Learn more at: https://lasr.cs.ucla.edu

Category:

more less

Transcript and Presenter's Notes

Title: Other Topics in Experiment Design CS 239 Experimental Methodologies for System Software Peter Reiher

1
Other Topics in Experiment Design CS
239Experimental Methodologies for System
SoftwarePeter ReiherMay 17, 2007

2
Outline

Experiment order and randomization
Important traces
Useful models for experimentation

3
Randomization of Experimental Order

Uncontrollable parameters may vary during
experimentation
In non-random ways
Plotting error vs. experiment number detects this
But doesnt control it
Randomization controls the problem
Becomes error parameter

4
An Example

Data from sample one factor experiment with
replications
Assumed order is all A levels first, then B
levels, etc.

5
What Does This Chart Tell Us?

Bigger errors for early replications of the
experiment
Eventually settling down to a narrow range
So maybe our A experiments observed some
different conditions than later experiments
Might get different results if A experiments were
run last, instead of first

6
Why Might This Kind of Thing Happen?

Consider measuring disk performance
Benchmark creates 1000 small files, 10 large
ones, writes them, then deletes them
File size is varied as experimental parameter
One run takes several hours
Other people use system daily
Disk fragmentation may increase over time,
changing results

7
Another Possible Reason

Cyclic effects
Something is happening on the computer every
hour/day/week
Experiments run while this thing is happening
behave differently
Ideally, should get rid of cyclic effect
But thats not always possible
There are many other similar reasons for this
kind of behavior

8
Another Reason

These kinds of effects are very common when you
run live tests
Also when you run raw traces
Of sufficient length and complexity to capture
them
Not a problem if all tests get same trace
But potentially a problem if you divide the trace
into pieces for different runs
That includes dividing traces for training
purposes

9
Complete Randomization

Plan experiment first
Levels of each parameter
Number of replications
List experiments by levels and replication number
Choose experiments from list randomly
Selection without replacement

10
More Advanced Techniques

Complete randomization sometimes impossible
E.g., might need to install different hardware
for each level
Too much intervention to potentially change HW
after each run
Divide experiments into blocks
Randomize within each block
Not that helpful if only one factor
Block effect confounded with true effect

11
An Example

Testing DDoS defense boxes
Your experiment has three factors
Which of three boxes
Varying number of attack sites (3 levels)
Makeup of DDoS traffic (3 levels)
The boxes are hardware appliances

12
Why Is This Problematic?

Boxes need to be put in-line in testing
framework
Requiring someone to switch cables (at least)
With complete randomization, need to switch
cables on roughly 2/3s of experimental runs

13
A Block Design for This Case

Set up blocks of experiments with single box
tested in each block
But multiple blocks for each box
E.g., all tests for box A with maximum number of
attack sites are in one block
Randomize order of block testing
Randomize within the block

14
What Have We Gained?

Many fewer cable changes
But less danger that unforeseen effects depending
on experiment order will cause problems
Havent removed the problem, but have decreased it

15
Something To Keep In Mind

Experimenters tend to think of periodic or
startup effects as a nuisance
They are actually real phenomena
Possibly important phenomena
When designing experiments, think seriously about
whether you want to avoid these effects
Or, alternately, capture them
The latter requires careful thought

16
Traces

Traces are often an important part of a workload
Many kinds of traces are hard to gather for
yourself
In some cases, traces are publicly available
Sometimes you can use those

17
Some Useful Traces

NLANR packet header traces
CAIDA traces and data sets
U. of Oregon Routeviews traces
File system traces
Web traces
Crawdad wireless traces

18
NLANR Network Traces

Traces of Internet packet activities
Just packet headers
Variety of traces gathered at different places in
Internet
Of varying length
Useful if you want to generate realistic
internal Internet traffic
http//pma.nlanr.net//
NLANR is out of business, now run by CAIDA

19
CAIDA Traces and Data Sets

CAIDA is organization dedicated to measuring
Internet phenomena
Theyve gathered a bunch of useful data
Some of which theyve made publicly available
Likely to be adding more over course of time
http//www.caida.org

20
Some CAIDA Datasets

Skitter topology data
Denial-of-service backscatter data
Internet worm activity data
Packet traces from OC12 and OC48 ISP points
DNS root server traffic activity

21
Skitter Data Sets

Skitter is CAIDA project to gather Internet
topology data
Skitter sends probe packets from many sites
around globe to Internet addresses
Gathers data based on responses
Data can be used to build map of current
topology/routing state of Internet

22
Denial of Service Backscatter Data

Typical DoS attacks result in victims sending
lots of response packets
If attack spoofed addresses, they go to random
sites
This is called backscatter
CAIDA watches backscatter and has made some
backscatter data available
Provides insight into DoS attack numbers, sizes,
targets, etc.

23
Internet Worm Activity

Worms spread to randomly chosen addresses
CAIDA has data on worm probe attempts to their
addresses
For Code Red and Witty
Some parts of data available to all
Others available on a restricted access basis
Useful for modeling worm activity

24
Routeviews Data

Gathered at University of Oregon
BGP updates and routing tables from several
participating ASes
From 2001 to date
Gathered every two hours, mostly
http//www.routeviews.org/

25
What Does Routeviews Data Show?

Full picture of routing from perspective of
particular points on Internet
Partial view of overall Internet topology and
routing
Data can be used to deduce lots of useful things

26
What Could Experimenters Use Routeviews Data For?

Generating Internet topology maps
Generating realistic BGP update traffic
Generating models of path lengths in Internet

27
File System Traces

Surprisingly few traces of significant amounts of
file system activity
But some are available
Many are old
More might become so in near future
Best place to start looking is SNIA IOTTA
repository
http//iotta.snia.org/

28
Some File System Traces

Seer trace
Gathered in my research group (1996/1997)
Real activity by real users
575 Mbytes
LASR trace
Also gathered in my group (2000/2001)
Real activity by real users
3.2 Gbytes
TraceFS data
16 minutes worth of activity (2007)
Based on running benchmarks
58 Mbytes

29
Typical File System Trace Contents

Records of file system related system calls
Recorded every time file system was invoked
Indicates file accessed, type of access, time,
size, perhaps user and process
With significant anonymization

30
What Can You Do With File System Traces?

Replay them when testing file systems
Use them to build models of file system activity
Use them to generate profiles of what files in a
file system are actually used
One big weakness in most traces is they show what
was accessed
No info about the rest of the file systems
contents

31
Other Interesting File System Traces

Cello traces
block level access to disk
Plan 9 traces
Possibly deceptive, due to unusual system model
of Plan 9
Seem to have disappeared from web
Werner Vogels Windows traces
Also seem to have disappeared

32
Web Server Traces

Usually traces of HTTP requests made to some web
server
Suitably anonymized
Many available
But many are old
Web moves fast enough that its not clear how
representative they are

33
Lawrence Berkeley Web Trace Repository

Various web traces kept at LBL
http//ita.ee.lbl.gov/
Some are quite extensive
E.g., 1.3 billion web requests for 1998 World Cup
site
None from after 2000

34
IRCache Traces

Weekly traces of a proxy cache
Latest currently available from January 2007
ftp//ftp.ircache.net/Traces/
Free for academic users
Commercial users have to pay

35
Web Caching Trace Site

Run by Brian D. Davison
http//www.web-caching.com/
Contains pointers to several web caches
Except IRCache, none newer than 1999
Many are pointers to same traces as LBL
But not all

36
Crawdad Wireless Traces

Crawdad is project to gather useful data on
wireless networks
Based at Dartmouth
http//crawdad.cs.dartmouth.edu/
Contains large quantity of data on various
wireless phenomena

37
The Dartmouth Wireless Traces

Maybe the best stuff in the Crawdad data archives
Dartmouths campus has had complete wireless
coverage for several years
And all students have wireless-enabled computers
Theyve kept complete data on associations to
wireless access points for five full years
Still gathering and making data available

38
What Can You Do With Dartmouths Data?

Lots of stuff
Traces of activity at wireless access points
Models of user mobility
Analysis of malware propagation via user movement
Models of typical patterns of user network access

39
Other Neat Stuff in Crawdad Repository

Other records of user mobility through wireless
networks
Data on Bluetooth activity in various
environments
Placelab data on use of wireless for localization
Link quality information for mesh networks
Ongoing data gathering project, so more will be
added

40
Useful Experimental Models

In many cases, we cant test in real conditions
Typically try to mimic real conditions by using
models
Workload models
Network topology models
Models of other experimental conditions
There are already useful models for many things
Often widely accepted as valid within certain
research communities
Might be better using them than trying to create
your own

41
Some Important Model Categories

Network topology models
Network traffic models

42
Network Topology Models

Many experiments nowadays investigate
network/distributed systems behavior
They need a realistic network to test the system
Usually embedded in testbed hardware
Where do you get that from?
In some cases, its obvious or you have a map of
a suitable network
In other cases, more challening

43
Some Challenging Cases

You need the Internet in the middle
You are investigating a large enterprise network
You are doing scalability testing that requires
networks of several sizes

44
Network Generation Models

The typical response to this problem
Run a program that generates a suitable network
Map the resulting network onto your available
hardware
Could be challenging, if you dont have enough
machines
Some generators create networks of specified size
But theoretically like whatever theyre modeling

45
Network Topologies and Power Law Behavior

Much debate on whether the Internet (and other
computer networks) follow power law behavior
Where P(k) is probability a node connects to k
other nodes
Generally some agreement that power law topology
generator do better job than hierarchical models
Less agreement on how power law properties arise
in networks like Internet

46
Some Popular Topology Generators

GT-ITM
BRITE
INET

47
GT-ITM

Supports various ways to randomly generate
network graphs
Including transit-stub model
Which doesnt produce power law graphs
Still, very widely used
http//www.cc.gatech.edu/projects/gtitm/

48
BRITE

Parameterizable network generation tool
Outputs its networks in NS-2 syntax
Places nodes randomly in a plane
Randomly selects some number of nodes to connect
to each new node
From a limited set of candidates
Some experiments suggest it produces graphs
matching power law behavior
Topology generator of choice for Emulab
http//www.cs.bu.edu/brite/

49
INET

Topology generator specifically intended to
produce Internet-like graphs
Much effort to match various network
characteristics
http//topology.eecs.umich.edu/inet/

50
A Different Approach

Map the real Internet accurately
Use that map for your topology
Rocketfuel project is one approach to this
mapping
http//www.cs.washington.edu/research/networking/r
ocketfuel/
Issue of producing small representative topology
you can actually test with remains

51
Network Traffic Models