Title: More on Data Presentation CS 239 Experimental Methodologies for System Software Peter Reiher May 24,
1More on Data Presentation CS 239Experimental
Methodologies for System SoftwarePeter
ReiherMay 24, 2007
2Outline
- Common graphics mistakes and games
- Special purpose graphs
3Common Mistakes in Graphics
- Excess information
- Multiple scales
- Using symbols in place of text
- Poor scales
- Using lines incorrectly
4Excess Information
- Sneaky trick to meet length limits
- Rules of thumb
- At most 6 curves on line chart
- At most 10 bars on bar chart
- At most 8 slices on pie chart
- But note that Tufte hates pie charts
- Extract essence, dont cram things in
5Way Too Much Information
6Whats Important About That Chart?
- Times for cp and rcp rise with number of replicas
- Most other benchmarks are near constant
- Exactly constant for rm
7The Right Amountof Information
8Multiple Scales
- Another way to meet length limits
- Basically, two graphs overlaid on each other
- Confuses reader (which line goes with which
scale?) - Misstates relationships
- Implies equality of magnitude that doesnt exist
9Some Especially Bad Multiple Scales
10Using Symbolsin Place of Text
- Graphics should be self-explanatory
- Remember that the graphs often draw the reader in
- So use explanatory text, not symbols
- This means no Greek letters!
- Unless your conference is in Athens...
11Its All Greek To Me...
12Explanation is Easy
13Poor Scales
- Plotting programs love non-zero origins
- But people are used to zero
- Fiddle with axis ranges (and logarithms) to get
your message across - But dont lie or cheat
- Sometimes trimming off high ends makes things
clearer - Brings out low-end detail
14Nonzero Origins(Chosen by Microsoft)
15Proper Origins
16A Poor Axis Range
17A Logarithmic Range
Shows all data on chart - Minimizes differences
of non-outliers
18A Truncated Range
Clarifies non-outlier distinctions - Makes
understanding outliers harder
19Using Lines Incorrectly
- Dont connect points unless interpolation is
meaningful - Dont smooth lines that are based on samples
- Exception fitted non-linear curves
20Incorrect Line Usage
21Pictorial Games
- Usually intentional attempts to use graphics to
deceive - Non-zero origins and broken scales
- Double-whammy graphs
- Omitting variation indices
- Scaling by height, not area
22Non-Zero Originsand Broken Scales
- People expect (0,0) origins
- Subconsciously
- So non-zero origins are a great way to lie
- Common in popular press
- Also very common to cheat by omitting part of
scale
23Non-Zero Origins
24The Three-Quarters Rule
- Highest point should be 3/4 of scale or more
25Double-Whammy Graphs
- Put two related measures on same graph
- One is (almost) function of other
- Hits reader twice with same information
- And thus overstates impact
26OmittingVariation Description
- Statistical data is inherently fuzzy
- But means/medians/modes appear precise
- Giving index of variation can make it clear
theres no real difference - So liars and fools leave them out
27Graph WithoutConfidence Intervals
28Graph WithConfidence Intervals
29Another Graph WithDifferent Confidence Intervals
30Scaling by HeightInstead of Area
- Clip art is popular with illustrators
Women in the Workforce
31The Troublewith Height Scaling
- Previous graph had heights of 21
- But people perceive areas, not heights
- So areas should be whats proportional to data
- Tufte defines a lie factor size of effect in
graphic divided by size of effect in data - Lie factor of 1.0 is the truth
- Anything far from 1.0 is that degree of a lie
- Not limited to area scaling
- But especially insidious there (quadratic effect)
32Scaling by Area
- Heres the same graph with 21 area
Women in the Workforce
33Poor Histogram Cell Size
- Picking bucket size is always a problem
- Prefer 5 or more observations per bucket
- Choice of bucket size can affect results
34Principles ofGraphics Integrity (Tufte)
- Proportional representation of numbers
- Clear, detailed, thorough labeling
- Show data variation, not design variation
- Use deflated money units
- Dont have more dimensions than data has
- Dont quote data out of context
35Proportional Representation of Numbers
- Maintain a lie factor of 1.0
- Use areas, not heights, with clip art
- Avoiding decorative graphs will do wonders
- This isnt too hard for most engineers
36Clear, Detailed,Thorough Labeling
- Goal is to defeat distortion and ambiguity
- Write explanations on graphic itself
- Label important events in the data
37Show Data Variation,Not Design Variation
- Use one design for the entire graphic
- In papers, try to use one design for all graphs
- Again, artistic license is the big culprit
38Use Deflated Money Units
- Often necessary to show money over time
- Even in computer science
- E.g., price/performance over time
- Or expected future cost of a disk
- Nominal dollars are meaningless
- Derate by some standard inflation measure
- Thats what the WWW is for!
39Might Need to Deflate Other Units
- Depending on what youre doing, might need to
deflate other units - E.g., transactions per second
- Dont deflate if point in differences is the
change in that rate over time - Must deflate if youre comparing other diffs,
like parallel vs. sequential
40Dont Have More Dimensions Than Data Has
- This gets back to the Lie Factor
- 1-D data (e.g., money) should occupy one
dimension on the graph not - Clip art is prohibited by this rule
- But if you have to, use an area measure
2.00
1.00
41Dont Quote DataOut of Context
42The Same Data in Context
43Special-Purpose Charts
- Tukeys box plot
- Histograms
- Scatter plots
- Gantt charts
- Kiviat graphs
44Tukeys Box Plot
- Shows range, median, quartiles all in one
- Tufte cant resist improvementsoror even
- Not entirely clear to me if these really are
better
minimum
maximum
quartile
quartile
median
45Histograms
- Tufte suggests various improvements
No y axis
No grid
Internal marker lines on bars
46Scatter Plots
- Useful in statistical analysis
- Also excellent for huge quantities of data
- Can show patterns otherwise invisible
47Better Scatter Plots
- Again, Tufte suggests improvements
- But it can be a pain with automated tools
- Better data-to-ink ratio
- Can use modified Tukey box plot for axes
48Gantt Charts
- Shows relative duration of Boolean conditions
- Arranged to make lines continuous
- Each level after first follows FTTF pattern
49Kiviat Graphs
- Also called star charts or radar plots
- Useful for looking at balance between HB and LB
metrics
50A Couple of Examples
51A Very Bad Graph
52A Good Graph Sunspots
53A Superb GraphDEC Traces