Improving Database Performance on Simultaneous Multithreading Processors presentation

About This Presentation

Transcript and Presenter's Notes

Title: Improving Database Performance on Simultaneous Multithreading Processors

1
Improving Database Performance on Simultaneous
Multithreading Processors

Jingren Zhou
Microsoft Research
jrzhou_at_microsoft.com

John Cieslewicz Columbia University johnc_at_cs.colum
bia.edu
Kenneth A. Ross Columbia University kar_at_cs.columbi
a.edu
Mihir Shah Columbia University ms2604_at_columbia.edu
2
Simultaneous Multithreading (SMT)

Available on modern CPUs
Hyperthreading on Pentium 4 and Xeon.
IBM POWER5
Sun UltraSparc IV
Challenge Design software to efficiently utilize
SMT.
This talk Database software

Intel Pentium 4 with Hyperthreading
3
Superscalar Processor (no SMT)
...
...
Time
Instruction Stream
...
...
CPI 3/4
Superscalar pipeline (up to 2 instructions/cycle)

Improved instruction level parallelism

4
SMT Processor
...
...
...
...
Time
Instruction Streams
...
...
CPI 5/8 ?

Improved thread level parallelism
More opportunities to keep the processor busy
But sometimes SMT does not work so well ?

5
Stalls
...
...
Instruction Stream 1
Time
...
...
Instruction Stream 2
...
...
Stall
CPI 3/4 ?. Progress despite stalled thread.
Stalls due to cache misses (200-300 cycles for L2
cache), branch mispredictions (20-30 cycles), etc.
6
Memory Consistency
...
...
Instruction Stream 1
Time
...
...
Instruction Stream 2
Detect conflicting access to common cache line
...
...
flush pipeline sync cache with RAM
MOMC Event on Pentium 4. (300-350 cycles)
7
SMT Processor

Exposes multiple logical CPUs (one per
instruction stream)
One physical CPU (5 extra silicon to duplicate
thread state information)
Better than single threading
Increased thread-level parallelism
Improved processor utilization when one thread
blocks
Not as good as two physical CPUs
CPU resources are shared, not replicated

8
SMT Challenges

Resource Competition
Shared Execution Units
Shared Cache
Thread Coordination
Locking, etc. has high overhead
False Sharing
MOMC Events

9
Approaches to using SMT

Ignore it, and write single threaded code.
Naïve parallelism
Pretend the logical CPUs are physical CPUs
SMT-aware parallelism
Parallel threads designed to avoid SMT-related
interference
Use one thread for the algorithm, and another to
manage resources
E.g., to avoid stalls for cache misses

10
Naïve Parallelism

Treat SMT processor as if it is multi-core
Databases already designed to utilize multiple
processors - no code modification
Uses shared processor resources inefficiently
Cache Pollution / Interference
Competition for execution units

11
SMT-Aware Parallelism

Exploit intra-operator parallelism
Divide input and use a separate thread to process
each part
E.g., one thread for even tuples, one for odd
tuples.
Explicit partitioning step not required.
Sharing input involves multiple readers
No MOMC events, because two reads dont conflict

12
SMT-Aware Parallelism (cont.)

Sharing output is challenging
Thread coordination for output
read/write and write/write conflicts on common
cache lines (MOMC Events)
Solution Partition the output
Each thread writes to separate memory buffer to
avoid memory conflicts
Need an extra merge step in the consumer of the
output stream
Difficult to maintain input order in the output

13
Managing Resources for SMT

Cache misses are a well-known performance
bottleneck for modern database systems
Mainly L2 data cache misses, but also L1
instruction cache misses Ailamaki et al 98.
Goal Use a helper thread to avoid cache misses
in the main thread
load future memory references into the cache
explicit load, not a prefetch

14
Data Dependency

Memory references that depend upon a previous
memory access exhibit a data dependency
E.g., Lookup hash table

Tuple
15
Data Dependency (cont.)

Data dependencies make instruction level
parallelism harder
Modern architectures provide prefetch
instructions.
Request that data be brought into the cache
Non-blocking
Pitfalls
Prefetch instructions are frequently dropped
Difficult to tune
Too much prefetching can pollute the cache

16
Staging Computation

Preload A.
(other work)
Process A.
Preload B.
(other work)
Process B.
Preload C.
(other work)
Process C.
Preload Tuple.
(other work)
Process Tuple.

Hash Buckets
Overflow Cells
Tuple
(Assumes each element is a cache line.)
17
Staging Computation (cont.)

By overlapping memory latency with other work,
some cache miss latency can be hidden.
Many probes in flight at the same time.
Algorithms need to be rewritten.
E.g. Chen, et al. 2004, Harizopoulos, et al.
2004.

18
Work-Ahead Set Main Thread

Writes memory address computation state to the
work-ahead set
Retrieves a previous address state
Hope that helper thread can preload data before
retrieval by the main thread
Correct whether or not helper thread succeeds at
preloading data
helper thread is read-only

19
Work-ahead Set Data Structure
state
address
Main Thread
20
Work-ahead Set Data Structure
state
address
Main Thread
21
Work-Ahead Set Helper Thread

Reads memory addresses from the work-ahead set,
and loads their contents
Data becomes cache resident
Tries to preload data before main thread cycles
around
If successful, main thread experiences cache hits

22
Work-ahead Set Data Structure
state
address
G
1
H
2
temp sloti
Helper Thread
I
2
J
2
E
1
F
1
23
Iterate Backwards!
state
address
G
1
i i-1 mod size
i
H
2
I
2
J
2
Helper Thread
E
1
F
1
Why? See Paper.
24
Helper Thread Speed

If helper thread faster than main thread
More computation than memory latency
Helper thread should not preload twice (wasted
CPU cycles)
See paper for how to stop redundant loads
If helper thread is slower
No special tuning necessary
Main thread will absorb some cache misses

25
Work-Ahead Set Size

Too Large Cache Pollution
Preloaded data evicts other preloaded data before
it can be used
Too Small Thread Contention
Many MOMC events because work-ahead set spans few
cache lines
Just Right Experimentally determined
But use the smallest size within the acceptable
range (performance plateaus), so that cache space
is available for other purposes (for us, 128
entries)
Data structure itself much smaller than L2 cache

26
Experimental Workload

Two Operators
Probe phase of Hash Join
CSB Tree Index Join
Operators run in isolation and in parallel
Intel VTune used to measure hardware events

27
Experimental Outline

Hash join
Index lookup
Mixed Hash join and index lookup

28
Hash JoinComparative Performance
29
Hash JoinL2 Cache Misses Per Tuple
30
CSB Tree Index JoinComparative Performance
31
CSB Tree Index JoinL2 Cache Misses Per Tuple
32
Parallel Operator Performance
33
Parallel Operator Performance
34
Conclusion

Write a Comment

User Comments (0)

About PowerShow.com

Improving Database Performance on Simultaneous Multithreading Processors PowerPoint PPT Presentation