Parallel Database Systems A SNAP Application - PowerPoint PPT Presentation

About This Presentation
Title:

Parallel Database Systems A SNAP Application

Description:

Databases are the dirt of Cyberspace. Billions of clients mean ... Storage Cheaper than ... (8 GB =) 4,000$ ASCII: 4 m pages 0.1 /sheet (30x cheaper) ... – PowerPoint PPT presentation

Number of Views:186
Avg rating:3.0/5.0
Slides: 39
Provided by: ResearchM53
Category:

less

Transcript and Presenter's Notes

Title: Parallel Database Systems A SNAP Application


1
Parallel Database Systems A SNAP Application
Jim Gray 310 Filbert, SF CA 94133 Gray_at_Microsoft
.com
Gordon Bell 450 Old Oak Court Los Altos, CA
94022 GBell_at_Microsoft.com
2
Outline
  • Cyberspace Pep Talk
  • Databases are the dirt of Cyberspace
  • Billions of clients mean millions of servers
  • Parallel Imperative
  • Hardware trend Many little devices
  • Consequence Servers are arrays of commodity
    components
  • PCs are the bricks of Cyberspace
  • Must automate parallel design / operation /
    use
  • Software parallelism via dataflow Data
    Partitioning
  • Parallel database techniques
  • Parallel execution of many little jobs (OLTP)
  • Data Partitioning
  • Pipeline Execution
  • Automation techniques)
  • Summary

3
Kinds Of Information Processing
Point-to-Point
Broadcast
lecture concert
conversation money
Net work
Immediate
book newspaper
mail
Time Shifted
Data Base
Its ALL going electronic Immediate is being
stored for analysis (so ALL database) Analysis
Automatic Processing are being added
4
Why Put Everything in Cyberspace?
Low rent min /byte Shrinks time now or
later Shrinks space here or there Automate
processing knowbots
Point-to-Point OR Broadcast
Network
Immediate OR Time Delayed
Locate Process Analyze Summarize
Data Base
5
Databases Information At Your Fingertips
Information NetworkKnowledge Navigator
  • All information will be in an online database
    (somewhere)
  • You might record everything you
  • read 10MB/day, 400 GB/lifetime (two tapes)
  • hear 400MB/day, 16 TB/lifetime (a tape per
    decade)
  • see 1MB/s, 40GB/day, 1.6 PB/lifetime (maybe
    someday)
  • Data storage, organization, and analysis is a
    challenge.
  • That is what databases are about
  • DBs do a good job on records
  • Now working on text, spatial, image, and sound.
  • This talk is about automatic parallel search (the
    outer loop)
  • Techniques work for ALL kinds of data

6
Database Store ALL Data Types
  • The Old World
  • Millions of objects
  • 100-byte objects
  • The New World
  • Billions of objects
  • Big objects (1MB)
  • Objects have behavior (methods)

People
Name
Address
David
NY
Mike
Berk
Paperless office Library of congress online All
information online entertainment
publishing business Information Network,
Knowledge Navigator, Information at your
fingertips
Won
Austin
People
Name
Address
Papers
Picture
Voice
NY
David
Mike
Berk
Won
Austin
7
Magnetic Storage Cheaper than Paper
  • File Cabinet cabinet (4 drawer) 250 paper
    (24,000 sheets) 250 space (2x3 _at_
    10/ft2) 180 total 700 3 /sheet
  • Disk disk (8 GB ) 4,000 ASCII
    4 m pages 0.1 /sheet (30x cheaper)
  • Image 200 k pages 2 /sheet (similar
    to paper)
  • Store everything on disk

8
Cyberspace Demographics
1950 National Computer 1960 Corporate
Computer 1970 Site Computer 1980 Departmental
Computer 1990 Personal Computer 2000 ?
  • Computer History
  • most computers are small NEXT 1 Billion X for
    some X (phone?)
  • most of the money is in clients and
    wiring 1990 50 desktop 1995 75 desktop

1B
9
Billions of Clients
  • Every device will be intelligent
  • Doors, rooms, cars, ...
  • Computing will be ubiquitous

10
Billions of Clients Need Millions of Servers
All clients are networked to servers may be
nomadic or on-demand Fast clients want faster
servers Servers provide data, control,
coordination communication
Clients
mobile
clients
fixed
clients
Servers
server
super
Super Servers Large Databases High Traffic
shared data
server
11
If Hardware is Free, Where Will The Money Go?
  • All clients and servers will be based on PC
    technology economies of scale give lowest
    price.
  • Traditional budget 40 vendor, 60 staff
  • If hardware_price software_price 0 then
    what?
  • Money will go to
  • CONTENT (databases)
  • NEW APPLICATIONS
  • AUTOMATION analogy to 1920 telephone operators
  • Systems programmer per MIPS
  • DBA per 10GB

12
The New Computer Industry
  • Horizontal integration is new structure
  • Each layer picks best from lower layer.
  • Desktop market
  • 1991 50
  • 1995 75
  • Compaq is biggest computer company

Example
Function
Operation
ATT
Integration
EDS
Applications
SAP
Middleware
Oracle
Baseware
Microsoft
Systems
Compaq
Intel Seagate
Silicon Oxide
13
Constant Dollars vs Constant Work
  • Constant Work
  • One SuperServer can do all the worlds
    computations.
  • Constant Dollars
  • The world spends 10 on information processing
  • Computers are moving from 5 penetration to 50
  • 300 B to 3T
  • We have the patent on the byte and algorithm

14
The Seven Price Tiers
  • 10 wrist watch computers
  • 100 pocket/ palm computers
  • 1,000 portable computers
  • 10,000 personal computers (desktop)
  • 100,000 departmental computers
    (closet)
  • 1,000,000 site computers (glass house)
  • 10,000,000 regional computers (glass
    castle)

SuperServer Costs more than 100,000
Mainframe Costs more than 1M Must be an
array of processors, disks, tapes comm
ports
15
Software Economics Bills Law
  • Bill Joys law (Sun) Dont write software for
    less than 100,000 platforms. _at_10M engineering
    expense, 1,000 price
  • Bill Gates lawDont write software for less
    than 1,000,000 platforms. _at_10M engineering
    expense, 100 price
  • Examples
  • UNIX vs NT 3,500 vs 500
  • Oracle vs SQL-Server 100,000 vs 6,000
  • No Spreadsheet or Presentation pack on
    Unix/VMS/...
  • Commoditization of base Software Hardware

16
What comes next
  • MANY new clients
  • Applications to enable clients servers
  • super-servers

17
Outline
  • Cyberspace Pep Talk
  • Databases are the dirt of Cyberspace
  • Billions of clients mean millions of servers
  • Parallel Imperative
  • Hardware trend Many little devices
  • Consequence Server arrays of commodity parts
  • PCs are the bricks of Cyberspace
  • Must automate parallel design / operation /
    use
  • Software parallelism via dataflow Data
    Partitioning
  • Parallel database techniques
  • Parallel execution of many little jobs (OLTP)
  • Data Partitioning
  • Pipeline Execution
  • Automation techniques)
  • Summary

18
Moores Law RestatedMany Little Won over Few Big
Hardware trends Few generic parts CPU
RAM

Disk Tape arrays
ATM for LAN/WAN ?? for CAN

?? for OS These parts will be
inexpensive (commodity components) Systems will
be arrays of these parts Software challenge how
to program arrays
1 M
100 K
10 K
Micro
Nano
Mini
Mainframe
1.8"
2.5"
3.5"
5.25"
9"
19
Future SuperServer
Array of processors, disks, tapes comm
lines Challenge How to program it Must use
parallelism Pipeline hide latency Partition bandw
idth scaleup
20
Great Debate Shared What?
Shared Memory (SMP)
Shared Nothing (network)
Shared Disk
Easy to program Difficult to build Difficult to
scaleup
Hard to program Easy to build Easy to scaleup
Sequent, SGI, Sun
VMScluster, Sysplex
Tandem, Teradata, SP2
Winner will be a synthesis of these
ideas Distributed shared memory (DASH, Encore)
blurs distinction between Network and Bus
(locality still important) But gives Shared
memory message cost.
21
The Hardware is in Place and Then A
Miracle Occurs
?
SNAP Scaleable Network And Platforms Commodity
Distributed OS built on Commodity
Platforms Commodity Network Interconnect
22
Why Parallel Access To Data?
At 10 MB/s 1.2 days to scan
1,000 x parallel 1.3 minute SCAN.
BANDWIDTH
Parallelism divide a big problem into many
smaller ones to be solved in parallel.
23
DataFlow ProgrammingPrefetch Postwrite Hide
Latency
  • Can't wait for the data to arrive
  • Need a memory that gets the data in advance (
    100MB/S)
  • Solution
  • Pipeline from source (tape, disc, ram...) to cpu
    cache
  • Pipeline results to destination

LATENCY
24
The New Law of Computing
Grosch's Law
Parallel Law Needs Linear Speedup and
Linear Scaleup Not always possible
25
Parallelism Performance is the Goal
Goal is to get 'good' performance.
Law 1 parallel system should be faster than
serial system
Law 2 parallel system should give near-linear
scaleup or near-linear speedup or both.
Parallelism is faster, not cheaper trades
money for time.
26
Parallelism Speedup Scaleup
100GB
Speedup Same Job, More Hardware
Less time
100GB
Scaleup Bigger Job, More Hardware
Same time

100GB
1 TB
10 k clients
1 k clients
Transaction Scaleup more clients/servers
Same response time
100GB
1 TB
Server
Server
27
The Perils of Parallelism
A Bad Speedup Curve
No Parallelism

Benefit
Linearity
Processors Discs
Startup Creating processes Opening
files Optimization Interference Device
(cpu, disc, bus) logical (lock, hotspot,
server, log,...) Skew If tasks get very small,
variance gt service time
28
Outline
  • Cyberspace Pep Talk
  • Databases are the dirt of Cyberspace
  • Billions of clients mean millions of servers
  • Parallel Imperative
  • Hardware trend Many little devices
  • Consequence Server arrays of commodity parts
  • PCs are the bricks of Cyberspace
  • Must automate parallel design / operation /
    use
  • Software parallelism via dataflow Data
    Partitioning
  • Parallel database techniques
  • Parallel execution of many little jobs (OLTP)
  • Data Partitioning
  • Pipeline Execution
  • Automation techniques
  • Summary

29
Kinds of Parallel Execution
Any
Any
Sequential
Sequential
Pipeline
Program
Program
Sequential
Sequential
Partition outputs split N ways inputs merge
M ways
Any
Any
Sequential
Sequential
Sequential
Sequential
Program
Program
30
Data Rivers Split Merge Streams
N X M Data Streams
M Consumers
N producers
River
Producers add records to the river, Consumers
consume records from the river Purely sequential
programming. River does flow control and
buffering does partition and merge of data
records River Exchange operator in Volcano.
31
Partitioned Data and Execution
Spreads computation and IO among processors

Partitioned data gives NATURAL
execution parallelism
32
Partitioned Merge Pipeline Execution

Pure dataflow programming Gives linear
speedup scaleup But, top node may be
bottleneck So....
33
N xM way Parallelism
N inputs, M outputs, no bottlenecks.
34
Why are Relational OperatorsSuccessful for
Parallelism?
Relational data model uniform operators on
uniform data stream Closed under
composition Each operator consumes 1 or 2 input
streams Each stream is a uniform collection of
data Sequential data in and out Pure
dataflow partitioning some operators (e.g.
aggregates, non-equi-join, sort,..) requires
innovation AUTOMATIC PARALLELISM
35
SQL a NonProcedural Programming Language
  • SQL functional programming language
    describes answer set.
  • Optimizer picks best execution plan
  • Picks data flow web (pipeline),
  • degree of parallelism (partitioning)
  • other execution parameters (process placement,
    memory,...)

Execution
Planning
Monitor
Schema
Plan
GUI
Optimizer
Rivers
36
Database Systems Hide Parallelism
  • Automate system management via tools
  • data placement
  • data organization (indexing)
  • periodic tasks (dump / recover / reorganize)
  • Automatic fault tolerance
  • duplex failover
  • transactions
  • Automatic parallelism
  • among transactions (locking)
  • within a transaction (parallel execution)

37
Success Stories
  • Online Transaction Processing
  • many little jobs
  • SQL systems support 3700 tps-A (24 cpu, 240
    disk)
  • SQL systems support 21,000 tpm-C
  • (110 cpu, 800 disk)
  • Batch (decision support and Utility)
  • few big jobs, parallelism inside
  • Scan data at 100 MB/s
  • Linear Scaleup to 50 processors

transactions / sec
hardware
recs/ sec
hardware
38
Kinds of Partitioned Data
Split a SQL table to subset of nodes
disks Partition within set Range Hash Round
Robin
Good to spread load
Good for equijoins, range queries group-by
Good for equijoins
Shared disk and memory less sensitive to
partitioning, Shared nothing benefits from
"good" partitioning
39
Index Partitioning
Hash indices partition by hash B-tree
indices partition as a forest of trees. One tree
per range Primary index clusters data
40
Secondary Index Partitioning
In shared nothing, secondary indices are
Problematic Partition by base table key
ranges Insert completely local (but what about
unique?) Lookup examines ALL trees (see
figure) Unique index involves lookup on
insert. Partition by secondary key
ranges Insert two nodes (base and index) Lookup
two nodes (index -gt base) Uniqueness is easy
A..Z
A..Z
A..Z
A..Z
A..Z
Base Table
Base Table
Teradata solution Partition non-unique by base
table Partition Unique by secondary key
41
Picking Data Ranges
Disk Partitioning For range partitioning, sample
load on disks. Cool hot disks by making range
smaller For hash partitioning, Cool hot disks by
mapping some buckets to others River
Partitioning Use hashing and assume uniform If
range partitioning, sample data and use
histogram to level the bulk Teradata, Tandem,
Oracle use these tricks
42
Parallel Data Scan
Select image from landsat where date between 1970
and 1990 and overlaps(location, Rockies) and
snow_cover(image) gt.7
Temporal
Spatial
Image
Landsat
Assign one process per processor/disk find
images with right data location analyze image,
if 70 snow, return it
Answer
date
loc
image
image
33N 120W . . . . . . . 34N 120W
1/2/72 . . . . . .. . . 4/8/95
date, location, image tests
43
Simple Aggregates (sort or hash?)
Simple aggregates (count, min, max, ...) can use
indices More compact Sometimes have aggregate
info. GROUP BY aggregates scan in category
order if possible (use indices) Else If
categories fit in RAM use RAM category hash table
Else make temp of ltcategory, itemgt sort by
category, do math in merge step.
44
Parallel Aggregates
For aggregate function, need a decomposition
strategy count(S) å count(s(i)), ditto for
sum() avg(S) (å sum(s(i))) / å count(s(i)) and
so on... For groups, sub-aggregate groups close
to the source drop sub-aggregates into a hash
river.
45
Sort
Used for loading and reorganization (sort makes
them sequential) building B-trees reports non-equi
joins Rarely used for aggregates or equi-joins
(if hash available) Should run at 10MB/s or
better Faster than a disk, so need striped
scratch files In memory sort is about 250Kr/s

Sort
Merge
Input Data
Sorted Data
Runs
46
Parallel Sort
M input N output Sort design Disk and merge not
needed if sort fits in memory Scales linearly
because
Sort is benchmark from hell for shared nothing
machines net traffic disk bandwidth, no data
filtering at the source
47
Blocking Operators Short Piplelines
An operator is blocking, if it does not produce
any output, until it has consumed all its
input Examples Sort, Aggregates,
Hash-Join (reads all of one operand) Blocking
operators kill pipeline parallelism Make
partition parallelism all the more important.
Database Load Template has three blocked phases
48
Nested Loops Join
If inner table indexed on join cols (b-tree or
hash) then sequential scan outer (from start
key) For each outer record probe inner table for
matching recs Works best if inner is in RAM (gt
small inner ) Works great if inner is B-tree or
hash in RAM Partitions well replicate inner at
each outer partition. (if outer partitioned on
join col, dont replicate inner, partition
it) Works for all joins (outer, non-equijoins,
cartesian, exclusion,...)
Inner Table
Outer Table
49
Merge Join (and sort-merge join)
If tables sorted on join cols (b-tree or
hash) then sequential scan each (from start
key) left lt right leftright left gt right advance
left match advance right Nice sequential scan
of data (disk speed) (MxN case may cause
backwards rescan) Sort-merge join sorts before
doing the merge
NxM case cartesian product
Left Table
Right Table
Partitions well partition smaller to larger
partition. Works for all joins (outer,
non-equijoins, cartesian, exclusion,...)
50
Hash Join
Right Table
Hash smaller table into N buckets (hope N1) If
N1 read larger table, hash to smaller Else, hash
outer to disk then bucket-by-bucket hash
join. Purely sequential data behavior Always
beats sort-merge and nested unless data is
clustered. Good for equi, outer, exclusion
join Lots of papers, products just appearing
(what went wrong?) Hash reduces skew
Hash Buckets
Left Table
51
Observation Execution easyAutomation hard
It is easy to build a fast parallel execution
environment (no one has done it, but it is just
programming) It is hard to write a robust and
world-class query optimizer. There are many
tricks One quickly hits the complexity
barrier Common approach Pick best sequential
plan Pick degree of parallelism based on
bottleneck analysis Bind operators to
process Place processes at nodes Place scratch
files near processes Use memory as a constraint
52
Systems That Work This Way
Shared Nothing Teradata 400 nodes Tandem 110
nodes IBM / SP2 / DB2 48 nodes ATT Sybase 112
nodes Informix/SP2 48 nodes Shared
Disk Oracle 170 nodes Rdb 24 nodes Shared
Memory Informix 9 nodes RedBrick ?
nodes

53
Research Problems
Automatic data placement (partition
random or organized) Automatic parallel
programming (process placement)
Parallel concepts, algorithms tools  Parallel
Query Optimization  Execution Techniques
load balance, checkpoint/restart,
pacing,
54
Summary
  • Cyberspace is Growing
  • Databases are the dirt of cybersspace PCs are
    the bricks, Networks are the morter. Many
    little devices Performance via Arrays of cpu,
    disk ,tape
  • Then a miracle occurs a scaleable distributed OS
    and net
  • SNAP Scaleable Networks and Platforms
  • Then parallel database systems give software
    parallelism
  • OLTP lots of little jobs run in parallel
  • Batch TP data flow data partitioning
  • Automate processor storage array administration
  • Automate processor storage array programming
  • 2000 platforms as easy as 1 platform.
Write a Comment
User Comments (0)
About PowerShow.com