CS597A: Managing and Exploring Large Datasets - PowerPoint PPT Presentation

About This Presentation
Title:

CS597A: Managing and Exploring Large Datasets

Description:

Overview of a few of state-of-the-art storage systems ... (Hal Varian, Peter Lyman et al. ... Week 10: Managing medical data. Week 11: Managing genomic data ... – PowerPoint PPT presentation

Number of Views:54
Avg rating:3.0/5.0
Slides: 20
Provided by: Kai45
Category:

less

Transcript and Presenter's Notes

Title: CS597A: Managing and Exploring Large Datasets


1
CS597A Managing and Exploring Large Datasets
  • Kai Li

2
About This Seminar
  • Goal
  • Identify research directions and issues in
    managing and exploring large datasets
  • Plan
  • Overview of a few of state-of-the-art storage
    systems
  • Reading some papers on a few research systems in
    storage systems, data management and data
    exploration
  • Discussions on wild ideas
  • Define, work, and present course projects

3
Why Is This Area Interesting?(Where Are The
Bottlenecks?)
Network
Create
Transform
Transmit
Store and Retrieve
4
Computer Food Chains
Mini-super (Convex, etc)
Mainframe (IBM 370)
Minicomputer (VAX)
WS (SUN)
PC
Supercomputer (Cray, etc)
(Computer systems in 1980s)
Supercomputer (Cray, etc)
Servers (IBM, SUN)
PC
Laptop
PDA
(Computer systems in 1990s and 2000s)
5
Storage Arrays of Food Chains?
Direct Attached Storage (DAS)
USB, Microdrive, Flash
ATA disks
Super SCSI RAID
ATA RAID
Storage Area Network (SAN)
Super SAN storage (EMC, Hitachi, IBM)
MiniSuper SAN storage (HPQ, Startups)
iSCSI (Startups)
Network Attached Storage (NAS)
PC storage (Dell, Snap!, MSFT SAK boxes)
Super NAS (NetApp, SUN)
MiniSuper NAS (Startups)
6
Typical General Infrastructures
File servers/wo disks
Storage Area Network
Network
Backuptape library
BCV or 3rd copy (e.g. EMC)
Mirroredstorage(e.g EMC)
Clients
File servers/w disks
Storage Area Network
Network
Backuptape library
Clients
7
Exponential Growth(Courtesy Jim Gray, Turing
Lecture 99)
  • Performance/Price doubles every 18 months
  • 100x per decade
  • Progress in next 18 months ALL previous
    progress
  • New storage sum of all old storage (ever)
  • New processing sum of all old processing.

15 years ago
8
Disk Density vs. Moores Law
9
Storage Capacity Grows Fast
10
Raw Storage Is Cheap
  • Disk drives beat tapes in 2002 in /TB (IDC)
  • Disk /TB declines 50 / year
  • Tape /TB declines 29 / year
  • But, ATA arrays (/TB) beat tape libraries in
    2006 (Gartner)
  • Disk system /TB declines 40/year
  • Tape library /TB declines 29/year

2006
/TB
2002
(Source Gartner and IDC)
11
Summary of Storage Trends
  • Disk density beats Moores Law
  • Data growth rate follows Moores law
  • Raw disks are cheap while storage systems are
    very expensive
  • Crossover from tapes to disks

12
How Much Information Is there?(Courtesy Jim
Gray, Turing Lecture 99)
Yotta Zetta Exa Peta Tera Giga Mega Kilo
Everything! Recorded
  • Soon everything can be recorded and indexed
  • Most data never be seen by humans
  • Precious Resource Human attention
    Auto-Summarization Auto-Searchis key
    technology.www.lesk.com/mlesk/ksg97/ksg.html

All Books MultiMedia
All LoC books (words)
.Movie
A Photo
A Book
24 Yecto, 21 zepto, 18 atto, 15 femto, 12 pico, 9
nano, 6 micro, 3 milli
13
How Much Information Is There?(Hal Varian, Peter
Lyman et al. 2001)
  • Web has a lot of documents
  • Surface web had 2.5B docs, adding 7.5M
    pages/day
  • Deep web had 550B docs, 95 publicly accessible
  • Most websites are in English
  • 78 all websites and 96 e-commerce
  • E-mail generates a large amount of information
  • A white-collar worker receives 40 messages/day
  • E-mail information is 500x of web every year

14
How Much Information Is There?(Hal Varian, Peter
Lyman et al. 2001)
Storage media TB/year (Upper est.) TB/year (Lower est.) Growth rate
Paper 240 23 2
Film 427,216 58,216 4
Optical 83 31 70
Magnetic 1,693,000 577,210 55
15
Challenges In Managing and Exploring Datasets
  • Disks behavior is like a big tape
  • Storage is indeed infinitely large
  • Ability to get information is slow
  • Reliability is far from what we need
  • Disks do fail
  • Software and human corrupt data
  • Managing storage is difficult
  • Storage and data are both growing
  • Retrieving data is difficult
  • Get what you want
  • See what you get

16
Properties of A Research Goal(Jim Gray, 1999)
  • Simple to state
  • Not obvious how to do it
  • Clear benefit
  • Progress and solution is testable
  • Can be broken in to smaller steps
  • So that you can see intermediate progress

17
Systems Challenges(Lampson, SOSP Keynote 99)
  • Systems that work
  • Meeting their specs
  • Always available
  • Adapting to changing environment
  • Evolving while they run
  • Made from unreliable components
  • Growing without practical limit
  • Credible simulations or analysis
  • Writing good specs
  • Testing
  • Performance
  • Understanding when it doesnt matter

18
What Should the New World Focus Be?(Hennessy,
FCRC keynote 99)
  • Availability
  • Both appliance service
  • Maintainability
  • Two functions
  • Enhancing availability by preventing failure
  • Ease of SW and HW upgrades
  • Scalability
  • Especially of service
  • Cost
  • per device and per service transaction
  • Performance
  • Remains important, but its not SPECint

19
Tentative Syllabus
  • Today About the Course
  • Week 2 Read several vision papers
  • Week 3 Guest lecture on archival storage
  • Week 4 Commercial storage systems (EMC, Veritas,
    NetApp)
  • Week 5 Global-scale storage (OceanStore and the
    like)
  • Week 6 Managing personal (Coda, Bayou, Personal
    RAID)
  • Week 7 Managing geographical data (TerraServer)
  • Week 8 Guest lecture on managing astrophysical
    data (SkyServer)
  • Week 9 Managing and exploring large scientific
    data
  • Week 10 Managing medical data
  • Week 11 Managing genomic data
  • Week 12 Project reports and presentations
  • Detailed, tentative reading will be available
    this weekend
Write a Comment
User Comments (0)
About PowerShow.com