Title: A Petabyte in Your Pocket David Maier Oregon Graduate Institute with help from D' DeWitt, J' Naughto
1A Petabyte in Your PocketDavid MaierOregon
Graduate Institutewith help fromD. DeWitt, J.
Naughton, L. Delcambre, K. Tufte, V. Papadimos,
P. Tucker
2Your PetDB
- Its 2015.
- For 300 a year, you can have a personal petabyte
database (PetDB). - You can talk to it from anywhere.
- Organizes any kind of digital data.
- Doesnt lose structure, can restructure
- Queryable
- Handles streams
- Organized by type, content, associations,
multiple categorizations and groupings - Locate items by
- How or where you encountered them
- What youve done with them
- Where you were when you accessed them
3What Would I Put in a Petabyte?
- A lot.
- Fill my office floor to ceiling with books ? 100
GB - What do I do with 10,000? as much?
- Many possibilities
- Contents of every book and magazine I read
- Every web page I visit
- All email I send or receive
- Every TV program I watch
- Every version of every piece of software I use
- Maps of everywhere I go
- Notes from every class or seminar I attend
- All the telephone calls I make
- My Lifestream (Freeman and Gerlernter)
4Streams and Restructuring
- Can incorporate streamed data on the fly.
- MD Vital signs from patients in ICU
- Factory supervisor status, output rate of all
machines finished products rejects - Can restructure data if desired.
- Combined list of conferences in my area
- Info sheets on autos Im considering buying
- Comparable salaries of faculty at my rank in
similar departments
5Anything I Might Want to Refer Back to
- Personally indexed for me.
- Can be located in a thousand different ways.
- What is the company in Massachusetts I read about
in the article on factory tours when I was on the
plane to the sales meeting in Atlanta last
spring?
6Or Things I Might Want in the Future
- Histories of news groups and mailing lists
- Parts of the web I might want to browse,
including past snapshots - Descriptions and prices for any item I might want
to buy - Papers Ive been meaning to read
- Historical data on stocks Im interested in
- Functions as a personal web portal
7Database Not Completely Apt
- Didnt have to define a scheme for it
- Doesnt need to know the datatypes I want to
store in advance - Doesnt chop data into rows and columns
- Unless I ask
- Can query over information streams
- Dont need to write and run applications to add
data - Anything Ive touched is there
- Or expressed an interest in
- Not on a particular computer
- Doesnt have an outside
8My PetDB is Good to Me
- I dont move data between environments
- Im never on the wrong machine
- Never go back to my office to grab a paper, never
have the wrong folder at a meeting - Dont worry a lot about filing systemsPetDB
organizes itself by ways I like to look for
information - Anticipates what data Ill be using
9How to Do This?
- On 300/year
- Plan A Pack my office floor to ceiling with disk
drives. - About a 1 million.
- Plan B Be clever.
- Share
- Stage
- Reconstitute
10Share
- Most of the information in my PetDB isnt unique
to me magazine article, web page, stock quote. - Store one copy.
- Information Paradox Whats too expensive for one
may be affordable for all.
Others PetDBs
My PetDB
11Stage
- Not all data has to be at my current point of
connection. - Mainly resides in shared and private servers on
the Internet. - Staged to me on a series of data managers.
- Access time depends on context, likely use
- Current itinerary 1 second
- Upcoming trips 5 seconds
- Past trips 30 seconds
12Reconstitute
- If I found it once, PetDB can find it again
- Remember what procedure or search constructed or
located data originally. - Use the same method to get it again.
- Need to ensure base data is archived.
- Plus a small amount of unique content
- Stuff Ive created
- Foreground information that superimposes my
personal perspective selections, annotations,
responses, manipulations, groupings
13What Infrastructure Do I Need?
- Net Data Managers
- Network-centric vs. disk-centric
- Data movement vs. data storage
- Work on lives streams as well as stored data
- Deal with data of arbitrary types
- Run queries of thousands of sites
- Locate data by external contexts as well as
internal content - Large-scale monitoring
14Data Management Space
15Why Net Data Managers?
- File systems wont work
- No queries, disk centric
- Web Servers wont work
- No structural query, no combining of data
- No support for optimization and execution of
high-level queries spanning 1000s of sites - No support for triggers
- In reality, nothing more than page servers
16Limitations of Current DBMSs
- Schema-first
- Load then query
- Data in the box
- Scale
- Search by content, not by context
17Key Elements of NDM
- Self-describing data (e.g., XML)
- NetQueries
- Algebraic basis
- Stream-processing components
- Oil refinery vs. book-order warehouse
- Want to do for net-centric, data-intensive
applications what relational DBs did for business
data processing - Reduce the coding effort to produce such
applications, while improving performance,
scalability and reliability.
18Codds Contribution
- Whats the most important aspect of the
relational model? - Calculus?
- Algebra?
- Equivalence?
- My opinion Observing that BDP programs only do
about 6-7 different things - scan files remove fields
- select records remove duplicates
- combine records aggregate records
- concatenate files
- What are the building blocks of net data
management?
19Without NDMs
Data Sources
Users
20With NDMs
21Kinds of Components
- Stream-based query processors
- Alerters
- Accumulators
- Remote monitoring/indexing
- Semantic Routers
- Replicators lazy, eager, just-in-time
- Semantic caches
- Splitters
- Access-mode adapters
- Partial evaluators
22Alerting vs. Querying
23Access Modes Who Decides
When DataMoves
Post
Push
Producer
Poll
Pull
Consumer
Producer
Consumer
What Data Moves
24Assembling Applications from Components
- Akamai FreeFlow (see NASDAQ site)
- Splitting Replication Merge Adapters
Browser
Merge
Base Server
Pull
Web Content
Pull
Text
Field Server
Push
Replicate
Split
Graphics
Field Server
Field Server
25NIAGARA Project
- Initial investigation of NDM based on XML
- University of Wisconsin and OGI
- Stream-oriented XML-QL evaluator
- Text-in-context search
- NiagaraCQ
- Merge operator (and rest of algebra)
- XML Firehose
26Use of NDM for PetDB
- NetQueries encode procedures for reconstituting
data - Monitoring sources of interest
- Replication, splitting, push, accumulators,
semantic routing for staging data - NetQuery to inform an archive server what to save
- Archives, semantic caches express what they
already hold with a NetQuery
27Building the PetDB System
Context Mgr.
Stager
Petster
Task Analyzer
Profiler
Stager
Private Archive
Pet DB
Stager
Replicate Server
IP Server
Secure Local Cache
Back Quote
Data Kennel
WebSnap
Indexer
Stream Processor
Internet Monitor
Public Archives
28What Else is Needed?
- Superimposed Information
- Much of my unique content is an organizational
overlay on base data - Small-footprint data managers
- Presentation model of stream data
- Authorization and Authentication
- QoS control, content scaling
- Intelligent prediction, learning
- Secure staging areas