Zachary G' Ives - PowerPoint PPT Presentation

1 / 22
About This Presentation
Title:

Zachary G' Ives

Description:

... idea: we don't want to build semantics into the data model, ... How do we semantically interrelate data to build a better Web? 17. Layers of a Typical ... – PowerPoint PPT presentation

Number of Views:54
Avg rating:3.0/5.0
Slides: 23
Provided by: zack4
Category:

less

Transcript and Presenter's Notes

Title: Zachary G' Ives


1
Introduction
  • Zachary G. Ives
  • University of Pennsylvania
  • CIS 650 Implementing Data Management Systems
  • January 10, 2005

2
Welcome to CIS 650, Database and Information
Systems!
  • Instructor Zachary Ives, zives_at_cis
  • 576 Levine Hall North
  • Office hours Tuesday, 230-330PM (before
    colloquium)
  • Home page www.seas.upenn.edu/zives/cis650/
  • Texts and readings
  • Hellerstein and Stonebraker Readings in Database
    Systems, 4th ed.
  • (Should be available soon)
  • Supplementary papers (will be linked via schedule)

3
Course Format and Grading
  • Very discussion-oriented about one topic area
    per week or two
  • Readings in the text other research papers
    summaries/commentary on papers (20)
  • Midterm report (25)
  • Youll take one of the topics weve discussed and
    write a summary and synthesis paper
  • Graded for organization, clarity, grammar, etc.
    as well as content
  • Project (50) -- may choose to work in teams
  • Implementation
  • Experimentation / validation
  • Project report (should be in the style of a
    research paper)
  • Brief (15-minute) presentation
  • Participation, discussion, intangibles (5)
  • At the end, you should be equipped to do research
    in this field, or to take ideas from databases
    and apply them to your field

4
So What Is This Course About?
  • Not how to build an Oracle-driven Web site
  • nor even how to build Oracle

5
What Is Unique about Data Management?
  • Its been said that databases and data management
    focus on scalability to huge volumes of data
  • What is it that makes this possible and what
    makes the work interesting if NOT at huge scale?
  • Why are data management techniques useful in
    situations where scale isnt the bottleneck?

6
The Key Principle Data Independence
  • Most methods of programming dont separate the
    logical and physical representations of data
  • The data structures, access methods, etc. are all
    given via interfaces!
  • The relational data model was the first model for
    data that is independent of its data structures
    and implementation

7
What Is Data Independence?
  • Codd points out that previous methods had
  • Order dependence
  • Index dependence
  • Access path dependence
  • Still true in todays Java/C what is the
    drawback?
  • What might you be able to do in removing those?

8
The Relational Data Model
  • More than just tables!
  • True relations sets of tuples
  • The only data representation a user/programmer
    sees
  • Explicit encoding of everything in values
  • Additional integrity constraints
  • Key constraints, functional dependencies,
  • General and universal means of encoding
    everything!
  • (Semantics are pushed to queries)
  • A secondary concept views
  • Define virtual, derived relations that are always
    live
  • A way of encapsulating, abstracting data

9
Constraints and Normalization
  • Fundamental idea we dont want to build
    semantics into the data model, but we want to be
    able to encode certain constraints
  • Functional dependencies, key constraints,
    foreign-key constraints, multivalued
    dependencies, join dependencies, etc.
  • Allows limited data validation, plus
    opportunities for optimization
  • The theory of normalization (see CSE 330, CIS
    550) makes use of known constraints
  • Idea eliminate redundancy, in order to maintain
    consistency in the presence of updates
  • (Note that theres no reason for normalization of
    data in views!)
  • Ergo, XML???

10
Relational Completeness(Plus Extensions)
Declarativity
  • What is special about relational query languages
    that makes them amenable to scalability?
  • Limited expressiveness particularly when we
    consider conjunctive queries (even with
    recursion)
  • Guaranteed polytime execution in size of data
  • Can reason about containment, invert them, etc.
  • Magic sets
  • (What about XQuerys Turing-completeness???)
  • Equivalence between relational calculus and
    algebra
  • Calculus ? fully declarative, basis of query
    languages
  • Algebra ? imperative but polytime, basis of
    runtime systems
  • Predictability of operations ? cost models
  • Ability to supplement data with auxiliary
    structures for performance

11
Concurrency and Reliability(Generally requires
full control)
  • Another key element of databases ACID
    properties
  • Atomicity, Consistency, Isolation, Durability
  • Transaction an atomic sequence of database
    actions (read/write) on data items (e.g. calendar
    entry)
  • Recoverability via a log keeping track of all
    actions carried out by the database
  • How do distributed systems, Web services,
    service-oriented architectures, and the like
    affect these properties?

12
Other Data Models
  • Concepts from the relational data model have been
    adapted to form object-oriented data models (with
    classes and subclasses), XML models, etc.
  • But doesnt this result in some loss of
    logical-physical independence?
  • GMAP and answering queries using views?

13
What Is a Data Management System?
  • Of course, there are traditional databases
  • The focus of most work in the past 25 years
  • Tight loops due to locally controlled data
  • Indexing, transactions, concurrency, recovery,
    optimization
  • But

14
80 of the Worlds Data is Not in Databases!
  • Examples
  • Scientific data (large images, complex programs
    that analyze the data)
  • Personal data
  • WWW and email (some of it is stored in something
    resembling a DBMS)
  • Network traffic logs
  • Sensor data
  • Are there benefits to declarative techniques and
    data independence in tackling these issues?
  • XML is a great way to make this data available
  • Also need to deal with data we dont control and
    cant guarantee consistency over

15
An Example of Data Management with Heterogeneity
Data Integration
Mediated Schema
XML
  • A layer above heterogeneous sources, to combine
    them under a unified logical abstraction
  • Some of these are databases over which we have no
    control
  • Some must be accessed in special ways
  • Data integration system translates queries over
    mediated schema to the languages of the sources
    converts answers to mediated schema

16
Other Interesting Points
  • Data streams and sensor data
  • How do we process infinite amounts of data?
  • Peer-to-peer architectures
  • Whats the best way of finding data here?
  • Personal information management
  • Can we use integration-style concepts and a bit
    of AI to manage associations between our data?
  • Web search
  • Whats the back-end behind Google?
  • Semantic Web
  • How do we semantically interrelate data to build
    a better Web?

17
Layers of a Typical Data Management System
API/GUI
(Simplification!)
Query
Optimizer
Stats
Physical plan
Exec. Engine
Logging, recovery
Schemas
Catalog
Requests
Data/etc
Access Methods
Data/etc
Requests
Buffer Mgr
Red logical Blue physical
Pages
Pages
Physical retrieval
Requests
Data
Source
18
Query Answering in a Data Management System
  • Based on declarative query languages
  • Based on restricted first-order logic expressions
    over relations
  • Not procedural defines constraints on the
    output
  • Converted into a query plan that exploits
    properties run over the data by the query
    optimizer and query execution engine
  • Data may be local or remote
  • Data may be heterogeneous or homogeneous
  • Data sources may have different interfaces,
    access methods, etc.
  • Most common query languages
  • SQL (based on tuple relational calculus)
  • Datalog (based on domain relational calculus,
    plus fixpoint)
  • XQuery (functional language has an XML calculus
    core)

19
Processing the Query
Web Server / UI / etc
Execution Engine
Optimizer
Storage Subsystem
SELECT FROM STUDENT, Takes, COURSE
WHERE STUDENT.sid Takes.sID AND
Takes.cID cid
20
DBMSs in the Real World
  • Big, mature relational databases
  • IBM, Oracle, Microsoft
  • Middleware above these
  • SAP, PeopleSoft, dozens of special-purpose apps
  • Application servers
  • Integration and warehousing systems
  • Current trends
  • Web services XML everywhere
  • Smarter, self-tuning systems
  • Stream systems

21
Our Agenda this Semester
  • Reading the canonical papers in the data
    management literature
  • Some are very systems-y
  • Some are very experimental
  • Some are highly algorithmic, complexity-oriented
  • Gaining an understanding of the principles of
    building systems to handle declarative queries
    over large volumes of data

22
For Next Time
  • Skim Codd if you havent already
  • Read the overview papers of the two first
    database systems
  • Astrahan et al., pp. 117-
  • Wong et al. (skip Section 2 focus on pp. 200-)
  • Write a summary of your assigned paper and email
    it to me at zives_at_cis
  • Key question how well did this system mesh with
    Codds relational model? (You may need to skim
    through other aspects of your assigned paper to
    help answer that question)
Write a Comment
User Comments (0)
About PowerShow.com