Zachary G. Ives - PowerPoint PPT Presentation

About This Presentation
Title:

Zachary G. Ives

Description:

Sounds like operating systems all over again! 19. Database Architecture: The Traditional View ... a query plan that exploits properties; run over the data by ... – PowerPoint PPT presentation

Number of Views:110
Avg rating:3.0/5.0
Slides: 36
Provided by: zack4
Category:
Tags: ives | zachary

less

Transcript and Presenter's Notes

Title: Zachary G. Ives


1
Introduction
  • Zachary G. Ives
  • University of Pennsylvania
  • CIS 550 Database Information Systems
  • September 4, 2003

Some slide content courtesy of Susan Davidson
Raghu Ramakrishnan
2
Welcome!
  • To CIS 550, officially Database Information
    Systems in the course catalog
  • A tour of the data management field
  • A question for you what does this really
    mean? What is this course (and the field) about?

3
What The Course Is Not(A Few Warnings)
  • This is not a course on Oracle or SQL
  • It may not directly impact your marketability
  • Its an investigation into the principles of data
    management which will improve your
    understanding
  • This course will not be a cakewalk!
  • The data management field is broad, and well
    touch on many subjects at a rapid pace
  • 8 homeworks, paper summaries, term project,
    midterm, final,
  • This course is not suitable for people with a
    limited programming background need skills in
  • Algorithms data structures
  • Logic
  • Programming languages
  • Perhaps even a little complexity theory!

4
What The Course Will Do
  • Most CS courses concentrate on code now youll
    understand data management and representation
  • In the end, its all about the data!
  • Background in most of the important areas
  • Data design, modeling
  • Understanding of DB system internals, performance
  • Understanding of data-driven systems (e.g., web
    sites)
  • An understanding of the complexities of
    integrating data perhaps the biggest CS problem
    today
  • Understanding of what research topics in data
    management are

5
Administrivia
  • Instructor Zachary Ives, zives_at_cis
  • Levine 611 (until end of Sept. then 5th floor
    GRW)
  • Office hours Tuesday, 300-400PM (after class)
  • TA Dinkar Gupta, dinkar_at_grasp.cis
  • Office hours on Monday time 300-400PM (office
    TBA)
  • Newsgroup upenn.cis.cis550
  • Home page www.seas.upenn.edu/zives/cis550/
  • Text(s)
  • Ramakrishnan Gerke, Database Systems, 3rd ed.
  • Supplementary papers (to be handed out in class)
  • Other books may be useful (see web page)

6
Course Format and Grading
  • Well cover roughly one major topic area per week
    to two weeks
  • Readings in the text research papers
  • Occasionally, summaries/commentary on papers (5)
  • Homework assignment for each topic area (30)
  • One midterm (10), one final exam (20)
  • Project (30) groups of 3-4
  • Build a blogging system on top of a database, or
  • Build a P2P data sharing system for XML data
  • (Or propose your own idea)
  • General participation, discussion, intangibles
    (5)

7
Diving In
  • What is a database and a DBMS?
  • Why do we need a DBMS?
  • Database and data management architectures
  • Process of building a DB
  • DBMS components

8
Whats Data Management?
  • In the 1960s and early 70s
  • file formats, traversals, indexes
  • In the 1980s (mid- to late-70s in research) and
    90s
  • Separation of logical physical data
    representations
  • Well-defined general purpose, declarative data
    manipulation language (DML) and data definition
    language (DDL)
  • Reliable, consistent storage concurrency
    control
  • Sophisticated system that takes DDL statements
    and knowledge of physical data representations
    and produces answers in optimized way
  • Today
  • All that plus managing and manipulating data in
    many models and representations

9
What is a DBMS?
  • A database (DB) is a large, integrated collection
    of data
  • Generally is cohesive in some way
  • A DB models a real-world organization or unit
  • A database management system (DBMS) is a software
    package designed to store and manage databases
  • Reliable storage recovery of 100s of GB
  • Querying/updating interface and API
  • Support for many concurrent users

10
Connection to OtherAreas of CS
  • Programming languages and software engineering
    (obviously)
  • Algorithms (obviously)
  • Logic, discrete math, and theory of computation
  • Systems concurrency, operating systems, file
    organization and networks, peer-to-peer,
  • Web (and Semantic Web), information retrieval,
    digital libraries, software agents,
  • AI planning and machine learning

11
But 80 of the Worlds Data is Not in a DB!
  • Examples
  • scientific data (large images, complex programs
    that analyze the data)
  • personal data
  • WWW
  • Data management is expanding to tackle these
    problems
  • Flexibility data management imposes many
    constraints to make problems solvable
  • Must deal with entities outside our control
  • In this course, well start by focusing on
    databases, but eventually look outside the box

12
Why Not Program up Databases As Needed?
  • For simple (single-concept) and small databases
    this is often the best solution
  • Flat files and grep get us a long way
  • But there are limits
  • The structure is complicated (more than a simple
    table)
  • The database gets large (e.g., bigger than RAM)
  • Many people want to use it simultaneously
  • Need for reliable recovery from crashes
  • Updates generally require complete rewrite of file

13
Example Palm-Style Calendar
  • We might start by building a file with the
    following structure
  • This text file is easy to deal with. So there's
    no need for a DBMS! Right?

14
Problem 1 Data Organization
  • Consider the all-important who field. Do we also
    want to keep e-mail addresses, telephone numbers
    etc.?
  • Expand our file to look like
  • Now we are keeping our address book in our
    calendar and doing so redundantly

15
Link Calendar with Address Book?
  • Two conceptual entities contact information
    and calendar with a relationship between them,
    linking people in the calendar to their contact
    information
  • This link could be based on something as simple
    as the person's name

16
Problem 2 Efficiency
  • Size of personal address book is probably less
    than one hundred entries, but there are things
    we'd like to do quickly and efficiently
  • Give me all appointments on 10/28
  • When am I next meeting Jim?
  • Program these as quickly as possible (and make
    them resilient to data format changes)
  • Have these programs executed efficiently
  • What would happen if you were using a corporate
    calendar with hundreds of thousands of entries?

17
Problem 3 Concurrency and Reliability
  • Suppose other people are allowed access to your
    calendar and are allowed to modify it? How do we
    stop two people changing the file at the same
    time and leaving it in a physical (or logical)
    mess?
  • Suppose the system crashes while we are changing
    the calendar. How do we recover our work?

18
Transactions
  • Key concept for concurrency is that of a
    transaction an atomic sequence of database
    actions (read/write) on data items (e.g. calendar
    entry).
  • Key concept for recoverability is that of a log
    keeping track of all actions carried out by the
    db.
  • Sounds like operating systems all over again!

19
Database Architecture The Traditional View
  • It is common to describe databases in two ways
  • Logical structure
  • What users see. The program or query language
    interface.
  • Physical structure
  • How files are organized. What indexing
    mechanisms are used.
  • The logical level is further split into two
    components
  • Overall database design (conceptual seen by the
    DB designer)
  • Views that various users get to see

20
Three-level Architecture
View 1 View 2 View N
Schema
Conceptual Level
21
Data Independence
  • A user of a relational database system should be
    able to use query the database without knowing
    about how the precisely how data is stored, e.g.
  • After all, you don't worry much how numbers are
    stored when you program some arithmetic or use a
    computer-based calculator

SELECT When, Where FROM Calendar WHERE Who
"Bill"
22
More on Data Independence
  • Logical data independence
  • Protects the user from changes in the logical
    structure of the data could reorganize the
    calendar schema without changing how I query it
  • Physical data independence
  • Protects the user from changes in the physical
    structure of data could add an index on who (or
    sort by when) without changing how the user would
    write the query, but the query would execute
    faster (query optimization)

23
That's the Traditional View, But ...
  • Three-level architecture is not always
    achievable when databases get big, queries must
    be carefully written to achieve efficiency
  • Also, may need a 4th tier Sometimes this is
    called middleware

24
Combining Databases with Mediators(a kind of
middleware)
Mediated Schema
XML
  • May need to add further layers to combine
    multiple databases/sources on the Web
  • Some of these are databases over which we have
    no control
  • Some must be accessed in special ways

25
Data-Driven Web Sites Consumers of Database
Output
HTML
view
Processing
  • Data driven web sites also add an HTML
    presentation layer on top of what weve seen

26
The Process of Building a Database
  • Start with a conceptual model
  • Design implement schema
  • Write applications using DBMS and other tools
  • Many ways of doing this where the hard problems
    are taken care of by other people (DBMS, API
    writers, library authors, web server, etc.)

27
Conceptual Design
PROFESSOR
fid
name
Teaches
Takes
STUDENT
COURSE
cid
name
semester
sid
name
28
Designing a Schema (Set of Relations)
STUDENT
COURSE
Takes
  • Convert to tables constraints
  • Then need to do physical design the layout on
    disk, indices, etc.

cid name sem
550-0103 DB F03
700-1003 AI S03
501-0103 Arch F03
sid name
1 Jill
2 Qun
3 Nitin
sid cid
1 550-0103
1 700-1003
3 500-0103
PROFESSOR
Teaches
fid name
1 Ives
2 Saul
8 Roth
fid cid
1 550-0103
2 700-1003
8 501-0103
29
Applications Use Queries in SQL
  • Structured Query Language
  • Based on restricted first-order logic expressions
    over relations
  • Not procedural defines constraints on the
    output
  • Converted into a query plan that exploits
    properties run over the data by the query
    optimizer and query execution engine
  • lthtmlgt
  • ltbodygt
  • lt!-- hypotheticalEmbeddedSQL
  • SELECT FROM STUDENT, Takes, COURSE
  • WHERE STUDENT.sid Takes.sID
  • AND Takes.cID cid
  • --gt
  • lt/bodygt
  • lt/htmlgt

30
Processing the Query
Web Server / UI / etc
Execution Engine
Optimizer
Storage Subsystem
SELECT FROM STUDENT, Takes, COURSE
WHERE STUDENT.sid Takes.sID AND
Takes.cID cid
31
DBMS in a Bit More Detail
API/GUI
(Simplification!)
Query
Optimizer
Stats
Physical plan
Exec. Engine
Logging, recovery
Schemas
Catalog
Requests
Data/etc
Index/file/rec Mgr
Data/etc
Requests
Buffer Mgr
Pages
Pages
Storage Mgr
Requests
Data
Storage
32
DBMSs in the Wild
  • A huge industry for 20 of the worlds data!
  • Big, mature relational databases
  • IBM, Oracle, Microsoft
  • Middleware above these
  • SAP, PeopleSoft, dozens of special-purpose apps
  • Application servers
  • Integration and warehousing systems
  • Trends
  • More integration web services XML everywhere
  • Smarter, self-tuning systems

33
The Research World
  • Conventional databases arent interesting!
  • Understanding whats possible to do with XML
  • Better query processing
  • Better languages for meta-info (e.g.,
    constraints)
  • Data streams
  • Peer-to-peer
  • Integrating data from different formats
  • Lots of theory and systems-building

34
In this Course...
  • Study relational databases, their design, how to
    query, what forms of indices to use.
  • Beyond relational algebra a logical model of
    data (Datalog), recursion
  • XML and semi-structured data models
  • Understanding DB internals
  • How DBs are built
  • Performance implications
  • Integrating and mediating between databases (a
    huge problem today)

35
Questions?
Dilbert, 8/9/2003 (via online archive)
Write a Comment
User Comments (0)
About PowerShow.com