CSE490i Advanced Internet Systems - PowerPoint PPT Presentation

About This Presentation
Title:

CSE490i Advanced Internet Systems

Description:

CSE494 - Information Retrieval, Mining and Integration on the Internet ... 1981 President Reagan shot & wounded by John W Hinckley Jr ... – PowerPoint PPT presentation

Number of Views:29
Avg rating:3.0/5.0
Slides: 52
Provided by: supp176
Category:

less

Transcript and Presenter's Notes

Title: CSE490i Advanced Internet Systems


1
CSE494 - Information Retrieval, Mining and
Integration on the InternetDatabase Concepts -
A Refresher30th March 2004
2
This Day in History
  • 1867 US purchases Alaska from Russia for
    7.2 million (2 cents/acre)
  • 1953 Einstein announces revised unified field
    theory
  • 1954 Test Cricket debut of Sir Garry Sobers
    vs. England
  • 1981 President Reagan shot wounded by John
    W Hinckley Jr
  • 2004 The first ever regular class of Rao
    taught by someone other than Rao

3
Concepts covered so far
  • Information Retrieval
  • Text retrieval
  • Hyper-linked text retrieval
  • Improvements
  • Information Mining
  • Clustering techniques to improve result
    presentation
  • Classification and filtering techniques

4
Structured data..
  • Focus on text data till date.
  • However, a lot of the data available on the web
    is actually from (semi-)structured databases !!!!
  • They do their best to look like they are text
    sources
  • What are the issues and opportunities brought up
    by the presence of such sources on the web?

5
Databases !!!??? you may have used
6
Is the a DBMS?
Skeptics corner
  • Fairly sophisticated search available
  • crawler indexes pages on the web
  • Keyword-based search for pages
  • But, currently
  • data is mostly unstructured and untyped
  • search only
  • cant modify the data
  • cant get summaries, complex combinations of data
  • Web sites typically have a DBMS in the background
    to provide these functions.
  • They dynamically convert (wrap) the structured
    data into readable English
  • ltIndia, New Delhigt gt The capital of India is
    New Delhi.
  • So, if we can unwrap the text, we have
    structured data!
  • Note also that such dynamic pages cannot be
    crawled...
  • The (coming) Semi-structured web
  • Most pages are at least semi-structured
  • XML standard is expected to ease the
    presentation/on-the-wire transfer of such pages.
    (BUT..)
  • The Services
  • Travel services, mapping services
  • The Sensors

7
Structure
An employee record
A generic web page containing text
A movie review
  • How will search and querying on these three types
    of data differ?

Semi-Structured
8
Search vs. Query
  • What if you wanted to find out which actors
    donated to Al Gores presidential campaign?
  • Try actors donated to gore in your favorite
    search engine.

9
Structure helps querying
  • Expressive queries
  • Give me all pages that have key words Get Rich
    Quick
  • Give me the social security numbers of all the
    employees who have stayed with the company for
    more than 5 years, and whose yearly salaries are
    three standard deviations away from the average
    salary
  • Give me all mails from people from ASU written
    this year, which are relevant to get rich quick
  • Efficient searching
  • equality vs. similarity
  • range-limited search

10
Why use a DBMS in your website?
  • Suppose we are building web-based music
    distribution site.
  • Several questions arise
  • How do we store the data? (file organization,
    etc.)
  • How do we query the data? (write programs)
  • Make sure that updates dont mess things up?
  • Provide different views on the data? (registrar
    versus students)
  • How do we deal with crashes?
  • Way too complicated!
  • Buy a database system!

11
What Is a Database System?
  • Database
    a very
    large, integrated collection of data.
  • Models a real-world enterprise
  • Entities (e.g., teams, games)
  • Relationships
  • (e.g., The Patriots are playing in The
    Superbowl)
  • More recently, also includes active components ,
    often called business logic. (e.g., the BCS
    ranking system)
  • A Database Management System (DBMS) is a software
    system designed to store, manage, and facilitate
    access to databases.

12
Functionality of a DBMS
  • Data Dictionary Management
  • Storage management
  • Data storage Definition Language (DDL)
  • High level query and data manipulation language
  • SQL/XQuery etc.
  • May tell us what we are missing in text-based
    search
  • Efficient query processing
  • May change in the internet scenario
  • Transaction processing
  • Resiliency recovery from crashes,
  • Different views of the data, security
  • May be useful to model a collection of databases
    together
  • Interface with programming languages

13
Traditional Database Architecture
14
Building an Application with a Database System
  • Requirements modeling (conceptual, pictures)
  • Decide what entities should be part of the
    application and how they should be linked.
  • Schema design and implementation
  • Decide on a set of tables, attributes.
  • Define the tables in the database system.
  • Populate database (insert tuples).
  • Write application programs using the DBMS
  • Now much easier, with data management API

15

Conceptual Modeling
ssn
16
Data Models
  • A data model is a collection of concepts for
    describing data.
  • A schema is a description of a particular
    collection of data, using a given data model.
  • The relational model of data is the most widely
    used model today.
  • Main concept relation, basically a table with
    rows and columns.
  • Every relation has a schema, which describes the
    columns, or fields.

17
Levels of Abstraction
  • Views describe how users see the data.
  • Conceptual schema defines logical structure
  • Physical schema describes the files and indexes
    used.

18
Example University Database
  • Conceptual schema
  • Students(sid string, name string,
    login string, age integer, gpareal)
  • Courses(cid string, cnamestring,
    creditsinteger)
  • External Schema (View)
  • Course_info(cidstring,enrollmentinteger)
  • Physical schema
  • Relations stored as unordered files.
  • Index on first column of Students.

19
Data Independence
  • Applications insulated from
  • how data is structured and stored.
  • Logical data independence Protection from
    changes in logical structure of data.
  • Physical data independence Protection from
    changes in physical structure of data.
  • Q Why are these particularly important for DBMS?

20
Schema Design Implementation
  • Table Students
  • Separates the logical view from the physical view
    of the data.

21
Terminology
Attribute names
Students
tuples
(Arity3)
22
Querying a Database
  • Find all the students taking CSE594 in Q1, 2004
  • S(tructured) Q(uery) L(anguage)
  • select E.name
  • from Enroll E
  • where E.courseCS490i and
  • E.quarterWinter, 2000
  • Query processor figures out how to answer the
    query efficiently.

23
Relational Algebra
  • Operators
  • tuple sets as input, new set as output
  • Basic Binary Set Operators
  • Result is table (set) with same attributes
  • Sets must be compatible!
  • R1(A1,A2,A3) ? R2(B1,B2,B3)
  • ? Domain(Ai) Domain(Bi)
  • Union
  • All tuples in either R1 or in R2
  • Intersection
  • All tuples in both R1 and R2
  • Difference
  • All tuples in R1 but not in R2
  • Complement
  • All tuples not in R1
  • Selection, Projection, Cartesian Product, Join

whats the universe?
24
Selection s
  • Grab a subset of the tuples in a relation that
    satisfy a given condition
  • Use and, or, not, gt, lt to build condition
  • Unary operation returns set with same
    attributes, but selects rows

25
Selection Example
Employee
SSN
Name
DepartmentID
Salary
999999999
John
1
30,000
777777777
Tony
1
32,000
888888888
Alice
2
45,000
26
Projection p
  • Unary operation, selects columns
  • Returned schema is different,
  • So returned tuples are not subset of original set
  • Contrast with selection
  • Eliminates duplicate tuples

27
(No Transcript)
28
Cartesian Product X
  • Binary Operation
  • Result is set of tuples combining all elements of
    R1 with all elements of R2, for R1 ? R2
  • Schema is union of Schema(R1) Schema(R2)
  • Notice we could do selection on result to get
    meaningful info!

29
Cartesian Product Example
30
Join
  • Most common (and exciting!) operator
  • Combines 2 relations
  • Selecting only related tuples
  • Result has all attributes of the two relations
  • Equivalent to
  • Cross product followed by selection followed by
    Projection
  • Equijoin
  • Join condition is equality between two attributes
  • Natural join
  • Equijoin on attributes of same name
  • result has only one copy of join condition
    attribute

31
Example Natural Join
32
Complex Queries
Product ( pname, price, category,
maker) Purchase (buyer, seller, store,
prodname) Company (cname, stock price,
country) Person( per-name, phone number, city)
Find phone numbers of people who bought gizmos
from Fred. Find telephony products that
somebody bought
33
Exercises
Product ( pname, price, category,
maker) Purchase (buyer, seller, store,
prodname) Company (cname, stock price,
country) Person( per-name, phone number,
city) Ex 1 Find people who bought telephony
products. Ex 2 Find names of people who bought
American products Ex 3 Find names of people who
bought American products and did not
buy French products Ex 4 Find names of people
who bought American products and they
live in Seattle. Ex 5 Find people who bought
stuff from Joe or bought products
from a company whose stock prices is more than
50.
34
SQL Introduction
Standard language for querying and manipulating
data Structured Query
Language
Many standards out there SQL92, SQL2, SQL3,
SQL99 Vendors support various subsets of
these (but well only discuss a subset of what
they support) Basic form syntax on relational
algebra (but many other features too) Select
attributes From relations (possibly
multiple, joined) Where conditions
(selections)
35
Selections s
SELECT FROM
Company WHERE countryUSA AND
stockPrice gt 50 You can use
Attribute names of the relation(s) used in the
FROM. Comparison operators , ltgt,
lt, gt, lt, gt Apply arithmetic
operations stockprice2 Operations
on strings (e.g., for concatenation).
Lexicographic order on strings.
Pattern matching s LIKE p Special
stuff for comparing dates and times.
36
Projection p
Select only a subset of the attributes
SELECT name, stock price
FROM Company WHERE
countryUSA AND stockPrice gt 50
Rename the attributes in the resulting table
SELECT name AS company,
stockprice AS price FROM
Company WHERE countryUSA AND
stockPrice gt 50
37
Ordering the Results
SELECT name, stock price
FROM Company WHERE
countryUSA AND stockPrice gt 50
ORDERBY country, name
Ordering is ascending, unless you specify the
DESC keyword. Ties are broken by the second
attribute on the ORDERBY list, etc.
38
Join
SELECT name, store
FROM Person, Purchase WHERE
per-namebuyer AND citySeattle
AND
productgizmo Product ( pname, price,
category, maker) Purchase (buyer, seller,
store, product) Company (cname, stock price,
country) Person( per-name, phone number, city)

39
Disambiguating Attributes
Find names of people buying telephony products
SELECT Person.name FROM
Person, Purchase, Product WHERE
Person.namebuyer
AND productProduct.name
AND Product.categorytelephony Product (
name, price, category, maker) Purchase (buyer,
seller, store, product) Person( name, phone
number, city)
40
Tuple Variables
Find pairs of companies making products in the
same category
SELECT product1.maker, product2.maker
FROM Product AS product1, Product AS
product2 WHERE
product1.category product2.category
AND product1.maker ltgt
product2.maker
Product ( name, price, category, maker)
41
Exercises
Product ( pname, price, category,
maker) Purchase (buyer, seller, store,
product) Company (cname, stock-price,
country) Person( per-name, phone number,
city) Ex 1 Find people who bought telephony
products. Ex 2 Find names of people who bought
American products Ex 3 Find names of people who
bought American products and did not
buy French products Ex 4 Find names of people
who live in Seattle and who bought American
products. Ex 5 Find people who bought stuff
from Joe or bought products from a
company whose stock prices is more than 50.
42
Views
43
Defining Views
(Virtual) Views are relations, except that they
are not physically stored. They are used
mostly in order to simplify complex queries
and to define conceptually different views of the
database to different classes of users. View
purchases of telephony products CREATE VIEW
telephony-purchases AS SELECT product, buyer,
seller, store FROM Purchase, Product WHERE
Purchase.product Product.name
AND Product.category telephony
44
A Different View
CREATE VIEW Seattle-view AS SELECT
buyer, seller, product, store FROM
Person, Purchase WHERE Person.city
Seattle AND
Person.name Purchase.buyer
We can later use the views SELECT
name, store FROM Seattle-view,
Product WHERE Seattle-view.product
Product.name AND
Product.category shoes
Whats really happening when we query a view??
45
Updating Views
How can I insert a tuple into a table that
doesnt exist? CREATE VIEW bon-purchase AS
SELECT store, seller, product FROM
Purchase WHERE store The Bon
Marche If we make the following insertion
INSERT INTO bon-purchase VALUES
(the Bon Marche, Joe, Denby Mug) We can
simply add a tuple (the Bon Marche,
Joe, NULL, Denby Mug) to relation Purchase.
46
Materialized Views
  • Views whose corresponding queries have been
    executed and the data is stored in a separate
    database
  • Uses Caching
  • Issues
  • Using views in answering queries
  • Normally, the views are available in addition to
    database
  • (so, views are local caches)
  • In information integration, views may be the only
    things we have access to.
  • An internet source that specializes in woody
    allen movies can be seen as a view on a database
    of all movies. Except, there is no database out
    there which contains all movies..
  • Maintaining consistency of materialized views

47
Non-Updatable Views
Given Purchase (buyer, seller, store,
product) Person( name, phone-num, city)
CREATE VIEW Seattle-view AS SELECT
seller, product, store FROM Person,
Purchase WHERE Person.city Seattle
AND Person.name
Purchase.buyer
Why non-updatable?
How can we add the following tuple to the view?
(Joe, Shoe Model 12345, Nine West)
48
Issues w.r.t. Databases on the Web
  • Information Extraction (invert the tuple to text
    transformation)
  • Support lay user queries
  • More flexible queries
  • Exact (SQL) vs Approximate/Similar (Text search?)
  • On semi-structured databases
  • Joins over text attributes?
  • Exact (SQL) vs Approximate/Similar !!!!!
  • Support integration/aggregation of multiple
    databases
  • Take a query from the user and send it to all
    relevant databases
  • TONS of challenges

49
Imprecise Queries
  • Increasing number of Web accessible databases
  • E.g. bibliographies, reservation systems,
    department catalogs etc
  • Support for precise queries only exactly
    matching tuples
  • Difficulty in extracting desired information
  • Limited query capabilities provided by form based
    query interface
  • Lack of schema/domain information
  • Increasing complexity of types of data e.g.
    hyptertext, images etc
  • Often times user wants about the same instead
    of exact
  • Bibliography search find similar publications

Solution Provide answers closely matching query
constraints
50
Query Optimization
51
Query Optimization
Goal
Imperative query execution plan
Declarative SQL query
SELECT S.buyer FROM Purchase P, Person Q WHERE
P.buyerQ.name AND Q.cityseattle AND
Q.phone gt 5430000
  • Inputs
  • the query
  • statistics about the data (indexes,
    cardinalities, selectivity factors)
  • available memory

Ideally Want to find best plan. Practically
Avoid worst plans!
52
(On-the-fly)
sname
SELECT S.sname FROM Reserves R, Sailors S WHERE
R.sidS.sid AND R.bid100 AND S.ratinggt5
  • Goal of optimization To find more efficient
    plans that compute the same answer.

(On-the-fly)
rating gt 5
with pipelining )
sidsid
(Use hash
Sailors
bid100
index do
not write
result to
temp)
Reserves
53
Optimizing Joins
  • Q(u,x) - R(u,v), S(v,w), T(w,x)
  • R S T
  • Many ways of doing a single join R S
  • Symmetric vs. asymmetric join operations
  • Nested join, hash join, double pipe-lined hash
    join etc.
  • Processing costs alone vs. processing transfer
    costs
  • Get R and S together vs, get R, get just the
    tuples of S that will join with R (semi-join)
  • Many orders in which to do the join
  • (R join S) join T
  • (S join R) join T
  • (T join S) join R etc.
  • All with different costs

54
Determining Join Order
  • In principle, we need to consider all possible
    join orderings
  • As the number of joins increases, the number of
    alternative plans grows rapidly we need to
    restrict the search space.
  • System-R consider only left-deep join trees.
  • Left-deep trees allow us to generate all fully
    pipelined plansIntermediate results not written
    to temporary files.
  • Not all left-deep trees are fully pipelined
    (e.g., SM join).

55
Query Optimization Process(simplified a bit)
  • Parse the SQL query into a logical tree
  • identify distinct blocks (corresponding to nested
    sub-queries or views).
  • Query rewrite phase
  • apply algebraic transformations to yield a
    cheaper plan.
  • Merge blocks and move predicates between blocks.
  • Optimize each block join ordering.
  • Complete the optimization select scheduling
    (pipelining strategy).

56
Cost Estimation
  • For each plan considered, must estimate cost
  • Must estimate cost of each operation in plan
    tree.
  • Depends on input cardinalities.
  • Must estimate size of result for each operation
    in tree!
  • Use information about the input relations.
  • For selections and joins, assume independence of
    predicates.
  • System R cost estimation approach.
  • Very inexact, but works ok in practice.
  • More sophisticated techniques known now.

57
Key Lessons in Optimization
  • There are many approaches and many details to
    consider in query optimization
  • Classic search/optimization problem!
  • Not completely solved yet!
  • Main points to take away are
  • Algebraic rules and their use in transformations
    of queries.
  • Deciding on join ordering System-R style
    (Selinger style) optimization.
  • Estimating cost of plans and sizes of
    intermediate results.

58
Concurrency Control
  • Concurrent execution of user programs key to
    good DBMS performance.
  • Disk accesses frequent, pretty slow
  • Keep the CPU working on several programs
    concurrently.
  • Interleaving actions of different programs
    trouble!
  • e.g., account-transfer print statement at same
    time
  • DBMS ensures such problems dont arise.
  • Users/programmers can pretend they are using a
    single-user system. (called Isolation)
  • Thank goodness! Dont have to program very,
    very carefully.

59
Transactions ACID Properties
  • Key concept is a transaction a sequence of
    database actions (reads/writes).
  • DBMS ensures atomicity (all-or-nothing property)
    even if system crashes in the middle of a Xact.
  • Each transaction, executed completely, must take
    the DB between consistent states or must not run
    at all.
  • DBMS ensures that concurrent transactions appear
    to run in isolation.
  • DBMS ensures durability of committed Xacts even
    if system crashes.
  • Note can specify simple integrity constraints on
    the data. The DBMS enforces these.
  • Beyond this, the DBMS does not understand the
    semantics of the data.
  • Ensuring that a single transaction (run alone)
    preserves consistency is largely the users
    responsibility!

60
Scheduling Concurrent Transactions
  • DBMS ensures that execution of T1, ... , Tn is
    equivalent to some serial execution T1 ... Tn.
  • Before reading/writing an object, a transaction
    requests a lock on the object, and waits till the
    DBMS gives it the lock. All locks are held
    until the end of the transaction. (Strict 2PL
    locking protocol.)
  • Idea If an action of Ti (say, writing X) affects
    Tj (which perhaps reads X), say Ti obtains the
    lock on X first so Tj is forced to wait until
    Ti completes.This effectively orders the
    transactions.
  • What if Tj already has a lock on Y and Ti
    later requests a lock on Y? (Deadlock!) Ti or Tj
    is aborted and restarted!

61
Ensuring Transaction Properites
  • DBMS ensures atomicity (all-or-nothing property)
    even if system crashes in the middle of a Xact.
  • DBMS ensures durability of committed Xacts even
    if system crashes.
  • Idea Keep a log (history) of all actions carried
    out by the DBMS while executing a set of Xacts
  • Before a change is made to the database, the
    corresponding log entry is forced to a safe
    location. (WAL protocol OS support for this is
    often inadequate.)
  • After a crash, the effects of partially executed
    transactions are undone using the log. Effects of
    committed transactions are redone using the log.
  • trickier than it sounds!
Write a Comment
User Comments (0)
About PowerShow.com