CS245A - PowerPoint PPT Presentation

About This Presentation
Title:

CS245A

Description:

Use human judgement, machine computation power ... Wesley W. Chu. Rei-Chi Lee. 51. Database Semantics. Database semantics can be classified into: ... – PowerPoint PPT presentation

Number of Views:34
Avg rating:3.0/5.0
Slides: 71
Provided by: wesle4
Learn more at: http://web.cs.ucla.edu
Category:
Tags: cs245a

less

Transcript and Presenter's Notes

Title: CS245A


1
CS245A Syllabus (2005)
  • Knowledge Discovery in Databases
  • Query Processing With Domain Semantics
  • Capture Database Semantics by Rule Induction
  • Intentional Query Answering
  • Fault Tolerant DDBMS Via Data Inference
  • Intelligent Dictionary Directory
  • Uncertainty Management Using Rough Sets
  • Data Mining Techniques (Ch 4-7, H K)
  • Active Databases
  • Mediators in Information Systems
  • KQML A Language and Protocol for Knowledge and
    Information Exchange

2
CS 245A - Syllabus (contd)
  • CoBase
  • CoSent
  • Relaxation for XML Documents
  • Query Formation From High-level Concepts
  • Knowledge Acquisition for Query Relaxation
  • Principles of Case-based Reasoning
  • A Case-based Reasoning Approach to AQA
  • CoXML
  • Data Mining for Sequence Data
  • Extracting key features from Free Text
  • Knowledge based Approach for Free Text Retrieval
  • Content-based Information Retrieval
  • Digital Library

3
References
  • Course notes Intelligent Information Systems,
    CS245A, Course Reader Material, 1141 Westwood
    Blvd, 310-443-3303
  • Jiawei Han and Micheline Kamber, Data Mining
    Concept and Techniques, Morgan Kaufmann, August
    2000.
  • Wesley Chu T.Y. Lin (ed.) Foundations and
    Advances in Data Mining. Springer, 2005

4
CS 245AIntelligent Information Systems
  • Wesley W. Chu
  • Computer Science Department
  • U. of California
  • Los Angeles, CA

5
Knowledge Discovery In Databases
  • Information Explosion
  • Information doubles every 20 months
  • Increase in the number and size of DBs
  • NASA - Earth observation satellites, 1
    picture/sec
  • Human genome - several billion genetic bases
  • US census data - lifestyle and subculture of the
    US
  • How to analyze these databases (raw data)
  • There is a gap between
  • Data generation and data understanding
  • Intelligent data analysis will be useful and
    valuable
  • AA uses frequent flyer DB to find its better
    customers for specific market promotions

6
Knowledge Discovery In Databases (Contd)
  • Bank uses customers loan and credit information
    to derive better loan approval and bankrupt
    protection
  • Package-goods manufacturers use the scanned
    supermarket data to measure the effect of their
    promotions and to look for shopping patterns
  • Techniques
  • Machine Learning
  • Statistics
  • Information Theory
  • Fuzzy Set

7
Knowledge Discovery
  • Extraction of implicit, previously unknown and
    potentially useful information from Data
  • Given a set of facts (Data) F, a language L,
    measure of certainty C,
  • pattern a statement S in L that describes the
    relationship among a subset Fs of F with
    certainty C, such that Fs is a simpler
    representation than the enumeration of all facts
    in Fs
  • Discovered Knowledge
  • The output of a program that monitors the set
    of facts in a DB and produce patterns.

8
Patterns
  • Expressed by high level language
  • Understand and used directly by people
  • Able to input to another program (e.g. expert
    system)
  • e.g.
  • If age lt 25 and Driver-Education-Course No
  • Then At-Fault-Accident Yes
  • with likelihood 0.3

9
Patterns (Contd)
  • Patterns that are completely unrelated to current
    goals are not considered as knowledge.
  • e.g.
  • Patterns that are relating at-fault-accident to
    a drivers age is not useful to auto sales
    figures.
  • Pattern interesting results knowledge
  • Age gt 16 is not an interesting pattern for
    driver since all drivers require age gt 16.

10
Knowledge Discovery in DB Exhibits Four Main
Characteristics
  • High-Level Language
  • Understood by human users
  • Accuracy
  • Expressed by measure of uncertainty
  • Interesting Results
  • Patterns are novel and potentially useful
  • Efficiency
  • Running times for large-sized DB are predictable
    and acceptable

11
Efficiency
  • The discovery process should be efficiently
    implemented on a computer.
  • An algorithm is considered efficient if the run
    time and space used are a polynomial function of
    low degree of input length.
  • e.g.
  • efficient algorithms for restricted concept
    classes
  • Conjunctive concepts, (A B C)
  • Conjunction of classes of disjunctions of no more
    than k literals
  • (A B) (C D) (E F) , k 2.

12
Machine Learning
  • A learning algorithm takes the data set and its
    accompanying information as input and returns a
    statement (e.g., a concept) representing the
    results of the learning as output
  • Data sets can be a file of records in DB
  • Problems in learning DB
  • DB are
  • Dynamic
  • Incomplete
  • Noisy
  • Much larger than typical machine learning data
    sets
  • Much of work in learning DB focuses on overcoming
    these complications!

13
Related Approaches
  • DB Management
  • Integrity
  • Querying in DB
  • Deduction in DB
  • OODBM
  • Expert Systems
  • Expert generated knowledge usually are higher
    quality than the data in DB
  • Only cover the important cases
  • Experts are available to confirm the validity and
    usefulness of discovered patterns
  • Autonomy of discovery is lacking in expert systems

14
Related Approaches (Contd)
  • Statistics
  • Ill suited for the nominal and structured data
    types
  • Precluding the use of domain knowledge
  • Difficult to interpret
  • Require the guidance of the user to specify when
    and how to analyze the data

15
Scientific Discovery
  • DBKD is less purposeful and controlling than SD
  • Scientists can reformulate and rerun their
    experiment should they find the initial design
    was inadequate
  • Database manager rarely have the luxury of
    redesigning their data fields and recollecting
    the data

16
A Framework for Knowledge Discovery
  • Input
  • Raw data from DB
  • Information from data dictionary
  • Additional domain knowledge
  • User defined biases that provide high level focus
  • Output
  • New Domain Knowledge
  • Feedback of the discovered knowledge to generate
    new knowledge
  • DB issues
  • Dynamic data (time sensitive e.g. weight
    height pulse rate)
  • Irrelevant fields (zip codes, pulse rate, sex)
  • Missing data
  • Noise and uncertainty
  • Missing field

17
Translation Between Database Management and
Machine Learning Terms
18
Conflicting Viewpoints Between Database
Management and Machine Learning
19
A Framework for Knowledge Discovery in Databases
20
Database and Knowledge
  • Domain Knowledge assist in discovery by the
    searching scope
  • Data Dictionary
  • Inter-field Knowledge
  • e.g., weight and height
  • Inter-instance knowledge
  • e.g., age height seniority
  • age weight seniority
  • Contradictory - rule out valuable discovery
  • Trucks dont drive over water
  • eliminates potentially interesting solution,
  • Trucks drive over frozen lakes in winter.

21
Discovered Knowledge
  • Form
  • Inter-field patterns - related values of field in
    the same record
  • e.g. (procedure surgery implies days in
    hospital gt 5)
  • Inter-record patterns - aggregated over group of
    records or identify useful clusters (e.g., profit
    making companies)
  • Rules X gt Y1, A gt B
  • forms casual chains or network

22
Discovered Knowledge (contd)
  • Representation
  • Discovery must be represented in a form
    appropriate for the intended user.
  • Human natural language, formal logic, visual
    depictions of information
  • Computer program (expert system shells)
    Programming language, declarative formalisms
  • Discovery System Feedback as domain knowledge
  • Need common representation
  • Uncertainty
  • Patterns are often probabilistic rather than
    deterministic
  • missing and erroneous data
  • inherent indeterminism of the underlying real
    world causes (50 chance of rain tomorrow)
  • sampling

23
Discovered Knowledge (contd)
  • Measures
  • Proof of success
  • Standard deviation
  • Belief measures
  • Linguistic uncertainty - fuzzy sets
  • Visual presentations by density, size, and
    shading
  • Sampling technique for large DB accuracy of
    results depends on sample size

24
Discovery Algorithms
  • Machine Learning
  • Unsupervised Learning
  • Supervised Learning
  • Unsupervised Learning
  • Pattern identification identifying interesting
    patterns and describing them in a concise and
    meaningful manner
  • Examples
  • customer with income gt 25,000/yr
  • questionable insurance claims

25
Discovery Algorithms (Contd)
  • Methods
  • Traditional Clustering
  • Minimized similarity between classes
  • Maximize similarity within classes
  • Drawbacks
  • Based on Euclidean Distance, work well only on
    numerical data
  • Inability to use background information such as
    likely cluster shape
  • Conceptual clustering
  • Based on attributes similarity, conceptual
    cohesiveness (defined by background information)
  • Interactive clustering
  • Combines human users knowledge with computation
    power of the computer

26
Discovery Algorithms (Contd)
  • Supervised Learning
  • Description process
  • Summaries relevant qualities of the identified
    class
  • In discovery systems, user supervision can occur
    in either the identification or description
    process.

27
Concept Description(Supervised Concept Learning)
  • Discovery in large, complex database requires
    both empirical methods to detect the statistical
    regularity of patterns and knowledge-based
    approaches to incorporate available domain
    knowledge.
  • Discovery tasks
  • Summarization - Summarize class records by
    describing their common or characteristic
    features
  • Discrimination - Describe qualities sufficient to
    discriminate records of one class from another
  • Comparison - Describe the class in a way that
    facilitates comparison and analysis with other
    records

28
Future Directions
  • Domain Knowledge - how to effectively use domain
    knowledge to discover knowledge
  • Efficient Algorithms
  • Restrict rule type
  • Heuristic and approximate algorithms
  • Sampling
  • Parallel computing
  • OODBM
  • Deductive DB
  • Incremental methods
  • Efficiently keep pace with changes in Data
  • Incremental discovery system, reuse their
    discoveries and make more complex discoveries

29
Future Directions (contd)
  • Interactive systems
  • Knowledge analyst included in the discovery loop
  • Use human judgement, machine computation power
  • Need information to be presented on a human
    oriented form (text, sound, visuals)
  • Integration

30
Applications of Discovery in DB
  • Medicine
  • Finance
  • Agriculture
  • Social
  • Marketing Sales
  • Insurance
  • Engineering
  • Physics Chemistry
  • Military
  • Law Enforcement
  • Space Science
  • Publishing

31
Applications of Discovery in DB (Contd)
  • Discovery of Quantitative Laws
  • Data Driven Discovery of Quantitative Laws
  • Using Knowledge in Discovery
  • Data Summarization
  • Domain Specific Discovery Methods
  • Integrated Multi-Paradigm Systems
  • Methodology and Application Issues

32
Query Processing WithDomain Semantics
  • Wesley W. Chu

33
Query Optimization Problem
  • To find a sequence of operations, which has the
    minimal processing cost.

34
Conventional Query Optimization (CQO)
  • For a given query
  • Generate a set of query that are equivalent to
    the given query
  • Determine the processing cost of each such query
  • Select the lowest cost query processing strategy
    among these equivalent queries

35
Limitations of CQO
  • There are certain queries that cannot be
    optimized by Conventional Query Optimization.
  • For example, given the query
  • Which ships have deadweight greater than 200
    thousand tons?
  • A search of entire the database may be required
    to answer this query.

36
The Use of Knowledge
  • ASSUMING EXPERT KNOWS THAT
  • 1. SHIP relation is indexed on ShipType. There
    are about 10 different ship types, and
  • 2. the ship must be a SuperTanker (one of the
    ShipTypes) if the deadweight is greater than 150K
    tons.
  • AUGMENTED QUERY
  • Which SuperTanker have deadweight greater than
    200K tons?
  • RESULT
  • About 90 time saved in searching the answers.
  • The technique of improving queries with semantic
    knowledge is called Semantic Query Optimization.

37
Semantic Query Optimization (SQO)
  • Uses domain knowledge to transform the original
    query into a more efficient query yet still
    yields the same answer.
  • Assuming a set of integrity constraints is
    available as the domain knowledge,
  • Represent each integrity constraint as Pi
    Ci, where 1 lt i lt n.
  • Translate (Augment) original query Q into Q
    subject to C1, C2, ..., Cn, such that Q yields
    lower processing cost than Q.
  • Query Optimization Problem Find C1, C2, ..., Cm
    that yields minimal query processing cost that
    is,
  • C(Q) min C(QLC1L ... LCm)

Ci
38
Semantic Equivalence
  • Domain knowledge of the database application
    maybe used to transform the original query into
    semantically equivalent queries.
  • Semantic Equivalence
  • Two queries are considered to be semantically
    equivalent if they result in the same answer in
    any state of the database that conforms to the
    Integrity Constraints.
  • Integrity Constraints
  • A set of if and then rules that enforce the
    database to be accurate instance of the real
    world database application. Examples of
    constraints include
  • state snapshot constraints
  • e.g., if deadweight gt 150K then ShipType
    SuperTanker.
  • state transition constraints
  • e.g., salary can only be increased,
  • i.e., salary (new) gt salary (old)

39
Limitations of Current Approach
  • Current approach of SQO using
  • Integrity constraints as knowledge
  • Conventional data models

40
Limitations of Integrity Constraints
  • Integrity constraints are often too general to be
    useful in SQO, because
  • Integrity constraints describe every possible
    database state
  • User is only concerned with the current database
    content.
  • Most database do not provide integrity checking
    due to
  • Unavailability of integrity constraints
  • Overhead of checking the integrity
  • Thus, the usefulness of integrity constraints in
    SQO is quite limited.

41
Limitations Of Conventional Data Models
  • Conventional data models lack expressive
    capability for modeling conveniences. Many
    useful semantics are ignored. Therefore, limited
    knowledge are collected.
  • FOR EXAMPLE
  • Which employee earns more than 70K a year?
  • The integrity constraint
  • The salary range of employee is between 20K to
    90K.
  • is useless in improving this query.

42
Augmentation Of SQO With Semantic Data Models
  • If the employees are divided into three
    categories MANAGERS, ENGINEERS, STAFFS
  • and each category is associated with some
    constraints
  • The salary range of MANAGERS is from 35K to 90K.
  • The salary range of ENGINEERS is from 25K to 60K.
  • The salary range of STAFF is from 20K to 35K.
  • A better query can be obtained
  • Which managers earn more than 70K a year?

43
(No Transcript)
44
CLASS (Type, Class, Name, Displacement, Draft,
Enlist)
45
Rule Statistics
46
SQP Performance for Selected Database Structure
47
Performance Improvement for Selected Attributes
CQP
SQP
attribute
cpu (ms) 505 432
dio 11 11
dio 3 4
cpu (ms) 129 130
Class Enlist
48
(No Transcript)
49
Summary
  • Contributions
  • Providing a model-based methodology for acquiring
    knowledge from the database by rule induction.
  • Applications
  • 1. Semantic Query Processing use semantic
    knowledge to improve query processing
    performance.
  • 2. Deductive Database Systems - use induced rules
    to provide intentional answers.
  • 3. Data Inference Applications - use rules to
    improve data availability by inferring
    inaccessible data from accessible data.

50
Capture Database SemanticsBy Rule Induction
  • Wesley W. Chu
  • Rei-Chi Lee

51
Database Semantics
  • Database semantics can be classified into
  • Database Structure - the description of the
    interrelationships between database objects.
  • Database Characteristics - defines the
    characteristics and properties of each object
    type.
  • However, only tools for modeling database
    structure are available. Very few tools exist in
    gathering and maintaining the database
    characteristics.

52
An Example of Database Characteristics
  • The following table illustrates the US Navy
    battleship characteristics that classify ships
    into ship types with different displacement
    ranges.

53
Knowledge Acquisition
  • A major problem in the development of a
    knowledge-based data processing system.
  • Knowledge Engineers - persons in the use of
    expert system tools
  • Domain Experts - persons with the expertise of
    the application domain
  • The Process
  • Studying literature to obtain fundamental
    background.
  • Interacting with domain experts to get their
    expertise.
  • Translating the expertise into knowledge
    representation.
  • Refining knowledge base through testing and
    further interacting with domain experts.
  • A VERY TIME-CONSUMING TASK!

54
Knowledge Acquisition from Database
  • Database schema is defined according to database
    semantics, and
  • Database instances are constrained by the
    database characteristics.
  • Thus,
  • Database characteristics can be induced as the
    semantic knowledge from the database.
  • Database schema can be a useful tool to guide the
    knowledge acquisition.

55
Knowledge Acquisition By Rule Induction
  • Given an object hierarchy and a set of database
    instances contained in the object hierarchy, a
    set of classification rules can be induced by
    inductive learning techniques.
  • Given
  • H - an object type hierarchy H1, ..., Hn
  • S - object schema
  • I - database instances representing H
  • Find
  • D - a set of descriptions, D1, ..., Dn such
    that
  • for all x, x in I,
  • if Di (x) is true, then x ISA Hi
  • Example
  • SUBMARINES contains SSN, SSBN
  • DSSN 2145 lt Displacement lt 6955
  • DSSBN 7250 lt Displacement lt 30000

56
Model-Based Knowledge Acquisition Methodology
  • The methodology consists of
  • a Knowledge-based ER (KER) Model,
  • a knowledge acquisition methodology, and
  • a rule induction algorithm.
  • KER is used as a knowledge acquisition tool when
  • no knowledge specification is provided, or
  • the database already exists.

57
Knowledge-Based ER (KER) Model
  • To capture the database characteristics, a
    Knowledge-based Entity Relationship (KER) is
    proposed to extend the basic ER model to provide
    knowledge specification capability.
  • A KER schema is defined by the following
    constructs
  • has-attributed/with (aggregation)
  • This construct links an object with other
    objects and specify certain properties of the
    object.
  • 2. isa/with (generalization)
  • This construct specifies a type/subtype
    relationship between object types.
  • has-instance (classification)
  • This construct links a type to an object that is
    an instance of that type.
  • The knowledge specification is represented by the
    with-constraint specification.

58
Components of the KER Diagram
59
A KER Diagram Example
60
Classification of Semantic Knowledge
  • Domain Knowledge
  • Specifying the static properties of entities and
    relationships.
  • e.g., displacement in the range of (0 - 30,000).
  • Intra-Structure Knowledge
  • Specifying the relationships between attributes
    within an object (an entity or a relationship).
  • e.g., if the displacement is less than 7000, then
    it is a nuclear submarine.
  • Inter-Structure Knowledge
  • Specifying the relationship that is related to
    attributes of several entities of the aggregation
    relationship.
  • e.g., the instructors department must be the
    same as the department of the class offered.

61
Knowledge Acquisition Methodology
  • To provide a systematical way of collecting
    domain knowledge guided by the database schema.
    It consists of three steps
  • Schema Generating - using KER
  • a. Identify entities and associated attributes.
  • b. Identify type hierarchies by determining the
    class attributes of each type hierarchy.
  • c. Identify aggregation relationships. Define
    each referential key as a class attribute.
  • Rule Induction
  • Knowledge Base Refinement

62
Rule Induction Algorithm
  • Semantic rules for pair-wise attributes (X --gt
    Y) are induced using the relational operations.
  • Sketch of the Algorithm
  • 1. Retrieving (X,Y) value pairs.
  • Retrieve the instance of the (X,Y) pair from the
    database.
  • Let S be the result.
  • 2. Removing inconsistent (X,Y) value pairs.
  • Retrieve all the (X,Y) pairs that for the same
    value of X has multiple values of Y. Let T be
    the result.
  • Let S S -T.
  • 3. Constructing Rules.
  • For each distinct value of Y in S, say y,
    determine the value range x of X and create a
    rule in the form of
  • if x1 lt X lt x2 then Y y.

63
Examples Of Induced Rules
  • A prototype system was implemented at UCLA using
    a naval ship database as a test bed. Examples of
    rules induced are
  • Entity SUBMARINE
  • x isa SUBMARINE
  • R1 if 0101 lt x.Class lt 0103 then x isa SSBN
  • R2 if 0201 lt x.Class lt 0215 then x isa SSN
  • R3 if Skate lt x.ClassName lt Thresher then x
    isa SSN
  • R4 if 2145 lt x.Displacement lt 6955 then x isa
    SSN
  • R5 if 7250 lt x.Displacement lt 30000 then x
    isa SSBN

64
Examples of Induced Rules (Contd)
  • Relationship INSTALL
  • x isa SUBMARINE and y isa SONAR
  • R1 if SSN582 lt x.Id SSN601 then y isa BQS
  • R2 if SSN604 lt x.Id SSN671 then y isa BQQ
  • R3 if x.Class 0203 then y isa BQQ
  • R4 if 0205 lt x.Class lt 0207 then y isa BQQ
  • R5 if 0208 lt x.Class lt 0215 then y isa BQS
  • R6 if y.Sonar BQS-04 then x isa SSN

65
Pruning the Rule Set
  • When the number of rules generated becomes too
    large, the system must reduce the size of the
    knowledge base.
  • Two Criteria for Rule Pruning
  • Coverage
  • Keep the rules that are satisfied by more than
    Nc instances and drop those rules that are
    satisfied by less than Nc instances.
  • 2. Completeness
  • Keep the rule schema (X ? Y) that the total
    number of instances satisfied by the rules of the
    same scheme is greater than a coverage threshold
    Cc.

66
Induced Rules from Relation PORT
67
Summary
  • Contributions
  • Providing a model-based methodology for
    acquiring knowledge from the database by rule
    induction.
  • Applications
  • Semantic query processing use semantic
    knowledge to improve query processing
    performance.
  • Deductive Database Systems use induced rules to
    provide intensional answers.
  • Data Inference Applications use rules to
    improve data availability by inferring
    inaccessible data from accessible data.

68
Rule Induction
69
(No Transcript)
70
Generate the Rules
  • Select targets
  • Targets are the RHS attributes of rules.
  • Method of selection
  • Use indices as targets
  • Use selectivity
  • selectivity of tuples with distinct
    value/total of tuples
  • Targets are chosen based on database schema
    (e.g., type hierarchy).
  • Generate rules for each target
Write a Comment
User Comments (0)
About PowerShow.com