Chapter 22: Advanced Querying and Information Retrieval - PowerPoint PPT Presentation

About This Presentation
Title:

Chapter 22: Advanced Querying and Information Retrieval

Description:

Online Analytical ... greater than $50,000 are most likely to buy sports cars' ... periodically downloaded form online transaction processing (OLTP) systems. ... – PowerPoint PPT presentation

Number of Views:117
Avg rating:3.0/5.0
Slides: 78
Provided by: profd76
Category:

less

Transcript and Presenter's Notes

Title: Chapter 22: Advanced Querying and Information Retrieval


1
Chapter 22 Advanced Querying and Information
Retrieval
  • Decision-Support Systems
  • Data Analysis and OLAP
  • Data Mining
  • Data Warehousing
  • Information-Retrieval Systems

2
Decision Support Systems
  • Decision-Support systems are used to make
    business decisions often based on data collected
    by on-line transaction-processing systems.
  • Examples of business decisions
  • what items to stock?
  • What insurance premium to change?
  • Who to send advertisements to?
  • Examples of data used for making decisions
  • Retail sales transaction details
  • Customer profiles (income, age, sex, etc.)

3
Decision-Support Systems Overview
  • Data analysis tasks are simplified by specialized
    tools and SQL extensions
  • Example tasks
  • For each product category and each region, what
    were the total sales in the last quarter and how
    do they compare with the same quarter last year
  • As above, for each product category and each
    customer category
  • Statistical analysis packages (e.g., S) can
    be interfaced with databases
  • Statistical analysis is a large field will not
    study it here
  • Data mining seeks to discover knowledge
    automatically in the form of statistical rules
    and patterns from Large databases.
  • A data warehouse archives information gathered
    from multiple sources, and stores it under a
    unified schema, at a single site.
  • Important for large businesses which generate
    data from multiple divisions, possibly at
    multiple sites
  • Data may also be purchased externally

4
Data Analysis and OLAP
  • Aggregate functions summarize large volumes of
    data
  • Online Analytical Processing (OLAP)
  • Tools to support interactive analysis of data,
    allowing data to be summarized and viewed in
    different ways
  • A histogram partitions the values taken by an
    attribute into ranges, and computes an aggregate
    over the values in each range cumbersome to use
    standard SQL to construct a histogram. Extension
    proposed by Red Brick
  • select percentile, avg (balance)
  • from account
  • group by N_tile (balance, 10) as percentile

5
Cross Tabulation of sales by item-name and color
  • The table above is an example of a
    cross-tabulation(or cross-tab) also referred to
    as a pivot-table. In general, a cross-table is a
    table where values for one attribute form the row
    headers, values for another attribute form the
    column headers, and the values in an individual
    cell are derived as follows.
  • A cross tab with summary rows/columns can be
    represented by introducing a special value all to
    represent subtotals.

6
Relational Representation of the Data in Figure
22.1
7
Three-Dimensional Data Cube
8
Online Analytical Processing
  • The operation of changing the dimensions used in
    a cross-tab is called pivoting.
  • An OLAP system provides other functionality as
    well. For instance, the analyst may wish to see a
    cross-tab on item-name and color for a fixed
    value of size, for example, large, instead of the
    sum across all sizes. Such an operation is
    referred to as slicing. The operation is
    sometimes called dicing, particularly when values
    for multiple dimensions are fixed.
  • The operation of moving from finer-granularity
    data to a coarser granularity is called a rollup.
  • The opposite operation - that of moving from
    coarser-granularity data to finer-granularity
    data is called a drill down.

9
OLAP Implementation
  • The earliest OLAP systems used multidimensional
    arrays in memory to store data cubes, and are
    referred to as mutidimensional OLAP (MOLAP)
    systems.
  • Hybrid systems, which store some summaries in
    memory and store the base data and other
    summaries in a relational database, are called
    hybrid OLAP (HOLAP) systems.

10
Hierarchies on Dimensions
11
Cross Tabulation of sales With Hierarchy on
item-name
12
Data Analysis (Cont.)
Total
Medium
Small
Large
8 20
35 10
310 5
Light Dark
53 35
28
45
15
88
Total
  • Cross-tabulation of number by size and color of
    sample relation sales with the schema
    Sales(color, size, number).

13
Data Analysis (Cont.)
Color
Size
Number
Light Light Light Light Dark Dark Dark Dark all al
l all all
8 35 10 53 20 10 5 35 28 45 15 88
Small Medium Large all Small Medium Large all Smal
l Medium Large all
  • Can represent subtotals in relational form by
    using the value all
  • E.g. obtain (Light, all, 53) and (Dark, all,
    35) by aggregating individual tuples with
    different values for size for each color.

14
Data Analysis (Cont.)
  • Rollup Moving from finer-granularity data to a
    coarser granularity by means of aggregation.
  • Drill down Moving from coarser-granularity data
    finer-granularity data.
  • Proposed extensions to SQL, such as the cube
    operation help to support generation of summary
    data
  • The following query generates the previous table.
  • select color, size, sum (number)
  • from sales
  • groupby color, size with cube

15
Data Analysis (Cont.)
  • Figure shows the combinations of dimensions size,
    color, price
  • In general computing cube operation with n
    groupby columns gives 2nd different groupby
    combinations.

16
Data Mining
  • Like knowledge discovery in artificial
    intelligence data mining discovers statistical
    rules and patterns it differs from machine
    learning in that it deals with large volumes of
    data stored primarily on disk.
  • Knowledge discovered from a database can be
    represented by a set of rules.
  • e.g., Young women with annual incomes greater
    than 50,000 are most likely to buy sports cars
  • Discover rules using one of two models
  • 1. The user is involved directly in the process
    of knowledge discovery.
  • 2. The system is responsible for automatically
    discovering knowledge from the database by
    detecting patterns and correlation's in the data.

17
Knowledge Representation Using Rules
  • General form of rules ? X antecedent ? consequent
  • X is a list of one or more variables with
    associated ranges.
  • The rule ? transactions T, buys (T, bread) ?
    buys(T, milk) states if there is a tuple (t1,
    bread) in the relation buys, there must also be a
    tuple (t1, milk) in the relation buys.
  • Population Cross-product of the ranges of the
    variables in the rule.
  • In the above example, the set of all
    transactions.
  • Support Measure of what fraction of the
    population satisfies both the antecedent and the
    consequent of the rule.
  • e.g., 10 of transactions buy bread and milk.
  • Confidence Measure of how often the consequent
    is true when the antecedent is true.
  • e.g., 80 of transactions that buy bread also buy
    milk

18
Some Classes of Data-Mining Problems
  • Classification Finding rules that partition the
    given data into disjoint groups (classes) that
    are relevant for making a decision
  • (e.g., which of several factors help classify a
    persons credit worthiness).
  • Associations Useful to determine associations
    between different items
  • (e.g., someone who buys bread is quite likely
    also to buy milk).
  • Sequence correlations determine correlations
    between related sequence data.
  • (e.g., when bond rates go up stock prices go
    down within two days.)

19
User-Guided Data Mining
  • In user-guided data mining, primary
    responsibility for discovering rules is with the
    user.
  • User may runs tests on the database to verify or
    refute a hypothesis. Confidence and support for
    rules expressing a hypothesis are derived from
    the database.
  • An iterative process of forming and refining
    rules is used. Example Test the hypothesis
    People who hold masters degrees are the most
    likely to have an excellent credit rating. If
    confidence of rule is low, may refine it into the
    rule
  • ? people P, P.degree Masters and C.income ?
    75, 000
  • ? C.credit excellent
  • Data-visualization though graphical
    representations like maps, charts, and
    color-coding, helps detect patterns in data

20
Classification Rules
  • Classification rules help assign new objects to a
    set of classes. E.g., given a new automobile
    insurance applicant, should he or she be
    classified as low risk, medium risk or high risk?
  • Classification rules for above example could use
    a variety of knowledge, such as educational level
    of applicant, salary of applicant, age of
    applicant, etc.
  • Classification rules can be compactly shown as a
    Classification tree.

21
Part of Credit Risk Classification Tree
22
Discovery of Classification Rules
  • Training set a data sample in which the grouping
    for each tuple is already known.
  • Top down generation of classification tree.
  • Each internal node of the tree partitions the
    data into groups based on the attribute.
  • The data at a node is not partitioned further if
    either all (or most) of the items at the node
    belong to the same class, or all attributes have
    been considered. Such a node is a leaf node.
  • Otherwise the data at the node is partitioned
    further by picking an attribute for partitioning
    data at the node.

23
Discovery of Classification Rules (Cont.)
  • Consider credit risk example Suppose degree is
    chosen to partition the data at the root. Since
    degree has a small number of possible values,
    one child is created for each value.
  • At each child node of the root, further
    classification is done tuple if required. Here,
    partitions are defined by income. Since income is
    a continuous attribute, some number of intervals
    are chosen, and one child created for each
    interval.
  • Different classification algorithms use different
    ways of choosing which attribute to partition on
    at each node, and what the intervals, if any,
    are.
  • In general, different branches of the tree could
    grow to different levels. Different nodes at the
    same level may use difficult partitioning
    attributes.

24
Discovery of Association Rules
  • Example ? transactions T, buys (T, bread) ? buys
    (T, milk)
  • In general notion of transaction , and its
    intemset, the set of items contained in the
    transaction
  • General form of rule
  • ? transactions T, c(T, i1) and . . . and c(T,
    io) ? c(T, i0)
  • where c(T, ik) denotes that transaction T
    contains item ik.
  • Above can be represented as A ? b where A i1,
    i2, . . . , in
  • and b io.
  • Support of rule number of transactions whose
    itemsets contain A ? b
  • Usually desire rules with strong support, which
    will involve only items purchased in a
    significant percentage of the transactions.

25
Discovery of Association Rules (Cont.)
  • Consider all possible sets of relevant items.
  • For each set find its support (i.e. , how many
    transactions purchase all items in the set).
  • Use sets with sufficiently high support to
    generate association rules.
  • From set A generate the rules A - b ?b for
    each b ? A.
  • Support of each of the rules is support of A.
  • Confidence of a rule is support of A divided by
    support of A - b.

26
Finding Support
  • Few sets Determine level of support via a single
    pass.
  • A count is maintained for each set, initially set
    to 0.
  • When a transaction is fetched, the count is
    incremented for each set of items which contained
    in the itemset of the transaction.
  • Sets with a high count at the end of the pass
    correspond to items with a high degree of
    association.
  • Many sets If memory not enough to hold all
    counts for all sets Use multiple passes,
    considering only some sets in each pass.
  • Optimization Once a set is eliminated because it
    occurs in too small a fraction of the
    transactions, none of its supersets needs to be
    considered.

27
Data Warehousing
  • A data warehouse is a repository of information
    gathered from multiple sources.

28
Data Warehousing (Cont.)
  • Provides a single consolidated interface to data
  • Data stored for an extended period, providing
    access to historical data
  • Data/updates are periodically downloaded form
    online transaction processing (OLTP) systems.
  • Typically, download happens each night.
  • Data may not be completely up-to-date, but is
    recent enough for analysis.
  • Running large queries at the warehouse ensures
    that OLTP systems are not affected by the
    decision-support workload.

29
Issues in Building a Warehouse
  • When and how to gather data.
  • Source driven data source initiates data
    transfer
  • Destination driven warehouse initiates data
    transfer
  • What schema to use.
  • Schema integration
  • Cleaning and conversion of incoming data
  • What data to summarize.
  • Raw data may be too large to store on-line
  • Aggregate values (totals/subtotals) often suffice
  • Queries on raw data can often be transformed by
    query optimizer to use aggregate values
  • How to propagate updates.
  • Date at warehouse is a view on source data
  • Efficient view maintenance techniques required

30
Information Retrieval Systems
  • Information retrieval (IR) systems use a simpler
    data model than database systems, but provide
    more powerful querying capabilities within the
    restricted model.
  • Queries attempt to locate documents that are of
    interest by specifying, for example, sets of
    keywords.
  • e.g., find documents containing the words
    database systems
  • Information retrieval systems order answers based
    on their estimated relevance.
  • e.g., user may really only want documents about
    database systems, but the system may retrieve all
    documents that mention the phrase database
    systems.
  • Documents may be ordered by, for example, how
    many times the phrase appears in the document.

31
Queries
  • Combinations of keywords
  • motorcycle and maintenance
  • computer or micro-processor
  • computer but not database
  • Closeness of keyword s in the and case affects
    ranking. Some systems allow user to specify that
    the keywords must occur close to each other.
  • Synonyms
  • To retrieve document title motorcycle repair for
    the query motorcycle and maintenance, need to
    realize that maintenance and repair are synonyms
  • Similarity based retrieval - retrieve documents
    similar to a given document. Similarity may be
    defined based on metrics such as number of common
    keywords.

32
Differences From Database Systems
  • Information retrieval systems, unlike traditional
    database systems, handle
  • Unstructured documents
  • Searching using keywords and relevance ranking
  • Most information retrieval systems do not handle
  • High update rates
  • Concurrency control
  • Data structured using more complex data models
    (e.g., relational or object oriented data models)
  • Complex queries written in, e.g., SQL

33
Indexing of Documents
  • Documents that contain a specified keyword can be
    located using an inverted index which maps each
    keyword Ki to the set Si of identifiers of
    documents that contain Ki .
  • IR systems save space by using index structures
    that support only approximate retrieval. May
    result in
  • false drop - some relevant documents may not be
    retrieved.
  • false positive - some irrelevant documents may be
    retrieved.
  • For many applications a good index should not
    permit any false drops, but may permit a few
    false positives.
  • Relevant performance metrics
  • Precision - what percentage of the retrieved
    documents are relevant to the query.
  • Recall - what percentage of the documents
    relevant to the query were retrieved.

34
Indexing of Documents (Cont.)
  • and operation Finds documents that contain all
    of a set of keywords K1, K2, ..., Kn.
  • Retrieve the corresponding sets of identifiers of
    documents S1, S2, ... Sn.
  • The intersection, S1? S2 ?..... ? Sn, gives the
    identifiers of the desired set of documents.
  • or operation Gives the set of all documents that
    contain at least one of the keywords K1, K2, ,
    Kn
  • Found by computing the union, S1? S2 ?..... ? Sn,
    of the sets.

35
Indexing of Documents (Cont.)
  • not operation Finds documents that do not
    contain a specified keyword Ki
  • Let Si be the set of identifiers of documents
    that contain the keyword Ki.
  • Given a set of document identifies S, eliminate
    documents that contain the specified keyword Ki
    by taking the difference S-Si
  • A full-text index uses every work in the document
    as a keyword.
  • Stop words are very commonly occurring words that
    are useless as key words, e.g, a, an, the, it
    etc. These are eliminated from the index.

36
Browsing
  • Storing related documents together facilitates
    browsing, where a user can see not only requested
    document but also related ones.
  • Browsing in a library facilitated by
    classification system that organizes logically
    related books together.

37
A Classification Tree
38
Classification DAG
  • Documents can reside in multiple places in a
    hierarchy in an information retrieval system,
    since physical location is not important.
  • Classification hierarchy is thus Directed Acyclic
    Graph (DAG).

39
A Classification DAG for a library information
retrieval system
40
Locating of Information on the Web
  • The Archie system automatically follows Gopher
    links to locate information, and creates a
    centralized index of information found various
    sites.
  • Web indexing systems (Web crawlers) follow the
    hypertext links in documents to find other
    documents, and build an index on the documents.
  • The indices are often full-text indices, and are
    stored locally at the indexing system.
  • These systems run a background process to
  • find new sites.
  • obtain updated information from known sites.
  • discard defunct sites.

41
Locating of Information on the Web (Cont.)
  • Web indexing systems permit documents to be
    located even though they are not registered with
    any central authority.
  • Drawback poor precision of recall
  • Full text indexing retrieves unrelated documents
    that just happen to mention the requested keyword
  • HTML extensions now allow documents to be tagged
    with keywords to be used by search engines
    unfortunately, many documents do not provide such
    keywords.
  • The extremely large number of documents on the
    Web often leads to far more results than a human
    can handle.
  • Alternative approach a cataloging system for the
    Web, such as that provide by Yahoo
  • Combinations of catalogs and indexing are quite
    useful Provided by, e.g., Yahoo (www.yahoo.com).

42
Classification Tree
43
Data-Warehouse Architecture
44
Star Schema For A Data Warehouse
45
A Classification Hierarchy For A Library System
46
A Classification DAG For A Library Information
Retrieval System
47
Data Analysis and OLAP
  • Online Analytical Processing
  • Data that can be modeled as dimension attributes
    and measure attributes are called
    multidimensional data.
  • Given a relation used for data analysis, we can
    identify some of its attributes as measure
    attributes, since they measure some value, and
    can be aggregated upon. For instance, the
    attribute number of the sales relation is a
    measure attribute, since it measures the number
    of units sold.
  • Some of the other attributes of the relation are
    identified as dimension attributes, since they
    define the dimensions on which measure
    attributes, and summaries of measure attributes,
    are viewed.

48
Extended Aggregation
  • SQL1999 also supports generalizations of the
    group by constructs, using the cube and rollup
    constructs. A representative use of the cube
    construct is
  • select item-name, color, size,
    sum(number) from sales group by cube(item-name,
    color, size)
  • This query computes the union of eight different
    groupings of the
  • sales relation
  • (item-name, color, size), (item-name, color),
    (item-name, size), (color, size), (item-name),
    (color), (size), ( )
  • Where ( ) denotes an empty group by list.
  • For each grouping, the result contains the null
    value for attributes not present in the grouping.
    For instance, with occurrences of all replaced by
    null, can be computed by the query
  • select item-name, color, sum(number) from
    sales group by cube(item-name, color)

49
Extended Aggregation (Cont.)
  • A representative rollup construct is
  • select item-name, color, size,
    sum(number) from sales group by
    rollup(item-name, color, size)
  • Here only four grouping are generated
  • (item-name, color, size), (item-name, color),
    (item-name), ( )
  • Rollup can be used to generate aggregates at
    multiple levels of ahierarchy on a column. For
    instance, we have a table itemcategory(item-name,
    category) giving the category of each item. Then
    the query
  • select category, item-name, sum(number)from
    sales, categorywhere sales.item-name
    itemcategory.item-namegroup by rollup(category,
    item-name)
  • would give a hierarchical summary by item-name
    and by category.

50
Extended Aggregation (Cont.)
  • Multiple rollups and cubes can be used in a
    single group by clause.For instances, the
    following query
  • select item-name, color, size, sum(number)from
    salesgroup by rollup(item-name), rollup(color,
    size)
  • generates the groupings
  • (item-name, color, size), (item-name, color),
    (item-name), (color, size), (color), ( )
  • The function grouping can be applied on an
    attribute it returns 1 if the value is a null
    value representing all, and returns 0 in all
    other cases. Consider the following query
  • select item-name, color, size,
    sum(number), grouping(item-name) as
    item-name-flag, grouping(color) as
    color-flag, grouping(size) as size-flag,from
    salesgroup by cube(item-name, color, size)

51
Ranking
  • Ranking is done in conjunction with an order by
    specification. Suppose we are given a relation
    student-marks(student-id, marks) which stores the
    marks obtained by each student. The following
    query gives the rank of each student.
  • select student-id, rank( ) over (order by
    (marks) desc as s-rankfrom student-marks
  • An extra order by clause is needed to get them in
    sorted order, as shown below.
  • select student-id, rank ( ) (order by (marks)
    desc) as s-rankfrom student-marks order by
    s-rank
  • Ranking can be done within partition of the data.
    The following query then gives the rank of
    students within each section.
  • select student-id, section, rank ( ) over
    (partition by section order by marks desc) as
    sec-rankfrom student-marks, student-sectionwhere
    student-marks.student-id student-section.studen
    t-idorder by section, sec-rank

52
Ranking (Cont.)
  • For a given constant n, the ranking the function
    ntile(n) takes the tuples in each partition in
    the specified order, and divides them into n
    buckets with qual numbers of tuples. For
    instance, we an sort employees by salary, and use
    ntile(3) to find which range (bottom third,
    middle third, or top third) each employee is in,
    and compute the total salary earned by employees
    in each range
  • select threetile, sum(salary)from ( select
    salary, ntile(3) over (order by (salary) as
    threetile from employee) as sgroup by threetile
  • SQL1999 permits the user to specify where they
    should occur by using nulls first or nulls last,
    for instance
  • select student-id, rank ( ) over (order by marks
    desc nulls last) as s-rankfrom student-marks

53
Windowing
  • An example of window query is that, given sales
    values for each date, calculates for each date
    the average of the sales on that day, the
    previous day, and the next day such moving
    average queries are used to smooth out random
    variations.
  • In contrast to group by, the same tuple can exist
    in multiple windows. Suppose we are given a
    relation transaction(account-number, date-time,
    value), where value is positive for a deposit and
    negative for a withdrawal, consider a query
  • select account-number, date-time, sum(value)
    over (partition by account-number order by
    date-time rows unbounded preceding) as
    balancefrom transactionorder by account-number,
    date-time

54
Decision - Tree Construction
  • The main idea of decision tree construction is to
    evaluate different attributes and different
    partitioning conditions and pick the attribute
    and partitioning condition that results in the
    maximum information gain ratio.
  • Procedure Grow.Tree(S) Partition(S)Procedure
    Partition (S) if (purity(S) gt ?p or S lt ?s)
    then return for each attribute A
    evaluate splits on attribute A Use best split
    found (across all attributes) to partition
    S into S1, S2, ., Sr, for i 1, 2, .., r
    Partition(Si)
  • We can generate classification rules from a
    decision tree, if we so desire. For each leaf we
    generate a rule as follows The left hand side is
    the conjunction of all the split conditions on
    the path to the leaf, and the class is the class
    of the majority of the training instances at the
    leaf. An example of such a classification rule is
  • degree masters and income gt 75,000 ? excellent

55
Other Types of Classifiers
  • There are two types of classifiers
  • Neural net classifiers
  • Bayesian classifiers
  • Neural net classifiers use the training data to
    train artificial neural nets.
  • To find the probability p(cjd) of instance d
    being in class cj, Bayesian classifiers use
    Bayes theorem, which says
  • p(cjd) p(dcj)p(cj)
  • p(d)where p(dcj) is the
    probability of generating instance d given class
    cj, p(cj) is the probability of occurrence of
    class cj, and p(d) is the probability of
    instanced occuring.
  • To simplify the task, naïve Bayesian classifiers
    assume attributes have independent distributions,
    and thereby estimate
  • p(dcj) p(d1cj) p(d2cj) . (p(dncj)
  • That is, the probability of the instance d
    occuring is the product of the probability of
    occurrence of each of the attribute values di of
    d, given the class cj.

56
Regression
  • Regression deals with the prediction of a value,
    rather than a class. Given values for a set of
    variables, X1, X2, , Xn, we wish to predict the
    value of a variable Y.
  • One way is to infer coefficients a0, a1, a1, ,
    an such that Y a0 a1 X1 a2 X2 an
    Xn
  • Finding such a linear polynomial is called linear
    regression. The process to find a curve that fits
    the data is called curve fitting.
  • The fit may only be approximate, because of noise
    in the data or because the relationship is not
    exactly a polynomial, so regression aims to find
    coefficients that give the best possible fit.

57
Association Rules
  • Retail shops are often interested in associations
    between different items that people buy. Examples
    of such associations are
  • Someone who buys bread is quite likely also to
    buy milk
  • A person who bought the book Database System
    Concepts is quite likely also to buy the book
    Operating System Concepts.
  • Associations information can be used in several
    ways. When a customer buys a particular book, an
    online shop may suggest associated books.
  • Association Rules
  • An example of Association Rules is
  • bread ? milk
  • An association rule must have an associated
    population the population consists of a set of
    instances.

58
Association Rules (Cont.)
  • Association Rules An example of Association
    Rules is
  • bread ? milk
  • An association rule must have an associated
    population the population consists of a set of
    instances.
  • Rules have an associated support, as well as an
    associated confidence. These are defined in the
    context of population
  • Support is a measure of what fraction of the
    population satisfies both the antecedent and the
    consequent of the rule.For instance, suppose only
    0.001 percent of all purchases include milk and
    screwdrivers. The support for the rule is milk ?
    screwdrivers is low.
  • Confidence is a measure of how often the
    consequent is true when the antecedent is true.
    For instance, the rule bread ? milk has a
    confidence of 80 percent if 80 percent of the
    purchases that include bread also include milk.A
    rule with a low confidence is not meaningful.
  • Note that the confidence of bread ? milk may be
    very different from the confidence of milk ?
    bread, although both have the same supports.

59
Clustering
60
Other Types of Mining
61
Data Warehousing
62
Components of Data Warehouse
  • When and how to gather data.
  • What schema to use.
  • Data cleansing
  • How to propagate updates.
  • What data to summarize

63
Warehouse Schemas
64
Information-Retrieval Systems
65
Keyword Search
  • Information-retrieval systems typically allow
    query expressions formed using keywords and the
    logical connectives and, or, and not.
  • A query containing keywords without any of the
    above connectives is assumed to have ands
    implicitly connecting the keywords
  • In full text retrieval, all the words in each
    document are considered to be keywords. We shall
    use the word term to refer to the words in a
    document, since all words are keywords.

66
Relevance Ranking Using Terms
  • One way of measuring r(d, t), the relevance of a
    document d to a term t, is
  • Where n(d) denotes the number of terms in the
    document and n(d, t) denotes the number of
    occurrences of term t in the document d.

67
Relevance Using Hyperlinks
68
Similarity-Based Retrieval
69
Synonyms and Homonyms
70
Indexing of Documents
71
Measuring Retrieval Effectiveness
72
Web Search Engines
73
Directories
74
(No Transcript)
75
Applications of Data Mining
  • ? person P, P.degree masters and P.income gt
    75,000 ? P.credit excellent?person P,
    P.degree bachelors or (p.income ? 25,000 and
    P.income ? 75,000) ? P.credit good

76
Best Splits
  • The purity of a set S of training instances can
    be measured quantitatively in several ways.
    Suppose there are k classes, and of the instances
    in S the fraction of instances in class I is pi.
    One measure of purity, the Gini measure is
    defined as
  • Gini (S) 1 - ?
  • When all instances are in a single class, the
    Gini value is 0, while it reaches its maximum (of
    1 1 /k) if each class the same number of
    instances. Another measure of purity is the
    entropy measure, which is defined as
  • Entropy (S) - ?
  • When a set S is split into multiple sets Si, I1,
    2, , r, we can measure the purity of the
    resultant set of sets as
  • Purity(S1, S2, .., Sr) ?
  • The above formula can be used with both the Gini
    measure and the entropy measure of purity.

77
Best Splits (Cont.)
  • The information gain due to particular split of S
    into Si, I 1, 2, ., r is then
  • Information-gain (S,S1, S2, ., Sr) purity(S)
    purity (S1, S2, Sr)
  • The information content of a particular split can
    be defined in terms of entropy as
    Information-content(S, S1, S2, .., Sr)) - ?
  • All of this leads to a definition The best split
    for an attribute is the one that gives the
    maximum information gain ratio, defined as
  • Information-gain (SS1, S2, , Sr)
  • Information-content (S, S1, S2, .., Sr)
Write a Comment
User Comments (0)
About PowerShow.com