CS 405G: Introduction to Database Systems - PowerPoint PPT Presentation

About This Presentation

Title:

CS 405G: Introduction to Database Systems

Description:

CS 405G: Introduction to Database Systems – PowerPoint PPT presentation

Number of Views:399

Avg rating:3.0/5.0

Slides: 39

Provided by: uky48

Learn more at: http://protocols.netlab.uky.edu

Category:

more less

Transcript and Presenter's Notes

Title: CS 405G: Introduction to Database Systems

1
CS 405G Introduction to Database Systems
2
Review What Is Data Mining?

Data mining (knowledge discovery from data)
Extraction of interesting (non-trivial, implicit,
previously unknown and potentially useful)
patterns or knowledge from huge amount of data
What is classification?
Predict the value of unseen data
What is clustering
Grouping similar objects into groups

3
Challenges of Data Mining

Scalability
Dimensionality
Complex and Heterogeneous Data
Data Quality
Data Ownership and Distribution
Privacy Preservation
Streaming Data

4
Knowing the Nature of Your Data

Data types nominal, ordinal, interval, ratio.
Data quality
Data preprocessing

5
What is Data?

Collection of data objects and their attributes
An attribute is a property or characteristic of
an object
Examples eye color of a person, temperature,
etc.
Attribute is also known as variable, field,
characteristic, or feature
A collection of attributes describe an object
Object is also known as record, point, case,
sample, entity, or instance

6
Attribute Values

Attribute values are numbers or symbols assigned
to an attribute
Distinction between attributes and attribute
values
Same attribute can be mapped to different
attribute values
Example height can be measured in feet or
meters
Different attributes can be mapped to the same
set of values
Example Attribute values for ID and age are
integers
But properties of attribute values can be
different
ID has no limit but age has a maximum and minimum
value

7
Types of Attributes

There are different types of attributes
Nominal
Examples ID numbers, eye color, zip codes
Ordinal
Examples rankings (e.g., taste of potato chips
on a scale from 1-10), grades, height in tall,
medium, short
Interval
Examples calendar dates, temperatures in Celsius
or Fahrenheit.
Ratio
Examples temperature in Kelvin, length, time,
counts

8
Properties of Attribute Values

The type of an attribute depends on which of the
following properties it possesses
Distinctness ?
Order lt gt
Addition -
Multiplication /
Nominal attribute distinctness
Ordinal attribute distinctness order
Interval attribute distinctness, order
addition
Ratio attribute all 4 properties

9
Properties of Attribute Values
10
Discrete and Continuous Attributes

Discrete Attribute
Has only a finite or countablely infinite set of
values
Examples zip codes, counts, or the set of words
in a collection of documents
Often represented as integer variables.
Note binary attributes are a special case of
discrete attributes
Continuous Attribute
Has real numbers as attribute values
Examples temperature, height, or weight.
Practically, real values can only be measured and
represented using a finite number of digits.
Continuous attributes are typically represented
as floating-point variables.

11
Structured vs Unstructured Data

Structured Data
Data in a relational database
Semi-structured data
Graphs, trees, sequencs
Un-structured data
Image, text

12
Important Characteristics Data

Dimensionality
Curse of Dimensionality
Sparsity
Only presence counts
Resolution
Patterns depend on the scale

13
Record Data

Data that consists of a collection of records,
each of which consists of a fixed set of
attributes

14
Data Matrix

If data objects have the same fixed set of
numeric attributes, then the data objects can be
thought of as points in a multi-dimensional
space, where each dimension represents a distinct
attribute
Such data set can be represented by an m by n
matrix, where there are m rows, one for each
object, and n columns, one for each attribute

15
Document Data

Each document becomes a term' vector,
each term is a component (attribute) of the
vector,
the value of each component is the number of
times the corresponding term occurs in the
document.

16
Transaction Data

A special type of record data, where
each record (transaction) involves a set of
items.
For example, consider a grocery store. The set
of products purchased by a customer during one
shopping trip constitute a transaction, while the
individual products that were purchased are the
items.

17
Data Quality

What kinds of data quality problems?
How can we detect problems with the data?
What can we do about these problems?
Examples of data quality problems
Noise and outliers
missing and duplicated data

18
Noise

Noise refers to modification of original values
Examples distortion of a persons voice when
talking on a poor phone and snow on television
screen

Two Sine Waves
Two Sine Waves Noise
19
Mapping Data to a New Space

Fourier transform
Wavelet transform

Two Sine Waves
Two Sine Waves Noise
Frequency
20
Outliers

Outliers are data objects with characteristics
that are considerably different than most of the
other data objects in the data set
One persons outlier can be another ones
treasure!!

21
Missing Values

Reasons for missing values
Information is not collected (e.g., people
decline to give their age and weight)
Attributes may not be applicable to all cases
(e.g., annual income is not applicable to
children)
Handling missing values
Eliminate Data Objects
Estimate Missing Values
Ignore the Missing Value During Analysis
Replace with all possible values (weighted by
their probabilities)

22
Duplicate Data

Data set may include data objects that are
duplicates, or almost duplicates of one another
Major issue when merging data from heterogeous
sources
Examples
Same person with multiple email addresses
Data cleaning
Process of dealing with duplicate data issues

23
EDA Exploratory Data Analysis

Histogram
Box plot
Scatter plot
Correlation

24
Visualization Techniques Histograms

Histogram
Usually shows the distribution of values of a
single variable
Divide the values into bins and show a bar plot
of the number of objects in each bin.
The height of each bar indicates the number of
objects
Shape of histogram depends on the number of bins
Example Petal Width (10 and 20 bins,
respectively)

25
Two-Dimensional Histograms

Show the joint distribution of the values of two
attributes
Example petal width and petal length
What does this tell us?

26
Visualization Techniques Box Plots

Box Plots
Invented by J. Tukey
Another way of displaying the distribution of
data
Following figure shows the basic part of a box
plot

27
Example of Box Plots

Box plots can be used to compare attributes

28
Scatter Plot Array of Iris Attributes
29
Correlation

Correlation measures the linear relationship
between objects
To compute correlation, we standardize data
objects, p and q, and then take their dot product

30
Visually Evaluating Correlation
Scatter plots showing the similarity from 1 to 1.
31
Discover Association Rules

Apriori Algorithm

32
Association Rule Mining

Given a set of transactions, find rules that will
predict the occurrence of an item based on the
occurrences of other items in the transaction

Market-Basket transactions
Example of Association Rules
Diaper ? Beer,Milk, Bread ?
Eggs,Coke,Beer, Bread ? Milk,
Implication means co-occurrence, not causality!
33
Definition Frequent Itemset

Itemset
A collection of one or more items
Example Milk, Bread, Diaper
k-itemset
An itemset that contains k items
Support count (?)
Frequency of occurrence of an itemset
E.g. ?(Milk, Bread,Diaper) 2
Support
Fraction of transactions that contain an itemset
E.g. s(Milk, Bread, Diaper) 2/5
Frequent Itemset
An itemset whose support is greater than or equal
to a minsup threshold

34
Definition Association Rule

Association Rule
An implication expression of the form X ? Y,
where X and Y are itemsets
Example Milk, Diaper ? Beer
Rule Evaluation Metrics
Support (s)
Fraction of transactions that contain both X and
Y
Confidence (c)
Measures how often items in Y appear in
transactions thatcontain X

35
Mining Association Rules
Example of Rules Milk,Diaper ? Beer (s0.4,
c0.67)Milk,Beer ? Diaper (s0.4,
c1.0) Diaper,Beer ? Milk (s0.4,
c0.67) Beer ? Milk,Diaper (s0.4, c0.67)
Diaper ? Milk,Beer (s0.4, c0.5) Milk ?
Diaper,Beer (s0.4, c0.5)

Observations
All the above rules are binary partitions of the
same itemset Milk, Diaper, Beer
Rules originating from the same itemset have
identical support but can have different
confidence
Thus, we may decouple the support and confidence
requirements

36
An Exercise

The support value of pattern acm is
Sup(acm)3
The support of pattern ac is
Sup(ac)3
Given min_sup3, acm is
Frequent
The confidence of the rule ac gt m is
100

Transaction-id Items bought
100 f, a, c, d, g, I, m, p
200 a, b, c, f, l,m, o
300 b, f, h, j, o
400 b, c, k, s, p
500 a, f, c, e, l, p, m, n
Transaction database TDB
37
Mining Association Rules

Two-step approach
Frequent Itemset Generation
Generate all itemsets whose support ? minsup
Rule Generation
Generate high confidence rules from each frequent
itemset, where each rule is a binary partitioning
of a frequent itemset
Frequent itemset generation is still
computationally expensive

38
Frequent Itemset Generation
Given d items, there are 2d possible candidate
itemsets
39
Apriori Algorithm

A level-wise, candidate-generation-and-test
approach (Agrawal Srikant 1994)

Data base D
1-candidates
Freq 1-itemsets
2-candidates
TID Items
10 a, c, d
20 b, c, e
30 a, b, c, e
40 b, e
Itemset Sup
a 2
b 3
c 3
d 1
e 3
Itemset Sup
a 2
b 3
c 3
e 3
Itemset
ab
ac
ae
bc
be
ce
Scan D
Min_sup2
Counting
Freq 2-itemsets
3-candidates
Itemset Sup
ab 1
ac 2
ae 1
bc 2
be 3
ce 2
Itemset Sup
ac 2
bc 2
be 3
ce 2
Itemset
bce
Scan D
Scan D
Freq 3-itemsets
Itemset Sup
bce 2
40
Summary

Nature of the data
Data types
SSN Nominal
Grade Ordinal
Temperature (degree) Interval
Length Ratio
Data Quality
Noise
Outlier
Missing/duplicated data

41
Summary

Common tools for exploratory data analysis
Histogram
Box plot
Scatter plot
Correlation
Association
Each rule L gt R has two parts L, the left hand
item set and R the right hand item set
Each rule is measured by two parameters
Support
Confidence