CS 405G Introduction to Database Systems

Review What Is Data Mining?

- Data mining (knowledge discovery from data)
- Extraction of interesting (non-trivial, implicit,

previously unknown and potentially useful)

patterns or knowledge from huge amount of data - What is classification?
- Predict the value of unseen data
- What is clustering
- Grouping similar objects into groups

Challenges of Data Mining

- Scalability
- Dimensionality
- Complex and Heterogeneous Data
- Data Quality
- Data Ownership and Distribution
- Privacy Preservation
- Streaming Data

Knowing the Nature of Your Data

- Data types nominal, ordinal, interval, ratio.
- Data quality
- Data preprocessing

What is Data?

- Collection of data objects and their attributes
- An attribute is a property or characteristic of

an object - Examples eye color of a person, temperature,

etc. - Attribute is also known as variable, field,

characteristic, or feature - A collection of attributes describe an object
- Object is also known as record, point, case,

sample, entity, or instance

Attribute Values

- Attribute values are numbers or symbols assigned

to an attribute - Distinction between attributes and attribute

values - Same attribute can be mapped to different

attribute values - Example height can be measured in feet or

meters - Different attributes can be mapped to the same

set of values - Example Attribute values for ID and age are

integers - But properties of attribute values can be

different - ID has no limit but age has a maximum and minimum

value

Types of Attributes

- There are different types of attributes
- Nominal
- Examples ID numbers, eye color, zip codes
- Ordinal
- Examples rankings (e.g., taste of potato chips

on a scale from 1-10), grades, height in tall,

medium, short - Interval
- Examples calendar dates, temperatures in Celsius

or Fahrenheit. - Ratio
- Examples temperature in Kelvin, length, time,

counts

Properties of Attribute Values

- The type of an attribute depends on which of the

following properties it possesses - Distinctness ?
- Order lt gt
- Addition -
- Multiplication /
- Nominal attribute distinctness
- Ordinal attribute distinctness order
- Interval attribute distinctness, order

addition - Ratio attribute all 4 properties

Properties of Attribute Values

Discrete and Continuous Attributes

- Discrete Attribute
- Has only a finite or countablely infinite set of

values - Examples zip codes, counts, or the set of words

in a collection of documents - Often represented as integer variables.
- Note binary attributes are a special case of

discrete attributes - Continuous Attribute
- Has real numbers as attribute values
- Examples temperature, height, or weight.
- Practically, real values can only be measured and

represented using a finite number of digits. - Continuous attributes are typically represented

as floating-point variables.

Structured vs Unstructured Data

- Structured Data
- Data in a relational database
- Semi-structured data
- Graphs, trees, sequencs
- Un-structured data
- Image, text

Important Characteristics Data

- Dimensionality
- Curse of Dimensionality
- Sparsity
- Only presence counts
- Resolution
- Patterns depend on the scale

Record Data

- Data that consists of a collection of records,

each of which consists of a fixed set of

attributes

Data Matrix

- If data objects have the same fixed set of

numeric attributes, then the data objects can be

thought of as points in a multi-dimensional

space, where each dimension represents a distinct

attribute - Such data set can be represented by an m by n

matrix, where there are m rows, one for each

object, and n columns, one for each attribute

Document Data

- Each document becomes a term' vector,
- each term is a component (attribute) of the

vector, - the value of each component is the number of

times the corresponding term occurs in the

document.

Transaction Data

- A special type of record data, where
- each record (transaction) involves a set of

items. - For example, consider a grocery store. The set

of products purchased by a customer during one

shopping trip constitute a transaction, while the

individual products that were purchased are the

items.

Data Quality

- What kinds of data quality problems?
- How can we detect problems with the data?
- What can we do about these problems?
- Examples of data quality problems
- Noise and outliers
- missing and duplicated data

Noise

- Noise refers to modification of original values
- Examples distortion of a persons voice when

talking on a poor phone and snow on television

screen

Two Sine Waves

Two Sine Waves Noise

Mapping Data to a New Space

- Fourier transform
- Wavelet transform

Two Sine Waves

Two Sine Waves Noise

Frequency

Outliers

- Outliers are data objects with characteristics

that are considerably different than most of the

other data objects in the data set - One persons outlier can be another ones

treasure!!

Missing Values

- Reasons for missing values
- Information is not collected (e.g., people

decline to give their age and weight) - Attributes may not be applicable to all cases

(e.g., annual income is not applicable to

children) - Handling missing values
- Eliminate Data Objects
- Estimate Missing Values
- Ignore the Missing Value During Analysis
- Replace with all possible values (weighted by

their probabilities)

Duplicate Data

- Data set may include data objects that are

duplicates, or almost duplicates of one another - Major issue when merging data from heterogeous

sources - Examples
- Same person with multiple email addresses
- Data cleaning
- Process of dealing with duplicate data issues

EDA Exploratory Data Analysis

- Histogram
- Box plot
- Scatter plot
- Correlation

Visualization Techniques Histograms

- Histogram
- Usually shows the distribution of values of a

single variable - Divide the values into bins and show a bar plot

of the number of objects in each bin. - The height of each bar indicates the number of

objects - Shape of histogram depends on the number of bins
- Example Petal Width (10 and 20 bins,

respectively)

Two-Dimensional Histograms

- Show the joint distribution of the values of two

attributes - Example petal width and petal length
- What does this tell us?

Visualization Techniques Box Plots

- Box Plots
- Invented by J. Tukey
- Another way of displaying the distribution of

data - Following figure shows the basic part of a box

plot

Example of Box Plots

- Box plots can be used to compare attributes

Scatter Plot Array of Iris Attributes

Correlation

- Correlation measures the linear relationship

between objects - To compute correlation, we standardize data

objects, p and q, and then take their dot product

Visually Evaluating Correlation

Scatter plots showing the similarity from 1 to 1.

Discover Association Rules

- Apriori Algorithm

Association Rule Mining

- Given a set of transactions, find rules that will

predict the occurrence of an item based on the

occurrences of other items in the transaction

Market-Basket transactions

Example of Association Rules

Diaper ? Beer,Milk, Bread ?

Eggs,Coke,Beer, Bread ? Milk,

Implication means co-occurrence, not causality!

Definition Frequent Itemset

- Itemset
- A collection of one or more items
- Example Milk, Bread, Diaper
- k-itemset
- An itemset that contains k items
- Support count (?)
- Frequency of occurrence of an itemset
- E.g. ?(Milk, Bread,Diaper) 2
- Support
- Fraction of transactions that contain an itemset
- E.g. s(Milk, Bread, Diaper) 2/5
- Frequent Itemset
- An itemset whose support is greater than or equal

to a minsup threshold

Definition Association Rule

- Association Rule
- An implication expression of the form X ? Y,

where X and Y are itemsets - Example Milk, Diaper ? Beer
- Rule Evaluation Metrics
- Support (s)
- Fraction of transactions that contain both X and

Y - Confidence (c)
- Measures how often items in Y appear in

transactions thatcontain X

Mining Association Rules

Example of Rules Milk,Diaper ? Beer (s0.4,

c0.67)Milk,Beer ? Diaper (s0.4,

c1.0) Diaper,Beer ? Milk (s0.4,

c0.67) Beer ? Milk,Diaper (s0.4, c0.67)

Diaper ? Milk,Beer (s0.4, c0.5) Milk ?

Diaper,Beer (s0.4, c0.5)

- Observations
- All the above rules are binary partitions of the

same itemset Milk, Diaper, Beer - Rules originating from the same itemset have

identical support but can have different

confidence - Thus, we may decouple the support and confidence

requirements

An Exercise

- The support value of pattern acm is
- Sup(acm)3
- The support of pattern ac is
- Sup(ac)3
- Given min_sup3, acm is
- Frequent
- The confidence of the rule ac gt m is
- 100

Transaction-id Items bought

100 f, a, c, d, g, I, m, p

200 a, b, c, f, l,m, o

300 b, f, h, j, o

400 b, c, k, s, p

500 a, f, c, e, l, p, m, n

Transaction database TDB

Mining Association Rules

- Two-step approach
- Frequent Itemset Generation
- Generate all itemsets whose support ? minsup
- Rule Generation
- Generate high confidence rules from each frequent

itemset, where each rule is a binary partitioning

of a frequent itemset - Frequent itemset generation is still

computationally expensive

Frequent Itemset Generation

Given d items, there are 2d possible candidate

itemsets

Apriori Algorithm

- A level-wise, candidate-generation-and-test

approach (Agrawal Srikant 1994)

Data base D

1-candidates

Freq 1-itemsets

2-candidates

TID Items

10 a, c, d

20 b, c, e

30 a, b, c, e

40 b, e

Itemset Sup

a 2

b 3

c 3

d 1

e 3

Itemset Sup

a 2

b 3

c 3

e 3

Itemset

ab

ac

ae

bc

be

ce

Scan D

Min_sup2

Counting

Freq 2-itemsets

3-candidates

Itemset Sup

ab 1

ac 2

ae 1

bc 2

be 3

ce 2

Itemset Sup

ac 2

bc 2

be 3

ce 2

Itemset

bce

Scan D

Scan D

Freq 3-itemsets

Itemset Sup

bce 2

Summary

- Nature of the data
- Data types
- SSN Nominal
- Grade Ordinal
- Temperature (degree) Interval
- Length Ratio
- Data Quality
- Noise
- Outlier
- Missing/duplicated data

Summary

- Common tools for exploratory data analysis
- Histogram
- Box plot
- Scatter plot
- Correlation
- Association
- Each rule L gt R has two parts L, the left hand

item set and R the right hand item set - Each rule is measured by two parameters
- Support
- Confidence