ICS 278: Data Mining Lecture 1: Introduction to Data Mining - PowerPoint PPT Presentation

Loading...

PPT – ICS 278: Data Mining Lecture 1: Introduction to Data Mining PowerPoint presentation | free to download - id: 1b9252-YTg4M



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

ICS 278: Data Mining Lecture 1: Introduction to Data Mining

Description:

Develop an overall sense of how to extract information ... Office hours. Fridays, 8:30 to 10. Web page. www.ics.uci.edu/~smyth/courses/ics278. Prerequisites ... – PowerPoint PPT presentation

Number of Views:328
Avg rating:3.0/5.0
Slides: 59
Provided by: Informatio367
Learn more at: http://www.ics.uci.edu
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: ICS 278: Data Mining Lecture 1: Introduction to Data Mining


1
ICS 278 Data MiningLecture 1 Introduction to
Data Mining
  • Padhraic Smyth
  • Department of Information and Computer Science
  • University of California, Irvine

2
Philosophy behind this class
  • Develop an overall sense of how to extract
    information from data in a systematic way
  • Emphasis on the process of data mining
  • understanding specific algorithms and methods is
    important
  • But alsoemphasize the big picture of why, not
    just how
  • Less emphasis on theory and mathematics (than in
    274)
  • Builds on knowledge from ICS 273, 274, etc.

3
Logistics
  • Grading
  • 30 homeworks
  • Every 2 weeks
  • Guidelines for collaboration
  • Homework 1 due in 2 weeks (on the Web page)
  • 70 class project
  • Will discuss in next lecture
  • Office hours
  • Fridays, 830 to 10
  • Web page
  • www.ics.uci.edu/smyth/courses/ics278
  • Prerequisites
  • Either ICS 273 or 274 or equivalent
  • Text

4
Quiz
  • 5 minute quiz to quickly assess your background
  • Will not be used in grading for the class, for
    information purposes only

5
Lecture 1 Introduction to Data Mining
  • What is data mining?
  • Data sets
  • The data matrix
  • Other data formats
  • Data mining tasks
  • Prediction and description
  • Data mining algorithms
  • Score functions, models, and optimization methods
  • The dark side of data mining
  • Required reading Chapter 1 of PDM (Principles of
    Data Mining)

6
What is data mining?

7
What is data mining?
The magic phrase used to .... - put in your
resume - use in a proposal to NSF, NIH, NASA,
etc - market database software - sell
statistical analysis software - sell parallel
computing hardware - sell consulting services

8
What is data mining?
Data-driven discovery of models and patterns
from massive observational data sets

9
What is data mining?
Data-driven discovery of models and patterns
from massive observational data sets

Statistics, Inference
10
What is data mining?
Data-driven discovery of models and patterns
from massive observational data sets

Languages and Representations
Statistics, Inference
11
What is data mining?
Data-driven discovery of models and patterns
from massive observational data sets

Languages and Representations
Engineering, Data Management
Statistics, Inference
12
What is data mining?
Data-driven discovery of models and patterns
from massive observational data sets

Languages and Representations
Engineering, Data Management
Statistics, Inference
Retrospective Analysis
13
Technological Driving Factors
  • Larger, cheaper memory
  • Moores law for magnetic disk density
    capacity doubles every 18 months
  • storage cost per byte falling rapidly
  • Faster, cheaper processors
  • the CRAY of 15 years ago is now on your desk
  • Success of Relational Database technology
  • everybody is a data owner
  • New ideas in machine learning/statistics
  • Boosting, SVMs, decision trees, etc

14
Examples of data volumes
  • MEDLINE text database
  • 12 million published articles
  • Google
  • 4.2 billion Web pages indexed
  • 80 million site visitors per day
  • CALTRANS loop sensor data
  • Every 30 seconds, thousands of sensors, 2Gbytes
    per day
  • NASA MODIS satellite
  • Coverage at 250m resolution, 37 bands, whole
    earth, every day
  • Walmart transaction data
  • Order of 100 million transactions per day

15
Two Types of Data
  • Experimental Data
  • Hypothesis H
  • design an experiment to test H
  • collect data, infer how likely it is that H is
    true
  • e.g., clinical trials in medicine
  • Observational or Retrospective or Secondary Data
  • massive non-experimental data sets
  • e.g., human genome, atmospheric simulations, etc
  • assumptions of experimental design no longer
    valid
  • how can we use such data to do science?
  • data must support model exploration, hypothesis
    testing

16
Data-Driven Discovery
  • Observation data
  • cheap relative to experimental data
  • Examples
  • Transaction data archives for retail stores,
    airlines, etc
  • Web logs for Amazon, Google, etc
  • The human/mouse/rat genome
  • Etc., etc
  • makes sense to leverage available data
  • useful (?) information may be hidden in vast
    archives of data
  • Contrast data mining with traditional statistics
  • traditional stats first hypothesize, then
    collect data, then analyze
  • data mining
  • few if any a priori hypotheses,
  • data is usually already there
  • analysis is typically data-driven not hypothesis
    driven
  • Nonetheless, statistical ideas are very useful in
    data mining, e.g., in validating whether
    discovered knowledge is useful

17
Let the data speak
18
Let the data speak
The data may have quite a lot to say.. but it
may just be noise!
19
Origins of Data Mining
pre 1960 1960s 1970s 1980s 1990s
Hardware (sensors, storage, computation)
Relational Databases
Data Mining
Machine Learning
AI
Pattern Recognition
Flexible Models
EDA
Pencil and Paper
Data Dredging
20
DM Intersection of Many Fields
Machine Learning (ML)
Statistics (stats)
Computer Science (CS)
Data Mining
Visualization (viz)
Databases (DB)
Human Computer Interaction (HCI)
High-Performance Parallel Computing
21
Flat File or Vector Data
n
p
  • Rows objects
  • Columns measurements on objects
  • Represent each row as a p-dimensional vector,
    where p is the dimensionality
  • In efffect, embed our objects in a p-dimensional
    vector space
  • Often useful, but always appropriate
  • Both n and p can be very large in certain data
    mining applications

22
Sparse Matrix (Text) Data
Text Documents
Word IDs
23
Market Basket Data
24
Sequence (Web) Data
128.195.36.195, -, 3/22/00, 103511, W3SVC,
SRVR1, 128.200.39.181, 781, 363, 875, 200, 0,
GET, /top.html, -, 128.195.36.195, -, 3/22/00,
103516, W3SVC, SRVR1, 128.200.39.181, 5288,
524, 414, 200, 0, POST, /spt/main.html, -,
128.195.36.195, -, 3/22/00, 103517, W3SVC,
SRVR1, 128.200.39.181, 30, 280, 111, 404, 3, GET,
/spt/images/bk1.jpg, -, 128.195.36.101, -,
3/22/00, 161850, W3SVC, SRVR1, 128.200.39.181,
60, 425, 72, 304, 0, GET, /top.html, -,
128.195.36.101, -, 3/22/00, 161858, W3SVC,
SRVR1, 128.200.39.181, 8322, 527, 414, 200, 0,
POST, /spt/main.html, -, 128.195.36.101, -,
3/22/00, 161859, W3SVC, SRVR1, 128.200.39.181,
0, 280, 111, 404, 3, GET, /spt/images/bk1.jpg, -,
128.200.39.17, -, 3/22/00, 205437, W3SVC,
SRVR1, 128.200.39.181, 140, 199, 875, 200, 0,
GET, /top.html, -, 128.200.39.17, -, 3/22/00,
205455, W3SVC, SRVR1, 128.200.39.181, 17766,
365, 414, 200, 0, POST, /spt/main.html, -,
128.200.39.17, -, 3/22/00, 205455, W3SVC,
SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET,
/spt/images/bk1.jpg, -, 128.200.39.17, -,
3/22/00, 205507, W3SVC, SRVR1, 128.200.39.181,
0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -,
128.200.39.17, -, 3/22/00, 205536, W3SVC,
SRVR1, 128.200.39.181, 1061, 382, 414, 200, 0,
POST, /spt/main.html, -, 128.200.39.17, -,
3/22/00, 205536, W3SVC, SRVR1, 128.200.39.181,
0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -,
128.200.39.17, -, 3/22/00, 205539, W3SVC,
SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET,
/spt/images/bk1.jpg, -, 128.200.39.17, -,
3/22/00, 205603, W3SVC, SRVR1, 128.200.39.181,
1081, 382, 414, 200, 0, POST, /spt/main.html, -,
128.200.39.17, -, 3/22/00, 205604, W3SVC,
SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET,
/spt/images/bk1.jpg, -, 128.200.39.17, -,
3/22/00, 205633, W3SVC, SRVR1, 128.200.39.181,
0, 262, 72, 304, 0, GET, /top.html, -,
128.200.39.17, -, 3/22/00, 205652, W3SVC,
SRVR1, 128.200.39.181, 19598, 382, 414, 200, 0,
POST, /spt/main.html, -,
3
3
3
3
1
3
1
1
1
3
3
3
2
2
3
2
User 1
1
1
1
3
3
3
User 2
User 3
7
7
7
7
7
7
7
7
1
1
1
1
1
1
5
1
5
1
1
1
5
1
User 4
5
1
1
5
User 5


25
Time Series Data
26
Image Data
27
(No Transcript)
28
(No Transcript)
29
Spatio-temporal data
30
Tucker Balch and Frank Dellaert, Computer
Science Department, Georgia Tech
31
Tucker Balch and Frank Dellaert, Computer
Science Department, Georgia Tech
32
Movement of Pigment in Skin Cells of Frogs
Steve Gross, Physics and Biology, UC Irvine
33
Movement of Pigment in Skin Cells of Frogs
Steve Gross, Physics and Biology, UC Irvine
34
Movement of Pigment in Skin Cells of Frogs
Steve Gross, Physics and Biology, UC Irvine
35
Movement of Pigment in Skin Cells of Frogs
Steve Gross, Physics and Biology, UC Irvine
36
Movement of Pigment in Skin Cells of Frogs
Steve Gross, Physics and Biology, UC Irvine
37
Relational Data
38
Different Data Mining Tasks
  • Exploratory Data Analysis
  • Descriptive Modeling
  • Predictive Modeling
  • Discovering Patterns and Rules
  • others.

39
Exploratory Data Analysis
  • Getting an overall sense of the data set
  • Computing summary statistics
  • Number of distinct values, max, min, mean,
    median, variance, skewness,..
  • Visualization is widely used
  • 1d histograms
  • 2d scatter plots
  • Higher-dimensional methods
  • Useful for data checking
  • E.g., finding that a variable is always integer
    valued or positive
  • Finding the some variables are highly skewed
  • Simple exploratory analysis can be extremely
    valuable
  • You should always look at your data before
    applying any data mining algorithms

40
Example of Exploratory Data Analysis(Pima
Indians data, scatter plot matrix)
41
Descriptive Modeling
  • Goal is to build a generative or descriptive
    model,
  • E.g., a model that could simulate the data if
    needed
  • models the underlying process
  • Examples
  • Density estimation
  • estimate the joint distribution P(x1,xp)
  • Cluster analysis
  • Find natural groups in the data
  • Dependency models among the p variables
  • Learning a Bayesian network for the data

42
Example of Descriptive Modeling
Control Group
Anemia Group
43
Another Example of Descriptive Modeling
  • Learning Directed Graphical Models (aka Bayes
    Nets)
  • goal learn directed relationships among p
    variables
  • techniques directed (causal) graphs
  • challenge distinguishing between correlation and
    causation
  • example Do yellow fingers cause lung cancer?

hidden cause smoking
44
Predictive Modeling
  • Predict one variable Y given a set of other
    variables X
  • Here X could be a p-dimensional vector
  • Classification Y is categorical
  • Regression Y is real-valued
  • In effect this is function approximation,
    learning the relationship between Y and X
  • Many, many algorithms for predictive modeling in
    statistics and machine learning
  • Often the emphasis is on predictive accuracy,
    less emphasis on understanding the model

45
Example of Predictive Modeling
  • Background
  • ATT has about 100 million customers
  • It logs 200 million calls per day, 40 attributes
    each
  • 250 million unique telephone numbers
  • Which are business and which are residential?
  • Solution (Pregibon and Cortes, ATT,1997)
  • Proprietary model, using a few attributes,
    trained on known business customers to adaptively
    track p(businessdata)
  • Significant systems engineering data are
    downloaded nightly, model updated (20 processors,
    6Gb RAM, terabyte disk farm)
  • Status
  • running daily at ATT
  • HTML interface used by ATT marketing

46
Pattern Discovery
  • Goal is to discover interesting local patterns
    in the data rather than to characterize the data
    globally
  • given market basket data we might discover that
  • If customers buy wine and bread then they buy
    cheese with probability 0.9
  • These are known as association rules
  • Given multivariate data on astronomical objects
  • We might find a small group of previously
    undiscovered objects that are very self-similar
    in our feature space, but are very far away in
    feature space from all other objects

47
Example of Pattern Discovery
ADACABDABAABBDDBCADDDDBCDDBCCBBCCDADADAADABDBBDAB
ABBCDDDCDDABDCBBDBDBCBBABBBCBBABCBBACBBDBAACCADDAD
BDBBCBBCCBBBDCABDDBBADDBBBBCCACDABBABDDCDDBBABDBDD
BDDBCACDBBCCBBACDCADCBACCADCCCACCDDADCBCADADBAACCD
DDCBDBDCCCCACACACCDABDDBCADADBCBDDADABCCABDAACABCA
BACBDDDCBADCBDADDDDCDDCADCCBBADABBAAADAAABCCBCABDB
AADCBCDACBCABABCCBACBDABDDDADAABADCDCCDBBCDBDADDCC
BBCDBAADADBCAAAADBDCADBDBBBCDCCBCCCDCCADAADACABDAB
AABBDDBCADDDDBCDDBCCBBCCDADADACCCDABAABBCBDBDBADBB
BBCDADABABBDACDCDDDBBCDBBCBBCCDABCADDADBACBBBCCDBA
AADDDBDDCABACBCADCDCBAAADCADDADAABBACCBB
48
Example of Pattern Discovery
ADACABDABAABBDDBCADDDDBCDDBCCBBCCDADADAADABDBBDAB
ABBCDDDCDDABDCBBDBDBCBBABBBCBBABCBBACBBDBAACCADDAD
BDBBCBBCCBBBDCABDDBBADDBBBBCCACDABBABDDCDDBBABDBDD
BDDBCACDBBCCBBACDCADCBACCADCCCACCDDADCBCADADBAACCD
DDCBDBDCCCCACACACCDABDDBCADADBCBDDADABCCABDAACABCA
BACBDDDCBADCBDADDDDCDDCADCCBBADABBAAADAAABCCBCABDB
AADCBCDACBCABABCCBACBDABDDDADAABADCDCCDBBCDBDADDCC
BBCDBAADADBCAAAADBDCADBDBBBCDCCBCCCDCCADAADACABDAB
AABBDDBCADDDDBCDDBCCBBCCDADADACCCDABAABBCBDBDBADBB
BBCDADABABBDACDCDDDBBCDBBCBBCCDABCADDADBACBBBCCDBA
AADDDBDDCABACBCADCDCBAAADCADDADAABBACCBB
49
Example of Pattern Discovery
  • IBM Advanced Scout System
  • Bhandari et al. (1997)
  • Every NBA basketball game is annotated,
  • e.g., time 6 mins, 32 seconds event 3
    point basket player Michael Jordan
  • This creates a huge untapped database of
    information
  • IBM algorithms search for rules of the form
    If player A is in the game, player Bs scoring
    rate increases from 3.2 points per quarter to 8.7
    points per quarter
  • IBM claimed around 1998 that all NBA teams except
    1 were using this software the other team was
    Chicago.

50
Structure Models and Patterns
  • Model abstract representation of a process
  • e.g., very simple linear model structure
  • Y a X b
  • a and b are parameters determined from the data
  • Y aX b is the model structure
  • Y 0.9X 0.3 is a particular model
  • All models are wrong, some are useful (G.E.
    Box)
  • Pattern represents local structure in a data
    set
  • E.g., if Xgtx then Y gty with probability p
  • or a pattern might be a small cluster of
    outliers in multi-dimensional space

51
Components of Data Mining Algorithms
  • Representation
  • Determining the nature and structure of the
    representation to be used
  • Score function
  • quantifying and comparing how well different
    representations fit the data
  • Search/Optimization method
  • Choosing an algorithmic process to optimize the
    score function and
  • Data Management
  • Deciding what principles of data management are
    required to implement the algorithms efficiently.

52
Whats in a Data Mining Algorithm?
Task
Representation
Score Function
Search/Optimization
Data Management
Models, Parameters
53
An Example Multivariate Linear Regression
Task
Regression
Y Weighted linear sum of Xs
Representation
Score Function
Least-squares
Search/Optimization
Gaussian elimination
Data Management
None specified
Models, Parameters
Regression coefficients
54
An Example Decision Trees (C4.5 or CART)
Task
Classification
Hierarchy of axis-parallel linear class
boundaries
Representation
Cross-validated accuracy
Score Function
Search/Optimization
Greedy Search
Data Management
None specified
Models, Parameters
Decision tree classifier
55
An Example Hierarchical Clustering
Task
Clustering
Representation
Tree of clusters
Score Function
Various
Search/Optimization
Greedy search
Data Management
None specified
Models, Parameters
Dendrogram
56
An Example Association Rules
Task
Pattern Discovery
Rules if A and B then C with prob p
Representation
No explicit score
Score Function
Search/Optimization
Systematic search
Data Management
Multiple linear scans
Models, Parameters
Set of Rules
57
Data Mining the downside
  • Hype
  • Data dredging, snooping and fishing
  • Finding spurious structure in data that is not
    real
  • historically, data mining was a derogatory term
    in the statistics community
  • making inferences from small samples
  • The Super Bowl fallacy
  • Bangladesh butter prices and the US stock market
  • The challenges of being interdisciplinary
  • computer science, statistics, domain discipline

58
Next Lecture
  • Discussion of class projects
  • Chapter 2
  • Measurement and data
  • Distance measures
  • Data quality
About PowerShow.com