David Newman, UC Irvine Lecture 1: Introduction 1 - PowerPoint PPT Presentation

1 / 68
About This Presentation
Title:

David Newman, UC Irvine Lecture 1: Introduction 1

Description:

Ebay, Amazon, Walmart: order of 100 million transactions per day ... 500 people, 20k relationships. How does this network evolve over time? ... – PowerPoint PPT presentation

Number of Views:102
Avg rating:3.0/5.0
Slides: 69
Provided by: Informatio367
Category:

less

Transcript and Presenter's Notes

Title: David Newman, UC Irvine Lecture 1: Introduction 1


1
CS 277 Data MiningLecture 1 Introduction to
Data Mining
  • Dr. David Newman
  • Department of Computer Science
  • University of California, Irvine

2
Acknowledgement
  • The material in these lectures is based on the
    data mining class taught by Professor Padhraic
    Smyth. This material is used with permission by
    Padhraic Smyth.

3
Philosophy behind this class
  • Develop an overall sense of how to extract
    information from data in a systematic way
  • Emphasis on the process of data mining
  • understanding specific algorithms and methods is
    important
  • but alsoemphasize the big picture of why, not
    just how
  • less emphasis on mathematical theory
  • Builds on knowledge from CS 273, 274

4
Logistics
  • Grading
  • 30 homeworks
  • 3 assignments
  • Review guidelines for collaboration
  • 70 class project
  • Will discuss in next lecture
  • Office hours
  • Fridays, 930 to 1030
  • Web page
  • www.ics.uci.edu/newman/cs277
  • Prerequisites
  • Either CS 273 or 274 or equivalent
  • Text
  • Mining the Web Discovering Knowledge from
    Hypertext Data, Soumen Chakrabati

5
Logistics (cont.)
  • Matlab
  • You will need to have access to Matlab to
    complete the assignments
  • Link on class webpage on Matlab resources
  • Emailing
  • ALL EMAILS TO ME MUST START SUBJECT LINE WITH
    cs277

6
Data Mining the Internet
  • This year, more focus on data mining text and
    internet content
  • Information Retrieval
  • Information Extraction
  • Clustering
  • Classification
  • Prediction
  • For every algorithm we use, we should always know
  • Time complexity
  • Space complexity

7
Lecture 1 Introduction to Data Mining
  • What is data mining?
  • Data sets
  • The data matrix
  • Other data formats
  • Data mining tasks
  • Exploration
  • Description
  • Prediction
  • Pattern finding
  • Data mining algorithms
  • Score functions, models, and optimization methods
  • The dark side of data mining

8
What is data mining?

9
What is data mining?
Wikipedia Data mining has been defined as "the
nontrivial extraction of implicit, previously
unknown, and potentially useful information from
data" and "the science of extracting useful
information from large data sets or databases."

10
What is data mining?
The magic phrase used to .... - put in your
resume - use in a proposal to NSF, NIH, NASA,
etc - market database software - sell
statistical analysis software - sell parallel
computing hardware - sell consulting services

11
What is data mining?
Data-driven discovery of models and patterns
from massive observational data sets

12
What is data mining?
Data-driven discovery of models and patterns
from massive observational data sets

Statistics, Inference
13
What is data mining?
Data-driven discovery of models and patterns
from massive observational data sets

Languages and Representations
Statistics, Inference
14
What is data mining?
Data-driven discovery of models and patterns
from massive observational data sets

Languages and Representations
Engineering, Data Management
Statistics, Inference
15
What is data mining?
Data-driven discovery of models and patterns
from massive observational data sets

Languages and Representations
Engineering, Data Management
Statistics, Inference
Retrospective Analysis
16
Technological Driving Factors
  • Exponential growth in storage
  • True or False Every 2 years there is more data
    stored than all data stored from all previous
    years
  • Google, NASA,
  • Faster, cheaper processors
  • Moores law Processor speed doubles every 18
    months
  • End of Moores law, but beginning of multicore
  • How does multicore change data mining?
  • New ideas in machine learning/statistics
  • Boosting, SVMs, decision trees, non-parametric
    Bayes, text models, etc

17
Examples of massive data sets
  • PubMed text database
  • 16 million published articles
  • Google
  • Order of 10 billion Web pages indexed
  • 100s of millions of site visitors per day
  • CALTRANS loop sensor data
  • Every 30 seconds, thousands of sensors, 2Gbytes
    per day
  • NASA MODIS satellite
  • Coverage at 250m resolution, 37 bands, whole
    earth, every day
  • Retail transaction data
  • Ebay, Amazon, Walmart order of 100 million
    transactions per day
  • Visa, Mastercard similar or larger numbers

18
Examples of massive data sets Web 2.0
  • Blogs
  • Social-Networking Sites MySpace
  • Photo Sharing Flickr
  • Video Sharing YouTube

19
Two Types of Data
  • Experimental Data
  • Hypothesis H
  • design an experiment to test H
  • collect data, infer how likely it is that H is
    true
  • e.g., clinical trials in medicine
  • Observational or Retrospective or Secondary Data
  • massive non-experimental data sets
  • e.g., Web logs, human genome, atmospheric
    simulations, etc
  • assumptions of experimental design no longer
    valid
  • how can we use such data to do science?
  • use the data to support model exploration,
    hypothesis testing

20
Data-Driven Discovery
  • Observational data
  • cheap relative to experimental data
  • Examples
  • Transaction data archives for retail stores,
    airlines, etc
  • Web logs for Amazon, Google, etc
  • The human/mouse/rat genome
  • makes sense to leverage available data
  • useful (?) information may be hidden in vast
    archives of data

21
Data Mining v. Statistics
  • Traditional statistics
  • first hypothesize, then collect data, then
    analyze
  • often model-oriented (strong parametric models)
  • Data mining
  • few if any a priori hypotheses
  • data is usually already collected a priori
  • analysis is typically data-driven not
    hypothesis-driven
  • Often algorithm-oriented rather than
    model-oriented
  • Different?
  • Yes, in terms of culture, motivation however..
  • statistical ideas are very useful in data mining,
    e.g., in validating whether discovered knowledge
    is useful
  • Increasing overlap at the boundary of statistics
    and DM e.g., exploratory data analysis (based on
    pioneering work of John Tukey in the 1960s)

22
Data Mining v. Machine Learning
  • To first-order, very little difference.
  • Data mining relies heavily on ideas from machine
    learning (and from statistics)
  • Some differences between DM and ML
  • More emphasis in DM on scalability, e.g.,
  • algorithms that can work on huge amounts of data
  • analyzing data in a relational database (reflects
    database roots of DM)
  • analyzing data streams
  • DM is somewhat more applications-oriented
  • Higher visibility in industry
  • ML is somewhat more theoretical, research
    oriented

23
Data Mining Intersection of Many Fields
Machine Learning (ML)
Statistics (stats)
Computer Science (CS)
Data Mining
Visualization (viz)
Databases (DB)
Human Computer Interaction (HCI)
High-Performance Computing (HPC)
24
Flat File or Vector Data
N
P
  • Rows objects
  • Columns measurements on objects
  • Represent each row as a P-dimensional vector,
    where P is the dimensionality
  • In effect, embed our objects in a P-dimensional
    vector space
  • Often useful, but not always appropriate
  • Both N and P can be very large in data mining
  • Matrix can be quite sparse

25
Sparse Matrix (Text) Data
Text Documents
Q How do we store in memory?
aardvarks? words in doc 301?
aardvark Word IDs zygote
26
Market Basket Data
27
Sequence (Web) Data
128.195.36.195, -, 3/22/00, 103511, W3SVC,
SRVR1, 128.200.39.181, 781, 363, 875, 200, 0,
GET, /top.html, -, 128.195.36.195, -, 3/22/00,
103516, W3SVC, SRVR1, 128.200.39.181, 5288,
524, 414, 200, 0, POST, /spt/main.html, -,
128.195.36.195, -, 3/22/00, 103517, W3SVC,
SRVR1, 128.200.39.181, 30, 280, 111, 404, 3, GET,
/spt/images/bk1.jpg, -, 128.195.36.101, -,
3/22/00, 161850, W3SVC, SRVR1, 128.200.39.181,
60, 425, 72, 304, 0, GET, /top.html, -,
128.195.36.101, -, 3/22/00, 161858, W3SVC,
SRVR1, 128.200.39.181, 8322, 527, 414, 200, 0,
POST, /spt/main.html, -, 128.195.36.101, -,
3/22/00, 161859, W3SVC, SRVR1, 128.200.39.181,
0, 280, 111, 404, 3, GET, /spt/images/bk1.jpg, -,
128.200.39.17, -, 3/22/00, 205437, W3SVC,
SRVR1, 128.200.39.181, 140, 199, 875, 200, 0,
GET, /top.html, -, 128.200.39.17, -, 3/22/00,
205455, W3SVC, SRVR1, 128.200.39.181, 17766,
365, 414, 200, 0, POST, /spt/main.html, -,
128.200.39.17, -, 3/22/00, 205455, W3SVC,
SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET,
/spt/images/bk1.jpg, -, 128.200.39.17, -,
3/22/00, 205507, W3SVC, SRVR1, 128.200.39.181,
0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -,
128.200.39.17, -, 3/22/00, 205536, W3SVC,
SRVR1, 128.200.39.181, 1061, 382, 414, 200, 0,
POST, /spt/main.html, -, 128.200.39.17, -,
3/22/00, 205536, W3SVC, SRVR1, 128.200.39.181,
0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -,
128.200.39.17, -, 3/22/00, 205539, W3SVC,
SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET,
/spt/images/bk1.jpg, -, 128.200.39.17, -,
3/22/00, 205603, W3SVC, SRVR1, 128.200.39.181,
1081, 382, 414, 200, 0, POST, /spt/main.html, -,
128.200.39.17, -, 3/22/00, 205604, W3SVC,
SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET,
/spt/images/bk1.jpg, -, 128.200.39.17, -,
3/22/00, 205633, W3SVC, SRVR1, 128.200.39.181,
0, 262, 72, 304, 0, GET, /top.html, -,
128.200.39.17, -, 3/22/00, 205652, W3SVC,
SRVR1, 128.200.39.181, 19598, 382, 414, 200, 0,
POST, /spt/main.html, -,
3
3
3
3
1
3
1
1
1
3
3
3
2
2
3
2
User 1
1
1
1
3
3
3
User 2
User 3
7
7
7
7
7
7
7
7
1
1
1
1
1
1
5
1
5
1
1
1
5
1
User 4
5
1
1
5
User 5


28
Time Series Data
29
Image Data
30
(No Transcript)
31
Spatio-temporal data
32
Tucker Balch and Frank Dellaert, Computer
Science Department, Georgia Tech
33
Tucker Balch and Frank Dellaert, Computer
Science Department, Georgia Tech
34
Relational Data
35
Algorithms for estimating relative importance in
networks S. White and P. Smyth, ACM SIGKDD,
2003.
36
HP Labs email network 500 people, 20k
relationships How does this network evolve over
time?
37
Different Data Mining Tasks
  • Exploratory Data Analysis
  • Descriptive Modeling
  • Predictive Modeling
  • Discovering Patterns and Rules
  • others.

38
Exploratory Data Analysis
  • Getting an overall sense of the data set
  • Computing summary statistics
  • Number of distinct values, max, min, mean,
    median, variance, skewness,..
  • Visualization is widely used
  • 1d histograms
  • 2d scatter plots
  • Higher-dimensional methods
  • Useful for data checking
  • E.g., finding that a variable is always integer
    valued or positive
  • Finding the some variables are highly skewed
  • Simple exploratory analysis can be extremely
    valuable
  • You should always look at your data before
    applying any data mining algorithms

39
Exploratory Data Analysis
  • What are exploratory data analyses for
  • text?
  • networks?

40
Example of Exploratory Data Analysis(Pima
Indians data, scatter plot matrix)
41
Different Data Mining Tasks
  • Exploratory Data Analysis
  • Descriptive Modeling
  • Predictive Modeling
  • Discovering Patterns and Rules
  • others.

42
Descriptive Modeling
  • Goal is to build a descriptive model
  • e.g., a model that could simulate the data if
    needed
  • models the underlying process
  • Examples
  • Density estimation
  • estimate the joint distribution P(x1,xp)
  • Cluster analysis
  • Find natural groups in the data
  • Dependency models among the p variables
  • Learning a Bayesian network for the data

43
Example of Descriptive Modeling
Control Group
Anemia Group
44
Example of Descriptive Modeling
Control Group
Anemia Group
45
Learning User Navigation Patterns from Web Logs
128.195.36.195, -, 3/22/00, 103511, W3SVC,
SRVR1, 128.200.39.181, 781, 363, 875, 200, 0,
GET, /top.html, -, 128.195.36.195, -, 3/22/00,
103516, W3SVC, SRVR1, 128.200.39.181, 5288,
524, 414, 200, 0, POST, /spt/main.html, -,
128.195.36.195, -, 3/22/00, 103517, W3SVC,
SRVR1, 128.200.39.181, 30, 280, 111, 404, 3, GET,
/spt/images/bk1.jpg, -, 128.195.36.101, -,
3/22/00, 161850, W3SVC, SRVR1, 128.200.39.181,
60, 425, 72, 304, 0, GET, /top.html, -,
128.195.36.101, -, 3/22/00, 161858, W3SVC,
SRVR1, 128.200.39.181, 8322, 527, 414, 200, 0,
POST, /spt/main.html, -, 128.195.36.101, -,
3/22/00, 161859, W3SVC, SRVR1, 128.200.39.181,
0, 280, 111, 404, 3, GET, /spt/images/bk1.jpg, -,
128.200.39.17, -, 3/22/00, 205437, W3SVC,
SRVR1, 128.200.39.181, 140, 199, 875, 200, 0,
GET, /top.html, -, 128.200.39.17, -, 3/22/00,
205455, W3SVC, SRVR1, 128.200.39.181, 17766,
365, 414, 200, 0, POST, /spt/main.html, -,
128.200.39.17, -, 3/22/00, 205455, W3SVC,
SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET,
/spt/images/bk1.jpg, -, 128.200.39.17, -,
3/22/00, 205507, W3SVC, SRVR1, 128.200.39.181,
0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -,
128.200.39.17, -, 3/22/00, 205536, W3SVC,
SRVR1, 128.200.39.181, 1061, 382, 414, 200, 0,
POST, /spt/main.html, -, 128.200.39.17, -,
3/22/00, 205536, W3SVC, SRVR1, 128.200.39.181,
0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -,
128.200.39.17, -, 3/22/00, 205539, W3SVC,
SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET,
/spt/images/bk1.jpg, -, 128.200.39.17, -,
3/22/00, 205603, W3SVC, SRVR1, 128.200.39.181,
1081, 382, 414, 200, 0, POST, /spt/main.html, -,
128.200.39.17, -, 3/22/00, 205604, W3SVC,
SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET,
/spt/images/bk1.jpg, -, 128.200.39.17, -,
3/22/00, 205633, W3SVC, SRVR1, 128.200.39.181,
0, 262, 72, 304, 0, GET, /top.html, -,
128.200.39.17, -, 3/22/00, 205652, W3SVC,
SRVR1, 128.200.39.181, 19598, 382, 414, 200, 0,
POST, /spt/main.html, -,
3
3
3
3
1
3
1
1
1
3
3
3
2
2
3
2
User 1
1
1
1
3
3
3
User 2
User 3
7
7
7
7
7
7
7
7
1
1
1
1
1
1
5
1
5
1
1
1
5
1
User 4
5
1
1
5
User 5


46
Clusters of Probabilistic State Machines
A
A
Cluster 1
Cluster 2
B
B
C
C
E
E
A
Motivation capture heterogeneity of Web surfing
behavior
B
C
Cluster 3
E
47
WebCanvas algorithm and software - currently
in new SQLServer
48
Different Data Mining Tasks
  • Exploratory Data Analysis
  • Descriptive Modeling
  • Predictive Modeling
  • Discovering Patterns and Rules
  • others.

49
Predictive Modeling
  • Predict one variable Y given a set of other
    variables X
  • Here X could be a p-dimensional vector
  • Classification Y is categorical
  • Regression Y is real-valued
  • In effect this is function approximation,
    learning the relationship between Y and X
  • Many, many algorithms for predictive modeling in
    statistics and machine learning
  • Often the emphasis is on predictive accuracy,
    less emphasis on understanding the model

50
Predictive Modeling Fraud Detection
  • Credit card fraud detection
  • Credit card losses in the US are over 1 billion
    per year
  • Roughly 1 in 50k transactions are fraudulent
  • Approach
  • For each transaction estimate p(fraudulent
    transaction)
  • Model is built on historical data of known
    fraud/non-fraud
  • High probability transactions investigated by
    fraud police
  • Example
  • Fair-Isaac/HNCs fraud detection software based
    on neural networks, led to reported fraud
    decreases of 30 to 50
  • http//www.fairisaac.com/fairisaac
  • Issues
  • Significant feature engineering/preprocessing
  • false alarm rate vs missed detection what is
    the tradeoff?

51
Predictive Modeling Customer Scoring
  • Example a bank has a database of 1 million past
    customers, 10 of whom took out mortgages
  • Use machine learning to rank new customers as a
    function of p(mortgage customer data)
  • Customer data
  • History of transactions with the bank
  • Other credit data (obtained from Experian, etc)
  • Demographic data on the customer or where they
    live
  • Techniques
  • Binary classification logistic regression,
    decision trees, etc
  • Many, many applications of this nature

52
Different Data Mining Tasks
  • Exploratory Data Analysis
  • Descriptive Modeling
  • Predictive Modeling
  • Discovering Patterns and Rules
  • others.

53
Pattern Discovery
  • Goal is to discover interesting local patterns
    in the data rather than to characterize the data
    globally
  • Given market basket data we might discover that
  • If customers buy wine and bread then they buy
    cheese with probability 0.9
  • These are known as association rules
  • Given multivariate data on astronomical objects
  • We might find a small group of previously
    undiscovered objects that are very self-similar
    in our feature space, but are very far away in
    feature space from all other objects

54
Example of Pattern Discovery
ADACABDABAABBDDBCADDDDBCDDBCCBBCCDADADAADABDBBDAB
ABBCDDDCDDABDCBBDBDBCBBABBBCBBABCBBACBBDBAACCADDAD
BDBBCBBCCBBBDCABDDBBADDBBBBCCACDABBABDDCDDBBABDBDD
BDDBCACDBBCCBBACDCADCBACCADCCCACCDDADCBCADADBAACCD
DDCBDBDCCCCACACACCDABDDBCADADBCBDDADABCCABDAACABCA
BACBDDDCBADCBDADDDDCDDCADCCBBADABBAAADAAABCCBCABDB
AADCBCDACBCABABCCBACBDABDDDADAABADCDCCDBBCDBDADDCC
BBCDBAADADBCAAAADBDCADBDBBBCDCCBCCCDCCADAADACABDAB
AABBDDBCADDDDBCDDBCCBBCCDADADACCCDABAABBCBDBDBADBB
BBCDADABABBDACDCDDDBBCDBBCBBCCDABCADDADBACBBBCCDBA
AADDDBDDCABACBCADCDCBAAADCADDADAABBACCBB
55
Example of Pattern Discovery
ADACABDABAABBDDBCADDDDBCDDBCCBBCCDADADAADABDBBDAB
ABBCDDDCDDABDCBBDBDBCBBABBBCBBABCBBACBBDBAACCADDAD
BDBBCBBCCBBBDCABDDBBADDBBBBCCACDABBABDDCDDBBABDBDD
BDDBCACDBBCCBBACDCADCBACCADCCCACCDDADCBCADADBAACCD
DDCBDBDCCCCACACACCDABDDBCADADBCBDDADABCCABDAACABCA
BACBDDDCBADCBDADDDDCDDCADCCBBADABBAAADAAABCCBCABDB
AADCBCDACBCABABCCBACBDABDDDADAABADCDCCDBBCDBDADDCC
BBCDBAADADBCAAAADBDCADBDBBBCDCCBCCCDCCADAADACABDAB
AABBDDBCADDDDBCDDBCCBBCCDADADACCCDABAABBCBDBDBADBB
BBCDADABABBDACDCDDDBBCDBBCBBCCDABCADDADBACBBBCCDBA
AADDDBDDCABACBCADCDCBAAADCADDADAABBACCBB
56
Example of Pattern Discovery
  • IBM Advanced Scout System
  • Bhandari et al. (1997)
  • Every NBA basketball game is annotated,
  • e.g., time 6 mins, 32 seconds event 3
    point basket player Michael Jordan
  • This creates a huge untapped database of
    information
  • IBM algorithms search for rules of the form
    If player A is in the game, player Bs scoring
    rate increases from 3.2 points per quarter to 8.7
    points per quarter
  • IBM claimed around 1998 that all NBA teams except
    1 were using this software the other team was
    Chicago.

57
Data Mining the downside
  • Hype
  • Data dredging, snooping and fishing
  • Finding spurious structure in data that is not
    real
  • Historically, data mining was a derogatory term
    in the statistics community
  • making inferences from small samples
  • Bangladesh butter prices and the US stock market
  • The challenges of being interdisciplinary
  • computer science, statistics, domain discipline

58
Example of data fishing
  • Example data set with
  • 50 data vectors
  • 100 variables
  • Even if data are entirely random (no dependence)
    there is a very high probability some variables
    will appear dependent just by chance.
  • ? Matlab example

59
Example of data fishing
  • Example data set with
  • 50 data vectors
  • 100 variables
  • Even if data are entirely random (no dependence)
    there is a very high probability some variables
    will appear dependent just by chance.

60
Example Bonferronis Principle
  • This example illustrates a problem with
    intelligence-gathering.
  • Suppose we believe that certain groups of
    evil-doers are meeting occasionally in hotels to
    plot doing evil.
  • We want to find people who at least twice have
    stayed at the same hotel on the same day.

Example from http//infolab.stanford.edu/ullma
n/mining/2006
61
The Details
  • 109 people being tracked.
  • 1000 days.
  • Each person stays in a hotel 1 of the time (10
    days out of 1000).
  • Hotels hold 100 people (so 105 hotels).
  • If everyone behaves randomly (I.e., no
    evil-doers) will the data mining detect anything
    suspicious?

62
Calculations --- (1)
  • Probability that persons p and q will be at the
    same hotel on day d
  • 1/100 1/100 10-5 10-9.
  • Probability that p and q will be at the same
    hotel on two given days
  • 10-9 10-9 10-18.
  • Pairs of days
  • 5105.

63
Calculations --- (2)
  • Probability that p and q will be at the same
    hotel on some two days
  • 5105 10-18 510-13.
  • Pairs of people
  • 51017.
  • Expected number of suspicious pairs of people
  • 51017 510-13 250,000.

64
Conclusion
  • Suppose there are (say) 10 pairs of evil-doers
    who definitely stayed at the same hotel twice.
  • Analysts have to sift through 250,010 candidates
    to find the 10 real cases.
  • Not gonna happen.
  • But how can we improve the scheme?

65
Moral
  • When looking for a property (e.g., two people
    stayed at the same hotel twice), make sure that
    there are not so many possibilities that random
    data will not produce facts of interest.

66
Data Mining Resources
  • Wikipedia
  • Online (free) KD Nuggets newsletter
  • www.kdnuggets.com
  • Tends to be more industry-oriented than research,
    but nonetheless interesting
  • ACM SIGKDD Conference
  • Leading annual conference on DM and knowledge
    discovery
  • Papers provide a snapshot of current DM research
  • Machine learning resources
  • Journal of Machine Learning Research,
    www.jmlr.org
  • Annual proceedings of NIPS and ICML conferences

67
Next Lecture
  • Discussion of class projects
  • Lecture 2
  • Measurement and data
  • Distance measures
  • Data quality

68
  • LAST SLIDE
Write a Comment
User Comments (0)
About PowerShow.com