NPACI Summer Institute 2003 TUTORIAL Data Mining for Scientific Applications - PowerPoint PPT Presentation

1 / 59
About This Presentation
Title:

NPACI Summer Institute 2003 TUTORIAL Data Mining for Scientific Applications

Description:

Damage only happened at the lower end of each pier (location of plastic hinge) There is only one damaged pier each time. San Diego Supercomputer Center ... – PowerPoint PPT presentation

Number of Views:71
Avg rating:3.0/5.0
Slides: 60
Provided by: sdsc
Learn more at: http://www.sdsc.edu
Category:

less

Transcript and Presenter's Notes

Title: NPACI Summer Institute 2003 TUTORIAL Data Mining for Scientific Applications


1
NPACI Summer Institute 2003TUTORIALData Mining
for Scientific Applications
  • Tony Fountain Peter Shin
  • San Diego Supercomputer Center UCSD

2
NPACI Data Mining
  • Resources
  • TeraGrid

3
NSF TeraGrid Building Integrated National
CyberInfrastructure
  • Prototype for CyberInfrastructure
  • Ubiquitous computational resources
  • Plug-in compatibility
  • National Reach
  • SDSC, NCSA, CIT, ANL, PSC
  • High Performance Network
  • 40 Gb/s backbone, 30 Gb/s to each site
  • Over 20 Teraflops compute power
  • Approx. 1 PB rotating Storage
  • Extending by 3-4 sites in 03

4
  • SDSC is Data-Intensive Center

4
5
SDSC Machine Room Data Architecture
  • .5 PB disk
  • 6 PB archive
  • 1 GB/s disk-to-tape
  • Optimized support for DB2 /Oracle
  • Philosophy enable SDSC configuration to serve
    the grid as Data Center

6
SDSC IBM Regatta - DataStar
  • 100 TB Disk
  • Numerous fast CPUs
  • 64 GB of RAM per node
  • DB2 v8.x ESE
  • IBM Intelligent Miner
  • SAS Enterprise Miner
  • Platform for high-performance database, data
    mining, comparative IT studies

7
Data Mining Definition
  • The search for interesting patterns and models.

8
Data Mining Definition
  • The search for interesting patterns and models,
  • in large databases,
  • that were collected for other applications,
  • using machine learning algorithms,
  • and high-performance computational
    infrastructure.

9
Broad DefinitionAnalysis and Infrastructure
  • Informal methods graphs, plots, visualizations,
    exploratory data analysis (yes Excel is a data
    mining tool)
  • Advanced query processing and OLAP e.g.,
    National Virtual Observatory (NVO), Blast
  • Machine learning (compute-intensive statistical
    methods)
  • Supervised classification, prediction
  • Unsupervised clustering
  • Computational infrastructure collections
    management, information integration,
    high-performance database systems, web services,
    grid services, the global IT grid

10
The Case for Data Mining Data Reality
  • Deluge from new sources
  • Remote sensing
  • Microarray processing
  • Wireless communication
  • Simulation models
  • Instrumentation
  • Digital publishing
  • Federation of collections
  • Legacy archives and independent collection
    activities
  • Many types of data, many uses, many types of
    queries
  • Growth of data collections vs. analysts
  • Paradigm shift from hypothesis-driven data
    collection to data mining
  • Virtual laboratories and digital science

11
KDD ProcessKnowledge Discovery and Data Mining
Application/Decision Support
Knowledge
Presentation/Visualization
Analysis/Modeling
Management/Federation/Warehousing
Processing/Cleansing/Corrections
Collection
Bulk of difficult work is below
analysis Integrated infrastructure increases
efficiency
Data
12
Characteristics of Data Mining Applications
  • Lots of data, numerous sources
  • Noisy missing values, outliers, interference
  • Heterogeneous mixed types, mixed media
  • Complex scale, resolution, temporal, spatial
    dimensions
  • Relatively little domain theory, few quantitative
    causal models
  • No rigorous experimental design, limited control
    on data collection
  • Lack of valid ground truth all data is not
    equal!
  • Finding needles in haystacks
  • Advice dont choose problems that have all
    these characteristics

13
Scientific vs. Commercial Data Mining
  • Goals profits vs. theories
  • Need for insight and the depth of science
  • The role of black boxes and theory-based models
  • Produce interpretable model structures, generate
    domain rules or causal structures, support for
    theory development
  • Scientists-in-the-loop architectures
  • Data characteristics
  • Transaction data vs. images, sensors, simulations
  • Spatial and temporal dimensions, heterogeneity
  • Trend the IT differences are diminishing --
    this is good!
  • Databases, integration tools, web services
  • Industry is a big IT engine

14
SDSC Applications
  • Alliance for Cell Signaling (AFCS)
  • Joint Center for Structural Genomics (JCSG)
  • Cooperative Association for Internet Data
    Analysis (CAIDA)

15
Hyperspectral Example
  • Characteristics of the data
  • Over 200 bands
  • Small number of samples through labor-intensive
    collecting process
  • Task
  • Classify the vegetation (e.g. Kangaroo Mound,
    Juniper, Pinyon, etc.)

16
Cancer Example
  • Data Set
  • 88 prostate tissue samples
  • 37 labeled no tumor,
  • 51 labeled tumor
  • Each tissue with 10,600 gene expression
    measurements
  • Collected by the UCSD Cancer Center, analyzed at
    SDSC
  • Tasks
  • Build model to classify new, unseen tissues as
    either no tumor or tumor
  • Report back to the domain expert the key genes
    used by the model, to find out their biological
    significance in the process of cancer

17
Some genes are more useful than others for
building classification models
Example genes 36569_at and 36495_at are useful
No Tumor
Tumor
18
Civil Infrastructure Health Monitoring Example
  • Goal
  • Provide a flexible and scalable infrastructure
  • Process sensor network data stream
  • to monitor and analyze real-time sensor data
  • to integrate various types of sensor data
  • Support classification and decision support task
  • to build a real-time decision support system

19
Detecting Damage Location in a Bridge
  • Task
  • Identify which pier is damaged based on the data
    stream of acceleration measured from the sensors
    at the span middles.
  • Compare the prediction accuracy between the exact
    data vs. approximate data
  • Testbed
  • Humboldt Bay Bridge with 8 piers.
  • Assumption
  • Damage only happened at the lower end of each
    pier (location of plastic hinge)
  • There is only one damaged pier each time.

20
Introduction to Machine Learning
  • Concepts and inductive reasoning
  • Supervised and unsupervised learning
  • Model development -- training and testing
    methodology, cross validation
  • Measuring performance -- overfitting, confusion
    matrices
  • Survey of algorithms
  • Decision Trees classification
  • k-means clustering
  • Hierarchical clustering
  • Bayesian networks and probabilistic inference
  • Support vector machines

21
Basic Machine Learning Theory
  • Inductive learning hypothesis
  • Any hypothesis found to approximate the target
    function well over a sufficiently large set of
    training examples will also approximate the
    target function well over other unobserved
    examples.
  • No Free Lunch Theorem
  • In the absence of prior information about the
    problem there are no reasons to prefer one
    learning algorithm over another.
  • Conclusion
  • There is no problem-independent best learning
    system. Formal theory and algorithms are not
    enough.
  • Machine learning is an empirical subject.

22
Concepts and Feature Vectors
  • Concepts can be identified by features
  • Example vehicles
  • Has wheels
  • Runs on gasoline
  • Carries people
  • Flies
  • Weighs less than 500 pounds
  • Boolean feature vectors for vehicles
  • Car 1 1 1 0 0
  • Motorcycle 1 1 1 0 1
  • Airplane 1 1 1 1 0
  • Motorcycle1 1 1 0 0

23
Concepts and Feature Vectors 2
  • Easy to generalize to complex data types
  • type num_wheels, fuel_type, carrying_capacity,
    max_altitude, weight
  • Car, 4, gas, 600,
    0.0, 2190
  • Most machine learning algorithms expect this
    input format
  • Suggestions
  • Identify the target concept
  • Organize your data to fit feature vector
    representation
  • Design your database schemas to support
    generation of data in this format

24
Dimensions of Data These help determine
algorithm selection
  • Questions to consider (metadata)
  • Number of features (columns)?
  • Number of independent/dependent features?
  • Type of features?
  • Missing features?
  • Mixed types?
  • Labels? Types?
  • Ratio of rows to columns?
  • Goals of analysis?
  • Best guess at type of target function? (e.g.,
    linear?)

25
Challenges to Quality Ground Truth(labeled
training data)
  • Labels enable supervised learning
    (classification, prediction)
  • Difficult to expensive to impossible
  • Remote sensing imagery, biological field surveys
  • Clinical cancer data, limited biological samples
  • Failure signatures of civil infrastructure,
    catastrophes, terrorism
  • Approaches
  • Opportunistic labeling (e.g., security logs,
    multiuse field surveys)
  • Learning from process-based models (e.g., bridge
    simulations)
  • Shared community resources (amortize costs, e.g.,
    museum federation)
  • High-throughput annotations

26
Overview of Classification
  • Definition
  • A new observation can be assigned to one of
    several known classes using a rule.
  • The rule is learned using a set of labeled
    examples, through a process called supervised
    learning.
  • Survey of Applications
  • Ecosystem classification, hyperspectral image
    pixel classification, cancer diagnosis and
    prognosis, structural damage detection,
    crystallization success prediction, spam
    detection, etc.
  • Survey of Methods
  • Neural Network
  • Decision Trees
  • Naïve Bayesian Networks
  • Support Vector Machines

27
Classification Decision Tree
Ecosystem Precipitation
28
Classification Decision Tree
Precipitation gt 63
29
Classification Decision Tree
Precipitation gt 63
Precipitation gt 5
30
Classification Decision Tree
Learned Model
IF(Precip gt 63 ) then Forest Else If (Precip gt 5)
then Prairie Else Desert
True D F P
D
Confusion Matrix
F
Predicted
P
Classification accuracy on training data is 100
31
Classification Decision TreeTesting Set Results
Test Data
IF(Precip gt 63 ) then Forest Else If (Precip gt 5)
then Prairie Else Desert
Learned Model
True D F P
D
True
Predicted
F
Predicted
Confusion Matrix
P
Result Accuracy 67 Model shows overfitting,
generalizes poorly
32
Pruning to improve generalizationPruned Decision
Tree
IF(Precip lt 60 ) then Desert Else P(Forest)
.75 P(Prairie) .25
Precipitation lt 60
33
Decision Trees Summary
  • Simple to understand
  • Works with mixed data types
  • Heuristic search so sensitive local minima
  • Models non-linear functions
  • Handles classification and regression
  • Many successful applications
  • Readily available tools

34
Cross ValidationGetting the most mileage from
your data
Train
Test
Apply
  • Creating training and testing data sets
  • Hold-out validation (2/3, 1/3 splits)
  • Cross validation, simple and n-fold (reuse)
  • Bootstrap validation (sample with replacement)
  • Jackknife validation (leave one out)
  • When possible hide a subset of the data until
    train-test is complete.

35
Learning CurvesReality checks and optimization
decisions
Overfitting
Optimal Depth
Train
Test
36
Hands-on Analysis
  • Decision Tree with IBM Intelligent Miner

37
Sales Problem
  • Goal
  • Maximize the profits on sales of quality shoes
  • Problem Characteristics
  • 1. We make 50 profit on a sale of 200 shoes.
  • 2. People who make over 50k buy the shoes at a
    rate of 5 when they receive the brochure.
  • 3. People who make less than 50k buy the shoes
    at a rate of 1 when they receive the brochure.
  • 4. It costs 1 to send a brochure to a potential
    customer.
  • 5. In general, we do not know whether a person
    will make more than 50k or not. However, we
    have indirect information about them.

38
Data Description
  • Credit
  • Census Bureau (1994)
  • Data processed and donated by Ron Kohavi and
    Barry Becker (Data Mining and Visualization, SGI)
  • Variable Description
  • Please refer to the hand-out.

39
Two Datasets
  • Train Set
  • Total Number 32561 (100)
  • Less than or equal to 50k 24720 (76 )
  • Over 50k 7841 (24 )
  • Test Set (Validation Set)
  • Total Number 16281 (100)
  • Less than or equal to 50k 12435 (76)
  • Over 50k 3846 (24)

40
Marketing Plan
  • We will send out 30,000 brochures.
  • Remember!
  • We do not know whether a person makes 50k or
    not,
  • We do know the other information (e.g. age,
    education level, etc.)
  • Plans
  • Plan A randomly send them out (a.k.a ran-dumb
    plan)
  • Plan B send them to a group of people who are
    likely to make over 50k (a.k.a InTelligent (IT)
    plan)

41
Plan A (ran-dumb plan)
  • Cost of sending one brochure 1
  • Probability of Response
  • 1 for 76 of the population who make lt 50k.
  • 5 for 24 of the population who make gt 50k.
  • Probability of response
  • (0.01 .76 0.05 0.24) 0.0196 (1.96)
  • Estimated Earnings
  • Expected profit from one brochure (Probability
    of response profit Cost of a brochure)
  • (0.0196 50 - 1) -0.02
  • Expected Earning Expected profit from one
    brochure number of brochures sent
  • -0.02 30000 -600

42
Plan B (InTelligent (IT) plan)
  • Strategy
  • Send out brochures to only the ones who are
    likely to make over 50k.
  • Cost of sending one flier 1
  • Probability of Response
  • Use the test result of the decision tree
  • Send out the brochures to only the ones that are
    predicted to make over 50k
  • Total number of cases where predicted gt50k
    3,061 (100)
  • Number of cases where predicted gt50k and
    actually make gt50k 2,121 (69)
  • Number of cases where predicted gt50k and
    actually make lt50k 940 (31)
  • Probability of response
  • (0.01 .31 0.05 0.69) 0.0376 (3.76)

43
Plan B (InTelligent (IT) plan)
  • Estimated Earnings
  • Expected profit from one brochure (Probability
    of response profit Cost of a brochure)
  • (0.0376 50 - 1) 0.88
  • (Probability of response profit Cost of a
    flier) number of fliers
  • 1.06 30000 26,400

44
Comparison of Two Plans
  • Expected earning from ran-dumb plan
  • -600
  • Expected earning from IT plan
  • 26,400
  • Net Difference
  • 26,400 (-600) 27,000

45
Overview of Clustering
  • Definition
  • Clustering is the discovery of classes that
    belong together.
  • The classes are discovered using a set of
    unlabeled examples, through a process called
    supervised learning.
  • Survey of Applications
  • Grouping of web-visit data, clustering of genes
    according to their expression values, grouping of
    customers into distinct profiles using
    transaction data,
  • Survey of Methods
  • k-means clustering
  • Hierarchical clustering
  • Expectation Maximization (EM) algorithm
  • Gaussian mixture modeling
  • Cluster analysis
  • Concept discovery
  • Bootstrapping knowledge

46
Clustering K-Means
Precipitation Temperature
47
Clustering K-Means
48
Clustering K-Means
49
Clustering K-Means
50
Clustering K-Means
51
Clustering K-Means
52
Clustering K-Means
53
Clustering K-Means
Cluster Temperature Precipitation
54
Clustering K-Means
Cluster Temperature Precipitation
55
Clustering K-Means
Cluster Temperature Precipitation Ecosystem
56
Using k-means
  • Requires a priori knowledge of k
  • The final outcome depends on the initial choice
    of k-means -- inconsistency
  • Sensitive to the outliers, which can skew the
    means of their clusters
  • Favors spherical clusters clusters may not
    match domain boundaries
  • Requires real-valued features

57
Clustering Demo
  • IBM Intelligent Miner

58
Mining Difficult Datasets
  • Text
  • Sensor Streams
  • Web Mining

59
SKIDLkit
  • Toolkit for feature selection and classification
Write a Comment
User Comments (0)
About PowerShow.com