Scalable Benchmarks and Kernels for Data Mining and Analytics - PowerPoint PPT Presentation

About This Presentation

Title:

Scalable Benchmarks and Kernels for Data Mining and Analytics

Description:

Scalable Benchmarks and Kernels for Data Mining and Analytics – PowerPoint PPT presentation

Number of Views:114

Avg rating:3.0/5.0

Slides: 32

Provided by: wwwuser

Learn more at: https://www-users.cse.umn.edu

Category:

more less

Transcript and Presenter's Notes

Title: Scalable Benchmarks and Kernels for Data Mining and Analytics

1
Scalable Benchmarks and Kernels for Data Mining
and Analytics

Vipin Kumar
University of Minnesota
kumar_at_cs.umn.edu
www.cs.umn.edu/kumar
Joint work with Alok Choudhary and Gokhan Memik
(Northwestern) and Michael Steinbach (University
of Minnesota)
Research funded by NSF

2
Need for High Performance Data Mining

Todays digital society has seen enormous data
growth in both commercial and scientific
databases
Data Mining is becoming a commonly used tool to
extract information from large and complex
datasets
Advances in computing capabilities and
technological innovation needed to harvest the
available wealth of data

Homeland Security
Biomedical Data
Internet
Geo-spatial data
Computational Simulations
Sensor Networks
3
Data Mining for Climate Data

NASA ESE questions
How is the global Earth system changing?
What are the primary forcings?
How does Earth system respond to natural
human-induced changes?
What are the consequences of changes in the Earth
system?
How well can we predict future changes?

Global snapshots of values for a number of
variables on land surfaces or water

NASA DATA MINING REVEALS A NEW HISTORY OF NATURAL
DISASTERS NASA is using satellite data to paint a
detailed global picture of the interplay among
natural disasters, human activities and the rise
of carbon dioxide in the Earth's atmosphere
during the past 20 years.http//www.nasa.gov/ce
nters/ames/news/releases/2003/03_51AR.html

High Resolution EOS Data
EOS satellites provide high resolution
measurements
Finer spatial grids
1 km ? 1 km grid produces 694,315,008 data points
Going from 0.5º ? 0.5º degree data to 1 km ? 1 km
data results in a 2500-fold increase in the data
size
More frequent measurements
Multiple instruments
High resolution data allows us to answer more
detailed questions
Detecting patterns such as trajectories, fronts,
and movements of regions with uniform properties
Finding relationships between leaf area index
(LAI) and topography of a river drainage basin
Finding relationships between fire frequency and
elevation as well as topographic position
Leads to substantially high computational and
memory requirements

Detection of Ecosystem Disturbances
This interactive module displays the locations on
the earth surface where significant disturbance
events have been detected.
Disturbance Viewer
4
Data Mining for Cyber Security

Due to proliferation of Internet, more and more
organizations are becoming vulnerable to
sophisticated cyber attacks
Traditional Intrusion Detection Systems (IDS)
have well-known limitations
Too many false alarms
Unable to detect sophisticated and novel attacks
Unable to detect insider abuse/ policy abuse
Data Mining is well suited to address these
challenges

MINDS Minnesota Intrusion Detection System

Large Scale Data Analysis is needed for
Correlation of suspicious events across network
sites
Helps detect sophisticated attacks not
identifiable by single site analyses
Analysis of long term data (months/years)
Uncover suspicious stealth activities (e.g.
insiders leaking/modifying information)

Incorporated into Interrogator architecture at
ARL Center for Intrusion Monitoring and
Protection (CIMP)
Helps analyze data from multiple sensors at DoD
sites around the country
Routinely detects Insider Abuse / Policy
Violations / Worms / Scans

5
Data Mining for Biomedical Informatics

Recent technological advances are helping to
generate large amounts of both medical and
genomic data
High-throughput experiments/techniques
Gene and protein sequences
Gene-expression data
Biological networks and phylogenetic profiles
Electronic Medical Records
IBM-Mayo clinic partnership has created a DB of 5
million patients
NIH Roadmap
Data mining offers potential solution for
analysis of large-scale data
Automated analysis of patients history for
customized treatment
Design of drugs/chemicals
Prediction of the functions of anonymous genes

Protein Interaction Network
6
Role of Benchmarks in Architecture Design

Benchmarks guide the development of new processor
architectures in addition to measuring the
relative performance of different systems
SPEC General purpose architecture
(Advances in the microprocessor industry would
not have been possible without the SPEC
benchmarks - David Patterson)
TPC Database Systems
SPLASH Parallel machine architectures
Mediabench Media and Communication Processors
NetBench Network/Embedded processors

7
Do We Need Benchmarks Specific to Data Mining?

Performance metrics of several benchmarks
gathered from Vtune
Cache miss ratios, Bus usage, Page faults etc.
Benchmark applications were grouped using Kohenen
clustering to spot trends

8
Recently funded NSF project Scalable Benchmarks,
Software and Datafor Data Mining, Analytics and
Scientific DiscoveriesPIs A. Choudhary and
Gokhan Memik (NW) , V. Kumar and M. Steinbach
(UM)

Goal Establish a comprehensive benchmarking
suite for data mining applications.
Motivate the development of new processor
architectures and system design for data mining
Motivate the implementation of more sophisticated
data mining algorithms that can work with the
constraints imposed by current architecture
designs
Improvement the productivity of scientists and
engineers using data mining application in a wide
variety of domains

9
Data Mining Tasks
Data
Clustering
Predictive Modeling
Anomaly Detection
Association Rules
Milk
10
Key Data Mining Algorithms

Clustering
K-means, EM, SOM
Single link / Group Average hierarchical
clustering
DBSCAN, SNN
Classification
Bayes
SVM
Decision trees, Rule based systems
Association Rule Mining
Apriori, FP-Growth
Anomaly Detection
Statistical methods
Distance-based
Clustering-based
Preprocessing
SVD, PCA

11
Major Data Mining Kernels

Counting
Given a set of data records, count types of
different categories to build a contingency table
Count the occurrence of a set of items in a set
of transactions
Pairwise computations
Given a set of data records, perform pairwise
distane/similarity computations
Linear Algebra operations
SVD, PCA

12
General Characteristics of Data Mining Algorithms

Dense/Sparse data
Hash table / Hash tree
Linked Lists
Iterative nature
Data often too large to fit in main memory
Spatial locality is critical

13
Constructing a Decision Tree
Employed
14
Constructing a Decision Tree
Employed Yes
Employed No
15
Constructing a Decision Tree in Parallel
m categorical attributes
n records

Partitioning of data only
global reduction per node is required
large number of classification tree nodes gives
high communication cost

16
Constructing a Decision Tree in Parallel

Partitioning of classification tree nodes
natural concurrency
load imbalance
the amount of work associated with each node
varies
limited concurrency on the upper portion of the
tree
child nodes use the same data as used by parent
node
loss of locality
high data movement cost

17
Speedup Comparison of the Three Parallel
Algorithms

Data set used in SLIQ paper (Ref Mehta, Agrawal
and Rissanen, 1996)
IBM SP2 with 128 processors

Dynamic load balancing inspired by parallel
sparse Cholesky factorization and parallel tree
search

18
Speedup of the Hybrid Algorithm with Different
Size Data Sets
19
Hash Table Access

Some efficient decision tree algorithms require
random access to large data structures.
Example SPRINT (Ref Shafer, Agrawal, Mehta,
1996)

Hash Table
Processor P0
Left
Right
Processor P1
Processor P2
20
ScalParC (Ref Joshi, Karypis, Kumar, 1998)

ScalParC is a scalable parallel decision tree
construction algorithm
Scales to large number of processors
Scales to large training sets
ScalParC is memory efficient
The hash-table is distributed among the
processors
ScalParC performs minimum amount of communication

21
This ScalParC Design is Inspired by..

Communication Structure of Parallel Sparse
Matrix-Vector Algorithms

Processor P0
Processor P1
Processor P2
Hash Table Entries
22
Parallel Runtime (Ref Joshi, Karypis, Kumar,
1998)
128 Processor Cray T3D
23
Computing Association Patterns
2. Find item combinations (itemsets) that
occur frequently in data
1. Market-basket transactions
3. Generate association rules
24
Counting Candidates

Frequent Itemsets are found by counting
candidates
Simple way
Search for each candidate in each transaction

Transactions
Candidates
Count
A B 0
A C 0
A D 0
A E 0
B C 0
B D 0
A B E 0
B C D 0
A B D E 0
A B C D E 0
A B C D
A C E
B C D
A B D E
B C E
B D
M
N
25
Parallel Association Rules Scaleup Results
(100K,0.25) (Ref Han, Karypis, and Kumar, 2000)
Efficient implementation of collective
communication
Dynamic restructuring of computation
26
Candidates for MineBench
27
Analysis of Benchmark Algorithms

Explore the bottlenecks associated with the
current general purpose sequential and parallel
machines
Explore how different architectural features
impact the performance of data mining algorithms

28
Preliminary Evaluation of Some Sample Data Sets

Example small (S), medium (M), and large (L) data
set
Execution time for some algorithms in the
MineBench suite.

Reference Liu Y., Pisharath J., Liao W., Memik
G., Choudhary A., Dubey P., 2004
29
Designing Efficient Kernels for Data Mining

Understanding of the bottlenecks in executing DM
algorithms on current architectures will help
design new, more efficient algorithms
Focus will be on design frequently used kernels
that dominates the execution time of most DM
algorithms

Both sequential and parallel versions will be
developed

Frequency of Kernel Operations in Representative
Applications
Reference Pisharath J., Zambreno J.,
Ozisikyilmaz B., Choudhary A., 2006
30
Conclusions

Data mining applications are becoming
increasingly important
Current systems design approach not adequate for
DM applications
MineBench a new benchmark suite which
encompasses many algorithms found in data mining
Initial findings
Data mining applications are unique in terms of
performance characteristics
There exists much room for optimization with
regards to data mining workloads

31
Bibliography

Introduction to Data Mining, Pang-Ning Tan,
Michael Steinbach, Vipin Kumar, Addison-Wesley
April 2005
Introduction to Parallel Computing, (Second
Edition) by Ananth Grama, Anshul Gupta, George
Karypis, and Vipin Kumar. Addison-Wesley, 2003
Data Mining for Scientific and Engineering
Applications, edited by R. Grossman, C. Kamath,
W. P. Kegelmeyer, V. Kumar, and R. Namburu,
Kluwer Academic Publishers, 2001
J. Han, R. B. Altman, V. Kumar, H. Mannila, and
D. Pregibon, "Emerging Scientific Applications in
Data Mining", Communications of the ACMVolume
45, Number 8, pp 54-58, August 2002
C. Potter, P. Tan, M. Steinbach, S. Klooster, V.
Kumar, R. Myneni, V. Genovese, Major Disturbance
Events in Terrestrial Ecosystems Detected using
global Satellite Data Sets, Global Change Biology
9 (7), 1005-1021, 2003
Vipin Kumar, Parallel and Distributed Computing
for Cyber Security". An article based on the
keynote talk by the author at 17th International
Conference on Parallel and Distributed Computing
Systems (PDCS-2004). DS Online Journal, OLUME 6,
NUMBER 10, October 2005
Ying Liu, Jayaprakash Pisharath, Wei-keng Liao,
Gokhan Memik, Alok Choudhary, and Pradeep Dubey.
Performance Evaluation and Characterization of
Scalable Data Mining Algorithms. In Proceedings
of the 16th International Conference on Parallel
and Distributed Computing and Systems (PDCS),
November 2004.
Joseph Zambreno, Berkin Ozisikyilmaz, Jayaprakash
Pisharath, Gokhan Memik, and Alok Choudhary.
Performance Characterization of Data Mining
Applications using MineBench. In Proceedings of
the 9th Workshop on Computer Architecture
Evaluation using Commercial Workloads (CAECW-9),
February 2006.
Jayaprakash Pisharath, Joseph Zambreno, Berkin
Ozisikyilmaz, and Alok Choudhary. Accelerating
Data Mining Workloads Current Approaches and
Future Challenges in System Architecture Design.
In Proceedings of the 9th International Workshop
on High Performance and Distributed Mining
(HPDM), April 2006