Title: Scalable Benchmarks and Kernels for Data Mining and Analytics
1Scalable Benchmarks and Kernels for Data Mining
and Analytics
- Vipin Kumar
- University of Minnesota
- kumar_at_cs.umn.edu
- www.cs.umn.edu/kumar
- Joint work with Alok Choudhary and Gokhan Memik
(Northwestern) and Michael Steinbach (University
of Minnesota) - Research funded by NSF
2Need for High Performance Data Mining
- Todays digital society has seen enormous data
growth in both commercial and scientific
databases - Data Mining is becoming a commonly used tool to
extract information from large and complex
datasets - Advances in computing capabilities and
technological innovation needed to harvest the
available wealth of data
Homeland Security
Biomedical Data
Internet
Geo-spatial data
Computational Simulations
Sensor Networks
3Data Mining for Climate Data
- NASA ESE questions
- How is the global Earth system changing?
- What are the primary forcings?
- How does Earth system respond to natural
human-induced changes? - What are the consequences of changes in the Earth
system? - How well can we predict future changes?
- Global snapshots of values for a number of
variables on land surfaces or water
NASA DATA MINING REVEALS A NEW HISTORY OF NATURAL
DISASTERS NASA is using satellite data to paint a
detailed global picture of the interplay among
natural disasters, human activities and the rise
of carbon dioxide in the Earth's atmosphere
during the past 20 years.http//www.nasa.gov/ce
nters/ames/news/releases/2003/03_51AR.html
- High Resolution EOS Data
- EOS satellites provide high resolution
measurements - Finer spatial grids
- 1 km ? 1 km grid produces 694,315,008 data points
- Going from 0.5º ? 0.5º degree data to 1 km ? 1 km
data results in a 2500-fold increase in the data
size - More frequent measurements
- Multiple instruments
- High resolution data allows us to answer more
detailed questions - Detecting patterns such as trajectories, fronts,
and movements of regions with uniform properties - Finding relationships between leaf area index
(LAI) and topography of a river drainage basin - Finding relationships between fire frequency and
elevation as well as topographic position - Leads to substantially high computational and
memory requirements
Detection of Ecosystem Disturbances
This interactive module displays the locations on
the earth surface where significant disturbance
events have been detected.
Disturbance Viewer
4Data Mining for Cyber Security
- Due to proliferation of Internet, more and more
organizations are becoming vulnerable to
sophisticated cyber attacks - Traditional Intrusion Detection Systems (IDS)
have well-known limitations - Too many false alarms
- Unable to detect sophisticated and novel attacks
- Unable to detect insider abuse/ policy abuse
- Data Mining is well suited to address these
challenges
MINDS Minnesota Intrusion Detection System
- Large Scale Data Analysis is needed for
- Correlation of suspicious events across network
sites - Helps detect sophisticated attacks not
identifiable by single site analyses - Analysis of long term data (months/years)
- Uncover suspicious stealth activities (e.g.
insiders leaking/modifying information)
- Incorporated into Interrogator architecture at
ARL Center for Intrusion Monitoring and
Protection (CIMP) - Helps analyze data from multiple sensors at DoD
sites around the country - Routinely detects Insider Abuse / Policy
Violations / Worms / Scans
5Data Mining for Biomedical Informatics
- Recent technological advances are helping to
generate large amounts of both medical and
genomic data - High-throughput experiments/techniques
- Gene and protein sequences
- Gene-expression data
- Biological networks and phylogenetic profiles
- Electronic Medical Records
- IBM-Mayo clinic partnership has created a DB of 5
million patients - NIH Roadmap
- Data mining offers potential solution for
analysis of large-scale data - Automated analysis of patients history for
customized treatment - Design of drugs/chemicals
- Prediction of the functions of anonymous genes
Protein Interaction Network
6Role of Benchmarks in Architecture Design
- Benchmarks guide the development of new processor
architectures in addition to measuring the
relative performance of different systems - SPEC General purpose architecture
- (Advances in the microprocessor industry would
not have been possible without the SPEC
benchmarks - David Patterson) - TPC Database Systems
- SPLASH Parallel machine architectures
- Mediabench Media and Communication Processors
- NetBench Network/Embedded processors
7Do We Need Benchmarks Specific to Data Mining?
- Performance metrics of several benchmarks
gathered from Vtune - Cache miss ratios, Bus usage, Page faults etc.
- Benchmark applications were grouped using Kohenen
clustering to spot trends
8Recently funded NSF project Scalable Benchmarks,
Software and Datafor Data Mining, Analytics and
Scientific DiscoveriesPIs A. Choudhary and
Gokhan Memik (NW) , V. Kumar and M. Steinbach
(UM)
- Goal Establish a comprehensive benchmarking
suite for data mining applications. - Motivate the development of new processor
architectures and system design for data mining - Motivate the implementation of more sophisticated
data mining algorithms that can work with the
constraints imposed by current architecture
designs - Improvement the productivity of scientists and
engineers using data mining application in a wide
variety of domains
9Data Mining Tasks
Data
Clustering
Predictive Modeling
Anomaly Detection
Association Rules
Milk
10Key Data Mining Algorithms
- Clustering
- K-means, EM, SOM
- Single link / Group Average hierarchical
clustering - DBSCAN, SNN
- Classification
- Bayes
- SVM
- Decision trees, Rule based systems
- Association Rule Mining
- Apriori, FP-Growth
- Anomaly Detection
- Statistical methods
- Distance-based
- Clustering-based
- Preprocessing
- SVD, PCA
11Major Data Mining Kernels
- Counting
- Given a set of data records, count types of
different categories to build a contingency table - Count the occurrence of a set of items in a set
of transactions - Pairwise computations
- Given a set of data records, perform pairwise
distane/similarity computations - Linear Algebra operations
- SVD, PCA
12General Characteristics of Data Mining Algorithms
- Dense/Sparse data
- Hash table / Hash tree
- Linked Lists
- Iterative nature
- Data often too large to fit in main memory
- Spatial locality is critical
13Constructing a Decision Tree
Employed
14Constructing a Decision Tree
Employed Yes
Employed No
15Constructing a Decision Tree in Parallel
m categorical attributes
n records
- Partitioning of data only
- global reduction per node is required
- large number of classification tree nodes gives
high communication cost
16Constructing a Decision Tree in Parallel
- Partitioning of classification tree nodes
- natural concurrency
- load imbalance
- the amount of work associated with each node
varies - limited concurrency on the upper portion of the
tree - child nodes use the same data as used by parent
node - loss of locality
- high data movement cost
17Speedup Comparison of the Three Parallel
Algorithms
- Data set used in SLIQ paper (Ref Mehta, Agrawal
and Rissanen, 1996) - IBM SP2 with 128 processors
- Dynamic load balancing inspired by parallel
sparse Cholesky factorization and parallel tree
search
18Speedup of the Hybrid Algorithm with Different
Size Data Sets
19Hash Table Access
- Some efficient decision tree algorithms require
random access to large data structures. - Example SPRINT (Ref Shafer, Agrawal, Mehta,
1996)
Hash Table
Processor P0
Left
Right
Processor P1
Processor P2
20ScalParC (Ref Joshi, Karypis, Kumar, 1998)
- ScalParC is a scalable parallel decision tree
construction algorithm - Scales to large number of processors
- Scales to large training sets
- ScalParC is memory efficient
- The hash-table is distributed among the
processors - ScalParC performs minimum amount of communication
21This ScalParC Design is Inspired by..
- Communication Structure of Parallel Sparse
Matrix-Vector Algorithms
Processor P0
Processor P1
Processor P2
Hash Table Entries
22Parallel Runtime (Ref Joshi, Karypis, Kumar,
1998)
128 Processor Cray T3D
23Computing Association Patterns
2. Find item combinations (itemsets) that
occur frequently in data
1. Market-basket transactions
3. Generate association rules
24Counting Candidates
- Frequent Itemsets are found by counting
candidates - Simple way
- Search for each candidate in each transaction
Transactions
Candidates
Count
A B 0
A C 0
A D 0
A E 0
B C 0
B D 0
A B E 0
B C D 0
A B D E 0
A B C D E 0
A B C D
A C E
B C D
A B D E
B C E
B D
M
N
25Parallel Association Rules Scaleup Results
(100K,0.25) (Ref Han, Karypis, and Kumar, 2000)
Efficient implementation of collective
communication
Dynamic restructuring of computation
26Candidates for MineBench
27Analysis of Benchmark Algorithms
- Explore the bottlenecks associated with the
current general purpose sequential and parallel
machines - Explore how different architectural features
impact the performance of data mining algorithms
28Preliminary Evaluation of Some Sample Data Sets
- Example small (S), medium (M), and large (L) data
set - Execution time for some algorithms in the
MineBench suite.
Reference Liu Y., Pisharath J., Liao W., Memik
G., Choudhary A., Dubey P., 2004
29Designing Efficient Kernels for Data Mining
- Understanding of the bottlenecks in executing DM
algorithms on current architectures will help
design new, more efficient algorithms - Focus will be on design frequently used kernels
that dominates the execution time of most DM
algorithms
- Both sequential and parallel versions will be
developed
Frequency of Kernel Operations in Representative
Applications
Reference Pisharath J., Zambreno J.,
Ozisikyilmaz B., Choudhary A., 2006
30Conclusions
- Data mining applications are becoming
increasingly important - Current systems design approach not adequate for
DM applications - MineBench a new benchmark suite which
encompasses many algorithms found in data mining - Initial findings
- Data mining applications are unique in terms of
performance characteristics - There exists much room for optimization with
regards to data mining workloads
31Bibliography
- Introduction to Data Mining, Pang-Ning Tan,
Michael Steinbach, Vipin Kumar, Addison-Wesley
April 2005 - Introduction to Parallel Computing, (Second
Edition) by Ananth Grama, Anshul Gupta, George
Karypis, and Vipin Kumar. Addison-Wesley, 2003 - Data Mining for Scientific and Engineering
Applications, edited by R. Grossman, C. Kamath,
W. P. Kegelmeyer, V. Kumar, and R. Namburu,
Kluwer Academic Publishers, 2001 - J. Han, R. B. Altman, V. Kumar, H. Mannila, and
D. Pregibon, "Emerging Scientific Applications in
Data Mining", Communications of the ACMVolume
45, Number 8, pp 54-58, August 2002 - C. Potter, P. Tan, M. Steinbach, S. Klooster, V.
Kumar, R. Myneni, V. Genovese, Major Disturbance
Events in Terrestrial Ecosystems Detected using
global Satellite Data Sets, Global Change Biology
9 (7), 1005-1021, 2003 - Vipin Kumar, Parallel and Distributed Computing
for Cyber Security". An article based on the
keynote talk by the author at 17th International
Conference on Parallel and Distributed Computing
Systems (PDCS-2004). DS Online Journal, OLUME 6,
NUMBER 10, October 2005 - Ying Liu, Jayaprakash Pisharath, Wei-keng Liao,
Gokhan Memik, Alok Choudhary, and Pradeep Dubey.
Performance Evaluation and Characterization of
Scalable Data Mining Algorithms. In Proceedings
of the 16th International Conference on Parallel
and Distributed Computing and Systems (PDCS),
November 2004. - Joseph Zambreno, Berkin Ozisikyilmaz, Jayaprakash
Pisharath, Gokhan Memik, and Alok Choudhary.
Performance Characterization of Data Mining
Applications using MineBench. In Proceedings of
the 9th Workshop on Computer Architecture
Evaluation using Commercial Workloads (CAECW-9),
February 2006. - Jayaprakash Pisharath, Joseph Zambreno, Berkin
Ozisikyilmaz, and Alok Choudhary. Accelerating
Data Mining Workloads Current Approaches and
Future Challenges in System Architecture Design.
In Proceedings of the 9th International Workshop
on High Performance and Distributed Mining
(HPDM), April 2006