Scalable Benchmarks and Kernels for Data Mining and Analytics - PowerPoint PPT Presentation

About This Presentation
Title:

Scalable Benchmarks and Kernels for Data Mining and Analytics

Description:

Scalable Benchmarks and Kernels for Data Mining and Analytics – PowerPoint PPT presentation

Number of Views:114
Avg rating:3.0/5.0
Slides: 32
Provided by: wwwuser
Category:

less

Transcript and Presenter's Notes

Title: Scalable Benchmarks and Kernels for Data Mining and Analytics


1
Scalable Benchmarks and Kernels for Data Mining
and Analytics
  • Vipin Kumar
  • University of Minnesota
  • kumar_at_cs.umn.edu
  • www.cs.umn.edu/kumar
  • Joint work with Alok Choudhary and Gokhan Memik
    (Northwestern) and Michael Steinbach (University
    of Minnesota)
  • Research funded by NSF

2
Need for High Performance Data Mining
  • Todays digital society has seen enormous data
    growth in both commercial and scientific
    databases
  • Data Mining is becoming a commonly used tool to
    extract information from large and complex
    datasets
  • Advances in computing capabilities and
    technological innovation needed to harvest the
    available wealth of data

Homeland Security
Biomedical Data
Internet
Geo-spatial data
Computational Simulations
Sensor Networks
3
Data Mining for Climate Data
  • NASA ESE questions
  • How is the global Earth system changing?
  • What are the primary forcings?
  • How does Earth system respond to natural
    human-induced changes?
  • What are the consequences of changes in the Earth
    system?
  • How well can we predict future changes?
  • Global snapshots of values for a number of
    variables on land surfaces or water

NASA DATA MINING REVEALS A NEW HISTORY OF NATURAL
DISASTERS NASA is using satellite data to paint a
detailed global picture of the interplay among
natural disasters, human activities and the rise
of carbon dioxide in the Earth's atmosphere
during the past 20 years.http//www.nasa.gov/ce
nters/ames/news/releases/2003/03_51AR.html
  • High Resolution EOS Data
  • EOS satellites provide high resolution
    measurements
  • Finer spatial grids
  • 1 km ? 1 km grid produces 694,315,008 data points
  • Going from 0.5º ? 0.5º degree data to 1 km ? 1 km
    data results in a 2500-fold increase in the data
    size
  • More frequent measurements
  • Multiple instruments
  • High resolution data allows us to answer more
    detailed questions
  • Detecting patterns such as trajectories, fronts,
    and movements of regions with uniform properties
  • Finding relationships between leaf area index
    (LAI) and topography of a river drainage basin
  • Finding relationships between fire frequency and
    elevation as well as topographic position
  • Leads to substantially high computational and
    memory requirements

Detection of Ecosystem Disturbances
This interactive module displays the locations on
the earth surface where significant disturbance
events have been detected.
Disturbance Viewer
4
Data Mining for Cyber Security
  • Due to proliferation of Internet, more and more
    organizations are becoming vulnerable to
    sophisticated cyber attacks
  • Traditional Intrusion Detection Systems (IDS)
    have well-known limitations
  • Too many false alarms
  • Unable to detect sophisticated and novel attacks
  • Unable to detect insider abuse/ policy abuse
  • Data Mining is well suited to address these
    challenges

MINDS Minnesota Intrusion Detection System
  • Large Scale Data Analysis is needed for
  • Correlation of suspicious events across network
    sites
  • Helps detect sophisticated attacks not
    identifiable by single site analyses
  • Analysis of long term data (months/years)
  • Uncover suspicious stealth activities (e.g.
    insiders leaking/modifying information)
  • Incorporated into Interrogator architecture at
    ARL Center for Intrusion Monitoring and
    Protection (CIMP)
  • Helps analyze data from multiple sensors at DoD
    sites around the country
  • Routinely detects Insider Abuse / Policy
    Violations / Worms / Scans

5
Data Mining for Biomedical Informatics
  • Recent technological advances are helping to
    generate large amounts of both medical and
    genomic data
  • High-throughput experiments/techniques
  • Gene and protein sequences
  • Gene-expression data
  • Biological networks and phylogenetic profiles
  • Electronic Medical Records
  • IBM-Mayo clinic partnership has created a DB of 5
    million patients
  • NIH Roadmap
  • Data mining offers potential solution for
    analysis of large-scale data
  • Automated analysis of patients history for
    customized treatment
  • Design of drugs/chemicals
  • Prediction of the functions of anonymous genes

Protein Interaction Network
6
Role of Benchmarks in Architecture Design
  • Benchmarks guide the development of new processor
    architectures in addition to measuring the
    relative performance of different systems
  • SPEC General purpose architecture
  • (Advances in the microprocessor industry would
    not have been possible without the SPEC
    benchmarks - David Patterson)
  • TPC Database Systems
  • SPLASH Parallel machine architectures
  • Mediabench Media and Communication Processors
  • NetBench Network/Embedded processors

7
Do We Need Benchmarks Specific to Data Mining?
  • Performance metrics of several benchmarks
    gathered from Vtune
  • Cache miss ratios, Bus usage, Page faults etc.
  • Benchmark applications were grouped using Kohenen
    clustering to spot trends

8
Recently funded NSF project Scalable Benchmarks,
Software and Datafor Data Mining, Analytics and
Scientific DiscoveriesPIs A. Choudhary and
Gokhan Memik (NW) , V. Kumar and M. Steinbach
(UM)
  • Goal Establish a comprehensive benchmarking
    suite for data mining applications.
  • Motivate the development of new processor
    architectures and system design for data mining
  • Motivate the implementation of more sophisticated
    data mining algorithms that can work with the
    constraints imposed by current architecture
    designs
  • Improvement the productivity of scientists and
    engineers using data mining application in a wide
    variety of domains

9
Data Mining Tasks
Data
Clustering
Predictive Modeling
Anomaly Detection
Association Rules
Milk
10
Key Data Mining Algorithms
  • Clustering
  • K-means, EM, SOM
  • Single link / Group Average hierarchical
    clustering
  • DBSCAN, SNN
  • Classification
  • Bayes
  • SVM
  • Decision trees, Rule based systems
  • Association Rule Mining
  • Apriori, FP-Growth
  • Anomaly Detection
  • Statistical methods
  • Distance-based
  • Clustering-based
  • Preprocessing
  • SVD, PCA

11
Major Data Mining Kernels
  • Counting
  • Given a set of data records, count types of
    different categories to build a contingency table
  • Count the occurrence of a set of items in a set
    of transactions
  • Pairwise computations
  • Given a set of data records, perform pairwise
    distane/similarity computations
  • Linear Algebra operations
  • SVD, PCA

12
General Characteristics of Data Mining Algorithms
  • Dense/Sparse data
  • Hash table / Hash tree
  • Linked Lists
  • Iterative nature
  • Data often too large to fit in main memory
  • Spatial locality is critical

13
Constructing a Decision Tree
Employed
14
Constructing a Decision Tree
Employed Yes
Employed No
15
Constructing a Decision Tree in Parallel
m categorical attributes
n records
  • Partitioning of data only
  • global reduction per node is required
  • large number of classification tree nodes gives
    high communication cost

16
Constructing a Decision Tree in Parallel
  • Partitioning of classification tree nodes
  • natural concurrency
  • load imbalance
  • the amount of work associated with each node
    varies
  • limited concurrency on the upper portion of the
    tree
  • child nodes use the same data as used by parent
    node
  • loss of locality
  • high data movement cost

17
Speedup Comparison of the Three Parallel
Algorithms
  • Data set used in SLIQ paper (Ref Mehta, Agrawal
    and Rissanen, 1996)
  • IBM SP2 with 128 processors
  • Dynamic load balancing inspired by parallel
    sparse Cholesky factorization and parallel tree
    search

18
Speedup of the Hybrid Algorithm with Different
Size Data Sets
19
Hash Table Access
  • Some efficient decision tree algorithms require
    random access to large data structures.
  • Example SPRINT (Ref Shafer, Agrawal, Mehta,
    1996)

Hash Table
Processor P0
Left
Right
Processor P1
Processor P2
20
ScalParC (Ref Joshi, Karypis, Kumar, 1998)
  • ScalParC is a scalable parallel decision tree
    construction algorithm
  • Scales to large number of processors
  • Scales to large training sets
  • ScalParC is memory efficient
  • The hash-table is distributed among the
    processors
  • ScalParC performs minimum amount of communication

21
This ScalParC Design is Inspired by..
  • Communication Structure of Parallel Sparse
    Matrix-Vector Algorithms

Processor P0
Processor P1
Processor P2
Hash Table Entries
22
Parallel Runtime (Ref Joshi, Karypis, Kumar,
1998)
128 Processor Cray T3D
23
Computing Association Patterns
2. Find item combinations (itemsets) that
occur frequently in data
1. Market-basket transactions
3. Generate association rules
24
Counting Candidates
  • Frequent Itemsets are found by counting
    candidates
  • Simple way
  • Search for each candidate in each transaction

Transactions
Candidates
Count
A B 0
A C 0
A D 0
A E 0
B C 0
B D 0
A B E 0
B C D 0
A B D E 0
A B C D E 0
A B C D
A C E
B C D
A B D E
B C E
B D
M
N
25
Parallel Association Rules Scaleup Results
(100K,0.25) (Ref Han, Karypis, and Kumar, 2000)
Efficient implementation of collective
communication
Dynamic restructuring of computation
26
Candidates for MineBench
27
Analysis of Benchmark Algorithms
  • Explore the bottlenecks associated with the
    current general purpose sequential and parallel
    machines
  • Explore how different architectural features
    impact the performance of data mining algorithms

28
Preliminary Evaluation of Some Sample Data Sets
  • Example small (S), medium (M), and large (L) data
    set
  • Execution time for some algorithms in the
    MineBench suite.

Reference Liu Y., Pisharath J., Liao W., Memik
G., Choudhary A., Dubey P., 2004
29
Designing Efficient Kernels for Data Mining
  • Understanding of the bottlenecks in executing DM
    algorithms on current architectures will help
    design new, more efficient algorithms
  • Focus will be on design frequently used kernels
    that dominates the execution time of most DM
    algorithms
  • Both sequential and parallel versions will be
    developed

Frequency of Kernel Operations in Representative
Applications
Reference Pisharath J., Zambreno J.,
Ozisikyilmaz B., Choudhary A., 2006
30
Conclusions
  • Data mining applications are becoming
    increasingly important
  • Current systems design approach not adequate for
    DM applications
  • MineBench a new benchmark suite which
    encompasses many algorithms found in data mining
  • Initial findings
  • Data mining applications are unique in terms of
    performance characteristics
  • There exists much room for optimization with
    regards to data mining workloads

31
Bibliography
  • Introduction to Data Mining, Pang-Ning Tan,
    Michael Steinbach, Vipin Kumar, Addison-Wesley
    April 2005
  • Introduction to Parallel Computing, (Second
    Edition) by Ananth Grama, Anshul Gupta, George
    Karypis, and Vipin Kumar. Addison-Wesley, 2003
  • Data Mining for Scientific and Engineering
    Applications, edited by R. Grossman, C. Kamath,
    W. P. Kegelmeyer, V. Kumar, and R. Namburu,
    Kluwer Academic Publishers, 2001
  • J. Han, R. B. Altman, V. Kumar, H. Mannila, and
    D. Pregibon, "Emerging Scientific Applications in
    Data Mining", Communications of the ACMVolume
    45, Number 8, pp 54-58, August 2002
  • C. Potter, P. Tan, M. Steinbach, S. Klooster, V.
    Kumar, R. Myneni, V. Genovese, Major Disturbance
    Events in Terrestrial Ecosystems Detected using
    global Satellite Data Sets, Global Change Biology
    9 (7), 1005-1021, 2003
  • Vipin Kumar, Parallel and Distributed Computing
    for Cyber Security". An article based on the
    keynote talk by the author at 17th  International
    Conference on Parallel and Distributed Computing
    Systems (PDCS-2004). DS Online Journal, OLUME 6,
    NUMBER 10, October 2005
  • Ying Liu, Jayaprakash Pisharath, Wei-keng Liao,
    Gokhan Memik, Alok Choudhary, and Pradeep Dubey.
    Performance Evaluation and Characterization of
    Scalable Data Mining Algorithms. In Proceedings
    of the 16th International Conference on Parallel
    and Distributed Computing and Systems (PDCS),
    November 2004.
  • Joseph Zambreno, Berkin Ozisikyilmaz, Jayaprakash
    Pisharath, Gokhan Memik, and Alok Choudhary.
    Performance Characterization of Data Mining
    Applications using MineBench. In Proceedings of
    the 9th Workshop on Computer Architecture
    Evaluation using Commercial Workloads (CAECW-9),
    February 2006.
  • Jayaprakash Pisharath, Joseph Zambreno, Berkin
    Ozisikyilmaz, and Alok Choudhary. Accelerating
    Data Mining Workloads Current Approaches and
    Future Challenges in System Architecture Design.
    In Proceedings of the 9th International Workshop
    on High Performance and Distributed Mining
    (HPDM), April 2006
Write a Comment
User Comments (0)
About PowerShow.com