Title: A Briefing Given to the Access Grid Community on D2K Data To Knowledge
1A Briefing Given to the Access Grid Community on
D2K Data To Knowledge
2Presentation Overview
- Brief Introduction to Knowledge Discovery in
Databases and Data Mining - Knowledge Discovery in Databases Framework
- Primer on Using the D2K Data To Knowledge
Framework - Questions?
3Goals
- Understanding of the Knowledge Discovery in
Databases Process - Gain Knowledge of Basic Data Mining Operations
and Techniques - Understanding the Role of the Knowledge Discovery
Framework - Key Issues in Utilization of D2K Framework
- Understanding the Role of Information
Visualization in Data Mining
4Motivation Necessity is the Mother of Invention
- Data Explosion Problem
- Automated Data Collection Tools and Mature
Database Technology Lead to Tremendous Amounts of
Data Stores in Databases, Data Warehouses, and
Other Information Repositories. - We Are Drowning In Data, But Starving For
Knowledge - Solution Data Management Environments and Data
Mining Frameworks - Data Warehousing and On-Line Analytical
Processing - Extraction Of Interesting Knowledge (Rules,
Regularities, Patterns) from Large Data and Large
Databases
5Why Data Mining? - Potential Applications
-
- Eliminating Waste, Fraud, Abuse
- Taxpayer Non-compliance
- Medicaid Claims Fraud
- Food Stamp Program
- Auditor Interestingness Tool
- Corporate Analysis and Risk Management
- Resource Planning
- Competitive Analysis
- Finance Planning and Asset Evaluation
6Why Data Mining? - Potential Applications
- Crisis Management
- Anticipatory Models
- Topic Detection
- Text Extraction
- Network Intrusion
- Multi-Objective Optimization
- Workforce/Education
- Constituent Relationship
- Management
- Real-time Profiling
- Peer Review Analysis
- Curriculum Generator
- Retention Programs
7Why Data Mining? - Potential Applications
- Managing Natural Resources
- Land Usage
- Water Resource Management
- Surveillance
- Biometrics for Identification
- Other Applications
- Astronomy
- Computational Biology
8Data Mining On What Kind of Data?
- Relational Databases
- Data Warehouses
- Transactional Databases
- Advanced Database Systems
- Object-Relational
- Spatial
- Temporal
- Text
- Heterogeneous, Legacy, and Distributed
- WWW
9Data Mining Confluence of Multiple Disciplines
- Database Systems, Data Warehouses, and OLAP
- Machine Learning
- Statistics
- Mathematical Programming
- Visualization
- High Performance Computing
10Why Do We Need Data Mining ?
- Data volumes are too large for classical analysis
approaches - Large number of records (108 1012 bytes)
- High dimensional data ( 102 104 attributes)
-
-
- How do you explore millions of records, tens or
hundreds of fields, and find patterns?
11Why Do We Need Data Mining?
- As databases grow, the ability to support the
decision support process using traditional query
languages becomes infeasible - Many queries of interest are difficult to state
in a query language (query formulation problem) - Find all cases of fraud
- Find all individuals likely to need Education
Credit Assistance - Find all documents that are similar to this
customers problem
12What is It?
- Knowledge Discovery in Databases is the
non-trivial process of identifying valid, novel,
potentially useful, and ultimately understandable
patterns in data. - The understandable patterns are used to
- Make predictions or classifications about new
data - Explain existing data
- Summarize the contents of a large database to
support decision making - Graphical data visualization to aid humans in
discovering deeper patterns
13Three Primary Data Mining Paradigms
- Predictive Modeling
- Classification (Categorical or Discrete)
- Regression (Continuous)
- Discovery
- Association Rules, Link Analysis, Sequences,
Clustering - Deviation Detection/ Monitoring
14Knowledge Discovery In Databases Process
15Need for Data Mining Framework
- Visual Programming Environment
- Robust Computational Infrastructure
- Flexible And Extensible Architecture
- Rapid Application Development Environment
- Integrated Environment For Models And
Visualization - Workflow and Group Use Interface
16D2K - Data To Knowledge
- D2K is a rapid, flexible data mining system that
integrates effective analytical data mining
methods for prediction, discovery, and anomaly
detection with data management and information
visualization.
17D2K Infrastructure, Toolkit, Modules, and
Applications
- Data Selection
- Distributed Knowledge Sources
- Data Transformation
- Feature Selection/ Construction
- Example Selection
- Data Modeling
- Scalable Algorithms
- Predictive
- Discovery
- Anomaly Detection
- Bias Optimization
- Layer Learning
- Model Evaluation
- Information Visualization
18D2K/T2K/I2K - Data, Text, and Image Analysis
19Summary
- Data mining discovering interesting patterns
from large amounts of data - A natural evolution of database technology, in
great demand, with wide applications - A KDD process includes data cleaning, data
integration, data selection, transformation, data
mining, pattern evaluation, and knowledge
presentation - Mining can be performed in a variety of
information repositories - Data mining functionalities characterization,
discrimination, association, classification,
clustering, outlier and trend analysis, etc - Importance of data mining framework
20D2K ToolKit
Tool Menu
Tool Bar
Side Tab Panes
Workspace
Jump Up Panes
21D2K - Software Environment for Data Mining
- Visual programming system employing a scalable
framework - Robust computational infrastructure
- Enable processor intensive apps, support
distributed computing - Enable data intensive apps, support
multi-processor, shared memory architectures,
thread pooling - Very low granularity, fast data flow paradigm,
integrated control flow - Reduction of time to market
- Increase code reuse and sharing
- Expedite custom software developments
- Relieve distributed computing burden
- Flexible and extensible architecture
- Create plug and play subsystem architectures, and
standard APIs - Rapid application development (RAD) environment
- Integrated environment for models and
visualization
22D2K Components
- D2K Infrastructure
- D2K API, data flow environment, distributed
computing framework and runtime system - D2K Modules
- Computational units written in Java that follow
the D2K API - D2K Itineraries
- Modules that are connected to form an application
- D2K Toolkit
- User interface for specification of itineraries
and execution which provides the rapid
application development environment - D2K-Driven Applications
- Applications that use D2K modules, but do not
need to run in the D2K Toolkit
23D2K Infrastructure
- D2K Module API Specification
- Distributed Computing Framework
- Uses Socket Based Connections to communicate to
remote machines - Uses Grid Services to deploy on the Grid
- Local D2K
- Controls the execution of an itinerary
- Manages the passing of data between modules and
machines (if necessary) - Remote D2K
- Executes a module on a remote machine
24D2K Modules
- Input Module Loads data from the outside world.
- Flat files, database, etc.
- Data Prep Module Performs functions to select,
clean, or transform the data - Binning, Normalizing, Feature Selection, etc.
- Compute Module Performs main algorithmic
computations. - Naïve Bayesian, Decision Tree, Apriori, etc.
- User Input Module Requires interaction with the
user. - Data Selection, Input and Output selection, etc.
- Output Module Saves data to the outside world.
- Flat files, databases, etc.
- Visualization Module Provides visual feedback to
the user. - Naïve Bayesian, Rule Association, Decision Tree,
Parallel Coordinates, 2D Scatterplot, 3D Surface
Plot
25D2K Module Icon Description
- Module Progress Bar
- Appears during execution to show the percentage
of time that this module executed over the entire
execution time. It is green when the module is
executing and red when not. - Input Trigger
- Specifies the input for control flow.
- Input Port
- Rectangular shapes on the left side of the module
represent the inputs for the module. They are
colored according to the data type that they
represent - Properties Symbol
- If a P is shown in the lower left corner of the
module, then the module has properties that can
be set before execution.
Output Trigger Specifies the output for control
flow. Output Port Rectangular shapes on the
right side of the module represent the outputs
for the module. They are colored according to the
data type that they represent. Serializable
Symbol If an S is shown in the lower right
corner of the module, then the module is
serializable and can be saved.
26D2K Itineraries
- Itineraries are applications that have connected
modules with their properties set. - D2K Core Itineraries include
- Prediction
- Discovery
- Anomaly Detection
- Data Selection
- Transformation
- Visualization
27D2K-Driven Applications
D2K-Driven applications are those that use D2K
modules and/or itineraries but do not require
interaction with the D2K Toolkit to function.
They can operate as stand alone applications.
- Advantages of Building D2K-Driven Applications
- Code reuse shortens development time
- Use the distributed computing features
implemented in D2K - Current Application Development By the ALG
- Text Analysis (ThemeWeaver uses T2K - Text to
Knowledge) - Other Potential Application Areas
- Image Analysis (I2K Image to Knowledge)
28New D2K 3.0 Features
- Extension of existing API
- Include the ability to programmatically connect
modules and set properties. - Allows D2K-driven applications to be developed.
- Ability to pause and restart an itinerary.
- Enhanced Distributed Computing
- Modules that are re-entrant can be executed
remotely. - Use of Jini services to look up distributed
resources. - For specifying the runtime layout of a
distributed itinerary, which can be changed
dynamically during runtime. - Processor Status Overlay
- Shows user how distributed computing resources
are being used. - Shows how many resources are ready to compute on
each machine. - Distributed Checkpointing
- Resource Manager
- Provides an API for indicating data structures to
be stored by the resource manager. - Resource manager provides these data structures
to distributed machines.
29 Processor Status Overlay
- Represents each machine being used.
- Multiple lines represent multiple processors per
machine.
30Lets look at D2K
- Demos
- D2K Toolkit
- Prediction
- Naive Bayesian
- Decision Tree
- Discovery
- Rule Association
- Text Analysis (D2K)
- Image Analysis (I2K)
- Visualization
31D2K SL
- Intuitive interfaces into a subset of D2K
functionality for non-data mining professionals. - Transparent access to mine data stored in
databases. - Extensible from desktop to cluster to grid.
- Visualization support at all stages of the data
mining process. - Support for very large data sets.
32New D2K User Interface D2K SL
- Provides step by step interface to guide user in
data analysis - Uses same D2K modules
- Provides way to capture different experiments
(streams)
33Another View of the New D2K User Interface D2K
SL
- Help users keep track of data
- Define templates that can be reused in different
experiments (streams)
34How To Write A Module
- How hard is it to write a module??
- We have an API to define what a given module is.
- Most modules need the following methods
implemented - Module Info (getModuleInfo)
- Input and Output Info (getInputInfo and
getOutputInfo) - Input and Output Types (getInputTypes and
getOutputTypes) - Names (getModuleName, getInputName,
getOutputName) - Module execution (doit)
- Flexibility exists for other methods to be
overwritten to provide different functionality. - Optional methods exist for providing more
information about properties, module icon, etc.
35The ALG Team
- Staff
- Loretta Auvil
- Ruth Aydt
- Peter Bajcsy
- Colleen Bushell
- Dora Cai
- David Clutter
- Yair Even-Zohar
- Lisa Gatzke
- Vered Goren
- Chris Navarro
- Greg Pape
- Tom Redman
- Duane Searsmith
- Andrew Shirk
- Anca Suvaiala
- David Tcheng
- Michael Welge
- Students
- Tyler Alumbaugh
- Bradley Berkin
- Martin Butz
- Peter Groves
- Nazan Khan
- Alexander Kosorukoff
- Kiran Lakkaraju
- Sang-Chul Lee
- Sameer Mathur
- Sunayana Saha
- Arun Srinivasan
- Bei Yu