A Briefing Given to the Access Grid Community on D2K Data To Knowledge

About This Presentation

Title:

A Briefing Given to the Access Grid Community on D2K Data To Knowledge

Description:

A Briefing Given to the Access Grid Community on D2K Data ... Yair Even-Zohar. Lisa Gatzke. Vered Goren. Chris Navarro. Greg Pape. Tom Redman. Duane Searsmith ... – PowerPoint PPT presentation

Number of Views:95

Avg rating:3.0/5.0

Slides: 36

Provided by: lisag5

Category:

more less

Transcript and Presenter's Notes

Title: A Briefing Given to the Access Grid Community on D2K Data To Knowledge

1
A Briefing Given to the Access Grid Community on
D2K Data To Knowledge
2
Presentation Overview

Brief Introduction to Knowledge Discovery in
Databases and Data Mining
Knowledge Discovery in Databases Framework
Primer on Using the D2K Data To Knowledge
Framework
Questions?

3
Goals

Understanding of the Knowledge Discovery in
Databases Process
Gain Knowledge of Basic Data Mining Operations
and Techniques
Understanding the Role of the Knowledge Discovery
Framework
Key Issues in Utilization of D2K Framework
Understanding the Role of Information
Visualization in Data Mining

4
Motivation Necessity is the Mother of Invention

Data Explosion Problem
Automated Data Collection Tools and Mature
Database Technology Lead to Tremendous Amounts of
Data Stores in Databases, Data Warehouses, and
Other Information Repositories.
We Are Drowning In Data, But Starving For
Knowledge
Solution Data Management Environments and Data
Mining Frameworks
Data Warehousing and On-Line Analytical
Processing
Extraction Of Interesting Knowledge (Rules,
Regularities, Patterns) from Large Data and Large
Databases

5
Why Data Mining? - Potential Applications

Eliminating Waste, Fraud, Abuse
Taxpayer Non-compliance
Medicaid Claims Fraud
Food Stamp Program
Auditor Interestingness Tool
Corporate Analysis and Risk Management
Resource Planning
Competitive Analysis
Finance Planning and Asset Evaluation

6
Why Data Mining? - Potential Applications

Crisis Management
Anticipatory Models
Topic Detection
Text Extraction
Network Intrusion
Multi-Objective Optimization
Workforce/Education
Constituent Relationship
Management
Real-time Profiling
Peer Review Analysis
Curriculum Generator
Retention Programs

7
Why Data Mining? - Potential Applications

Managing Natural Resources
Land Usage
Water Resource Management
Surveillance
Biometrics for Identification
Other Applications
Astronomy
Computational Biology

8
Data Mining On What Kind of Data?

Relational Databases
Data Warehouses
Transactional Databases
Advanced Database Systems
Object-Relational
Spatial
Temporal
Text
Heterogeneous, Legacy, and Distributed
WWW

9
Data Mining Confluence of Multiple Disciplines

Database Systems, Data Warehouses, and OLAP
Machine Learning
Statistics
Mathematical Programming
Visualization
High Performance Computing

10
Why Do We Need Data Mining ?

Data volumes are too large for classical analysis
approaches
Large number of records (108 1012 bytes)
High dimensional data ( 102 104 attributes)
How do you explore millions of records, tens or
hundreds of fields, and find patterns?

11
Why Do We Need Data Mining?

As databases grow, the ability to support the
decision support process using traditional query
languages becomes infeasible
Many queries of interest are difficult to state
in a query language (query formulation problem)
Find all cases of fraud
Find all individuals likely to need Education
Credit Assistance
Find all documents that are similar to this
customers problem

12
What is It?

Knowledge Discovery in Databases is the
non-trivial process of identifying valid, novel,
potentially useful, and ultimately understandable
patterns in data.
The understandable patterns are used to
Make predictions or classifications about new
data
Explain existing data
Summarize the contents of a large database to
support decision making
Graphical data visualization to aid humans in
discovering deeper patterns

13
Three Primary Data Mining Paradigms

Predictive Modeling
Classification (Categorical or Discrete)
Regression (Continuous)
Discovery
Association Rules, Link Analysis, Sequences,
Clustering
Deviation Detection/ Monitoring

14
Knowledge Discovery In Databases Process
15
Need for Data Mining Framework

Visual Programming Environment
Robust Computational Infrastructure
Flexible And Extensible Architecture
Rapid Application Development Environment
Integrated Environment For Models And
Visualization
Workflow and Group Use Interface

16
D2K - Data To Knowledge

D2K is a rapid, flexible data mining system that
integrates effective analytical data mining
methods for prediction, discovery, and anomaly
detection with data management and information
visualization.

17
D2K Infrastructure, Toolkit, Modules, and
Applications

Data Selection
Distributed Knowledge Sources
Data Transformation
Feature Selection/ Construction
Example Selection
Data Modeling
Scalable Algorithms
Predictive
Discovery
Anomaly Detection
Bias Optimization
Layer Learning
Model Evaluation
Information Visualization

18
D2K/T2K/I2K - Data, Text, and Image Analysis
19
Summary

Data mining discovering interesting patterns
from large amounts of data
A natural evolution of database technology, in
great demand, with wide applications
A KDD process includes data cleaning, data
integration, data selection, transformation, data
mining, pattern evaluation, and knowledge
presentation
Mining can be performed in a variety of
information repositories
Data mining functionalities characterization,
discrimination, association, classification,
clustering, outlier and trend analysis, etc
Importance of data mining framework

20
D2K ToolKit
Tool Menu
Tool Bar
Side Tab Panes
Workspace
Jump Up Panes
21
D2K - Software Environment for Data Mining

Visual programming system employing a scalable
framework
Robust computational infrastructure
Enable processor intensive apps, support
distributed computing
Enable data intensive apps, support
multi-processor, shared memory architectures,
thread pooling
Very low granularity, fast data flow paradigm,
integrated control flow
Reduction of time to market
Increase code reuse and sharing
Expedite custom software developments
Relieve distributed computing burden
Flexible and extensible architecture
Create plug and play subsystem architectures, and
standard APIs
Rapid application development (RAD) environment
Integrated environment for models and
visualization

22
D2K Components

D2K Infrastructure
D2K API, data flow environment, distributed
computing framework and runtime system
D2K Modules
Computational units written in Java that follow
the D2K API
D2K Itineraries
Modules that are connected to form an application
D2K Toolkit
User interface for specification of itineraries
and execution which provides the rapid
application development environment
D2K-Driven Applications
Applications that use D2K modules, but do not
need to run in the D2K Toolkit

23
D2K Infrastructure

D2K Module API Specification
Distributed Computing Framework
Uses Socket Based Connections to communicate to
remote machines
Uses Grid Services to deploy on the Grid
Local D2K
Controls the execution of an itinerary
Manages the passing of data between modules and
machines (if necessary)
Remote D2K
Executes a module on a remote machine

24
D2K Modules

Input Module Loads data from the outside world.
Flat files, database, etc.
Data Prep Module Performs functions to select,
clean, or transform the data
Binning, Normalizing, Feature Selection, etc.
Compute Module Performs main algorithmic
computations.
Naïve Bayesian, Decision Tree, Apriori, etc.
User Input Module Requires interaction with the
user.
Data Selection, Input and Output selection, etc.
Output Module Saves data to the outside world.
Flat files, databases, etc.
Visualization Module Provides visual feedback to
the user.
Naïve Bayesian, Rule Association, Decision Tree,
Parallel Coordinates, 2D Scatterplot, 3D Surface
Plot

25
D2K Module Icon Description

Module Progress Bar
Appears during execution to show the percentage
of time that this module executed over the entire
execution time. It is green when the module is
executing and red when not.
Input Trigger
Specifies the input for control flow.
Input Port
Rectangular shapes on the left side of the module
represent the inputs for the module. They are
colored according to the data type that they
represent
Properties Symbol
If a P is shown in the lower left corner of the
module, then the module has properties that can
be set before execution.

Output Trigger Specifies the output for control
flow. Output Port Rectangular shapes on the
right side of the module represent the outputs
for the module. They are colored according to the
data type that they represent. Serializable
Symbol If an S is shown in the lower right
corner of the module, then the module is
serializable and can be saved.
26
D2K Itineraries

Itineraries are applications that have connected
modules with their properties set.
D2K Core Itineraries include
Prediction
Discovery
Anomaly Detection
Data Selection
Transformation
Visualization

27
D2K-Driven Applications
D2K-Driven applications are those that use D2K
modules and/or itineraries but do not require
interaction with the D2K Toolkit to function.
They can operate as stand alone applications.

Advantages of Building D2K-Driven Applications
Code reuse shortens development time
Use the distributed computing features
implemented in D2K
Current Application Development By the ALG
Text Analysis (ThemeWeaver uses T2K - Text to
Knowledge)
Other Potential Application Areas
Image Analysis (I2K Image to Knowledge)

28
New D2K 3.0 Features

Extension of existing API
Include the ability to programmatically connect
modules and set properties.
Allows D2K-driven applications to be developed.
Ability to pause and restart an itinerary.
Enhanced Distributed Computing
Modules that are re-entrant can be executed
remotely.
Use of Jini services to look up distributed
resources.
For specifying the runtime layout of a
distributed itinerary, which can be changed
dynamically during runtime.
Processor Status Overlay
Shows user how distributed computing resources
are being used.
Shows how many resources are ready to compute on
each machine.
Distributed Checkpointing
Resource Manager
Provides an API for indicating data structures to
be stored by the resource manager.
Resource manager provides these data structures
to distributed machines.

29
Processor Status Overlay

Represents each machine being used.
Multiple lines represent multiple processors per
machine.

30
Lets look at D2K

Demos
D2K Toolkit
Prediction
Naive Bayesian
Decision Tree
Discovery
Rule Association
Text Analysis (D2K)
Image Analysis (I2K)
Visualization

31
D2K SL

Intuitive interfaces into a subset of D2K
functionality for non-data mining professionals.
Transparent access to mine data stored in
databases.
Extensible from desktop to cluster to grid.
Visualization support at all stages of the data
mining process.
Support for very large data sets.

32
New D2K User Interface D2K SL

Provides step by step interface to guide user in
data analysis
Uses same D2K modules
Provides way to capture different experiments
(streams)

33
Another View of the New D2K User Interface D2K
SL

Help users keep track of data
Define templates that can be reused in different
experiments (streams)

34
How To Write A Module

How hard is it to write a module??
We have an API to define what a given module is.
Most modules need the following methods
implemented
Module Info (getModuleInfo)
Input and Output Info (getInputInfo and
getOutputInfo)
Input and Output Types (getInputTypes and
getOutputTypes)
Names (getModuleName, getInputName,
getOutputName)
Module execution (doit)
Flexibility exists for other methods to be
overwritten to provide different functionality.
Optional methods exist for providing more
information about properties, module icon, etc.

35
The ALG Team

Staff
Loretta Auvil
Ruth Aydt
Peter Bajcsy
Colleen Bushell
Dora Cai
David Clutter
Yair Even-Zohar
Lisa Gatzke
Vered Goren
Chris Navarro
Greg Pape
Tom Redman
Duane Searsmith
Andrew Shirk
Anca Suvaiala
David Tcheng
Michael Welge

Students
Tyler Alumbaugh
Bradley Berkin
Martin Butz
Peter Groves
Nazan Khan
Alexander Kosorukoff
Kiran Lakkaraju
Sang-Chul Lee
Sameer Mathur
Sunayana Saha
Arun Srinivasan
Bei Yu

Write a Comment

User Comments (0)

About PowerShow.com

A Briefing Given to the Access Grid Community on D2K Data To Knowledge - PowerPoint PPT Presentation

A Briefing Given to the Access Grid Community on D2K Data To Knowledge

A Briefing Given to the Access Grid Community on D2K Data ... Yair Even-Zohar. Lisa Gatzke. Vered Goren. Chris Navarro. Greg Pape. Tom Redman. Duane Searsmith ... – PowerPoint PPT presentation