A Briefing Given to the Access Grid Community on D2K Data To Knowledge - PowerPoint PPT Presentation

1 / 35
About This Presentation
Title:

A Briefing Given to the Access Grid Community on D2K Data To Knowledge

Description:

A Briefing Given to the Access Grid Community on D2K Data ... Yair Even-Zohar. Lisa Gatzke. Vered Goren. Chris Navarro. Greg Pape. Tom Redman. Duane Searsmith ... – PowerPoint PPT presentation

Number of Views:95
Avg rating:3.0/5.0
Slides: 36
Provided by: lisag5
Category:

less

Transcript and Presenter's Notes

Title: A Briefing Given to the Access Grid Community on D2K Data To Knowledge


1
A Briefing Given to the Access Grid Community on
D2K Data To Knowledge
2
Presentation Overview
  • Brief Introduction to Knowledge Discovery in
    Databases and Data Mining
  • Knowledge Discovery in Databases Framework
  • Primer on Using the D2K Data To Knowledge
    Framework
  • Questions?

3
Goals
  • Understanding of the Knowledge Discovery in
    Databases Process
  • Gain Knowledge of Basic Data Mining Operations
    and Techniques
  • Understanding the Role of the Knowledge Discovery
    Framework
  • Key Issues in Utilization of D2K Framework
  • Understanding the Role of Information
    Visualization in Data Mining

4
Motivation Necessity is the Mother of Invention
  • Data Explosion Problem
  • Automated Data Collection Tools and Mature
    Database Technology Lead to Tremendous Amounts of
    Data Stores in Databases, Data Warehouses, and
    Other Information Repositories.
  • We Are Drowning In Data, But Starving For
    Knowledge
  • Solution Data Management Environments and Data
    Mining Frameworks
  • Data Warehousing and On-Line Analytical
    Processing
  • Extraction Of Interesting Knowledge (Rules,
    Regularities, Patterns) from Large Data and Large
    Databases

5
Why Data Mining? - Potential Applications
  • Eliminating Waste, Fraud, Abuse
  • Taxpayer Non-compliance
  • Medicaid Claims Fraud
  • Food Stamp Program
  • Auditor Interestingness Tool
  • Corporate Analysis and Risk Management
  • Resource Planning
  • Competitive Analysis
  • Finance Planning and Asset Evaluation

6
Why Data Mining? - Potential Applications
  • Crisis Management
  • Anticipatory Models
  • Topic Detection
  • Text Extraction
  • Network Intrusion
  • Multi-Objective Optimization
  • Workforce/Education
  • Constituent Relationship
  • Management
  • Real-time Profiling
  • Peer Review Analysis
  • Curriculum Generator
  • Retention Programs

7
Why Data Mining? - Potential Applications
  • Managing Natural Resources
  • Land Usage
  • Water Resource Management
  • Surveillance
  • Biometrics for Identification
  • Other Applications
  • Astronomy
  • Computational Biology

8
Data Mining On What Kind of Data?
  • Relational Databases
  • Data Warehouses
  • Transactional Databases
  • Advanced Database Systems
  • Object-Relational
  • Spatial
  • Temporal
  • Text
  • Heterogeneous, Legacy, and Distributed
  • WWW

9
Data Mining Confluence of Multiple Disciplines
  • Database Systems, Data Warehouses, and OLAP
  • Machine Learning
  • Statistics
  • Mathematical Programming
  • Visualization
  • High Performance Computing

10
Why Do We Need Data Mining ?
  • Data volumes are too large for classical analysis
    approaches
  • Large number of records (108 1012 bytes)
  • High dimensional data ( 102 104 attributes)
  • How do you explore millions of records, tens or
    hundreds of fields, and find patterns?

11
Why Do We Need Data Mining?
  • As databases grow, the ability to support the
    decision support process using traditional query
    languages becomes infeasible
  • Many queries of interest are difficult to state
    in a query language (query formulation problem)
  • Find all cases of fraud
  • Find all individuals likely to need Education
    Credit Assistance
  • Find all documents that are similar to this
    customers problem

12
What is It?
  • Knowledge Discovery in Databases is the
    non-trivial process of identifying valid, novel,
    potentially useful, and ultimately understandable
    patterns in data.
  • The understandable patterns are used to
  • Make predictions or classifications about new
    data
  • Explain existing data
  • Summarize the contents of a large database to
    support decision making
  • Graphical data visualization to aid humans in
    discovering deeper patterns

13
Three Primary Data Mining Paradigms
  • Predictive Modeling
  • Classification (Categorical or Discrete)
  • Regression (Continuous)
  • Discovery
  • Association Rules, Link Analysis, Sequences,
    Clustering
  • Deviation Detection/ Monitoring

14
Knowledge Discovery In Databases Process
15
Need for Data Mining Framework
  • Visual Programming Environment
  • Robust Computational Infrastructure
  • Flexible And Extensible Architecture
  • Rapid Application Development Environment
  • Integrated Environment For Models And
    Visualization
  • Workflow and Group Use Interface

16
D2K - Data To Knowledge
  • D2K is a rapid, flexible data mining system that
    integrates effective analytical data mining
    methods for prediction, discovery, and anomaly
    detection with data management and information
    visualization.

17
D2K Infrastructure, Toolkit, Modules, and
Applications
  • Data Selection
  • Distributed Knowledge Sources
  • Data Transformation
  • Feature Selection/ Construction
  • Example Selection
  • Data Modeling
  • Scalable Algorithms
  • Predictive
  • Discovery
  • Anomaly Detection
  • Bias Optimization
  • Layer Learning
  • Model Evaluation
  • Information Visualization

18
D2K/T2K/I2K - Data, Text, and Image Analysis
19
Summary
  • Data mining discovering interesting patterns
    from large amounts of data
  • A natural evolution of database technology, in
    great demand, with wide applications
  • A KDD process includes data cleaning, data
    integration, data selection, transformation, data
    mining, pattern evaluation, and knowledge
    presentation
  • Mining can be performed in a variety of
    information repositories
  • Data mining functionalities characterization,
    discrimination, association, classification,
    clustering, outlier and trend analysis, etc
  • Importance of data mining framework

20
D2K ToolKit
Tool Menu
Tool Bar
Side Tab Panes
Workspace
Jump Up Panes
21
D2K - Software Environment for Data Mining
  • Visual programming system employing a scalable
    framework
  • Robust computational infrastructure
  • Enable processor intensive apps, support
    distributed computing
  • Enable data intensive apps, support
    multi-processor, shared memory architectures,
    thread pooling
  • Very low granularity, fast data flow paradigm,
    integrated control flow
  • Reduction of time to market
  • Increase code reuse and sharing
  • Expedite custom software developments
  • Relieve distributed computing burden
  • Flexible and extensible architecture
  • Create plug and play subsystem architectures, and
    standard APIs
  • Rapid application development (RAD) environment
  • Integrated environment for models and
    visualization

22
D2K Components
  • D2K Infrastructure
  • D2K API, data flow environment, distributed
    computing framework and runtime system
  • D2K Modules
  • Computational units written in Java that follow
    the D2K API
  • D2K Itineraries
  • Modules that are connected to form an application
  • D2K Toolkit
  • User interface for specification of itineraries
    and execution which provides the rapid
    application development environment
  • D2K-Driven Applications
  • Applications that use D2K modules, but do not
    need to run in the D2K Toolkit

23
D2K Infrastructure
  • D2K Module API Specification
  • Distributed Computing Framework
  • Uses Socket Based Connections to communicate to
    remote machines
  • Uses Grid Services to deploy on the Grid
  • Local D2K
  • Controls the execution of an itinerary
  • Manages the passing of data between modules and
    machines (if necessary)
  • Remote D2K
  • Executes a module on a remote machine

24
D2K Modules
  • Input Module Loads data from the outside world.
  • Flat files, database, etc.
  • Data Prep Module Performs functions to select,
    clean, or transform the data
  • Binning, Normalizing, Feature Selection, etc.
  • Compute Module Performs main algorithmic
    computations.
  • Naïve Bayesian, Decision Tree, Apriori, etc.
  • User Input Module Requires interaction with the
    user.
  • Data Selection, Input and Output selection, etc.
  • Output Module Saves data to the outside world.
  • Flat files, databases, etc.
  • Visualization Module Provides visual feedback to
    the user.
  • Naïve Bayesian, Rule Association, Decision Tree,
    Parallel Coordinates, 2D Scatterplot, 3D Surface
    Plot

25
D2K Module Icon Description
  • Module Progress Bar
  • Appears during execution to show the percentage
    of time that this module executed over the entire
    execution time. It is green when the module is
    executing and red when not.
  • Input Trigger
  • Specifies the input for control flow.
  • Input Port
  • Rectangular shapes on the left side of the module
    represent the inputs for the module. They are
    colored according to the data type that they
    represent
  • Properties Symbol
  • If a P is shown in the lower left corner of the
    module, then the module has properties that can
    be set before execution.

Output Trigger Specifies the output for control
flow. Output Port Rectangular shapes on the
right side of the module represent the outputs
for the module. They are colored according to the
data type that they represent. Serializable
Symbol If an S is shown in the lower right
corner of the module, then the module is
serializable and can be saved.
26
D2K Itineraries
  • Itineraries are applications that have connected
    modules with their properties set.
  • D2K Core Itineraries include
  • Prediction
  • Discovery
  • Anomaly Detection
  • Data Selection
  • Transformation
  • Visualization

27
D2K-Driven Applications
D2K-Driven applications are those that use D2K
modules and/or itineraries but do not require
interaction with the D2K Toolkit to function.
They can operate as stand alone applications.
  • Advantages of Building D2K-Driven Applications
  • Code reuse shortens development time
  • Use the distributed computing features
    implemented in D2K
  • Current Application Development By the ALG
  • Text Analysis (ThemeWeaver uses T2K - Text to
    Knowledge)
  • Other Potential Application Areas
  • Image Analysis (I2K Image to Knowledge)

28
New D2K 3.0 Features
  • Extension of existing API
  • Include the ability to programmatically connect
    modules and set properties.
  • Allows D2K-driven applications to be developed.
  • Ability to pause and restart an itinerary.
  • Enhanced Distributed Computing
  • Modules that are re-entrant can be executed
    remotely.
  • Use of Jini services to look up distributed
    resources.
  • For specifying the runtime layout of a
    distributed itinerary, which can be changed
    dynamically during runtime.
  • Processor Status Overlay
  • Shows user how distributed computing resources
    are being used.
  • Shows how many resources are ready to compute on
    each machine.
  • Distributed Checkpointing
  • Resource Manager
  • Provides an API for indicating data structures to
    be stored by the resource manager.
  • Resource manager provides these data structures
    to distributed machines.

29
Processor Status Overlay
  • Represents each machine being used.
  • Multiple lines represent multiple processors per
    machine.

30
Lets look at D2K
  • Demos
  • D2K Toolkit
  • Prediction
  • Naive Bayesian
  • Decision Tree
  • Discovery
  • Rule Association
  • Text Analysis (D2K)
  • Image Analysis (I2K)
  • Visualization

31
D2K SL
  • Intuitive interfaces into a subset of D2K
    functionality for non-data mining professionals.
  • Transparent access to mine data stored in
    databases.
  • Extensible from desktop to cluster to grid.
  • Visualization support at all stages of the data
    mining process.
  • Support for very large data sets.

32
New D2K User Interface D2K SL
  • Provides step by step interface to guide user in
    data analysis
  • Uses same D2K modules
  • Provides way to capture different experiments
    (streams)

33
Another View of the New D2K User Interface D2K
SL
  • Help users keep track of data
  • Define templates that can be reused in different
    experiments (streams)

34
How To Write A Module
  • How hard is it to write a module??
  • We have an API to define what a given module is.
  • Most modules need the following methods
    implemented
  • Module Info (getModuleInfo)
  • Input and Output Info (getInputInfo and
    getOutputInfo)
  • Input and Output Types (getInputTypes and
    getOutputTypes)
  • Names (getModuleName, getInputName,
    getOutputName)
  • Module execution (doit)
  • Flexibility exists for other methods to be
    overwritten to provide different functionality.
  • Optional methods exist for providing more
    information about properties, module icon, etc.

35
The ALG Team
  • Staff
  • Loretta Auvil
  • Ruth Aydt
  • Peter Bajcsy
  • Colleen Bushell
  • Dora Cai
  • David Clutter
  • Yair Even-Zohar
  • Lisa Gatzke
  • Vered Goren
  • Chris Navarro
  • Greg Pape
  • Tom Redman
  • Duane Searsmith
  • Andrew Shirk
  • Anca Suvaiala
  • David Tcheng
  • Michael Welge
  • Students
  • Tyler Alumbaugh
  • Bradley Berkin
  • Martin Butz
  • Peter Groves
  • Nazan Khan
  • Alexander Kosorukoff
  • Kiran Lakkaraju
  • Sang-Chul Lee
  • Sameer Mathur
  • Sunayana Saha
  • Arun Srinivasan
  • Bei Yu
Write a Comment
User Comments (0)
About PowerShow.com