Paying Attention to a Stepchild of Data Access and Integration on the Grid Named Data Preprocessing - PowerPoint PPT Presentation

1 / 25
About This Presentation
Title:

Paying Attention to a Stepchild of Data Access and Integration on the Grid Named Data Preprocessing

Description:

Dorian Pyle, 'Data preparation for. data mining', 1999 ... ORDER BY column [,column] [ASC|DESC] OGSA-DAI R5. JDBC. XMLDB. CSV. File. SQLQueryActivity ... – PowerPoint PPT presentation

Number of Views:110
Avg rating:3.0/5.0
Slides: 26
Provided by: woeh
Category:

less

Transcript and Presenter's Notes

Title: Paying Attention to a Stepchild of Data Access and Integration on the Grid Named Data Preprocessing


1
Paying Attention to a Stepchild of Data Access
and Integration on the Grid Named Data
Preprocessing
  • Alexander Wöhrer, Lenka Nováková and Peter
    Brezany
  • Institute of Scientific Computing
  • University of Vienna
  • woehrerbrezany_at_par.univie.ac.at
  • Department of Cybernetics
  • Czech Technical University
  • novakova_at_labe.felk.cvut.cz

2
Content
  • Motivation
  • Tasks needed
  • Data Statistics
  • Data Preprocessing
  • Performance Tests
  • Performance setup
  • Discussion of results
  • Issues/Future work

3
Motivation
Dorian Pyle, Data preparation for data mining,
1999
Jiawei Han and Micheline Kamber, Data mining
Concepts and techniques, 2000
4
Motivation - Current Issues in DAI
Distance between data and code
Data Access
Data Integration
Data Preprocessing (DPP)
Data Mining
data
data
data
OGSA-DAI
OGSA-DQP GDMS
Grid Miner
Weka, SPSS
  • DPP for traditional row based DM done
  • locally, with proprietary tools
  • far away from the data gt expensive data movement
  • usefullness of the data unknown!

5
Motivation
  • with no quality data, there can be no quality
    data mining results
  • pilot application data from Traumatic Brain
    Injury patients for GridMiner
  • noisy
  • missing
  • most data mining methods wont work with
    unpreprocessed data
  • Decision trees,....
  • various methods
  • code should be applied as near
  • to the data as possible (Gray 03)

Feature extraction
Denoising
Normalization
Transformation
Sampling
6
Grid Data Mediation Service (GDMS)
tight federation
Wrapper
OGSA-DAI R5
JDBC
Mapping Schema (single view)
Transformation Functions (resolve heterogeneitis)
UNION
no proprietary solution
virtual integration
JOIN
Parse Decompose Optimize Execute
Wrapper
XMLDB
SQLQueryActivity
Wrapper
  • Supported SQL subset
  • SELECT column, column FROM table
  • WHERE condition ANDOR condition
  • ORDER BY column ,column ASCDESC

CSV File
7
Example I Mapping Schema
  • ltVDSTable namepatientgt
  • ltunion kindallgt
  • ltjoingt
  • ltselect sourcexmldb.. nameAgt
  • ltmapSourcegt.lt/mapSourcegt
  • ltsourcePartgtcollectionXYZlt/sourcePartgt
  • lt/selectgt
  • ltselect sourcejdbc// nameBgt
  • ltmapSourcegt.lt/mapSourcegt
  • ltsourcePartgtdatabaseXYZlt/sourcePartgt
  • lt/selectgt
  • ltjoinInfo kindinnergt
  • ltleft keyspidgt
  • ltright keyspidgt
  • lt/joinInfogt
  • lt/joingt
  • ltselect sourefile//... nameCgt
  • ltmapSourcegt.lt/mapSourcegt
  • ltsourcePartgtlt/sourcePartgt

8
Example II Transformations
  • CSV file line example
  • 1WoehrerAlexanderVienna12/12/19121/1/2004
  • ltmapSourcegt
  • ltColSeperatorgtlt/ColSeperatorgt
  • ltLineSeperatorgt\r\nlt/LineSeperatorgt
  • ltcolumn refp_name
  • transformcombine(fn,ln)
  • lt/columngt
  • ltcolumn reflngt
  • ltsourcegt2ltsourcegt
  • lt/columngt
  • ltcolumn reffngt
  • ltsourcegt3ltsourcegt
  • lt/columngt
  • lt/mapSourcegt

//Transformation function for //the CSV
file public class TestTransform public
static String combine( String one, String
two) return one two
9
Basic Idea
  • provide DPP functionality as near to the source
    as possible

Distance between data and code
data
Data Access and Integration
DPP
Data Mining
data statistics
  • DPP for traditional row based DM done
  • remotelly, with standard methods
  • code is close to the data
  • data statistics help to decide
  • what DPP methods to apply
  • if at all

10
Basic Idea - Solution
  • extend OGSA-DAI with DPP functionality
  • open and extensible interface/engine
  • relocating the data preprocessing task towards
    the Grid Data Service
  • 2 new groups of
  • activities
  • DataStatistics (DS)
  • basic
  • advanced
  • DPPMethods (DPP)

11
Usage Scenario I - Modes
User input
Pre-setup with local knowledge
active usage
passive usage
12
Usage Scenario II - Interactions
Find other source
Query
no
Data statistics
Step 2) Try to decide if data is useful
Step 3) Apply advanced DS
not enough info
yes
  • Step 1)
  • Get basic DS
  • about some
  • data set

DPP
13
Basic Data Statistics
  • extracts basic information about a dataset
  • missing/total frequency
  • number of distinct values
  • max, min, mean, standard deviation
  • used to decide
  • what DPP technique to use
  • whether DPP is necessary at all (especially
    interesting for very expensive DPP methods)
  • allows usage of IsMissing probes
  • no user input required!

PMML 3.0
Statistics-Activity
Webrowset
WRSID
14
Data Statistics - PMML
ltDataDictionary numberOfFields"11"gt ltDataField
dataType"double" name"B" optype"continuous"/gt
... lt/DataDictionarygt ltModelStatsgt ltUnivariateSta
ts field"B"gt ltCounts missingFreq"0"
totalFreq"25000"gt ltExtension
name"distinctValues" value"1000"/gt lt/Countsgt
ltNumericInfo maximum"4.999847"
mean"2.4973846009599865" minimum"3.05E
-4" standardDeviation"2.089256974654939"/gt
lt/UnivariateStatsgt ... ltModelStatsgt
15
Advanced Data Statistics
  • needs additional user inputs
  • interval specification
  • type of attribute (continous,...)
  • allows usage of IsInvalid probes

WRSID
AdvStatistics-Activity
PMML 3.0
PMML 3.0
16
Data Statistics PMML
ltDataDictionary numberOfFields2gt ltDataField
dataType"double" name"A" optype"continuous"gt lt
Interval closure"closedOpen" leftMargin"14.0"
rightMargin"16.0"/gt lt/DataFieldgt ...... ltUnivaria
teStats field"A"gt ltCounts invalidFreq"81"
missingFreq"0" totalFreq"100"gt ltExtension
name"distinctValues" value"100"/gt lt/Countsgt ltN
umericInfo maximum"19.822993"
mean"14.891346052631581"
minimum"12.006287"
standardDeviation"0.35539087801771935"/gt ltContS
tats totalValuesSum"282.93557500000003"gt ltInter
val closure"closedOpen" leftMargin"14.0"
rightMargin"16.0"/gt lt/ContStatsgt lt/UnivariateSta
tsgt
17
Data Preprocessing - Probes
Sampling
Row 1
  • Column based probes can be
  • used for
  • IsMissing
  • IsInvalid

Col 1
Col 2
Denoising
18
Data Preprocessing - Interface
ltdppMethods namedpp1"gt ltinWebRowsetIDgt123lt/in
WebRowsetIDgt ltcolumn colToApply"1"gt ltmethode
forCause"isMissing" impl"basic.ReplaceMis
singValuesByValue"gt ltParamgt15lt/Paramgt lt/metho
degt lt/columngt ltcolumn colToApply"4"gt ltmethode
forCause"isMissing" impl"basic.ReplaceMi
ssingValuesByMean/gt lt/columngt ltoutWebRowset
name234"/gt lt/dppMethodsgt
19
Setup for the Performance Tests (LAN)
FTP Server (for preprocessed results)
Sun Fire V880 8 GB RAM 4 X UltraSPARC-III 750MHz
Solaris 9
100 Mbit/s full duplex
Webrowset
OGSA-DAI R5
Thin Client
MySQL 1.000.000 rows 11 cols noisy/missing
100 Mbit/s full duplex
PMML
20
Performance of the Activities
21
Where we spend the time....
22
Related Work
  • DataCutter carry out a rich set of queries and
    application specific data transformations on data
    streams
  • Chimera automated data generation according to
    some data derivation procedure

23
Issues/Future Work
  • expose availiable DPP methods
  • advanced DPP methods
  • passive mode
  • documentation of how a data set has been
    processed
  • inspect the processes to be applied in advance
  • get data unprocessed?

24
Conclusion
  • row-based DPP framework for traditional DM
    support missing
  • contributions
  • Data Statistics activity
  • basic (no additional user input required)
  • advanced
  • Data Preprocessing activity
  • flexible row column probes
  • applicable to active/passive usage mode
  • implemented centralized prototype
  • column probes for IsMissing, IsInvalid
  • shows feasibility of concepts

25
References
  • Jim Gray, Distributed Computing Economics ,TR,
    2003
  • Jiawei Han and Micheline Kamber, Data mining
    Concepts and techniques, 2000
  • Dorian Pyle, Data preparation for data mining,
    1999
  • OGSA-DAI, www.ogsadai.org
  • Michael Beynon , Renato Ferreira , Tahsin Kurc ,
    Alan Sussman and Joel Saltz, DataCutter
    Middleware for Filtering Very Large Scientific
    Datasets on Archival Storage Systems, 2000
  • OGSA-DQP, www.ogsadai.org/dqp
  • Ian Foster, Jens Vöckler, Michael Wilde and Yong
    Zhao, Chimera A Virtual Data System for
    Representing, Querying, and Automating Data
    Derivation, 2002
Write a Comment
User Comments (0)
About PowerShow.com