Title: Paying Attention to a Stepchild of Data Access and Integration on the Grid Named Data Preprocessing
1Paying Attention to a Stepchild of Data Access
and Integration on the Grid Named Data
Preprocessing
- Alexander Wöhrer, Lenka Nováková and Peter
Brezany - Institute of Scientific Computing
- University of Vienna
- woehrerbrezany_at_par.univie.ac.at
- Department of Cybernetics
- Czech Technical University
- novakova_at_labe.felk.cvut.cz
2Content
- Motivation
- Tasks needed
- Data Statistics
- Data Preprocessing
- Performance Tests
- Performance setup
- Discussion of results
- Issues/Future work
3Motivation
Dorian Pyle, Data preparation for data mining,
1999
Jiawei Han and Micheline Kamber, Data mining
Concepts and techniques, 2000
4Motivation - Current Issues in DAI
Distance between data and code
Data Access
Data Integration
Data Preprocessing (DPP)
Data Mining
data
data
data
OGSA-DAI
OGSA-DQP GDMS
Grid Miner
Weka, SPSS
- DPP for traditional row based DM done
- locally, with proprietary tools
- far away from the data gt expensive data movement
- usefullness of the data unknown!
5Motivation
- with no quality data, there can be no quality
data mining results - pilot application data from Traumatic Brain
Injury patients for GridMiner - noisy
- missing
- most data mining methods wont work with
unpreprocessed data - Decision trees,....
- various methods
- code should be applied as near
- to the data as possible (Gray 03)
Feature extraction
Denoising
Normalization
Transformation
Sampling
6Grid Data Mediation Service (GDMS)
tight federation
Wrapper
OGSA-DAI R5
JDBC
Mapping Schema (single view)
Transformation Functions (resolve heterogeneitis)
UNION
no proprietary solution
virtual integration
JOIN
Parse Decompose Optimize Execute
Wrapper
XMLDB
SQLQueryActivity
Wrapper
- Supported SQL subset
- SELECT column, column FROM table
- WHERE condition ANDOR condition
- ORDER BY column ,column ASCDESC
CSV File
7Example I Mapping Schema
- ltVDSTable namepatientgt
- ltunion kindallgt
- ltjoingt
- ltselect sourcexmldb.. nameAgt
- ltmapSourcegt.lt/mapSourcegt
- ltsourcePartgtcollectionXYZlt/sourcePartgt
- lt/selectgt
- ltselect sourcejdbc// nameBgt
- ltmapSourcegt.lt/mapSourcegt
- ltsourcePartgtdatabaseXYZlt/sourcePartgt
- lt/selectgt
- ltjoinInfo kindinnergt
- ltleft keyspidgt
- ltright keyspidgt
- lt/joinInfogt
- lt/joingt
- ltselect sourefile//... nameCgt
- ltmapSourcegt.lt/mapSourcegt
- ltsourcePartgtlt/sourcePartgt
8Example II Transformations
- CSV file line example
- 1WoehrerAlexanderVienna12/12/19121/1/2004
- ltmapSourcegt
- ltColSeperatorgtlt/ColSeperatorgt
- ltLineSeperatorgt\r\nlt/LineSeperatorgt
-
- ltcolumn refp_name
- transformcombine(fn,ln)
- lt/columngt
- ltcolumn reflngt
- ltsourcegt2ltsourcegt
- lt/columngt
- ltcolumn reffngt
- ltsourcegt3ltsourcegt
- lt/columngt
-
- lt/mapSourcegt
//Transformation function for //the CSV
file public class TestTransform public
static String combine( String one, String
two) return one two
9Basic Idea
- provide DPP functionality as near to the source
as possible
Distance between data and code
data
Data Access and Integration
DPP
Data Mining
data statistics
- DPP for traditional row based DM done
- remotelly, with standard methods
- code is close to the data
- data statistics help to decide
- what DPP methods to apply
- if at all
10Basic Idea - Solution
- extend OGSA-DAI with DPP functionality
- open and extensible interface/engine
- relocating the data preprocessing task towards
the Grid Data Service - 2 new groups of
- activities
- DataStatistics (DS)
- basic
- advanced
- DPPMethods (DPP)
11Usage Scenario I - Modes
User input
Pre-setup with local knowledge
active usage
passive usage
12Usage Scenario II - Interactions
Find other source
Query
no
Data statistics
Step 2) Try to decide if data is useful
Step 3) Apply advanced DS
not enough info
yes
- Step 1)
- Get basic DS
- about some
- data set
DPP
13Basic Data Statistics
- extracts basic information about a dataset
- missing/total frequency
- number of distinct values
- max, min, mean, standard deviation
- used to decide
- what DPP technique to use
- whether DPP is necessary at all (especially
interesting for very expensive DPP methods) - allows usage of IsMissing probes
- no user input required!
PMML 3.0
Statistics-Activity
Webrowset
WRSID
14Data Statistics - PMML
ltDataDictionary numberOfFields"11"gt ltDataField
dataType"double" name"B" optype"continuous"/gt
... lt/DataDictionarygt ltModelStatsgt ltUnivariateSta
ts field"B"gt ltCounts missingFreq"0"
totalFreq"25000"gt ltExtension
name"distinctValues" value"1000"/gt lt/Countsgt
ltNumericInfo maximum"4.999847"
mean"2.4973846009599865" minimum"3.05E
-4" standardDeviation"2.089256974654939"/gt
lt/UnivariateStatsgt ... ltModelStatsgt
15Advanced Data Statistics
- needs additional user inputs
- interval specification
- type of attribute (continous,...)
- allows usage of IsInvalid probes
WRSID
AdvStatistics-Activity
PMML 3.0
PMML 3.0
16Data Statistics PMML
ltDataDictionary numberOfFields2gt ltDataField
dataType"double" name"A" optype"continuous"gt lt
Interval closure"closedOpen" leftMargin"14.0"
rightMargin"16.0"/gt lt/DataFieldgt ...... ltUnivaria
teStats field"A"gt ltCounts invalidFreq"81"
missingFreq"0" totalFreq"100"gt ltExtension
name"distinctValues" value"100"/gt lt/Countsgt ltN
umericInfo maximum"19.822993"
mean"14.891346052631581"
minimum"12.006287"
standardDeviation"0.35539087801771935"/gt ltContS
tats totalValuesSum"282.93557500000003"gt ltInter
val closure"closedOpen" leftMargin"14.0"
rightMargin"16.0"/gt lt/ContStatsgt lt/UnivariateSta
tsgt
17Data Preprocessing - Probes
Sampling
Row 1
- Column based probes can be
- used for
- IsMissing
- IsInvalid
Col 1
Col 2
Denoising
18Data Preprocessing - Interface
ltdppMethods namedpp1"gt ltinWebRowsetIDgt123lt/in
WebRowsetIDgt ltcolumn colToApply"1"gt ltmethode
forCause"isMissing" impl"basic.ReplaceMis
singValuesByValue"gt ltParamgt15lt/Paramgt lt/metho
degt lt/columngt ltcolumn colToApply"4"gt ltmethode
forCause"isMissing" impl"basic.ReplaceMi
ssingValuesByMean/gt lt/columngt ltoutWebRowset
name234"/gt lt/dppMethodsgt
19Setup for the Performance Tests (LAN)
FTP Server (for preprocessed results)
Sun Fire V880 8 GB RAM 4 X UltraSPARC-III 750MHz
Solaris 9
100 Mbit/s full duplex
Webrowset
OGSA-DAI R5
Thin Client
MySQL 1.000.000 rows 11 cols noisy/missing
100 Mbit/s full duplex
PMML
20Performance of the Activities
21Where we spend the time....
22Related Work
- DataCutter carry out a rich set of queries and
application specific data transformations on data
streams - Chimera automated data generation according to
some data derivation procedure
23Issues/Future Work
- expose availiable DPP methods
- advanced DPP methods
- passive mode
- documentation of how a data set has been
processed - inspect the processes to be applied in advance
- get data unprocessed?
24Conclusion
- row-based DPP framework for traditional DM
support missing - contributions
- Data Statistics activity
- basic (no additional user input required)
- advanced
- Data Preprocessing activity
- flexible row column probes
- applicable to active/passive usage mode
- implemented centralized prototype
- column probes for IsMissing, IsInvalid
- shows feasibility of concepts
25References
- Jim Gray, Distributed Computing Economics ,TR,
2003 - Jiawei Han and Micheline Kamber, Data mining
Concepts and techniques, 2000 - Dorian Pyle, Data preparation for data mining,
1999 - OGSA-DAI, www.ogsadai.org
- Michael Beynon , Renato Ferreira , Tahsin Kurc ,
Alan Sussman and Joel Saltz, DataCutter
Middleware for Filtering Very Large Scientific
Datasets on Archival Storage Systems, 2000 - OGSA-DQP, www.ogsadai.org/dqp
- Ian Foster, Jens Vöckler, Michael Wilde and Yong
Zhao, Chimera A Virtual Data System for
Representing, Querying, and Automating Data
Derivation, 2002