Paying Attention to a Stepchild of Data Access and Integration on the Grid Named Data Preprocessing - PowerPoint PPT Presentation

1 / 25

About This Presentation

Title:

Paying Attention to a Stepchild of Data Access and Integration on the Grid Named Data Preprocessing

Description:

Dorian Pyle, 'Data preparation for. data mining', 1999 ... ORDER BY column [,column] [ASC|DESC] OGSA-DAI R5. JDBC. XMLDB. CSV. File. SQLQueryActivity ... – PowerPoint PPT presentation

Number of Views:110

Avg rating:3.0/5.0

Slides: 26

Provided by: woeh

Category:

more less

Transcript and Presenter's Notes

Title: Paying Attention to a Stepchild of Data Access and Integration on the Grid Named Data Preprocessing

1
Paying Attention to a Stepchild of Data Access
and Integration on the Grid Named Data
Preprocessing

Alexander Wöhrer, Lenka Nováková and Peter
Brezany
Institute of Scientific Computing
University of Vienna
woehrerbrezany_at_par.univie.ac.at
Department of Cybernetics
Czech Technical University
novakova_at_labe.felk.cvut.cz

2
Content

Motivation
Tasks needed
Data Statistics
Data Preprocessing
Performance Tests
Performance setup
Discussion of results
Issues/Future work

3
Motivation
Dorian Pyle, Data preparation for data mining,
1999
Jiawei Han and Micheline Kamber, Data mining
Concepts and techniques, 2000
4
Motivation - Current Issues in DAI
Distance between data and code
Data Access
Data Integration
Data Preprocessing (DPP)
Data Mining
data
data
data
OGSA-DAI
OGSA-DQP GDMS
Grid Miner
Weka, SPSS

DPP for traditional row based DM done
locally, with proprietary tools
far away from the data gt expensive data movement
usefullness of the data unknown!

5
Motivation

with no quality data, there can be no quality
data mining results
pilot application data from Traumatic Brain
Injury patients for GridMiner
noisy
missing
most data mining methods wont work with
unpreprocessed data
Decision trees,....
various methods
code should be applied as near
to the data as possible (Gray 03)

Feature extraction
Denoising
Normalization
Transformation
Sampling
6
Grid Data Mediation Service (GDMS)
tight federation
Wrapper
OGSA-DAI R5
JDBC
Mapping Schema (single view)
Transformation Functions (resolve heterogeneitis)
UNION
no proprietary solution
virtual integration
JOIN
Parse Decompose Optimize Execute
Wrapper
XMLDB
SQLQueryActivity
Wrapper

Supported SQL subset
SELECT column, column FROM table
WHERE condition ANDOR condition
ORDER BY column ,column ASCDESC

CSV File
7
Example I Mapping Schema

ltVDSTable namepatientgt
ltunion kindallgt
ltjoingt
ltselect sourcexmldb.. nameAgt
ltmapSourcegt.lt/mapSourcegt
ltsourcePartgtcollectionXYZlt/sourcePartgt
lt/selectgt
ltselect sourcejdbc// nameBgt
ltmapSourcegt.lt/mapSourcegt
ltsourcePartgtdatabaseXYZlt/sourcePartgt
lt/selectgt
ltjoinInfo kindinnergt
ltleft keyspidgt
ltright keyspidgt
lt/joinInfogt
lt/joingt
ltselect sourefile//... nameCgt
ltmapSourcegt.lt/mapSourcegt
ltsourcePartgtlt/sourcePartgt

8
Example II Transformations

CSV file line example
1WoehrerAlexanderVienna12/12/19121/1/2004
ltmapSourcegt
ltColSeperatorgtlt/ColSeperatorgt
ltLineSeperatorgt\r\nlt/LineSeperatorgt
ltcolumn refp_name
transformcombine(fn,ln)
lt/columngt
ltcolumn reflngt
ltsourcegt2ltsourcegt
lt/columngt
ltcolumn reffngt
ltsourcegt3ltsourcegt
lt/columngt
lt/mapSourcegt

//Transformation function for //the CSV
file public class TestTransform public
static String combine( String one, String
two) return one two
9
Basic Idea

provide DPP functionality as near to the source
as possible

Distance between data and code
data
Data Access and Integration
DPP
Data Mining
data statistics

DPP for traditional row based DM done
remotelly, with standard methods
code is close to the data
data statistics help to decide
what DPP methods to apply
if at all

10
Basic Idea - Solution

extend OGSA-DAI with DPP functionality
open and extensible interface/engine
relocating the data preprocessing task towards
the Grid Data Service
2 new groups of
activities
DataStatistics (DS)
basic
advanced
DPPMethods (DPP)

11
Usage Scenario I - Modes
User input
Pre-setup with local knowledge
active usage
passive usage
12
Usage Scenario II - Interactions
Find other source
Query
no
Data statistics
Step 2) Try to decide if data is useful
Step 3) Apply advanced DS
not enough info
yes

Step 1)
Get basic DS
about some
data set

DPP
13
Basic Data Statistics

extracts basic information about a dataset
missing/total frequency
number of distinct values
max, min, mean, standard deviation
used to decide
what DPP technique to use
whether DPP is necessary at all (especially
interesting for very expensive DPP methods)
allows usage of IsMissing probes
no user input required!

PMML 3.0
Statistics-Activity
Webrowset
WRSID
14
Data Statistics - PMML
ltDataDictionary numberOfFields"11"gt ltDataField
dataType"double" name"B" optype"continuous"/gt
... lt/DataDictionarygt ltModelStatsgt ltUnivariateSta
ts field"B"gt ltCounts missingFreq"0"
totalFreq"25000"gt ltExtension
name"distinctValues" value"1000"/gt lt/Countsgt
ltNumericInfo maximum"4.999847"
mean"2.4973846009599865" minimum"3.05E
-4" standardDeviation"2.089256974654939"/gt
lt/UnivariateStatsgt ... ltModelStatsgt
15
Advanced Data Statistics

needs additional user inputs
interval specification
type of attribute (continous,...)
allows usage of IsInvalid probes

WRSID
AdvStatistics-Activity
PMML 3.0
PMML 3.0
16
Data Statistics PMML
ltDataDictionary numberOfFields2gt ltDataField
dataType"double" name"A" optype"continuous"gt lt
Interval closure"closedOpen" leftMargin"14.0"
rightMargin"16.0"/gt lt/DataFieldgt ...... ltUnivaria
teStats field"A"gt ltCounts invalidFreq"81"
missingFreq"0" totalFreq"100"gt ltExtension
name"distinctValues" value"100"/gt lt/Countsgt ltN
umericInfo maximum"19.822993"
mean"14.891346052631581"
minimum"12.006287"
standardDeviation"0.35539087801771935"/gt ltContS
tats totalValuesSum"282.93557500000003"gt ltInter
val closure"closedOpen" leftMargin"14.0"
rightMargin"16.0"/gt lt/ContStatsgt lt/UnivariateSta
tsgt
17
Data Preprocessing - Probes
Sampling
Row 1

Column based probes can be
used for
IsMissing
IsInvalid

Col 1
Col 2
Denoising
18
Data Preprocessing - Interface
ltdppMethods namedpp1"gt ltinWebRowsetIDgt123lt/in
WebRowsetIDgt ltcolumn colToApply"1"gt ltmethode
forCause"isMissing" impl"basic.ReplaceMis
singValuesByValue"gt ltParamgt15lt/Paramgt lt/metho
degt lt/columngt ltcolumn colToApply"4"gt ltmethode
forCause"isMissing" impl"basic.ReplaceMi
ssingValuesByMean/gt lt/columngt ltoutWebRowset
name234"/gt lt/dppMethodsgt
19
Setup for the Performance Tests (LAN)
FTP Server (for preprocessed results)
Sun Fire V880 8 GB RAM 4 X UltraSPARC-III 750MHz
Solaris 9
100 Mbit/s full duplex
Webrowset
OGSA-DAI R5
Thin Client
MySQL 1.000.000 rows 11 cols noisy/missing
100 Mbit/s full duplex
PMML
20
Performance of the Activities
21
Where we spend the time....
22
Related Work

DataCutter carry out a rich set of queries and
application specific data transformations on data
streams
Chimera automated data generation according to
some data derivation procedure

23
Issues/Future Work

expose availiable DPP methods
advanced DPP methods
passive mode
documentation of how a data set has been
processed
inspect the processes to be applied in advance
get data unprocessed?

24
Conclusion

row-based DPP framework for traditional DM
support missing
contributions
Data Statistics activity
basic (no additional user input required)
advanced
Data Preprocessing activity
flexible row column probes
applicable to active/passive usage mode
implemented centralized prototype
column probes for IsMissing, IsInvalid
shows feasibility of concepts

25
References

Jim Gray, Distributed Computing Economics ,TR,
2003
Jiawei Han and Micheline Kamber, Data mining
Concepts and techniques, 2000
Dorian Pyle, Data preparation for data mining,
1999
OGSA-DAI, www.ogsadai.org
Michael Beynon , Renato Ferreira , Tahsin Kurc ,
Alan Sussman and Joel Saltz, DataCutter
Middleware for Filtering Very Large Scientific
Datasets on Archival Storage Systems, 2000
OGSA-DQP, www.ogsadai.org/dqp
Ian Foster, Jens Vöckler, Michael Wilde and Yong
Zhao, Chimera A Virtual Data System for
Representing, Querying, and Automating Data
Derivation, 2002