Title: MAPPING DATA IN PEERTOPEER SYSTEMS:SEMANTICS AND ALGORITHMIC ISSUES Department of Computer Science U
1MAPPING DATA IN PEER-TO-PEER SYSTEMSSEMANTICS
AND ALGORITHMIC ISSUESDepartment of Computer
Science University of TorontoAnastasios
Kementsietsidis Marcelo Arenas Renee
J.Millerpresented by Ahmet OLGUN Suzan
BAYHAN
2OUTLINE
- 1-ABSTRACT
- 2-INTRODUCTION
- 3-MOTIVATING EXAMPLE
- 4-MAPPING TABLES
- 5-MAPPING AS CONSTRAINTS
- 6-CONSISTENCY AND INTERFERENCE
- 7-THE ALGORITHM
- 8-EXPERIMENTAL RESULTS
- 9-CONCLUSIONS
3ABSTRACT
- PROBLEM OF MAPPING DATA IN PEER-TO-PEER DATA
SHARING SYSTEMS(PPDSS) - MAPPING TABLES LISTING CORRESPONDING VALUES IN A
PPDSS - WHY TABLES ARE APPROPRIATE
- A LANGUAGE TO SPECIFY MAPPING TABLES UNDER
DIFFERENT SEMANTICS - COMPLEXITY OF THE PROBLEM
- AN EFFICIENT ALGORITHM FOR ITS SOLUTION
- IMPLEMENTATION WITH EXPERIMENTAL RESULTS
- HYPERION PROJECT
4INTRODUCTION
- Traditionally data integration and exchange bw
heterogeneous data sources is provided mainly
through use of views i.e., queries - Sources share their schemas and cooperate
- BUT IN OUR WORK SUCH CLOSE COOPERATION IS
- Not desirable (PRIVACY)
- Not feasible (maybe due to resource limitations)
5SIMILARITY WITH FILE-SHARING SYSTEMS
- TO FIND DATA WHEN THERE IS NO AGREEMENT ON THE
LOGICAL DESIGN OF DATA, - FOCUS ON VALUES AND HOW THEY CORRESPOND
- IN FILE SHARING SYSTEMS LIKE NAPSTER AND GNUTELLA
,QUERYING IS DONE ON SIMPLE VALUE SEARCH OF FILE
NAMES - QUERIES ARE OF THE FORM
- RETRIEVE ALL FILES NAMED X
- EASY BECAUSE THERE IS A CONSENSUS ON
NAMES
6WHAT IF NO ACCEPTED NAMING STANDARD???
- Each peer has to develop its own naming standard
- Conforming external standards is time-consuming
and expensive - So to search data in such environments
?MAPPING TABLES that store correspondence between
values. - At simplest, tables are binary tables
corresponding identifiers from two different
sources - Mapping Tables represent EXPERT KNOWLEDGE
7MOTIVATING EXAMPLE
- DOMAINBIOLOGICAL DATABASES
- GENE DATABASE?GDB
- PROTEIN DATABASE?SwissProt
- GENETIC DISORDERS AND RELATED GENES
DATABASE?MIM
8EXAMPLE (CONTD)
- Integration of these resources is extremely
desirable for scientists to have uniforn access
BUT SEEMS UNATTAINABLE due to political,financial
and technical reasons. - Among technical reasons , heterogeneity of
sources like formatted files,spreadsheets,relation
al databases
9MAIN CHARACTERISTICS AND USE OF MAPPING TABLES
- Associations within and Across Domains
- Peer Autonomy
- Semantics
- Automated discovery of mappings
10Association within and Across Domains
- Mapping table is not necessarily a function
- By mapping tables we associate seemingly
unconnect databases - Disjoint worlds can be associated since the
corresponding worlds are semantically close to
each other
11Peer Autonomy
- Autonomy has high importance in peer-to-peer
systems. - Mapping tables do not restrict the operation of
peers in any way beyond the agreement on values
expressed in the tables.
12Mapping Table 1
Figure 1
13Semantics
- Experts have varying degree of expertise,so we
should better show the confidence level of
mapping tables - A tuple (X,Y)
- If X value appearing in a mapping table follows
the open-world semantics then it can be
associated with any Y value-Partial Information
about X
14Closed World
- If X follows Closed-World semantics, then values
in the table can only be associated with the
specified Y values. - 4 alternatives
- 1-OO (No specific information,no practical
interest) - 2-OC (Partial knowledge)
- 3-CO(Partial knowledge)
- 4-CC(complete knowledge)
15Open/Closed World
Table 1Alternative open/closed world semantics
16Automated Discovery
- Given a semantics for mapping tables, to reason
about them,treat mapping tables as constraints on
the exchange of information. - Simplest way to combine tables? CONJUNCTION
17Example Mapping Tables
18MAPPING TABLES
- A,B,C,D ? individual attributes
- dom(A) ? domain of A like integers,characters
- U,X,Y ? set of attributes
- R ? a relational schema
- RU ? attributes of a schema
- r ? relation instance
- t ? tuples
19MAPPING TABLES(contd)
tX?values of tuple t in attributes of
X XA1,A2.... Ak dom(X)dom(A1)Xdom(A2)X...Xdom
(Ak) To represent different semantics of mapping
tables,it is necessary to introduce variables V?
a set of variables where Vn dom(A)F for each
attribute of A
20DEFINITION 1
- Given a set of attributes U,t is a mapping over U
if for each A?U ,tA is either a constant in
dom(A),a variable in V or an expression of the
form v-S,where v?V and S is a finite subset of
dom(A)
21DEFINITION 2
- Let X and Y be nonempty disjoint set of
attributes. A mapping table m from X to Y is a
finite set of mappings over X U Y such that each
variable appears in at most one mapping
22DEFINITION 2
- Set of mappings?mapping table
- Table?relations containing variables
- RESTRICTEach variable appears in at most one
mapping - TWO DIFFERENT MAPPINGS ARE COMPLETELY INDEPENDENT
23DEFINITION 3
- A valuation ? over a mapping table m is a
function that maps each constant value in m to
itself and each variable v of m to a value in
the intersection of the domains of the attributes
where v appears.Furthermore,if v appears in an
expression of the form v-S,then ? (v) is not an
element of S.
24MAPPING AS CONSTRAINTS
- View mapping tables as constraints on the
exchange of information between sources - Given a set of mapping constraints,we are able to
infer new mapping constraints and check the
consistency of the constraints
25(No Transcript)
26CONSISTENCY INFERENCE
- Infer new mapping tables
Combine the knowledge from mapping tables
available in a network of peers - Determine consistency of mapping tablesAutomated
inference and consistency checks will help a
curator to see whether semantics are valid
27Problem Definition
- Given a mapping constraint formula (MCF) F over
a set of attributes U, F is consistent if there
exists a nonempty relation r of U satisfying F. - Inference problem is the problem of verifying
whether a set of MCFs implies another MCF
28Theorems
- Theorem The consistency problem for
conjunctions of mapping constraints is
NP-complete. - Theorem If the length of the paths or number of
mapping constraints is fixed then the consistency
problem for the conjunctions of mapping
constraints is NP-complete.
29Assumptions
- Assumptions to solve the consistency problem
- Number of mapping constraints per peer is small
- The length of paths is small
- For example in Gnutella paths have maximum
size of 7
30THE ALGORITHM
- ? P1,P2,..,Pn a path of peers
- Ui set of attributes at each peer
- S set of constraints over path ?
-
- µ X ?Y a mapping constraint
- ext(µ )? (t) t ? m and ? is a valuation over
m
31THE ALGORITHM
- 1- S is consistent iff there exists t ? ext(µ)
- 2-? µX?Y, S ? µ iff ext(µ) ? ext(µ)
- For inference check 2 if S ? µ
- For consistencycheck 1.
32Design DecisionsP1,P2,P3,P4 path
33Algorithm for computing the cover
- P1 sends all mapping constraints to P2
- P2 uses those constraints with his own to create
a cover between P1 and P3 - P2 forwards cover to P3
- P3 does the same thing to create a cover bw P1
and P4 - P3 sends the computed cover back to P1
34Problems
- Unnecessary computation
- Cover involving A6 can be done locally
- Does not work in streaming fashion
- P1 has to wait for the whole computation to
finish to get the cover between itself and P4 - So ?...
35Partitions
Peer P2
Peer P1
p 5
p 1
p 6
p 7
p 2
Peer P3
p 3
p 8
p 4
p 9
36Description of the Algorithm
- Two phases
- Information gathering
- Computation
37Information Gathering
- P1 sends to P2 the set of attributes at each
partition BUT NO MAPPINGS - P2 computes inferred partitions
- Inferred partitions to discover interdependencies
or lack thereof bw partitions - Then computation phase
38Inferred Partitions
Peer P1
Peer P2
39Computation Phase
- The computation starts at penultimate peer
- Cover between P3 and P4 computed and sent to P2
- Cover between P2 and P4 computed and streamed to
P1 - Cover between P1 and P4 computed
40EXPERIMENTAL RESULTS
- Do our solutions provide added value for
communities that already use mapping tables
extenxively? - Are characteristics of our algorithm appropriate
and effective in a peer-to-peer environment?
41Implementation
- Geographically distributed machines with one peer
per machine - Each peer has 2 modules
- First module interacts with the storage
manager to retrieve mappings and perform cover - Second is peer-to-peer networking protocol
42Implementation
- Each peer decides how much cache to use
- Biology Domain6 Biological DB used
- GDB MIM SwissProt Hugo Locus Unigene
- Tabe sizes range from 7000 to 28000 mappings with
an average of 13000. - B2B Domainbusiness-to-business setting
43Results
- Cache sizes from 64 to 128 mappings result
- the best running times for those data character
- B2B
- Complex semantics for tables,but still
efficient new mappings - Total execution time scales linearly with
the number of computed mappings
44(No Transcript)
45(No Transcript)
46CONCLUSION
- Problem of managing collections of mapping tables
- Alternative semantics for tables
- A language that allows specification of mapping
tables under different semantics - Complexity of Inference and consistency
- An algorithm to solve the problem
47- ANY QUESTIONS?
- THANK YOU...