Query Processing in a Mediator System for Data and Multimedia

About This Presentation

Title:

Query Processing in a Mediator System for Data and Multimedia

Description:

The DMS schema includes , in general, a set of standard attributes declared ... Semi-automatic discovery of relationships between local schemata ... – PowerPoint PPT presentation

Number of Views:105

Avg rating:3.0/5.0

Slides: 51

Provided by: claudio61

Category:

more less

Transcript and Presenter's Notes

Title: Query Processing in a Mediator System for Data and Multimedia

1
Query Processing in a Mediator System for Data
and Multimedia

D. Beneventano1, C. Gennaro2, M. Mordacchini2, R.
Carlos Nana Mbinkeu1
1DII - Università di Modena e Reggio Emilia, via
Vignolese 905, Modena, Italy
2ISTI CNR, via Moruzzi 1, Pisa, Italy

2
Outline

Motivation
The system and scenario overview
Querying an ontology of data and multimedia
sources
mapping
Query unfolding for multimedia conditions
ranking
Conclusion and future work

3
Motivation

We proposed a method for building a populated
domain ontology representative of a set of web
data sources.
The method exploits the capabilities of a
mediator system (MOMIS) to create an integrated
view of a set of data sources,
i.e. a domain ontology schema, and a set of
annotations linking data to the integrated view.
We extend that approach with multimedia sources,
thus obtaining a methodology for building and
querying an ontology representing data and
multimedia sources.
There are several use cases where applications
interact with ontologies of data and multimedia
sources.
Multimedia and data sources are usually
represented with different models. No standard
for representing at the same time data and
multimedia sources has been adopted by large
communities.
Different languages and different interfaces for
querying traditional and multimedia data
sources have been developed. The formers rely on
expressive languages allowing expressing
selection clauses, the latters typically
implement similarity search techniques for
retrieving multimedia documents similar to the
ones provided by the user.

4
Managing a Semantic Peer MOMIS MILOS
provides a unified access to different data
sources referring to the same domain by means of
a Semantic Peer Data Ontology (SPDO) of the data
i.e. a common representation of all the data
sources belonging to the peer.
MOMIS (Mediator envirOnment for Multiple
Information Sources) is a framework to perform
information extraction and integration of
heterogeneous, structured and semistructured,
data sources

MILOS is a general purpose Multimedia Content
Management System
Manages and serves any multimedia documents
Manages any metadata of documents

NeP4B Semantic Peer
5
Data and Multimedia Sources (DMSs)

Data and Multimedia Source (DMS) is an object
oriented database of metadata objects describing
a collection of multimedia documents (such as
images, videos, etc.) represented with a schema
defined in ODLI3
The DMS schema includes , in general, a set of
standard attributes declared using standard
predefined ODLI3 types, such as string, double,
integer, etc, supporting selection predicates
typical of structured and semi-structured data,
such as , lt, gt, . . .
And multimedia attributes, LMS includes another
set of special attributes, declared by means of
special predefined classes in ODLI3 which support
similarity based searches (Full text search,
image similarity, geographical search, etc.)

6
A sample scenario
7
A sample scenario
8
A sample scenario
9
Quering DMSs

A DMS Mi can be queried using an extension of
standard SQL-like syntax SELECT clause. The WHERE
clause consists of a conjunctive combination of
predicates on the single standard attributes of
Mi, as in the following
ORDER BY LIMIT K, specify in practice a top-k
similarity query

SELECT Mi.Ak,, Mi.Sl, FROM Mi WHERE Mi.Ax
op1 val1 AND Mi.Ay op2 val2 ... ORDER BY
Mi.Sw(Q1), Mi.Sz(Q2), LIMIT K
10
Quering DMSs

interface city()
// standard attributes
attribute string Name
attribute string Zip
attribute string Country
attribute integer Surface
attribute integer Population
// similarity attributes
attribute Image Photo
attribute Text Description
attribute GeoCoord GeoPosition,
// query example
SELECT Name
FROM city
WHERE Country "Italy
ORDER BY Photo("http//www.flickr.com/32e324e.jpg
"),

This query tries to find among all Italian cities
the ones that best match the image given as
example, the textual description, and are nearest
as possible to the geographical point of location
40.25N, 14.32E.
11
DMS Assumptions

Since we would like to build a general purpose
framework, we make the following assumptions
The way by which the returned objects are ordered
is not known (black box)
The DMS does not return scores associated with
the objects indicating the relevance of them with
respect to the query
If no ORDER BY clause is specified, DMS will
return the records sorted in random order.

12
Representing the SPDO

We build a conceptualization of a set of DMSs,
composed of global classes and global attributes
and mappings between the SPDO and the DMS
schemata,

13
Mapping

The query is defined in a semiautomatic way as
follows
A Mapping Table (MT) is specified for each global
class G, whose columns represent the n local
classes M1, ,Mn belonging to G and whose rows
represent the h global attributes of G.
Multimedia attributes can be mapped only onto
Global multimedia attributes of the same type.
Join Conditions are defined between pairs of
local classes belonging to G and allow the system
to identify instances of the same real-world
object in different sources.

14
Example of mapping
15
Mapping

Resolution Functions are introduced to solve data
conflicts of local attribute values associated to
the same real-world object. In our framework we
consider and implement some of such resolution
functions, in particular, the PREFERRED function,
which takes the value of a preferred source and
the RANDOM function, which takes a random value.
For what concern the multimedia attributes, we
introduce a new resolution function, called
MOST_SIMILAR, which returns the multimedia
objects most similar to the one expressed in the
query (if any).

16
Query the SPDO

Given a global class G with m attributes of which
k multimedia attributes, denoted by G.S1,,G.Sk
(as photo and description in the class Hotel) and
h standard attributes, denoted by G.A1,,G.Ah, a
query on G (global query) is a conjunctive query,
expressed in a simple abstract SQL-like syntax
as
SELECT G.Al,,G.Sj
FROM G
WHERE G.Ax op1 val1
AND G.Ay op2 val2
...
ORDER BY G.Sw(Q1), , G.Sz(Q2)
LIMIT K

17
Query unfolding

To answer a global query on G, the query must be
rewritten as an equivalent set of queries (local
queries) expressed on the local classes L(G)
belonging to G.
the query rewriting is performed by means of
query unfolding, which consists of the following
four steps
Computation of Local Query conditions
Computation of Residual Conditions
Fusion of local answers
Application of the Residual Condition

18
Query Fusion Ranking

Why?
Modern multimedia content managers typically
return multimedia objects (i.e., which support
similarity) in decreasing order of relevance,
that is, so that the best answers are on the
top
we want to preserve this knowledge at global
level
However, since we cannot exploit scores we use
the rank as indicator of the relevance of the
record returned.

19
Ranking the results

our problem falls into the category of the
partial rank aggregation problems, in which we
merge top-k lists rather than fully ranked lists,
We use a simple but yet effective aggregation
function for ordinal ranks is the median
function
The score of an object its median position in all
the returned lists.
The median function is demonstrated by Fagin et
al., to be near-optimal, even for top-k or
partial lists.
The algorithm MEDRANK is based on median rank
aggregation

20
The MedRank algorithm

Access the rankings sequentially
when an element has appeared in more than half of
the rankings, output it in the aggregated ranking

21
The MedRank algorithm

Access the rankings sequentially
when an element has appeared in more than half of
the rankings, output it in the aggregated ranking

22
The MedRank algorithm

Access the rankings sequentially
when an element has appeared in more than half of
the rankings, output it in the aggregated ranking

23
The MedRank algorithm

Access the rankings sequentially
when an element has appeared in more than half of
the rankings, output it in the aggregated ranking

24
The MedRank algorithm

Access the rankings sequentially
when an element has appeared in more than half of
the rankings, output it in the aggregated ranking

25
Example

We would like to found image about the Arch of
Triumph of Rome by night.
and we assume to have two DMSs containing images
of monuments in the world, the first DMS1 with
geographical coordinates search capabilities, and
the second one DMS2 with image similarity search
capabilities

26
Example
27
SELECT FROM DMS1 WHERE subjectMonument
ORDER BY GeoCoord(4153'43.68"N, 1228'56.34"E
) STOP AFTER 5
Unfortunately if I just for geo coordinates
giving the coordinates of Rome as input I found a
lot of images of the Colosseum
Dist 1km
Dist 1km
DMS1
Dist 1km
Dist 1km
Dist 2km
Roma. Palazzo della Civiltà del Lavoro. EUR
28
SELECT FROM DMS2 WHERE typeMonument ORDER BY
Img(URL), STOP AFTER 5
And if I just search for similarity an image of
the Arch of Triumph of Rome by night I found a
lot of images about the Arch of Triumph of Paris,
which is very similar but more famous.
DMS2
Roma. Palazzo della Civiltà del Lavoro. EUR
29
SELECT FROM WorldMonuments WHERE
SubjectMonument ORDER BY Img(URL),
GeoCoord(4153'43.68"N, 1228'56.34"E ) STOP
AFTER 5
First element retrieved
Dist 1km
Dist 1km
MS1
MS2
Dist 1km
Dist 1km
Dist 2km
Roma. Palazzo della Civiltà del Lavoro. EUR
30
Conclusion and future work

We presented a methodology implemented in a tool
that allows a user to create and query an
integrated view of data and multimedia sources.
Future work will be devoted to experiment the
tool in real scenarios. In particular, our tool
will be exploited for integrating business
catalogs related to the area of tiles.
We think that such data may provide useful test
cases because of the need of connecting data
about the features of the tiles with their
images.

31
The end
32
Building the Data Ontology MOMIS

MOMIS (Mediator envirOnment for Multiple
Information Sources) is a framework to perform
information extraction and integration of
heterogeneous, structured and semistructured,
data sources
Semantic Integration of Information
A common data model ODLI3 (derived from ODL-ODMG
and I3) mapped into OLCD description logics
Tool-supported techniques to construct the Global
Virtual View (GVV)
Local sources wrapping
Local Schema Annotation w.r.t. a common lexical
ontology (WordNet)
Semi-automatic discovery of relationships between
local schemata
Clustering techniques to build the GVV mappings
between the GVV and local schemata (Mapping
Table)
automatic GVV Annotation w.r.t. a common lexical
ontology OWL exportation
Global Query Management
Including services and multimedia data sources

D. Beneventano, S. Bergamaschi, F. Guerra, M.
Vincini "Synthesizing an Integrated Ontology ",
IEEE Internet Computing Magazine,
September-October 2003,42-51. S. Bergamaschi, S.
Castano, M. Vincini "Semantic Integration of
Semistructured and Structured Data Sources",
SIGMOD Record Special Issue on Semantic
Interoperability in Global Information, Vol. 28,
No. 1, March 1999.
33
MOMIS architecture
MANUALANNOTATION
SEMI-AUTOMATIC ANNOTATION
34
Mapping definition in MOMIS

Mappings among a Global Class G of the GVV and
its local classes are represented by a Mapping
Table
Global-as-View (GAV) mappings for each global
class G a view VG over the local classes of G is
defined by a Full-Join Merge Operator
Outer Join to include into the result all
tuples of all local sources
Merge to perform data reconciliation
(Resolution functions)

35
Building the Mappings an example
Mapping Table of the global Class Hotel
L1.resort, L2.hotel
Data Conversion Functions
DollarEuro(mean_price)
Select name, avg(T_L1.price_avg,
T_L2.mean_price) as price, T_L1.Stars,
Full Join
from T_L1 outer join T_L2
on (T_L1.Name T_L2.denomination)
36
Global Query Management

The querying problem How to answer queries
expressed on the GS (global queries)?
In a Virtual Data Integration system, data reside
at the data sources then the query processing is
based on Query rewriting to rewrite a global
query as an equivalent set of queries expressed
on the local schemata data sources (local
queries).
GAV approach query rewriting is performed by
unfolding, i.e. by expanding a global query on G
according to the view associated to G

Query Optimization Techniques for the Full-Join
Merge Operator
Motivation
full outer join queries are very expensive,
especially in a distributed environment
only limited optimization is performed on full
outer join

37
An example of Full-Join Merge Optmization
SELECT FROM G WHERE city LIKE "Modena" AND
price lt 200
AND stars 4
AND free_wifi true
AND free_wifi true
AND stars 4
38
MILOS
XML Search Engine Structure search Fielded
search Full text search Multimedia search Schema
independent XQuery support(SOAP Web Service)
Metadata Editor Visual Basic (SOAP Comm.)
MultiMedia doc. serv.Allows homoneous acces to
heterogeneous media (SOAP Web Service)
Retrieval Interface JSP(SOAP Comm.)
Metadata independence The schema seen in the
interface logic can be different of the one(s)
used in the repository
Repository Metadata IntegratorAccess to
documents Access to metadata Metadata
indepence (SOAP Web Service)
39
MILOS (2)

The MILOS system is based on a threetier
distributed architecture
Client tier This is the top most level of the
system. It contains client application that
interacts with MILOS and that displays results to
user applications.
Business logic It manages query processing by
integrating and aligning information stored in
the databases. It performs reconciliation of
retrieved data by managing ranking.
Data tier It is composed of the Large Object
Database, that physically stores multimedia
documents managed by the system and the metadata
database, where all metadata associated with the
multimedia items are stored.
Multimedia metadata are represented in the data
tier in XML formats. MILOS adopts a native XML
database, which supports XML query language
standards and offers advanced search and indexing
functionality on arbitrary XML documents.
MILOS XML database provides fulltext search,
automatic classification, and feature similarity
search functionalities.
the Large Object Database permits clients of
MILOS to deal with multimedia in an uniform way.

40
The MedRank algorithm

Whenever there are multiple multimedia attributes
strange side effects can affect the precision of
the answer.
Example
Suppose we have two image database consisting of
monument images.
MS1 provides image similarity and geografic
coordinates
MS2 provides only image similarity
The query consists of a sample image and a point
coordinates

41
SELECT FROM WorldMonuments ORDER BY
Image(URL), GeoCoord(4153'43.68"N, 1228'56.34"E
) STOP AFTER 5
First element retrieved
Dist 1km
Dist 1km
MS1
MS2
Dist 1km
Dist 1km
Dist 2km
Roma. Palazzo della Civiltà del Lavoro. EUR
42
DMS Assumptions

The rationale of the above assumptions is that
our aim is to work in a general environment with
heterogeneous DMSs for which we do not have any
knowledge of their scoring functions.
The motivation is that the final scores
themselves are often the result of the
contributions of the scores of each attribute. A
scoring function is therefore usually defined as
an aggregation over partial heterogeneous scores
(e.g., the relevance for text-based IR with
keyword queries, or similarity degrees for color
and texture of images in a multimedia database).
Even in the simpler case of single multimedia
attributes the knowledge of the scores become
meaningless outside the context in which they are
evaluated. As an example consider the TF IDF
scoring function used by normal text search
engines. The score of a document depends upon the
collection statistics and search engines could
use different scoring algorithms.
However, the above assumptions of considering a
local DMS as a black box that does not return any
score associated to result elements, do not
presume that local DMSs do not use internally
scoring functions for combing different
multimedia attributes .
Typically modern multimedia systems use fuzzy
logic to aggregate scores of different multimedia
attributes that are graded in the interval 0,1.
Classical examples of thesefunctions are the min
and mean functions.

43
Computation of Local Query conditions

Each atomic predicate Pi and similarity predicate
in the global query are rewritten into
corresponding constraints supported by the local
classes.
For example, the constraints stars 3 is
translated into a constrain Stars 3 considering
the local class resort and is not translated into
any constraint considering the local class hotel.

44
Computation of Residual Conditions

Conditions on not homogeneous standard attributes
cannot be translated into local conditions they
are considered as residual and have to be solved
at the global level.

45
Computation of Residual Conditions

for multimedia attribute we use the MOST_SIMILAR.
For example, suppose we are searching for images
similar to one specified in the query by means of
ORDER BY clause. If we retrieve two or more
multimedia objects with one or more corresponding
images, MOST_SIMILAR function will simply select
the image that is more similar to the query
image.
However since we do not know scores, how do we
evaluate similarity?

46
Computation of Residual Conditions

Rank Based Similarity
we simply exploit the rank of the objects in the
returned list as indicator of similarity between
the attributes values belonging to the objects.
This aspect is related with the problem of the
fusion

47
Fusion of local answers

For each local source involved in the global
query, a local query is generated and executed on
the local sources. The local answers are fused
into the global answer on the basis of the
mapping query qG defined for G, i.e. by using the
Full Outerjoin-merge (FOJ) operation.
Computation of the full outer join of local
answers (FOJ). The result of this operation is
ordered on the basis of the multimedia attributes
specified in the query, this aspect is deeply
examined in the next Slide.
Application of the Resolution Functions for
each attribute GA of the global query the related
Resolution Function is applied to FOJ

48
Ranking the results

In principle, if we had ALL the (fused) records
of the result set we can exploit an optimal rank
aggregation method based on a distance measure to
quantify the disagreements among different
rankings.
In this respect the overall ranking is the one
that has minimum distance to the different
rankings obtained from different sources.
Several different distance measures are available
in literature. However, the difficult of solving
the problem of distance-based rank aggregation is
related to the choice of the distance measure and
its corresponding complexity that can be even
NP-Hard in some cases (see Kendall distance).
However, fortunately, our case falls into this
category of the partial rank aggregation
problems, in which we measures the distance
between only the top-k lists rather than fully
ranked lists.

49
Example1
A ( 1 , 2 , 3 ) B ( 1 , 1 , 2 ) C ( 3 , 3 , 4
) D ( 3 , 4 , 4 )
1 http//www.cs.helsinki.fi/u/tsaparas/Information
Networks/lectures/lecture10.ppt
50
Combining rankings