Title: Query Processing in a Mediator System for Data and Multimedia
1Query Processing in a Mediator System for Data
and Multimedia
- D. Beneventano1, C. Gennaro2, M. Mordacchini2, R.
Carlos Nana Mbinkeu1 - 1DII - Università di Modena e Reggio Emilia, via
Vignolese 905, Modena, Italy - 2ISTI CNR, via Moruzzi 1, Pisa, Italy
2Outline
- Motivation
- The system and scenario overview
- Querying an ontology of data and multimedia
sources - mapping
- Query unfolding for multimedia conditions
- ranking
- Conclusion and future work
3Motivation
- We proposed a method for building a populated
domain ontology representative of a set of web
data sources. - The method exploits the capabilities of a
mediator system (MOMIS) to create an integrated
view of a set of data sources, - i.e. a domain ontology schema, and a set of
annotations linking data to the integrated view. - We extend that approach with multimedia sources,
thus obtaining a methodology for building and
querying an ontology representing data and
multimedia sources. - There are several use cases where applications
interact with ontologies of data and multimedia
sources. - Multimedia and data sources are usually
represented with different models. No standard
for representing at the same time data and
multimedia sources has been adopted by large
communities. - Different languages and different interfaces for
querying traditional and multimedia data
sources have been developed. The formers rely on
expressive languages allowing expressing
selection clauses, the latters typically
implement similarity search techniques for
retrieving multimedia documents similar to the
ones provided by the user.
4Managing a Semantic Peer MOMIS MILOS
provides a unified access to different data
sources referring to the same domain by means of
a Semantic Peer Data Ontology (SPDO) of the data
i.e. a common representation of all the data
sources belonging to the peer.
MOMIS (Mediator envirOnment for Multiple
Information Sources) is a framework to perform
information extraction and integration of
heterogeneous, structured and semistructured,
data sources
- MILOS is a general purpose Multimedia Content
Management System - Manages and serves any multimedia documents
- Manages any metadata of documents
NeP4B Semantic Peer
5Data and Multimedia Sources (DMSs)
- Data and Multimedia Source (DMS) is an object
oriented database of metadata objects describing
a collection of multimedia documents (such as
images, videos, etc.) represented with a schema
defined in ODLI3 - The DMS schema includes , in general, a set of
standard attributes declared using standard
predefined ODLI3 types, such as string, double,
integer, etc, supporting selection predicates
typical of structured and semi-structured data,
such as , lt, gt, . . . - And multimedia attributes, LMS includes another
set of special attributes, declared by means of
special predefined classes in ODLI3 which support
similarity based searches (Full text search,
image similarity, geographical search, etc.)
6A sample scenario
7A sample scenario
8A sample scenario
9Quering DMSs
- A DMS Mi can be queried using an extension of
standard SQL-like syntax SELECT clause. The WHERE
clause consists of a conjunctive combination of
predicates on the single standard attributes of
Mi, as in the following - ORDER BY LIMIT K, specify in practice a top-k
similarity query
SELECT Mi.Ak,, Mi.Sl, FROM Mi WHERE Mi.Ax
op1 val1 AND Mi.Ay op2 val2 ... ORDER BY
Mi.Sw(Q1), Mi.Sz(Q2), LIMIT K
10Quering DMSs
- interface city()
- // standard attributes
- attribute string Name
- attribute string Zip
- attribute string Country
- attribute integer Surface
- attribute integer Population
-
- // similarity attributes
- attribute Image Photo
- attribute Text Description
- attribute GeoCoord GeoPosition,
-
- // query example
- SELECT Name
- FROM city
- WHERE Country "Italy
- ORDER BY Photo("http//www.flickr.com/32e324e.jpg
"),
This query tries to find among all Italian cities
the ones that best match the image given as
example, the textual description, and are nearest
as possible to the geographical point of location
40.25N, 14.32E.
11DMS Assumptions
- Since we would like to build a general purpose
framework, we make the following assumptions - The way by which the returned objects are ordered
is not known (black box) - The DMS does not return scores associated with
the objects indicating the relevance of them with
respect to the query - If no ORDER BY clause is specified, DMS will
return the records sorted in random order.
12Representing the SPDO
- We build a conceptualization of a set of DMSs,
composed of global classes and global attributes
and mappings between the SPDO and the DMS
schemata,
13Mapping
- The query is defined in a semiautomatic way as
follows - A Mapping Table (MT) is specified for each global
class G, whose columns represent the n local
classes M1, ,Mn belonging to G and whose rows
represent the h global attributes of G.
Multimedia attributes can be mapped only onto
Global multimedia attributes of the same type. - Join Conditions are defined between pairs of
local classes belonging to G and allow the system
to identify instances of the same real-world
object in different sources.
14Example of mapping
15Mapping
- Resolution Functions are introduced to solve data
conflicts of local attribute values associated to
the same real-world object. In our framework we
consider and implement some of such resolution
functions, in particular, the PREFERRED function,
which takes the value of a preferred source and
the RANDOM function, which takes a random value. - For what concern the multimedia attributes, we
introduce a new resolution function, called
MOST_SIMILAR, which returns the multimedia
objects most similar to the one expressed in the
query (if any).
16Query the SPDO
- Given a global class G with m attributes of which
k multimedia attributes, denoted by G.S1,,G.Sk
(as photo and description in the class Hotel) and
h standard attributes, denoted by G.A1,,G.Ah, a
query on G (global query) is a conjunctive query,
expressed in a simple abstract SQL-like syntax
as - SELECT G.Al,,G.Sj
- FROM G
- WHERE G.Ax op1 val1
- AND G.Ay op2 val2
- ...
- ORDER BY G.Sw(Q1), , G.Sz(Q2)
- LIMIT K
17Query unfolding
- To answer a global query on G, the query must be
rewritten as an equivalent set of queries (local
queries) expressed on the local classes L(G)
belonging to G. - the query rewriting is performed by means of
query unfolding, which consists of the following
four steps - Computation of Local Query conditions
- Computation of Residual Conditions
- Fusion of local answers
- Application of the Residual Condition
18Query Fusion Ranking
- Why?
- Modern multimedia content managers typically
return multimedia objects (i.e., which support
similarity) in decreasing order of relevance,
that is, so that the best answers are on the
top - we want to preserve this knowledge at global
level - However, since we cannot exploit scores we use
the rank as indicator of the relevance of the
record returned.
19Ranking the results
- our problem falls into the category of the
partial rank aggregation problems, in which we
merge top-k lists rather than fully ranked lists,
- We use a simple but yet effective aggregation
function for ordinal ranks is the median
function - The score of an object its median position in all
the returned lists. - The median function is demonstrated by Fagin et
al., to be near-optimal, even for top-k or
partial lists. - The algorithm MEDRANK is based on median rank
aggregation
20The MedRank algorithm
- Access the rankings sequentially
- when an element has appeared in more than half of
the rankings, output it in the aggregated ranking
21The MedRank algorithm
- Access the rankings sequentially
- when an element has appeared in more than half of
the rankings, output it in the aggregated ranking
22The MedRank algorithm
- Access the rankings sequentially
- when an element has appeared in more than half of
the rankings, output it in the aggregated ranking
23The MedRank algorithm
- Access the rankings sequentially
- when an element has appeared in more than half of
the rankings, output it in the aggregated ranking
24The MedRank algorithm
- Access the rankings sequentially
- when an element has appeared in more than half of
the rankings, output it in the aggregated ranking
25Example
- We would like to found image about the Arch of
Triumph of Rome by night. - and we assume to have two DMSs containing images
of monuments in the world, the first DMS1 with
geographical coordinates search capabilities, and
the second one DMS2 with image similarity search
capabilities
26Example
27SELECT FROM DMS1 WHERE subjectMonument
ORDER BY GeoCoord(4153'43.68"N, 1228'56.34"E
) STOP AFTER 5
Unfortunately if I just for geo coordinates
giving the coordinates of Rome as input I found a
lot of images of the Colosseum
Dist 1km
Dist 1km
DMS1
Dist 1km
Dist 1km
Dist 2km
Roma. Palazzo della Civiltà del Lavoro. EUR
28SELECT FROM DMS2 WHERE typeMonument ORDER BY
Img(URL), STOP AFTER 5
And if I just search for similarity an image of
the Arch of Triumph of Rome by night I found a
lot of images about the Arch of Triumph of Paris,
which is very similar but more famous.
DMS2
Roma. Palazzo della Civiltà del Lavoro. EUR
29SELECT FROM WorldMonuments WHERE
SubjectMonument ORDER BY Img(URL),
GeoCoord(4153'43.68"N, 1228'56.34"E ) STOP
AFTER 5
First element retrieved
Dist 1km
Dist 1km
MS1
MS2
Dist 1km
Dist 1km
Dist 2km
Roma. Palazzo della Civiltà del Lavoro. EUR
30Conclusion and future work
- We presented a methodology implemented in a tool
that allows a user to create and query an
integrated view of data and multimedia sources. - Future work will be devoted to experiment the
tool in real scenarios. In particular, our tool
will be exploited for integrating business
catalogs related to the area of tiles. - We think that such data may provide useful test
cases because of the need of connecting data
about the features of the tiles with their
images.
31The end
32Building the Data Ontology MOMIS
- MOMIS (Mediator envirOnment for Multiple
Information Sources) is a framework to perform
information extraction and integration of
heterogeneous, structured and semistructured,
data sources - Semantic Integration of Information
- A common data model ODLI3 (derived from ODL-ODMG
and I3) mapped into OLCD description logics - Tool-supported techniques to construct the Global
Virtual View (GVV) - Local sources wrapping
- Local Schema Annotation w.r.t. a common lexical
ontology (WordNet) - Semi-automatic discovery of relationships between
local schemata - Clustering techniques to build the GVV mappings
between the GVV and local schemata (Mapping
Table) - automatic GVV Annotation w.r.t. a common lexical
ontology OWL exportation - Global Query Management
- Including services and multimedia data sources
D. Beneventano, S. Bergamaschi, F. Guerra, M.
Vincini "Synthesizing an Integrated Ontology ",
IEEE Internet Computing Magazine,
September-October 2003,42-51. S. Bergamaschi, S.
Castano, M. Vincini "Semantic Integration of
Semistructured and Structured Data Sources",
SIGMOD Record Special Issue on Semantic
Interoperability in Global Information, Vol. 28,
No. 1, March 1999.
33MOMIS architecture
MANUALANNOTATION
SEMI-AUTOMATIC ANNOTATION
34Mapping definition in MOMIS
- Mappings among a Global Class G of the GVV and
its local classes are represented by a Mapping
Table - Global-as-View (GAV) mappings for each global
class G a view VG over the local classes of G is
defined by a Full-Join Merge Operator - Outer Join to include into the result all
tuples of all local sources - Merge to perform data reconciliation
(Resolution functions)
35Building the Mappings an example
Mapping Table of the global Class Hotel
L1.resort, L2.hotel
Data Conversion Functions
DollarEuro(mean_price)
Select name, avg(T_L1.price_avg,
T_L2.mean_price) as price, T_L1.Stars,
Full Join
from T_L1 outer join T_L2
on (T_L1.Name T_L2.denomination)
36Global Query Management
- The querying problem How to answer queries
expressed on the GS (global queries)? - In a Virtual Data Integration system, data reside
at the data sources then the query processing is
based on Query rewriting to rewrite a global
query as an equivalent set of queries expressed
on the local schemata data sources (local
queries). - GAV approach query rewriting is performed by
unfolding, i.e. by expanding a global query on G
according to the view associated to G
- Query Optimization Techniques for the Full-Join
Merge Operator - Motivation
- full outer join queries are very expensive,
especially in a distributed environment - only limited optimization is performed on full
outer join
37An example of Full-Join Merge Optmization
SELECT FROM G WHERE city LIKE "Modena" AND
price lt 200
AND stars 4
AND free_wifi true
AND free_wifi true
AND stars 4
38MILOS
XML Search Engine Structure search Fielded
search Full text search Multimedia search Schema
independent XQuery support(SOAP Web Service)
Metadata Editor Visual Basic (SOAP Comm.)
MultiMedia doc. serv.Allows homoneous acces to
heterogeneous media (SOAP Web Service)
Retrieval Interface JSP(SOAP Comm.)
Metadata independence The schema seen in the
interface logic can be different of the one(s)
used in the repository
Repository Metadata IntegratorAccess to
documents Access to metadata Metadata
indepence (SOAP Web Service)
39MILOS (2)
- The MILOS system is based on a threetier
distributed architecture - Client tier This is the top most level of the
system. It contains client application that
interacts with MILOS and that displays results to
user applications. - Business logic It manages query processing by
integrating and aligning information stored in
the databases. It performs reconciliation of
retrieved data by managing ranking. - Data tier It is composed of the Large Object
Database, that physically stores multimedia
documents managed by the system and the metadata
database, where all metadata associated with the
multimedia items are stored. - Multimedia metadata are represented in the data
tier in XML formats. MILOS adopts a native XML
database, which supports XML query language
standards and offers advanced search and indexing
functionality on arbitrary XML documents. - MILOS XML database provides fulltext search,
automatic classification, and feature similarity
search functionalities. - the Large Object Database permits clients of
MILOS to deal with multimedia in an uniform way.
40The MedRank algorithm
- Whenever there are multiple multimedia attributes
strange side effects can affect the precision of
the answer. - Example
- Suppose we have two image database consisting of
monument images. - MS1 provides image similarity and geografic
coordinates - MS2 provides only image similarity
- The query consists of a sample image and a point
coordinates
41SELECT FROM WorldMonuments ORDER BY
Image(URL), GeoCoord(4153'43.68"N, 1228'56.34"E
) STOP AFTER 5
First element retrieved
Dist 1km
Dist 1km
MS1
MS2
Dist 1km
Dist 1km
Dist 2km
Roma. Palazzo della Civiltà del Lavoro. EUR
42DMS Assumptions
- The rationale of the above assumptions is that
our aim is to work in a general environment with
heterogeneous DMSs for which we do not have any
knowledge of their scoring functions. - The motivation is that the final scores
themselves are often the result of the
contributions of the scores of each attribute. A
scoring function is therefore usually defined as
an aggregation over partial heterogeneous scores
(e.g., the relevance for text-based IR with
keyword queries, or similarity degrees for color
and texture of images in a multimedia database). - Even in the simpler case of single multimedia
attributes the knowledge of the scores become
meaningless outside the context in which they are
evaluated. As an example consider the TF IDF
scoring function used by normal text search
engines. The score of a document depends upon the
collection statistics and search engines could
use different scoring algorithms. - However, the above assumptions of considering a
local DMS as a black box that does not return any
score associated to result elements, do not
presume that local DMSs do not use internally
scoring functions for combing different
multimedia attributes . - Typically modern multimedia systems use fuzzy
logic to aggregate scores of different multimedia
attributes that are graded in the interval 0,1.
Classical examples of thesefunctions are the min
and mean functions.
43Computation of Local Query conditions
- Each atomic predicate Pi and similarity predicate
in the global query are rewritten into
corresponding constraints supported by the local
classes. - For example, the constraints stars 3 is
translated into a constrain Stars 3 considering
the local class resort and is not translated into
any constraint considering the local class hotel.
44Computation of Residual Conditions
- Conditions on not homogeneous standard attributes
cannot be translated into local conditions they
are considered as residual and have to be solved
at the global level.
45Computation of Residual Conditions
- for multimedia attribute we use the MOST_SIMILAR.
For example, suppose we are searching for images
similar to one specified in the query by means of
ORDER BY clause. If we retrieve two or more
multimedia objects with one or more corresponding
images, MOST_SIMILAR function will simply select
the image that is more similar to the query
image. - However since we do not know scores, how do we
evaluate similarity?
46Computation of Residual Conditions
- Rank Based Similarity
- we simply exploit the rank of the objects in the
returned list as indicator of similarity between
the attributes values belonging to the objects. - This aspect is related with the problem of the
fusion
47Fusion of local answers
- For each local source involved in the global
query, a local query is generated and executed on
the local sources. The local answers are fused
into the global answer on the basis of the
mapping query qG defined for G, i.e. by using the
Full Outerjoin-merge (FOJ) operation. - Computation of the full outer join of local
answers (FOJ). The result of this operation is
ordered on the basis of the multimedia attributes
specified in the query, this aspect is deeply
examined in the next Slide. - Application of the Resolution Functions for
each attribute GA of the global query the related
Resolution Function is applied to FOJ
48Ranking the results
- In principle, if we had ALL the (fused) records
of the result set we can exploit an optimal rank
aggregation method based on a distance measure to
quantify the disagreements among different
rankings. - In this respect the overall ranking is the one
that has minimum distance to the different
rankings obtained from different sources. - Several different distance measures are available
in literature. However, the difficult of solving
the problem of distance-based rank aggregation is
related to the choice of the distance measure and
its corresponding complexity that can be even
NP-Hard in some cases (see Kendall distance). - However, fortunately, our case falls into this
category of the partial rank aggregation
problems, in which we measures the distance
between only the top-k lists rather than fully
ranked lists.
49Example1
A ( 1 , 2 , 3 ) B ( 1 , 1 , 2 ) C ( 3 , 3 , 4
) D ( 3 , 4 , 4 )
1 http//www.cs.helsinki.fi/u/tsaparas/Information
Networks/lectures/lecture10.ppt
50Combining rankings
- In many cases the scores are not known
- e.g. meta-search engines scores are proprietary
information - or we do not know how they were obtained
- one search engine returns score 10, the other
100. What does this mean? - or the scores are incompatible
- apples and oranges does it make sense to combine
price with distance? - In this cases we can only work with the rankings