CIS750 - PowerPoint PPT Presentation

About This Presentation

Title:

CIS750

Description:

Design fast search algorithms that locate objects that match a query object, ... day. Mutlimedia Indexing Detailed outline. Generic Multimedia Indexing. problem dfn ... – PowerPoint PPT presentation

Number of Views:126

Avg rating:3.0/5.0

Slides: 58

Provided by: Vas111

Learn more at: https://cis.temple.edu

Category:

Tags: cis750

more less

Transcript and Presenter's Notes

Title: CIS750

1
CIS750 Seminar in Advanced Topics in Computer
ScienceAdvanced topics in databases
Multimedia Databases

V. Megalooikonomou
Generic Multimedia Indexing
(some slides are based on notes by C. Faloutsos)

2
General Overview

Multimedia Indexing
Spatial Access Methods (SAMs)
k-d trees
Point Quadtrees
MX-Quadtree
z-ordering
R-trees
Generic Multimedia Indexing

3
Mutlimedia Indexing Detailed outline

Generic Multimedia Indexing
problem dfn
Distance function
Similarity queries Types
Requirements (ideal method)
Basic idea, Lower-bounding
Gemini approach
Applications
1-D Time sequences
2-D Color images

4
Generic Multimedia Indexing - problem

Given a database of multimedia objects
Design fast search algorithms that locate objects
that match a query object, exactly or
approximately
Objects
1-d time sequences
Digitized voice or music
2-d color images
2-d or 3-d gray scale medical images
Video clips
E.g. Find companies whose stock prices move
similarly

5
Mutlimedia Indexing Detailed outline

Generic Multimedia Indexing
problem dfn
Distance function
Similarity queries Types
Requirements (ideal method)
Basic idea, Lower-bounding
Gemini approach
Applications
1-D Time sequences
2-D Color images

6
Generic Multimedia Indexing- problem

1st step provide a measure for the distance
between two objects
Distance function D()
Given two objects OA, OB the distance
(dis-similarity) of the two objects is denoted
by
D(OA, OB)
E.g., Euclidean distance (sum of squared
differences) of two equal-length time series

7
Mutlimedia Indexing Detailed outline

Generic Multimedia Indexing
problem dfn
Distance function
Similarity queries
Requirements (ideal method)
Basic idea, Lower-bounding
Gemini approach
Applications
1-D Time sequences
2-D Color images

8
Types of Similarity Queries

Similarity queries are classified into
Whole match queries
Given a collection of N objects O1,, ON and a
query object Q find data objects that are within
distance ? from Q
Sub-pattern Match
Given a collection of N objects O1,, ON and a
query (sub-) object Q and a tolerance ? identify
the parts of the data objects that match the
query Q

9
Types of Similarity Queries
std
S1
F(S1)
1
365
day
F(Sn)
Sn
avg
day
1
365

Similarity queries are classified into
Whole match queries
Given a collection of N objects O1,, ON and a
query object Q find data objects that are within
distance ? from Q
Sub-pattern Match
Given a collection of N objects O1,, ON and a
query (sub-) object Q and a tolerance ? identify
the parts of the data objects that match the
query Q

10
Types of Similarity Queries
std
S1
F(S1)
1
365
day
F(Sn)
Sn
avg
day
1
365

Similarity queries are classified into
Whole match queries
Given a collection of N objects O1,, ON and a
query object Q find data objects that are within
distance ? from Q
Sub-pattern Match
Given a collection of N objects O1,, ON and a
query (sub-) object Q and a tolerance ? identify
the parts of the data objects that match the
query Q

11
Types of Similarity Queries

Similarity queries are classified into
Whole match queries
Given a collection of N objects O1,, ON and a
query object Q find data objects that are within
distance ? from Q
Sub-pattern Match
Given a collection of N objects O1,, ON and a
query (sub-) object Q and a tolerance ? identify
the parts of the data objects that match the
query Q

12
Types of Similarity Queries

Additional types of queries
K- Nearest Neighbor queries
Given a collection of N objects O1,, ON and a
query object Q find the K most similar data
objects to Q
All pairs queries (or spatial joins)
Given a collection of N objects O1,, ON find all
objects that are within distance ? from each other

13
Types of Similarity Queries

Additional types of queries
K- Nearest Neighbor queries
Given a collection of N objects O1,, ON and a
query object Q find the K most similar data
objects to Q
All pairs queries (or spatial joins)
Given a collection of N objects O1,, ON find all
objects that are within distance ? from each other

14
Mutlimedia Indexing Detailed outline

Generic Multimedia Indexing
problem dfn
Distance function
Similarity queries Types
Requirements (ideal method)
Basic idea, Lower-bounding
Gemini approach
Applications
1-D Time sequences
2-D Color images

15
Idea method requirements

Fast sequential scanning and distance
calculation with each and every object too slow
for large databases
Correct No false dismissals. False alarms are
acceptable. Why?
Small space overhead
Dynamic easy to insert, delete, and update
objects

16
Approach Outline

Use k feature extraction functions to map objects
into k-dimensional space (applying a mapping F ()
)
Use highly fine-tuned database SAMs (Spatial
Access Methods) like R-trees to accelerate the
search (by pruning out large portions of the
database that are not promising)

17
Mutlimedia Indexing Detailed outline

Generic Multimedia Indexing
problem dfn
Distance function
Similarity queries Types
Requirements (ideal method)
Basic idea, Lower-bounding
Gemini approach
Applications
1-D Time sequences
2-D Color images

18
Basic idea

Focus on whole match queries
Given a collection of N objects O1,, ON, a
distance/dis-similarity function D(Oi, Oj), and
a query object Q find data objects that are
within distance ? from Q
Sequential scanning?

19
Basic idea

Focus on whole match queries
Given a collection of N objects O1,, ON, a
distance/dis-similarity function D(Oi, Oj), and
a query object Q find data objects that are
within distance ? from Q
Sequential scanning?
May be too slow.. Why?

20
Basic idea

Focus on whole match queries
Given a collection of N objects O1,, ON, a
distance/dis-similarity function D(Oi, Oj), and
a query object Q find data objects that are
within distance ? from Q
Sequential scanning?
May be too slow.. for the following reasons
Distance computation is expensive (e.g., editing
distance in DNA strings)
The Database size N may be huge
Faster alternative?

21
Basic idea

Faster alternative
Step 1 a quick and dirty test to discard
quickly the vast majority of non-qualifying
objects
Step 2 use of SAMs to achieve faster than
sequential searching
Example
Database of yearly stock price movements
Euclidean distance function
Characterize with a single number (feature)
Or use two or more features

22
Basic idea - illustration

A query with tolerance ? becomes a sphere with
radius ?

23
Basic idea caution!

The mapping F() from objects to k-d points should
not distort the distances
D() distance of two objects
Df() distance of their corresponding feature
vectors
Ideally, perfect preservation of distances
In practice, a guarantee of no false dismissals
How?

24
Basic idea caution!

The mapping F() from objects to k-d points should
not distort the distances
D() distance of two objects
Df() distance of the corresponding feature
vectors
Ideally, perfect preservation of distances
In practice, a guarantee of no false dismissals
How? If the distance in f-space matches or
underestimates the distance between two objects
in the original space

25
Basic idea Lower bounding

Let O1, O2 be two objects with distance function
D() and F(O1), F(O2), be their feature vectors
with distance function Df(), then
To guarantee no false dismissals for whole
match queries, the feature extraction function
F() should satisfy
Df(F(O1), F(O2)) ? D(O1, O2)
for every pair of objects O1, O2

26
Lower bounding - Proof

Let Q be the query object and O be the qualifying
object and ? be the tolerance.
Prove If object O qualifies it will be retrieved
by a range query in the f-space
Or, D(Q, O) ? ? ? Df(F(Q), F(O)) ? ?
However, Df(F(Q), F(O)) ? D(Q, O) ? ? ?
What about all-pairs?
What about nearest-neighbor queries?

27
Lower bounding - Proof

Let Q be the query object and O be the qualifying
object and ? be the tolerance.
Prove If object O qualifies it will be retrieved
by a range query in the f-space
Or, D(Q, O) ? ? ? Df(F(Q), F(O)) ? ?
However, Df(F(Q), F(O)) ? D(Q, O) ? ? ?
What about all-pairs? (spatial join on
f-space)
What about nearest-neighbor queries?

28
Lower bounding - Proof

Let Q be the query object and O be the qualifying
object and ? be the tolerance.
Prove If object O qualifies it will be retrieved
by a range query in the f-space
Or, D(Q, O) ? ? ? Df(F(Q), F(O)) ? ?
However, Df(F(Q), F(O)) ? D(Q, O) ? ? ?
What about all-pairs? (spatial join on
f-space)
What about nearest-neighbor queries? ??

29
Mutlimedia Indexing Detailed outline

Generic Multimedia Indexing
problem dfn
Distance function
Similarity queries Types
Requirements (ideal method)
Basic idea, Lower-bounding
Gemini approach
Applications
1-D Time sequences
2-D Color images

30
GEneric Multimedia object INdexIng

GEMINI approach
Determine distance function D()
Find one or more numerical feature-extraction
functions (to provide a quick and dirty test)
Prove that Df() lower-bounds D() to guarantee no
false dismissals
Use a SAM (e.g., R-tree) to store and retrieve
k-d feature vectors
!!! The methodology focuses on the speed of
search only not on the quality of the results
which relies on the distance function

31
Generic Multimedia Object Indexing

Applications
1-d time sequences
2-d color images
Problems to solve
How to apply the lower-bounding lemma
Curse of Dimensionality (time sequences)
Cross-talk of features (color images)

32
Mutlimedia Indexing Detailed outline

Generic Multimedia Indexing
problem dfn
Distance function
Similarity queries Types
Requirements (ideal method)
Basic idea, Lower-bounding
Gemini approach
Applications
1-D Time sequences
2-D Color images

33
1-D Time Sequences

Distance function Euclidean distance
Find features that
Preserve/lower-bound the distance
Carry as much information as possible(reduce
false alarms)
If we are allowed to use only one feature what
would this be?

34
1-D Time Sequences

Distance function Euclidean distance
Find features that
Preserve/lower-bound the distance
Carry as much information as possible(reduce
false alarms)
If we are allowed to use only one feature what
would this be? The average.
extending it

35
1-D Time Sequences

Distance function Euclidean distance
Find features that
Preserve/lower-bound the distance
Carry as much information as possible(reduce
false alarms)
If we are allowed to use only one feature what
would this be? The average.
extending it
The average of 1st half, of the 2nd half, of the
1st quarter, etc.
Coefficients of the Fourier transform (DFT),
wavelet transform, etc.

36
1-D Time Sequences

Show that the distance in feature space
lower-bounds the actual distance
What about DFT?

37
1-D Time Sequences

Show that the distance in feature space
lower-bounds the actual distance
What about DFT?
Parsevals Theorem DFT preserves the energy
of the signal as well as the distances between
two signals.
D(x,y) D(X,Y)
where X and Y are the Fourier transforms of
x and y
If we keep the first k ? n coefficients of DFT we
lower-bound the actual distance

38
1-D Time Sequences

Response time improves as the transform
concentrates more the energy of the signal
DFT concentrates the energy for a large class of
signals, the colored noises
Colored noises skewed energy spectrum that drops
as O(f -b)
Energy spectrum or power spectrum of a signal is
the square of the amplitude Xf as a function of
the frequency f
b 2 random walks or brown noise (very
predictable)
b ? 2 black noises
b 1 pink noise
b 0 white noise (completely unpredictable)
Colored noises even in images (photographs)

39
Mutlimedia Indexing Detailed outline

Generic Multimedia Indexing
problem dfn
Distance function
Similarity queries Types
Requirements (ideal method)
Basic idea, Lower-bounding
Gemini approach
Applications
1-D Time sequences
2-D Color images

40
2-D color images

Image features for Content Based Image Retrieval
(CBIR)
Low Level
Color color histograms
Texture directionality, granularity, contrast
Shape turning angle, moments of inertia,
pattern spectrum
Position 2D strings method
etc
Object Level
Regions

41
2-D color images Color histograms

Each color image a 2-d array of pixels
Each pixel 3 color components (R,G,B)
h colors each color denoting a point in 3-d
color space (as high as 224 colors)
For each image compute the h-element color
histogram each component is the percentage of
pixels that are most similar to that color
The histogram of image I is defined as
For a color Ci , Hci(I) represents the number
of pixels of color Ci in image I
OR
For any pixel in image I, Hci(I) represents the
possibility of that pixel having color Ci.

42
2-D color images Color histograms

Usually cluster similar colors together and
choose one representative color for each color
bin
Most commercial CBIR systems include color
histogram as one of the features (e.g., QBIC of
IBM)
No space information

43
Color histograms - distance

One method to measure the distance between two
histograms x and y is
where the color-to-color similarity matrix
A has entries aij that describe the similarity
between color i and color j

44
Color histograms lower bounding

Two obstacles for using color-histograms as
feature vectors in GEMINI
Dimensionality curse (h is large 64, 128)
Distance function is quadratic
It involves all cross terms (cross-talk among
features)
- expensive to compute
- precludes the use of SAMs

bright red
pink
orange
x
q
e.g.,64 colors
45
Color histograms lower bounding

1st step define the distance function between
two color images D()dh()
2nd step find numerical features (one or more)
whose Euclidean distance lower-bounds dh()
If we allowed to use one numerical feature to
describe the color image what should it be?
Avg. amount for each color component (R,G,B)
Where ,
similarly for G and B
Where P is the number of pixels in the
image, R(p) is the red component (intensity) of
the p-th pixel

46
Color histograms lower bounding

Given the average color vectors and of two
images we define davg() as the Euclidean distance
between the 3-d average color vectors
3rd step to prove that the feature distance
davg() lower-bounds the actual distance dh()
Main idea of approach
First a filtering using the average (R,G,B)
color,
then a more accurate matching using the full
h-element histogram

47
Color auto-correlogram

pick any pixel p1 of color Ci in the image I
at distance k away from p1 pick another pixel p2
what is the probability that p2 is also of color
Ci ?

Red ?
k
P2
P1
Image I
48
Color auto-correlogram

The auto-correlogram of image I for color Ci ,
distance k
Integrate both color information and space
information.

49
Color auto-correlogram
50
Implementations

Pixel Distance Measures
Use D8 distance (also called chessboard
distance)
Choose distance k1,3,5,7
Computation complexity
Histogram
Correlogram

51
Implementations

Features Distance Measures
D( f(I1) - f(I2) ) is small ? I1 and I2 are
similar.
Example f(a)1000, f(a)1050 f(b)100,
f(b)150
For histogram
For correlogram

52
Color Histogram vs Correlogram

If there is no difference between the query and
the target images, both methods have good
performance.

Correlogram method
Query Image (512 colors)
1st
2nd
3rd
4th
5th
Histogram method
1st
2nd
3rd
4th
5th
53
Color Histogram vs Correlogram

The correlogram method is more stable to color
change than the histogram method.

Query
Correlogram method 1st Histogram method 48th
Target
54
Color Histogram vs Correlogram

The correlogram method is more stable to large
appearance change than the histogram method

Query
Correlogram method 1st Histogram method 31th
Target
55
Color Histogram vs Correlogram

The correlogram method is more stable to contrast
brightness change than the histogram method.

Query 3
Query 1
Query 2
Query 4
C 178th H 230th
C 1st H 1st
C 1st H 3rd
C 5th H 18th
Target
56
Color Histogram vs Correlogram

The color correlogram describes the global
distribution of local spatial correlations of
colors.
Its easy to compute
Its more stable than the color histogram method

57
Mutlimedia Indexing Conclusions