Video A Text Retrieval Approach to Object Matching in Videos - PowerPoint PPT Presentation

1 / 35

About This Presentation

Title:

Video A Text Retrieval Approach to Object Matching in Videos

Description:

... representation of a frame ... tracked across sequence of frames using 'Constant Velocity Dynamical ... computed for stable regions in each key frame ... – PowerPoint PPT presentation

Number of Views:602

Avg rating:3.0/5.0

Slides: 36

Provided by: root94

Category:

more less

Transcript and Presenter's Notes

Title: Video A Text Retrieval Approach to Object Matching in Videos

1
Video A
Text Retrieval Approach to Object Matching in
Videos

Morteza Alamgir
Vision Lab

2
Motivation

IR representation, storage, organization of, and
access to information items
Focus is on the user information need
Information item Usually text, but possibly also
image, audio, video, etc.
Presently, most retrieval of non-text items is
based on searching their textual descriptions.

3
Text retrieval overview

Word Document
Vocabulary
Weighting
Inverted file
Ranking

4
Words Documents

Documents are parsed into words
Common words are ignored (the, an, etc)
This is called stop list
Words are represented by their stems
walk, walking, walks ?walk
Each word is assigned a unique identifier
A document is represented by a vector
With components given by the frequency of
occurrence of the words it contains

5
Vocabulary

The vocabulary contains K words
Each document is represented by a K components
vector of words frequencies
(0,0, 3, 4,. 5, 0,0)

6
Example

Representation, detection and learning are
the
main issues that need to be tackled in designing
a
visual system for recognizing object. categories
.

7
Parse and clean
8
Creating document vector ID

Assign unique id to each word
Create a document vector of size K with word
frequency
(3,7,2,)/789
Or compactly with the original order and position

9
Weighting

The vector components are weighted in various
ways
Naive - Frequency of each word.
Binary 1 if word appear 0 if not.
tf-idf - Term Frequency Inverse Document
Frequency

10
Inverted File Index

Crawling stage
Parsing all documents to create document
representing vectors
Creating word Indices
An entry for each word in the corpus followed by
a list of all documents (and positions in it)

11
Querying

Parsing the query to create query vector Query
Representation learning? Query Doc ID
(1,0,1,0,0,)
Retrieve all documents ID containing one of the
Query words ID (Using the invert file index)
Calculate the distance between the query and
document vectors (angle between vectors)
Rank the results

12
Ranking the query results

Page Rank (PR)
Assume page A has page T1,T2Tn links to it
Define C(X) as the number of links in page X
d is a weighting factor ( 0d1)
Word Order
Font size, font type and more

13
The Visual Analogy
???
Word
Stem
???
Document
Frame
Text
Visual
14
Detecting Visual Words

Visual word ? Descriptor
What is a good descriptor?
Invariant to different view points, scale,
illumination, shift and transformation
Local Versus Global
How to build such a descriptor ?
Finding invariant regions in the frame
Representation by a descriptor

15
Finding invariant regions

Two types of viewpoint covariant regions, are
computed for each frame
SA Shape Adapted
MS - Maximally Stable

16
SA Shape Adapted

Finding interest point using Harris corner
detector
Iteratively determining the ellipse center, scale
and shape around the interest point
Reference - Baumberg

17
MS - Maximally Stable

Intensity water shade image segmentation
The regions are those for which the area is
approximately stationary as the intensity
threshold is varied.
Reference - Matas

18
Why two types of detectors ?

They are complementary representation of a frame
SA regions tends to centered at corner like
features
MS regions correspond to blobs of high contrast
(such as dark window on a gray wall)
Each detector describes a different vocabulary
(e.g. the building design and the building
specification)

19
MS - MA example
MS yellow SA - cyan
Zoom
20
Building the Descriptors

SIFT Scale Invariant Feature Transform
Each elliptical region is represented by a
128-dimensional vector Lowe
SIFT is invariant to a shift of a few pixels
(often occurs)

21
Building the Descriptors

Removing noise tracking averaging
Regions are tracked across sequence of frames
using Constant Velocity Dynamical model
Any region which does not survive for more than
three frames is rejected
Descriptors throughout the tracks are averaged to
improve SNR
Large covariances descriptors are rejected

22
The Visual Analogy
Descriptor
Word
Document
Frame
Text
Visual
23
Building the Visual Stems

Cluster descriptors into K groups using K-mean
clustering algorithm
Each cluster represent a visual word in the
visual vocabulary
Result
10K SA clusters
16K MS clusters

24
K-Mean Clustering

Input
A set of n unlabeled examples Dx1,x2,,xn in
d-dimensional feature space
Number of clusters - K
Objective
Find the partition of D into K non-empty disjoint
subsets
So that the points in each subset are coherent
according to certain criterion

25
MS and SA Visual Words
SA
MS
26
The Visual Analogy
Descriptor
Word
Document
Frame
Text
Visual
27
Visual Stop List

The most frequent visual words that occur in
almost all images are suppressed

Before stop list?
After stop list ?
28
Ranking Frames

Distance between vectors (Like in words/Document)
Spatial consistency ( Word order in the text)

29
Visual Google process

Preprocessing
Vocabulary building
Crawling Frames
Creating Stop list
Querying
Building query vector
Ranking results

30
Vocabulary building
Regions construction (SA MS)
Frames tracking
Subset of 48 shots is selected
1.6E6 ?200k regions
10k frames 1600 1.6E6 regions
10k frames 10 of movie
Rejecting unstable regions
Clustering descriptors using k-mean algo.
SIFT descriptors representation
Parameters tuning is done with the ground truth
set
31
Crawling Implementation

To reduce complexity one keyframe per second is
selected (100-150k frames ? 5k frames)
Descriptors are computed for stable regions in
each key frame
Mean values are computed using two frames each
side of the key frame
Vocabulary Vector quantization using the
nearest neighbor algorithm (found from the ground
truth set)

The expressiveness of the visual vocabulary
Frames outside the ground truth set contains new
object and scenes, and their detected regions
have not been included in forming the clusters

32
Crawling movies summary
Key frames selection
Regions construction (SA MS)
Frames tracking
5k frames
Rejecting unstable regions
Nearest neighbored for vector quantization
SIFT descriptors representation
Tf-idf weighting
Stop list
Indexing
33
like Query Object
Use nearest neighbor algo to build query vector
Use inverse index to find relevant frames
Generate query descriptor
Doc vectors are sparse ? small set
Calculate distance to relevant frames
Rank results
0.1 seconds with a Matlab
34
Reference

Sivic, J. and Zisserman, A., Video Google A Text
Retrieval Approach to Object Matching in Videos.
Proceedings of the International Conference on
Computer Vision (2003)
Brin and L. Page. The anatomy of a large-scale
hypertextual web search engine. In 7th Int. WWW
Conference, 1998.
K. Mikolajczyk and C. Schmid. An affine invariant
interest point detector. In Proc. ECCV.
Springer-Verlag, 2002.
J. Matas, O. Chum, M. Urban, and T. Pajdla.
Robust wide baseline stereo from maximally stable