Title: Video A Text Retrieval Approach to Object Matching in Videos
1Video A
Text Retrieval Approach to Object Matching in
Videos
- Morteza Alamgir
- Vision Lab
2Motivation
- IR representation, storage, organization of, and
access to information items - Focus is on the user information need
- Information item Usually text, but possibly also
image, audio, video, etc. - Presently, most retrieval of non-text items is
based on searching their textual descriptions.
3Text retrieval overview
- Word Document
- Vocabulary
- Weighting
- Inverted file
- Ranking
4Words Documents
- Documents are parsed into words
- Common words are ignored (the, an, etc)
- This is called stop list
- Words are represented by their stems
- walk, walking, walks ?walk
- Each word is assigned a unique identifier
- A document is represented by a vector
- With components given by the frequency of
occurrence of the words it contains
5Vocabulary
- The vocabulary contains K words
- Each document is represented by a K components
vector of words frequencies -
- (0,0, 3, 4,. 5, 0,0)
6Example
- Representation, detection and learning are
the - main issues that need to be tackled in designing
a - visual system for recognizing object. categories
- .
7Parse and clean
8Creating document vector ID
- Assign unique id to each word
- Create a document vector of size K with word
frequency - (3,7,2,)/789
- Or compactly with the original order and position
9Weighting
- The vector components are weighted in various
ways - Naive - Frequency of each word.
- Binary 1 if word appear 0 if not.
- tf-idf - Term Frequency Inverse Document
Frequency
10Inverted File Index
- Crawling stage
- Parsing all documents to create document
representing vectors - Creating word Indices
- An entry for each word in the corpus followed by
a list of all documents (and positions in it)
11Querying
- Parsing the query to create query vector Query
Representation learning? Query Doc ID
(1,0,1,0,0,) - Retrieve all documents ID containing one of the
Query words ID (Using the invert file index) - Calculate the distance between the query and
document vectors (angle between vectors) - Rank the results
12Ranking the query results
- Page Rank (PR)
- Assume page A has page T1,T2Tn links to it
- Define C(X) as the number of links in page X
- d is a weighting factor ( 0d1)
- Word Order
- Font size, font type and more
13 The Visual Analogy
???
Word
Stem
???
Document
Frame
Text
Visual
14Detecting Visual Words
- Visual word ? Descriptor
- What is a good descriptor?
- Invariant to different view points, scale,
illumination, shift and transformation - Local Versus Global
- How to build such a descriptor ?
- Finding invariant regions in the frame
- Representation by a descriptor
15Finding invariant regions
- Two types of viewpoint covariant regions, are
computed for each frame - SA Shape Adapted
- MS - Maximally Stable
16SA Shape Adapted
- Finding interest point using Harris corner
detector - Iteratively determining the ellipse center, scale
and shape around the interest point - Reference - Baumberg
17MS - Maximally Stable
- Intensity water shade image segmentation
- The regions are those for which the area is
approximately stationary as the intensity
threshold is varied. - Reference - Matas
18Why two types of detectors ?
- They are complementary representation of a frame
- SA regions tends to centered at corner like
features - MS regions correspond to blobs of high contrast
(such as dark window on a gray wall) - Each detector describes a different vocabulary
(e.g. the building design and the building
specification) -
19MS - MA example
MS yellow SA - cyan
Zoom
20Building the Descriptors
- SIFT Scale Invariant Feature Transform
- Each elliptical region is represented by a
128-dimensional vector Lowe - SIFT is invariant to a shift of a few pixels
(often occurs)
21Building the Descriptors
- Removing noise tracking averaging
- Regions are tracked across sequence of frames
using Constant Velocity Dynamical model - Any region which does not survive for more than
three frames is rejected - Descriptors throughout the tracks are averaged to
improve SNR - Large covariances descriptors are rejected
22The Visual Analogy
Descriptor
Word
Document
Frame
Text
Visual
23Building the Visual Stems
- Cluster descriptors into K groups using K-mean
clustering algorithm - Each cluster represent a visual word in the
visual vocabulary - Result
- 10K SA clusters
- 16K MS clusters
24K-Mean Clustering
- Input
- A set of n unlabeled examples Dx1,x2,,xn in
d-dimensional feature space - Number of clusters - K
- Objective
- Find the partition of D into K non-empty disjoint
subsets - So that the points in each subset are coherent
according to certain criterion
25MS and SA Visual Words
SA
MS
26 The Visual Analogy
Descriptor
Word
Document
Frame
Text
Visual
27Visual Stop List
- The most frequent visual words that occur in
almost all images are suppressed
Before stop list?
After stop list ?
28Ranking Frames
- Distance between vectors (Like in words/Document)
- Spatial consistency ( Word order in the text)
29Visual Google process
- Preprocessing
- Vocabulary building
- Crawling Frames
- Creating Stop list
- Querying
- Building query vector
- Ranking results
30Vocabulary building
Regions construction (SA MS)
Frames tracking
Subset of 48 shots is selected
1.6E6 ?200k regions
10k frames 1600 1.6E6 regions
10k frames 10 of movie
Rejecting unstable regions
Clustering descriptors using k-mean algo.
SIFT descriptors representation
Parameters tuning is done with the ground truth
set
31Crawling Implementation
- To reduce complexity one keyframe per second is
selected (100-150k frames ? 5k frames) - Descriptors are computed for stable regions in
each key frame - Mean values are computed using two frames each
side of the key frame - Vocabulary Vector quantization using the
nearest neighbor algorithm (found from the ground
truth set)
- The expressiveness of the visual vocabulary
- Frames outside the ground truth set contains new
object and scenes, and their detected regions
have not been included in forming the clusters
32Crawling movies summary
Key frames selection
Regions construction (SA MS)
Frames tracking
5k frames
Rejecting unstable regions
Nearest neighbored for vector quantization
SIFT descriptors representation
Tf-idf weighting
Stop list
Indexing
33 like Query Object
Use nearest neighbor algo to build query vector
Use inverse index to find relevant frames
Generate query descriptor
Doc vectors are sparse ? small set
Calculate distance to relevant frames
Rank results
0.1 seconds with a Matlab
34Reference
- Sivic, J. and Zisserman, A., Video Google A Text
Retrieval Approach to Object Matching in Videos.
Proceedings of the International Conference on
Computer Vision (2003) - Brin and L. Page. The anatomy of a large-scale
hypertextual web search engine. In 7th Int. WWW
Conference, 1998. - K. Mikolajczyk and C. Schmid. An affine invariant
interest point detector. In Proc. ECCV.
Springer-Verlag, 2002. - J. Matas, O. Chum, M. Urban, and T. Pajdla.
Robust wide baseline stereo from maximally stable
35