Video A Text Retrieval Approach to Object Matching in Videos - PowerPoint PPT Presentation

1 / 35
About This Presentation
Title:

Video A Text Retrieval Approach to Object Matching in Videos

Description:

... representation of a frame ... tracked across sequence of frames using 'Constant Velocity Dynamical ... computed for stable regions in each key frame ... – PowerPoint PPT presentation

Number of Views:602
Avg rating:3.0/5.0
Slides: 36
Provided by: root94
Category:

less

Transcript and Presenter's Notes

Title: Video A Text Retrieval Approach to Object Matching in Videos


1
Video A
Text Retrieval Approach to Object Matching in
Videos
  • Morteza Alamgir
  • Vision Lab

2
Motivation
  • IR representation, storage, organization of, and
    access to information items
  • Focus is on the user information need
  • Information item Usually text, but possibly also
    image, audio, video, etc.
  • Presently, most retrieval of non-text items is
    based on searching their textual descriptions.

3
Text retrieval overview
  • Word Document
  • Vocabulary
  • Weighting
  • Inverted file
  • Ranking

4
Words Documents
  • Documents are parsed into words
  • Common words are ignored (the, an, etc)
  • This is called stop list
  • Words are represented by their stems
  • walk, walking, walks ?walk
  • Each word is assigned a unique identifier
  • A document is represented by a vector
  • With components given by the frequency of
    occurrence of the words it contains

5
Vocabulary
  • The vocabulary contains K words
  • Each document is represented by a K components
    vector of words frequencies
  • (0,0, 3, 4,. 5, 0,0)

6
Example
  • Representation, detection and learning are
    the
  • main issues that need to be tackled in designing
    a
  • visual system for recognizing object. categories
  • .

7
Parse and clean
8
Creating document vector ID
  • Assign unique id to each word
  • Create a document vector of size K with word
    frequency
  • (3,7,2,)/789
  • Or compactly with the original order and position

9
Weighting
  • The vector components are weighted in various
    ways
  • Naive - Frequency of each word.
  • Binary 1 if word appear 0 if not.
  • tf-idf - Term Frequency Inverse Document
    Frequency

10
Inverted File Index
  • Crawling stage
  • Parsing all documents to create document
    representing vectors
  • Creating word Indices
  • An entry for each word in the corpus followed by
    a list of all documents (and positions in it)

11
Querying
  • Parsing the query to create query vector Query
    Representation learning? Query Doc ID
    (1,0,1,0,0,)
  • Retrieve all documents ID containing one of the
    Query words ID (Using the invert file index)
  • Calculate the distance between the query and
    document vectors (angle between vectors)
  • Rank the results

12
Ranking the query results
  • Page Rank (PR)
  • Assume page A has page T1,T2Tn links to it
  • Define C(X) as the number of links in page X
  • d is a weighting factor ( 0d1)
  • Word Order
  • Font size, font type and more

13
The Visual Analogy
???
Word
Stem
???
Document
Frame
Text
Visual
14
Detecting Visual Words
  • Visual word ? Descriptor
  • What is a good descriptor?
  • Invariant to different view points, scale,
    illumination, shift and transformation
  • Local Versus Global
  • How to build such a descriptor ?
  • Finding invariant regions in the frame
  • Representation by a descriptor

15
Finding invariant regions
  • Two types of viewpoint covariant regions, are
    computed for each frame
  • SA Shape Adapted
  • MS - Maximally Stable

16
SA Shape Adapted
  • Finding interest point using Harris corner
    detector
  • Iteratively determining the ellipse center, scale
    and shape around the interest point
  • Reference - Baumberg

17
MS - Maximally Stable
  • Intensity water shade image segmentation
  • The regions are those for which the area is
    approximately stationary as the intensity
    threshold is varied.
  • Reference - Matas

18
Why two types of detectors ?
  • They are complementary representation of a frame
  • SA regions tends to centered at corner like
    features
  • MS regions correspond to blobs of high contrast
    (such as dark window on a gray wall)
  • Each detector describes a different vocabulary
    (e.g. the building design and the building
    specification)

19
MS - MA example
MS yellow SA - cyan
Zoom
20
Building the Descriptors
  • SIFT Scale Invariant Feature Transform
  • Each elliptical region is represented by a
    128-dimensional vector Lowe
  • SIFT is invariant to a shift of a few pixels
    (often occurs)

21
Building the Descriptors
  • Removing noise tracking averaging
  • Regions are tracked across sequence of frames
    using Constant Velocity Dynamical model
  • Any region which does not survive for more than
    three frames is rejected
  • Descriptors throughout the tracks are averaged to
    improve SNR
  • Large covariances descriptors are rejected

22
The Visual Analogy
Descriptor
Word
Document
Frame
Text
Visual
23
Building the Visual Stems
  • Cluster descriptors into K groups using K-mean
    clustering algorithm
  • Each cluster represent a visual word in the
    visual vocabulary
  • Result
  • 10K SA clusters
  • 16K MS clusters

24
K-Mean Clustering
  • Input
  • A set of n unlabeled examples Dx1,x2,,xn in
    d-dimensional feature space
  • Number of clusters - K
  • Objective
  • Find the partition of D into K non-empty disjoint
    subsets
  • So that the points in each subset are coherent
    according to certain criterion

25
MS and SA Visual Words
SA
MS
26
The Visual Analogy
Descriptor
Word
Document
Frame
Text
Visual
27
Visual Stop List
  • The most frequent visual words that occur in
    almost all images are suppressed

Before stop list?
After stop list ?
28
Ranking Frames
  • Distance between vectors (Like in words/Document)
  • Spatial consistency ( Word order in the text)

29
Visual Google process
  • Preprocessing
  • Vocabulary building
  • Crawling Frames
  • Creating Stop list
  • Querying
  • Building query vector
  • Ranking results

30
Vocabulary building
Regions construction (SA MS)
Frames tracking
Subset of 48 shots is selected
1.6E6 ?200k regions
10k frames 1600 1.6E6 regions
10k frames 10 of movie
Rejecting unstable regions
Clustering descriptors using k-mean algo.
SIFT descriptors representation
Parameters tuning is done with the ground truth
set
31
Crawling Implementation
  • To reduce complexity one keyframe per second is
    selected (100-150k frames ? 5k frames)
  • Descriptors are computed for stable regions in
    each key frame
  • Mean values are computed using two frames each
    side of the key frame
  • Vocabulary Vector quantization using the
    nearest neighbor algorithm (found from the ground
    truth set)
  • The expressiveness of the visual vocabulary
  • Frames outside the ground truth set contains new
    object and scenes, and their detected regions
    have not been included in forming the clusters

32
Crawling movies summary
Key frames selection
Regions construction (SA MS)
Frames tracking
5k frames
Rejecting unstable regions
Nearest neighbored for vector quantization
SIFT descriptors representation
Tf-idf weighting
Stop list
Indexing
33
like Query Object
Use nearest neighbor algo to build query vector
Use inverse index to find relevant frames
Generate query descriptor
Doc vectors are sparse ? small set
Calculate distance to relevant frames
Rank results
0.1 seconds with a Matlab
34
Reference
  • Sivic, J. and Zisserman, A., Video Google A Text
    Retrieval Approach to Object Matching in Videos.
    Proceedings of the International Conference on
    Computer Vision (2003)
  • Brin and L. Page. The anatomy of a large-scale
    hypertextual web search engine. In 7th Int. WWW
    Conference, 1998.
  • K. Mikolajczyk and C. Schmid. An affine invariant
    interest point detector. In Proc. ECCV.
    Springer-Verlag, 2002.
  • J. Matas, O. Chum, M. Urban, and T. Pajdla.
    Robust wide baseline stereo from maximally stable

35
  • Any Question?
Write a Comment
User Comments (0)
About PowerShow.com