Extraction of Text Objects in Video Documents: Recent Progress - PowerPoint PPT Presentation

Loading...

PPT – Extraction of Text Objects in Video Documents: Recent Progress PowerPoint presentation | free to download - id: 77279-ZDc1Z



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Extraction of Text Objects in Video Documents: Recent Progress

Description:

Extraction of Text Objects in Video Documents: Recent Progress. Jing Zhang ... them for their contributions towards the advances in video ... ViPER (Annotation) ... – PowerPoint PPT presentation

Number of Views:373
Avg rating:3.0/5.0
Slides: 67
Provided by: jzh18
Learn more at: http://www.u-pat.org
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Extraction of Text Objects in Video Documents: Recent Progress


1
Extraction of Text Objects in Video Documents
Recent Progress
  • Jing Zhang and Rangachar Kasturi
  • University of South Florida
  • Department of Computer Science and Engineering

2
Acknowledgements
The work presented here is that of numerous
researchers from around the world. We thank
them for their contributions towards the
advances in video document processing. In
particular we would like to thank the authors of
papers whose work is cited in this presentation
and in our paper.
3
Outline
  • Introduction
  • Recent Progress
  • Performance Evaluation
  • Discussion

4
Introduction
  • Since 1990s, with rapid growth of available
    multimedia documents and increasing demand for
    information indexing and retrieval, much effort
    has been done on text extraction in images and
    videos.

5
Introduction
  • Text Extraction in Video
  • Text consists of words that are well-defined
    models of concepts for humans communication.
  • Text objects embedded in video contain much
    semantic information related to the multimedia
    content.
  • Text extraction techniques play an important role
    in content-based multimedia information indexing
    and retrieval.

6
Introduction
  • Extracting text in video presents unique
    challenge over that in scanned documents

Cons Pros
Low contrast Temporal Redundancy (text in video usually persists for at least several seconds, to give human viewers the necessary time to read it)
Low resolution Temporal Redundancy (text in video usually persists for at least several seconds, to give human viewers the necessary time to read it)
Color bleeding Temporal Redundancy (text in video usually persists for at least several seconds, to give human viewers the necessary time to read it)
Unconstrained backgrounds Temporal Redundancy (text in video usually persists for at least several seconds, to give human viewers the necessary time to read it)
Unknown text color, size, position, orientation, and layout Temporal Redundancy (text in video usually persists for at least several seconds, to give human viewers the necessary time to read it)
7
Introduction
  • Caption Text which is artificially superimposed
    on the video at the time of editing.
  • Scene Text which naturally occurs in the field of
    view of the camera during video capture.
  • The extraction of scene text is a much tougher
    task due to varying lighting, complex movement
    and transformation.

Scene Text
Caption Text
8
Introduction
  • Five stages of text extraction in video
  • 1) Text Detection finding regions in a video
    frame that contain text
  • 2) Text Localization grouping text regions into
    text instances and generating a set of tight
    bounding boxes around all text instances
  • 3) Text Tracking following a text event as it
    moves or changes over time and determining the
    temporal and spatial locations and extents of
    text events
  • 4) Text Binarization binarizing the text bounded
    by text regions and marking text as one binary
    level and background as the other
  • 5) Text Recognition performing OCR on the
    binarized text image.

9
Introduction
Video Clips
Text Detection
Text Localization
Text Tracking
Text Binarization
Text Recognition
Text Objects
10
Introduction
  • The goal of Text detection, text localization and
    text tracking is to generate accurate bounding
    boxes of all text objects in video frames and
    provide a unique identity to each text event
    which is composed of the same text object
    appearing in a sequence of consecutive frames.

11
Introduction
  • This presentation mainly concentrates on the
    approaches proposed for text extraction in videos
    in the most recent five years, to summarize and
    discuss the recent progress in this research area.

12
Introduction
  • Region Based Approach utilizes the different
    region properties between text and background to
    extract text objects.
  • Bottom-up separating the image into small
    regions and then grouping character regions into
    text regions.
  • Color features, edge features, and connected
    component methods
  • Texture Based Approach uses distinct texture
    properties of text to extract text objects from
    background.
  • Top-down extracting texture features of the
    image and then locating text regions.
  • Spatial variance, Fourier transform, Wavelet
    transform, and machine learning methods.

13
Outline
  • Introduction
  • Recent Progress
  • Performance Evaluation
  • Discussion

14
Recent Progress
  • Text extraction in video documents, as an
    important research branch of content-based
    information retrieval and indexing, continues to
    be a topic of much interest to researchers.
  • A large number of newly proposed approaches in
    the literature have contributed to an impressive
    progress of text extraction techniques.

15
Recent Progress
  • Prior to 2003
  • Only a few text extraction approaches
    considered the temporal nature of video.
  • Very little work was done on scene text.
  • Objective performance evaluation metrics were
    scarce.
  • Now
  • Temporal redundancy of video is utilized by
    almost all recent text extraction approaches.
  • Scene text extraction is being extensively
    studied.
  • A comprehensive performance evaluation
    framework has been developed.

16
Recent Progress
  • The progress of text extraction in videos can be
    categorized into three types
  • New and improved text extraction approaches
  • Text extraction techniques adopted from other
    research fields
  • Text extraction approaches proposed for specific
    text types and specific genre of video documents

17
Recent Progress
  • New and improved text extraction approaches
  • The new and improved approaches play an
    important role in the recent progress of text
    extraction technique for videos. These new
    approaches introduce not only new algorithms but
    also new understanding of the problem.

18
Recent Progress -New and improved text extraction
approaches
  • H. Tran, A lux, H.L. Nguyen T. and A.
    Boucher, A novel approach for text detection in
    images using structural features, The 3rd
    International Conference on Advances in Pattern
    Recognition, LNCS Vol. 3686, pp. 627-635, 2005

A text string is modeled as its center line and
the skeletons of characters by ridges at
different hierarchical scales.
First line Images with rectangle showing the
text region. Second line Zoom on text regions.
Third line ridges detected at two scales (red in
high level, blue in small level) in the text
region that represent local structures of text
lines whatever the type of text.
19
H. Tran, A lux, H.L. Nguyen T. and A. Boucher, A
novel approach for text detection in images using
structural features, The 3rd International
Conference on Advances in Pattern Recognition,
LNCS Vol. 3686, pp. 627-635, 2005
  • Abstract. We propose a novel approach for finding
    text in images by using ridges at several scales.
    A text string is modelled by a ridge at a coarse
    scale representing its center line and numerous
    short ridges at a smaller scale representing the
    skeletons of characters. Skeleton ridges have to
    satisfy geometrical and spatial constraints such
    as the perpendicularity or non-parallelism to the
    central ridge. In this way, we obtain a
    hierarchical description of text strings, which
    can provide direct input to an OCR or a text
    analysis system. The proposed method does not
    depend on a particular alphabet, it works with a
    wide variety in size of characters and does not
    depend on orientation of text string. The
    experimental results show a good detection.
  • X. Liu, H. Fu and Y. Jia. Gaussian Mixture
    Modeling and learning of Neighbor Characters for
    Multilingual Text Extraction in Images, Pattern
    Recognition, Vol. 41, pp. 484-493, 2008.

Abstract This paper proposes an approach based
on the statistical modeling and learning of
neighboring characters to extract multilingual
texts in images. The case of three neighboring
characters is represented as the Gaussian mixture
model and discriminated from other cases by the
corresponding pseudo-probability defined under
Bayes framework. Based on this modeling, text
extraction is completed through labeling each
connected component in the binary image as
character or non-character according to its
neighbors, where a mathematical morphology based
method is introduced to detect and connect the
separated parts of each character, and a Voronoi
partition based method is advised to establish
the neighborhoods of connected components. We
further present a discriminative training
algorithm based on the maximumminimum similarity
(MMS) criterion to estimate the parameters in the
proposed text extraction approach. Experimental
results in Chinese and English text extraction
demonstrate the effectiveness of our approach
trained with the MMS algorithm, which achieved
the precision rate of 93.56 and the recall rate
of 98.55 for the test data set. In the
experiments, we also show that the MMS provides
significant improvement of overall performance,
compared with influential training criterions of
the maximum likelihood (ML) and the maximum
classification error (MCE).
20
Recent Progress -New and improved text extraction
approaches
  • X. Liu, H. Fu and Y. Jia, Gaussian Mixture
    Modeling and learning of Neighbor Characters for
    Multilingual Text Extraction in Images, Pattern
    Recognition, Vol. 41, pp. 484-493, 2008.

The GMM based algorithm treats the text features
of three neighboring characters as three mixed
Gaussian models to extract text objects.
(a) (b)
(c)
An example of neighborhood computation. In each
figure, the image (a) shows a binary image, where
black dots denote centroids of CCs the image (b)
shows the Delaunay triangulation of centroids,
where each triangle is corresponding with a
neighbor set. However, the neighborhoods of
characters cannot be completely reflected in the
Delaunay triangulation. (c) The solution by
taking all three nodes which are joined one by
one in the convex hull of the centroid set as
neighbor sets.
21
Recent Progress -New and improved text extraction
approaches
  • P. Dubey, Edge Based Text Detection for
    Multi-purpose Application, Proceedings of
    International Conference Signal Processing, IEEE,
    Vol. 4, 2006

Only the vertical edge features are utilized to
find text regions based on the observation that
vertical edges can enhance the characteristic of
text and eliminate most irrelevant information.
(a)
(b) (c)
(d)
(a) Original image, (b) detected group of
vertical lines, (c) extracted text region, (d)
result
22
Recent Progress -New and improved text extraction
approaches
  • K. Subramanian, P. Natajajan, M. Decerbo, and D.
    Castanon, Character-Stroke Detection for
    Text-Localization and Extraction, Proceedings of
    Ninth International Conference on Document
    Analysis and Recognition, IEEE, pp. 33-37, 2007

Character-stroke is used to extract text objects
by utilizing three line scans (a set of pixels
along the horizontal line of an intensity image)
to detect image intensity changes.
(a) Original image, (b) Intensity plots along the
blue line l, l-2, and l2, is the stroke
width, (c) threshold Ig 0.35, (d) The
thresholded image after morphological operations
and connected component analysis.
23
P. Dubey, Edge Based Text Detection for
Multi-purpose Application, Proceedings of
International Conference Signal Processing, IEEE,
Vol. 4, 2006
  • Abstract Text detection plays a crucial role in
    various applications. In this paper we present an
    edge based text detection technique in the
    complex images for multi purpose application. The
    technique applied vertical Sobel edge detection
    and a newly proposed morphological technique that
    used to connect the edges to form the candidate
    regions. The technique has special advantage, by
    providing a distinguishable texture on the text
    area over the others. The connected components
    are then extracted using a purposed segmentation
    algorithm. Later all the candidate regions are
    verified to specify the text region. The propose
    techniques has been tested with different types
    of image acquired from different input sources
    and environment. The experimental result shows
    highly successful rate.
  • K. Subramanian, P. Natajajan, M. Decerbo, and D.
    Castanon, Character-Stroke Detection for
    Text-Localization and Extraction, Proceedings of
    Ninth International Conference on Document
    Analysis and Recognition, IEEE, pp. 33-37, 2007

Abstract In this paper, we present a new
approach for analysis of images for
text-localization and extraction. Our approach
puts very few constraints on the font, size and
color of text and is capable of handling both
scene text and articial text well. In this paper,
we exploit two well-known features of text
approximately constant stroke width and local
contrast, and develop a fast, simple, and
effective algorithm to detect character strokes.
We also show how these can be used for accurate
extraction and motivate some advantages of using
this approach for text localization over other
colorspace segmentation based approaches. We
analyze the performance of our stroke detection
algorithm on images collected for the
robust-reading competitions at ICDAR 2003
24
Recent Progress -New and improved text extraction
approaches
D. Crandall, S. Antani, R. Kasturi,
Extraction of special effects caption text events
from digital video, International Journal on
Document Analysis and Recognition, Vol. 5, pp.
138-157, 2003
88 block-wise DCT is applied on each video
frame. For each block, 19 optimal coefficients
that best correspond to the properties of text
are determined empirically. The sum of the
absolute values of these coefficients is computed
and regarded as a measure of the text energy of
that block. The motion vectors of MPEG-compressed
videos are used for text objects tracking.
(a) Original image
(c) Tracking result
(b) Text energy
25
D. Crandall, S. Antani, R. Kasturi,
Extraction of special effects caption text events
from digital video, International Journal on
Document Analysis and Recognition, Vol. 5, pp.
138-157, 2003
  • Abstract. The popularity of digital video is
    increasing rapidly. To help users navigate
    libraries of video, algorithms that automatically
    index video based on content are needed. One
    approach is to extract text appearing in video,
    which often reflects a scenes semantic content.
    This is a difficult problem due to the
    unconstrained nature of general-purpose video.
    Text can have arbitrary color, size, and
    orientation. Backgrounds may be complex and
    changing. Most work so far has made restrictive
    assumptions about the nature of text occurring in
    video. Such work is therefore not directly
    applicable to unconstrained, general-purpose
    video. In addition, most work so far has focused
    only on detecting the spatial extent of text in
    individual video frames. However, text occurring
    in video usually persists for several seconds.
    This constitutes a text event that should be
    entered only once in the video index. Therefore
    it is also necessary to determine the temporal
    extent of text events. This is a non-trivial
    problem because text may move, rotate, grow,
    shrink, or otherwise change over time. Such text
    effects are common in television programs and
    commercials but so far have received little
    attention in the literature. This paper discusses
    detecting, binarizing, and tracking caption text
    in general-purpose MPEG-1 video. Solutions are
    proposed for each of these problems and compared
    with existing work found in the literature.

26
Recent Progress -New and improved text extraction
approaches
  • In addition, many former text extraction
    approaches have been enhanced and extended
    recently.
  • By extracting and integrating more
    comprehensive characteristics of text objects,
    these new approaches can provide more robust
    performance than previous approaches.
  • Besides new approaches, many improved
    approaches are presented to overcome the
    limitations of former approaches.

27
Recent Progress -New and improved text extraction
approaches
  • S Lefevre, N Vincent, Caption localization
    in video sequences by fusion of multiple
    detectors, Proceedings of Eighth International
    Conference on Document Analysis and Recognition,
    IEEE, pp. 106-110, 2005

Color-related detector, wavelet-based texture
detector, edge-based contour detector and
temporal invariance principle are adopted to
detect candidate caption regions. Then a parallel
fusion strategy
C. Mancas-Thilou, B. Gosselin, Spatial and Color
Spaces Combination for Natural Scene Text
Extraction, Proceedings of IEEE International
Conference on Iimage Processing, pp. 985-988,
2006.
Euclidean distance based and Cosine similarity
based clustering methods are applied on GRB color
space complementarily to partition the original
image into three clusters textual foreground,
textual background, and noise.
Overview of the proposed algorithm combining
color and spatial information.
28
S Lefevre, N Vincent, Caption localization
in video sequences by fusion of multiple
detectors, Proceedings of Eighth International
Conference on Document Analysis and Recognition,
IEEE, pp. 106-110, 2005
  • Abstract In this article, we focus on the
    problem of caption detection in video sequences.
    Contrary to most of existing approaches based on
    a single detector followed by an ad hoc and
    costly post-processing, we have decided to
    consider several detectors and to merge their
    results in order to combine advantages of each
    one. First we made a study of captions in video
    sequences to determine how they are represented
    in images and to identify their main features
    (color constancy and background contrast, edge
    density and regularity, temporal persistence).
    Based on these features, we then select or define
    the appropriate detectors and we compare several
    fusion strategies which can be involved. The
    logical process we have followed and the
    satisfying results we have obtained let us
    validate our contribution.

C. Mancas-Thilou, B. Gosselin, Spatial and Color
Spaces Combination for Natural Scene Text
Extraction, Proceedings of IEEE International
Conference on Iimage Processing, pp. 985-988,
2006.
Abstract Natural scene images brought new
challenges for a few years and one of them is
text understanding over images or videos. Text
extraction which consists to segment textual
foreground from the background succeeds using
color information. Faced to the large diversity
of text information in daily life and artistic
ways of display, we are convinced that this only
information is no more enough and we present a
color segmentation algorithm using spatial
information. Moreover, a new method is proposed
in this paper to handle uneven lighting, blur and
complex backgrounds which are inherent
degradations to natural scene images. To merge
text pixels together, complementary clustering
distances are used to support simultaneously clear
and well-contrasted images with complex and
degraded images. Tests on a public database show
finally efficiency of the whole proposed method.
29
Recent Progress -New and improved text extraction
approaches
  • M.R. Lyu, J Song, M. Cai, A Comprehensive
    method for multilingual video text detection,
    localization, and extraction, IEEE Transactions
    on Circuits and Systems for Video Technology,
    Vol. 15, pp. 243-255, 2005.

The sequential multi-resolution paradigm can
remove the redundancy of parallel
multi-resolution paradigm. No text edges can
appear several times at different resolution
levels.
Sequential multiresolution paradigm
30
  • M.R. Lyu, J Song, M. Cai, A Comprehensive
    method for multilingual video text detection,
    localization, and extraction, IEEE Transactions
    on Circuits and Systems for Video Technology,
    Vol. 15, pp. 243-255, 2005.
  • AbstractText in video is a very compact and
    accurate clue for video indexing and
    summarization. Most video text detection and
    extraction methods hold assumptions on text
    color, background contrast, and font style.
    Moreover, few methods can handle multilingual
    text well since different languages may have
    quite different appearances. This paper performs
    a detailed analysis of multilingual text
    characteristics, including English and Chinese.
    Based on the analysis, we propose a
    comprehensive, efficient video text detection,
    localization, and extraction method, which
    emphasizes the multilingual capability over the
    whole processing. The proposed method is also
    robust to various background complexities and
    text appearances. The text detection is carried
    out by edge detection, local thresholding, and
    hysteresis edge recovery. The coarse-to-fine
    localization scheme is then performed to identify
    text regions accurately. The text extraction
    consists of adaptive thresholding, dam point
    labeling, and inward filling. Experimental
    results on a large number of video images and
    comparisons with other methods are reported in
    detail.

31
Recent Progress -New and improved text extraction
approaches
  • J. Gllavata, E. Qeli and B. Freisleben,
    Detecting Text in Videos Using Fuzzy Clustering
    Ensembles, Proceedings of the Eighth IEEE
    International Symposium on Multimedia, pp.
    283-290, 2006.

Fuzzy C-means based individual frame clustering
is replaced by the fuzzy clustering ensemble
(FCE) based multi-frame clustering to utilize
temporal redundancy.
Fuzzy cluster ensemble for text detection in
videos
32
  • J. Gllavata, E. Qeli and B. Freisleben,
    Detecting Text in Videos Using Fuzzy Clustering
    Ensembles, Proceedings of the Eighth IEEE
    International Symposium on Multimedia, pp.
    283-290, 2006.
  • Abstract Detection and localization of text in
    videos is an important task towards enabling
    automatic content-based retrieval of digital
    video databases. However, since text is often
    displayed against a complex background, its
    detection is a challenging problem. In this
    paper, a novel approach based on fuzzy cluster
    ensemble techniques to solve this problem is
    presented. The advantage of this approach is that
    the fuzzy clustering ensemble allows the
    incremental inclusion of temporal information
    regarding the appearance of static text in
    videos. Comparative experimental results for a
    test set of 10.92 minutes of video sequences have
    shown the very good performance of the proposed
    approach with an overall recall of 92.04 and a
    precision of 96.71.

33
Recent Progress
  • 2. Text extraction techniques adopted from other
    research fields
  • Another encouraging progress is that more
    and more techniques that have been successfully
    applied in other research fields have been
    adapted for text extraction.
  • Because these approaches were not initially
    designed for the text extraction task, many
    unique characteristics of their original research
    fields are embedded in them intrinsically.
  • Therefore, by using these approaches from
    other fields, we can view the text extraction
    problem from the viewpoints of other related
    research fields and benefit from them. It is a
    promising way to find good solutions for text
    extraction task.

34
Recent Progress -Text extraction techniques
adopted from other research fields
  • K.I. Kim, K. Jung and J.H. Kim,
    Texture-based approach for text detection in
    image using support vector machine and
    continuously adaptive mean shift algorithm, IEEE
    Transcation Pattern Analysis and Machine
    Intelligence, Vol. 25, No. 12, pp. 1631-1638,
    2003.

The continuously adaptive mean shift algorithm
(CAMSHIFT) was initially used to detect and track
faces in a video stream.
Example of text detection using CAMSHIFT. (a)
input image (540400), (b) initial window
configuration for CAMSHIFT iteration (55-sized
windows located at regular intervals of (25,
25)), (c) texture classified region marked as
white and gray level (white text region, gray
non-text region), and (d) final detection result
35
Recent Progress -Text extraction techniques
adopted from other research fields
H.B. Aradhye and G.K. Myers, Exploiting Videotext
Events for Improved Videotext Detection,
Proceedings of Ninth International Conference on
Document Analysis and Recognition, IEEE pp.
894-898, 2007.
The multiscale statistical process control
(MSSPC) was originally proposed for detecting
changes in univariate and multivariate signals.
Substeps involved in the use of MSSPC for
videotext event detection
36
K.I. Kim, K. Jung and J.H. Kim,
Texture-based approach for text detection in
image using support vector machine and
continuously adaptive mean shift algorithm, IEEE
Transcation Pattern Analysis and Machine
Intelligence, Vol. 25, No. 12, pp. 1631-1638,
2003.
  • AbstractThe current paper presents a novel
    texture-based method for detecting texts in
    images. A support vector machine (SVM) is used to
    analyze the textural properties of texts. No
    external texture feature extraction module is
    used rather, the intensities of the raw pixels
    that make up the textural pattern are fed
    directly to the SVM, which works well even in
    high-dimensional spaces. Next, text regions are
    identified by applying a continuously adaptive
    mean shift algorithm (CAMSHIFT) to the results of
    the texture analysis. The combination of CAMSHIFT
    and SVMs produces both robust and efficient text
    detection, as time-consuming texture analyses for
    less relevant pixels are restricted, leaving only
    a small part of the input image to be
    texture-analyzed.

H.B. Aradhye and G.K. Myers, Exploiting Videotext
Events for Improved Videotext Detection,
Proceedings of Ninth International Conference on
Document Analysis and Recognition, IEEE pp.
894-898, 2007.
Abstract Text in video, whether overlay or
in-scene, contains a wealth of information vital
to automated content analysis systems. However,
low resolution of the imagery, coupled with
richness of the background and compression
artifacts limit the detection accuracy that can
be achieved in practice using existing text
detection algorithms. This paper presents a
novel, noncausal temporal aggregation method that
acts as a second pass over the output of an
existing text detector over the entire video
clip. A multiresolution change detection
algorithm is used along the time axis to detect
the appearance and disappearance of multiple,
concurrent lines of text followed by recursive
timeaveraged projections on Y and X axes. This
algorithm detects and rectifies instances of
missed text and enhances spatial boundaries of
detected text lines using consensus estimates.
Experimental results, which demonstrate
significant performance gain on publicly
collected and annotated data, are presented.
37
Recent Progress -Text extraction techniques
adopted from other research fields
  • D. Liu and T. Chen, Object Detection in
    Video with Graphical Models, Proceedings of IEEE
    International Conference on Acoustics, Speech and
    Signal Processing, Vol. 5, pp 14-19, 2006.

Discriminative Random Fields (DRF) was initially
applied to detect man-made building in 2D images.
(a) 2D DRF, with state si and one of its
neighbors sj . (b) 3D DRF, with multiple 2D DRFs
stacked over time. (c) 2D DRF-HMM type(A), with
intra-frame dependencies modelled by undirected
DRFs, and inter-frame dependencies modelled by
HMMs. States are shared between the two models.
38
Recent Progress -Text extraction techniques
adopted from other research fields
W. M. Pan, T. D. Bui, and C. Y. Suen, Text
Segmentation from Complex Background Using Sparse
Representations, Proceedings of Ninth
International Conference on Document Analysis and
Recognition, IEEE, pp. 412-416, 2007.
Sparse representation was initially used for
research on the receptive fields of simple cells.
(a)
(b)
(C)
(a) Camera Captured Image (b) foreground text
generated by image decomposition via sparse
representations (c) binarized result of (b)
using Otsus method.
39
D. Liu and T. Chen, Object Detection in
Video with Graphical Models, Proceedings of IEEE
International Conference on Acoustics, Speech and
Signal Processing, Vol. 5, pp 14-19, 2006.
  • Abstract In this paper, we propose a general
    object detection framework which combines the
    Hidden Markov Model with the Discriminative
    Random Fields. Recent object detection algorithms
    have achieved impressive results by using
    graphical models, such as Markov Random Field.
    These models, however, have only been applied to
    two dimensional images. In many scenarios, video
    is the directly available source rather than
    images, hence an important information for
    detecting objects has been omitted the temporal
    information. To demonstrate the importance of
    temporal information, we apply graphical models
    to the task of text detection in video and
    compare the result of with and without temporal
    information. We also show the superiority of the
    proposed models over simple heuristics such as
    median filter over time.

W. M. Pan, T. D. Bui, and C. Y. Suen, Text
Segmentation from Complex Background Using Sparse
Representations, Proceedings of Ninth
International Conference on Document Analysis and
Recognition, IEEE, pp. 412-416, 2007.
Abstract A novel text segmentation method from
complex background is presented in this paper.
The idea is inspired by the recent development in
searching for the sparse signal representation
among a family of over-complete atoms, which is
called a dictionary. We assume that the image
under investigation is composed of two
components the foreground text and the complex
background. We further assume that the latter can
be modeled as a piece-wise smooth function. Then
we choose two dictionaries, where the first one
gives sparse representation to one component and
nonsparse representation to another while the
second one does the opposite. By looking for the
sparse representations in each dictionary, we can
decompose the image into the two composing
components. After that, text segmentation can be
easily achieved by applying simple thresholding
to the text component. Preliminary experiments
show some promising results.
40
Recent Progress
  • 3. Text extraction approaches proposed for
    specific text types and specific genre of video
    documents
  • Besides general text extraction approaches,
    an increasing number of approaches have been
    proposed for specific text types.
  • Based on domain knowledge, these specific
    approaches can take advantages of unique
    properties of specific text type or video genre
    and often achieve better performance than general
    approaches.

41
Recent Progress -Text extraction approaches
proposed for specific text types and specific
genre of video documents
W. Wu, X. Chen and J. Yang, Detection of text on
road signs from video, IEEE Transactions on
Intelligent Transportation Systems, Vol. 6, pp.
378-390, 2005.
This approach is composed of two stages 1.
localizing road signs 2. detecting text.
Architecture of the proposed framework
42
W. Wu, X. Chen and J. Yang, Detection of text on
road signs from video, IEEE Transactions on
Intelligent Transportation Systems, Vol. 6, pp.
378-390, 2005.
  • AbstractA fast and robust framework for
    incrementally detecting text on road signs from
    video is presented in this paper. This new
    framework makes two main contributions. 1) The
    framework applies a divide-and-conquer strategy
    to decompose the original task into two subtasks,
    that is, the localization of road signs and the
    detection of text on the signs. The algorithms
    for the two subtasks are naturally incorporated
    into a unified framework through a feature-based
    tracking algorithm. 2) The framework provides a
    novel way to detect text from video by
    integrating two-dimensional (2-D) image features
    in each video frame (e.g., color, edges, texture)
    with the three-dimensional (3-D) geometric
    structure information of objects extracted from
    video sequence (such as the vertical plane
    property of road signs). The feasibility of the
    proposed framework has been evaluated using 22
    video sequences captured from a moving vehicle.
    This new framework gives an overall text
    detection rate of 88.9 and a false hit rate of
    9.2. It can easily be applied to other tasks of
    text detection from video and potentially be
    embedded in a driver assistance system.

43
Recent Progress -Text extraction approaches
proposed for specific text types and specific
genre of video documents
C. Choudary, and T. Liu, Summarization of Visual
Content in Instruction videos, IEEE Transactions
on Multimedia, Vol. 9, pp. 1443-1455, 2007.
content fluctuation curve based on the number of
chalk pixels is used to measure the content in
each frame of instructional videos. The frames
with enough chalk pixels are extracted as key
frames. Hausdorff-distance and connected-component
decomposition are adopted to reduce the
redundancy of key frames by matching the content
and mosaicking the frames.
(a)
(b)
(C)
(d)
Comparison of our summary frames with the key
frames obtained using different key frame
selection methods in a test video. (a) our
summarization algorithm (b) fixed sampling (c)
dynamic clustering (d) tolerance band. Our
summary frames are rich in content and more
appealing.
44
C. Choudary, and T. Liu, Summarization of Visual
Content in Instruction videos, IEEE Transactions
on Multimedia, Vol. 9, pp. 1443-1455, 2007.
  • AbstractIn instructional videos of chalk board
    presentations, the visual content refers to the
    text and figures written on the boards. Existing
    methods on video summarization are not effective
    for this video domain because they are mainly
    based on low-level image features such as color
    and edges. In this work, we present a novel
    approach to summarizing the visual content in
    instructional videos using middle-level features.
    We first develop a robust algorithm to extract
    content text and figures from instructional
    videos by statistical modelling and clustering.
    This algorithm addresses the image noise,
    nonuniformity of the board regions, camera
    movements, occlusions, and other challenges in
    the instructional videos that are recorded in
    real classrooms. Using the extracted text and
    figures as the middle level features, we retrieve
    a set of key frames that contain most of the
    visual content. We further reduce content
    redundancy and build a mosaicked summary image by
    matching extracted content based on K-th
    Hausdorff distance and connected component
    decomposition. Performance evaluation on four
    full-length instructional videos shows that our
    algorithm is highly effective in summarizing
    instructional video content

45
Recent Progress -Text extraction approaches
proposed for specific text types and specific
genre of video documents
  • Additional References
  • C. Mancas-Thilou, B. Gosselin, Spatial and Color
    Spaces Combination for Natural Scene Text
    Extraction, Proceedings of IEEE international
    Conference on Image Processing, pp. 985-988,
    2006.
  • D. Q. Zhang and S. F. Chang, Learning to Detect
    Scene Text Using a Higher-order MRF with Belief
    Propagation, IEEE Conference on Computer Vision
    and Pattern Recognition Workshop, 2004.
  • L. Tang and J.R. Kender, A unified text
    extractionmethod for instructional videos,
    Proceedings of IEEE international conference on
    image processing, Vol. 3, pp11-14, 2005.
  • M.R. Lyu, J Song, M. Cai, A Comprehensive method
    for multilingual video text detection,
    localization, and extraction, IEEE Transactions
    on Circuits and Systems for Video Technology,
    Vol. 15, pp. 243-255, 2005.
  • S Lefevre, N Vincent, Caption localization in
    video sequences by fusion of multiple detectors,
    IEEE Proceedings of Eighth International
    Conference on Document Analysis and Recognition,
    pp. 106-110, 2005.
  • CC Lee, YC Chiang, CY Shih, HM Huang, Caption
    localization and detection for news videos using
    frequency analysis and wavelet features,
    Proceedings of IEEE international conference on
    tools with artificial intelligence, Vol. 2 ,pp
    539-542, 2007.

46
Outline
  • Introduction
  • Recent Progress
  • Performance Evaluation
  • Discussion

47
Performance Evaluation
R. Kasturi, D. Goldgof, P. Soundararajan, V.
Manohar, J. Garofolo, R. Bowers, M. Boonstra, V.
Korzhova, and J. Zhang, Framework for Performance
Evaluation of Face, Text, and Vehicle Detection
and Tracking in Video Data, Metrics, and
Protocol, to appear IEEE Transactions on Pattern
Analysis Machine Intelligence, 2008. (http//doi.i
eeecomputersociety.org/10.1109/TPAMI.2008.57)
  • Evaluation Metrics
  • Video Analysis and Content Extraction (VACE)

48
Text Task Definition
  • Detection Task Spatially locate the blocks of
    text in each video frame in a video sequence
  • Text blocks (objects) contain all words in a
    particular line of text where the font and size
    are the same
  • Tracking Task Spatially/temporally locate and
    track the text objects in a video sequence
  • Recognition Task Transcribe the words in each
    frame, including their spatial location
    (detection implied)

49
Task Definition Highlights
  • Annotate oriented bounding rectangle around text
    objects (The reference annotation was done by
    VideoMining Inc., State College, PA)
  • Detection and Tracking task
  • Line level annotation with IDs maintained
  • Rules based on similarity of font, proximity and
    readability levels
  • Recognition task
  • Word Level (IDs maintained)
  • Documents
  • Annotation guidelines - Evaluation protocol
  • Tools
  • ViPER (Annotation) - USF-DATE (Scoring)

50
Data Resources
  • VIDEO

DATA NUMBER OF CLIPS TOTAL MINS
MICRO-CORPUS 5 10
TRAINING 50 175
TESTING 50 175
  • Micro-corpus a small amount of data that was
    created after extensive discussions with the
    research community to act as a seed for initial
    annotation experiments and to provide new
    participants with a concrete sampling of the
    datasets and the tasks.

51
Data Resources
  • These discussions were coordinated as a series of
    weekly teleconferences with VACE contractors and
    other eminent members of the CV community.
  • The discussions made the research community a
    partner in the evaluations and helped us in
  • selecting the video recordings to be used in the
    evaluations,
  • creating the specifications for the ground truth
    annotations and scoring tools
  • defining the evaluation infrastructure for the
    program.

52
Data Resources
TASK DOMAIN
Text Detect Track Broadcast News ABC CNN
Face Detect Track Broadcast News ABC CNN
Vehicle Detect Track Surveillance i-LIDS
  • MPEG2 standard, progressive scanned at 720 480
    resolution. GOP (Group of Pictures) of 12 for the
    broadcast news corpus where the frame-rate was
    29.97 fps (frames per second) and GOP of 10 for
    the surveillance dataset where the frame-rate was
    25 fps.

Distributed by the Linguistic Data Consortium
(LDC) http//www.ldc.upenn.edu
i-LIDS Multiple Camera Tracking/Parked
Vehicle Detection/Abandoned Baggage Detection
scenario datasets were developed by the UK Home
Office and CPNI. (http//scienceandresearch.homeof
fice.gov.uk/hosdb/cctv-imagingtechnology/video-bas
ed-detection-systems/i-lids/)
53
Reference Annotations
  • Text Ground Truth Every new text area was marked
    with a box when it appeared in the video. The box
    was moved and scaled to fit the text as it moved
    in successive frames. This process was done at
    the text line level until the text disappeared
    from the frame.

Three readability levels
READABILITY 1 (white) Completely unreadable text
READABILITY 1 (gray) Partially readable text
READABILITY 2 (black) Clearly readable text
54
Reference Annotations
  • Text regions were tagged based on a comprehensive
    set of rules
  • All text within a selected block must contain the
    same readability level and type.
  • Blocks of text must contain the same size and
    font.
  • The bounding box should be tight to the extent
    that there is no space between the box and the
    text.
  • Text boxes may not overlap other text boxes
    unless the characters themselves are superimposed
    atop one another.

55
Sample Annotation Clip (line-level)
56
Detection Metric
  • The Frame Detection Accuracy (FDA) measure
    calculates the spatial overlap between the ground
    truth and system output objects as a ratio of the
    spatial intersection between the two objects and
    the spatial union of them. The sum of all of the
    overlaps was normalized over the average of the
    number of ground truth and detected objects

Frame Detection Accuracy (FDA)
Gi denotes the ith ground truth object at the
sequence level and Gi(t) denotes the ith ground
truth object in frame t. Di denotes the ith
detected object at the sequence level and Di(t)
denotes the ith detected object in frame t. N(t)G
and N(t)D denote the number of ground truth
objects and the number of detected objects in
frame t respectively.
57
Detection Metric
  • The Sequence Frame Detection Accuracy (SFDA), is
    essentially the average of the FDA measure over
    all of the relevant frames in the sequence.

Sequence Frame Detection Accuracy (SFDA)
Range 0 to 1 (higher is better)
Nframes is the number of frames in the sequence
58
Tracking Metric
  • The Average Tracking Accuracy (ATA) is a
    spatio-temporal measure which penalizes
    fragmentations in both the temporal and spatial
    dimensions while accounting for the number of
    objects detected and tracked, missed objects, and
    false positives.

Sequence Track Detection Accuracy (STDA)
Average Tracking Accuracy (ATA)
Range 0 to 1 (higher is better)
NG and ND denote the number of unique ground
truth objects and the number of unique detected
objects in the given sequence respectively.
Uniqueness is defined by object IDs.
59
Example Detection Scoring
Green Detected box Red Ground truth box
Yellow Overlap in mapped boxes
60
Annotation Quality
Evaluation relies on manual labeling
10 of the entire corpus was doubly annotated by
multiple annotators and checked for quality
using the evaluation measures.
The degree of consistency becomes increasingly
important as systems approach human levels of
performance.
A high degree of consistency would be difficult
to achieve with somewhat subjective attributes
like readability
Humans fatigued easily when performing such
tedious tasks
61
Annotation Quality
  • For double annotated corpus
  • Average Sequence Frame Detection Accuracy
    (SFDA)

Text detection
95
Average Average Tracking Accuracy (ATA)
Text tracking
85
The scores for the current state-of-the-art
automatic algorithms are significantly lower than
these numbers (22 relative for text detection,
and 61 relative for text tracking).
62
Annotation Quality
Flowchart of Annotation Quality Control
Procedure. Steps denoted by dark shaded boxes
were carried out by the annotators. Steps denoted
by light shaded boxes were carried out by the
evaluators.
63
Text Detection and Tracking VACE
D
64
Text Recognition Evaluation
  • Datesets Broadcast News
  • Training/Dry Run Development Set
  • 5 Clips
  • 14.5 minutes
  • 1181 words
  • Evaluation Set
  • 25 Clips
  • 62.5 minutes
  • 4178 word objects
  • 68,738 word frame instances

65
Text Recognition Evaluation
  • Evaluate only the most easily readable text (to
    establish a baseline at a high level of
    inter-annotator agreement)
  • Type graphic (no scene text)
  • Readability 2
  • Logo false
  • Occlusion false
  • Ambiguous false
  • Exclude scrolling (ticker), dynamic text
    (scoreboard)
  • Case insensitive and punctuation ignored

66
Sample Annotation Clip (Word-level)
67
Recognition Evaluation Metrics
  • Spatially map system output detected words to
    reference words, then compare the strings for
    mapped words
  • An unmapped word in system output incurs an
    Insertion (I) error
  • An unmapped word in reference incurs a Deletion
    (D) error
  • A mapped word with a character mismatch incurs a
    Substitution (S) error
  • Errors are accumulated over entire test set
  • Also generate Character Error Rate

The raven caws at midnight
REF
D
S
I
Sys Output
raven calls at at midnight
WER (1 1 1)/5 3/5 (60)
68
Individual Clip Word Error Rate
69
Scores (Word Error Rate)
WER CER
0.4233 0.2823
70
Outline
  • Introduction
  • Recent Progress
  • Performance Evaluation
  • Discussion

71
Discussion
  • The recent progresses provide many promising
    solutions and research directions for text
    extraction problem.
  • Due to the large variations of text objects in
    videos, no single approach can achieve
    satisfactory performance in all applications.
  • To further improve the performance of text
    extraction techniques, much work in the area
    remains.

72
Discussion
  • Detection and Localization
  • How to efficiently combine several complementary
    extraction algorithms to produce better
    performance and how to extract better features by
    analyzing the shape of characters and the
    relationships between text and its background
    still need more investigation.

73
Discussion
  • Tracking
  • Although text tracking is an indispensable step
    for text extraction in videos, not many text
    tracking approaches have been reported in recent
    years.
  • More effort is needed to focus on tracking, not
    only for static and scrolling text, but also for
    dynamic text objects (growing, shrinking, and
    rotating text).

74
Discussion
  • Datasets
  • Besides extraction approaches, because most
    algorithms are still tested on their own
    datasets, in order to compare and evaluate all
    algorithms, a large freely available annotated
    video dataset is urgently needed.

75
THANK YOU! See you at ICPR 2008 in December
About PowerShow.com