Title: Amplifying Video InformationSeeking Success through Rich, Exploratory Interfaces
1Amplifying Video Information-Seeking Success
through Rich, Exploratory Interfaces
Mike Christel christel_at_cs.cmu.edu School of
Computer Science Carnegie Mellon University
KES-IIMSS July 11, 2008
2Talk Outline
- Creating metadata for video information sets
- Informedia demonstrations (oral history
collection, news video collection) - Types of search beyond fact-finding
- Exploratory search through multiple views
- Evaluation hurdles
- Discussion
- now is a perfect opportunity for leveraging user
involvement for better video information-seeking
experiences
3User Involvement
- User Correction Corrective action for metadata
errors (analogous to Harry Shums vision at
Microsoft for human-assisted computer vision
success) - User Control Driving the interface to overcome
metadata errors - User Context More useful interfaces driven
implicitly by context
4CMU Informedia Digital Video Research
- Details at http//www.informedia.cs.cmu.edu
- Speech recognition and alignment
- Image processing
- Named entity tagging
- Synchronized metadata for search and navigation
- Fast, direct video access to oral histories,
news, etc. - Demonstration oral history corpus 913 hours of
interviews from 400 individuals, 18,254 interview
story segments (average story segment length of 3
minutes) - Demonstration news corpus TRECVID 2006 test set
(165 hours of U.S., Arabic, and Chinese news with
79,484 reference shots)
5Speech Recognition Functions
- Generates transcript (if one is not given) to
enable text-based retrieval from spoken language
documents - Improves text synchronization to audio/video in
presence of scripts (align speech with text) - Supplies necessary information for library
segmentation and multimedia abstractions (e.g.,
break stories apart at silence points rather than
in the middle of sentences)
6Speech Alignment Example
7Image Understanding Functions
- Scene segmentation
- Similarity matching
- Camera motion determination and object tracking
- Optical Character Recognition (OCR) on video text
and titles - Face detection and recognition
- Ongoing research work in object identification
and scene characterization, e.g., indoor/outdoor,
road, building, etc.
8Images containing similar colors
9Images containing similar shapes
10Images containing similar content
11Goal Automatic Video Characterization
Shots
Yellowstone
Camera
Objects
Action
Captions
Scenery
12Goal Automatic Video Characterization
Shots
Yellowstone
Camera
Objects
Action
Captions
Scenery
13Automated Video Processing
- Produces descriptive metadata for video libraries
- Metadata has errors greater than metadata
produced by a careful, human-provided annotation - Errors in metadata can be reduced
- By more computation-intensive algorithms
- By taking advantage of video frame-to-frame
redundancy - By folding in context, e.g., probable text sizes
in video - By folding in extra sources of knowledge, e.g., a
dictionary for cleaning up VOCR, or labeled data
revealing patterns for named entity detection - By human review and correction, which can
generate additional labeled data for machine
learning
14Camera and Motion Detection
Pan
Right object motion (not pan left)
15Text and Face Detection
16Video OCR Block Diagram
Text Area Detection
Video
Text Area Preprocessing
Commercial OCR
ASCII Text
17Video Frames Filtered Frames
AND-ed Frames
(1/2 s intervals)
18VOCR Preprocessing Problems
19Augmenting VOCR with Dictionary Look-up
20Name-It Face/Name Association
21Named Entity Extraction
F. Kubala, R. Schwartz, R. Stone, and R.
Weischedel, Named Entity Extraction from
Speech, Proc. DARPA Workshop on Broadcast News
Understanding Systems, Lansdowne, VA, February
1998.
CNN national correspondent John Holliman is at
Hartsfield International Airport in Atlanta.
Good morning, John. But there was one situation
here at Hartsfield where one airplane flying from
Atlanta to Newark, New Jersey yesterday had a
mechanical problem and it caused a backup that
spread throughout the whole system because even
though there were a lot of planes flying to the
New York area from the Atlanta area yesterday, .
Key Place, Time, Organization/Person
22Enhancing Library Utility via Better Metadata
23Improving the Interface via Usage Context
- Example query-based thumbnail selection
24Improving Utility through End-User Control
- Example filtering storyboard based on visual
concepts with user controlling precision and
recall
25Improving the Metadata via User Interaction
- Example collecting positive and implicit
negative sets of labeled shot data for visual
concepts - Reference Ming-yu Chen, et al., ACM Multimedia
2005
26User Involvement
- User Correction Corrective action for metadata
errors (analogous to Harry Shums vision at
Microsoft for human-assisted computer vision
success) - User Control Driving the interface to overcome
metadata errors - User Context More useful interfaces driven
implicitly by context
27Video Summaries (without User Context)
- BBC rushes video summarization task in TRECVID
2007 and TRECVID 2008 shows difficulty of the
task - Video summary is a condensed version of some
information, such that various judgments about
the full information can be made using only the
summary and taking less time and effort than
would be required using the full information
source - Maximum 4 duration (2 in TRECVID 2008)
- Benefits of this TRECVID task provides a
reasonably large video collection to be
summarized, a uniform method of creating ground
truth, and a uniform scoring mechanism
28BBC Rushes
- 42 test videos ( development ones) from BBC
Archive - Test videos
- minimum duration 3.3 minutes
- maximum duration 36.4 minutes
- mean duration 25 minutes
- Raw (unedited) rush video with a great deal of
redundancy (repeated takes), mixed quality audio,
junk frames
29Video Summaries (with/without User Context)
- BBC Rush video has no context to build from
- However, users often provide cues as to what is
important, as will be seen shortly
30Storyboards TRECVID Search Success
- For the shot-based directed search information
retrieval task evaluated at TRECVID, storyboards
have consistently and overwhelmingly produced the
best performance (see references in paper, e.g.,
Snoek et al. 2007) - Motivated users can navigate through thousands of
shot thumbnails in storyboards, better even than
with extreme video retrieval interfaces 2487
shots on average per 15 minute topic for TRECVID
2006 Christel/Yan CIVR 2007 - Storyboard benefits packed visual overview,
trivial interactive control needed for overview,
zoom and filter, details on demand
Shneidermans Visual Information-Seeking Mantra
31Beyond Fact-Finding
- CACM April 2006 special issue on this topic
- G. Marchionini (Exploratory Search From Finding
to Understanding, CACM 49, April 2006) breaks
down 3 types of search activities - Lookup (fact-finding solving stated/understood
need) - Learn
- Investigate
- Computer scientists and information retrieval
specialists emphasize evaluation of lookup
activities (NIST TREC) - Real world interest in learn/investigate for an
oral history collection, State Univ. New York at
Buffalo Workshop library science and humanities
participants quite interested in
learn/investigate activities
32Exploratory Search (Demonstrations)
- Examples where storyboards still useful visual
review, e.g., of disaster field footage - Where storyboards fail
- Showing other facets like time, space,
co-occurrence, named entities (When did disasters
occur? Which ones? Where?) - Providing collection understanding, a holistic
view of whats in say 100s of segments of 1000s
of matching shots - Providing window into visually homogenous
results, e.g., results from color search perhaps,
or a corpus of just lecture slides, or
head-and-shoulder interview shots - Claim Storyboards are not sufficient, but are
part of a useful suite of tools/interfaces for
interactive video search
33Anecdotal Support for Claim
- Collected 2006-2007 from
- Government analysts with news data
- History students and faculty with oral history
data - Views Tested
- Timeline
- Visualization By Example (VIBE) Plot (query
terms) - Map View
- Named Entity view (people, places, organizations)
- Text-dominant views
- Nested Lists (pre-defined clusters by
contributor) - Common Text (on-the-fly grouping of common
phrases)
34Anecdotal Results
- 38 HistoryMakers corpus users (mostly students,
15 female, average age 24), experienced web
searchers, modest digital video experience - 6 intelligence analysts (1 female 2 older than
40, 3 in their 30s, 1 in 20s), very experienced
text searchers, experienced web searchers, novice
video searchers - View use minimal aside from Common Text
- Text titling and text transcripts used frequently
- A bit of evidence for collection understanding
(e.g., diffs in topic between New York and
Chicago), but overall, cautious use of default
settings for initial trial(s).
35Evaluation Hurdles
- How does one evaluate information visualization
for promoting exploratory video search? - Low level simple tasks vs. complex real-world
tasks - Traditional effectiveness, efficiency,
satisfaction are even problematic is fast
interface for exploration good or bad? - HCI discount usability techniques offer some
support, but ecological validity may limit impact
of conclusions (e.g., HCII students found Common
Text well suited for History students) - Look to field of Visual Analytics for help, e.g.,
Plaisant - First hour with system studies, or developer
as user insights too limiting. Rather, consider
Multi-dimensional In-depth Long-term Case-studies
(MILC)
36Concluding Points - 1
- Interactive allows human direction to
compensate for automation shortcomings and
varying needs - Interactive fact-finding better than automated
fact-finding in visual shot retrieval (TRECVID) - Interactive computer vision has successes (Harry
Shum at Microsoft, Michael Brown et al. at NUS) - Interactive view/facet control ??? (too early
to tell) - Users need scaffolding/support to get started
- Evaluations need to run longer term, in depth,
with case studies to see what has benefit (MILC) - Keep track of facet-based interfaces, e.g.,
Bungee View work by Mark Derthick (Carnegie
Mellon University) on web for faceted browsing of
image/video resources
37Concluding Points - 2
- Storyboards work well for visual overview
- Video surrogates can be made more effective,
efficient, and satisfying when tailored to user
activity (leverage context) - Interface should provide easy tuning of precision
vs. recall - As cheap storage and transmission is producing a
wealth of digital video, exploratory search will
gain emphasis regarding video repositories - Augment automatically produced metadata with
human-provided descriptors (take advantage of
what users are willing to volunteer, and in fact
solicit additional feedback from humans through
motivating games that allow for human
computation, a research focus of Luis von Ahn at
Carnegie Mellon University)
38Credits
- Many members of the Informedia Project, CMU
research community, and The HistoryMakers
contributed to this work, including - Informedia Project Director Howard Wactlar
- The HistoryMakers Executive Director Julieanna
Richardson - HistoryMakers Beta Testers Joe Trotter (CMU
History Dept.), SUNY at Buffalo and all UB
Workshop participants Schomburg Center for
Research in Black Culture, NY Public Library,
Randforce Associates, University of Illinois (3
campuses) - Informedia User Interface Ron Conescu, Neema
Moraveji - Informedia Processing Alex Hauptmann, Ming-yu
Chen, Wei-Hao Lin, Rong Yan, Jun Yang - Informedia Library Essentials Bob Baron, Bryan
Maher
- This work supported by the National Science
Foundation under Grant Nos. IIS-0205219 and
IIS-0705491