Title: Exploitation of knowledge in video recordings Dr' Alexia Briassouli, Dr' Yiannis Kompatsiaris Multim
1Exploitation of knowledge in video recordings
Dr. Alexia Briassouli, Dr. Yiannis
KompatsiarisMultimedia Knowledge
LaboratoryCERTH-ITI
- October 24, 2008
- Thessaloniki, Greece
2Evolution of Content
- 1-2 exabytes (millions of terabytes) of new
information produced world-wide annually - 80 billion of digital images are captured each
year - Over 1 billion images related to commercial
transactions are available through the Internet - This number is estimated to increase by ten times
in the next two years. - 4 000 new films are produced each year
- 300 000 world-wide available films
- 33 000 television stations and 43 000 radio
stations - 100 billions of hours of audiovisual content
Personal Content
Sport - News
Web Mobile
Movies
3(No Transcript)
4Need for annotation medatata
- The value of information depends on how easily
it can be found, retrieved, accessed, filtered or
managed in an active, personalized way
5Video Analysis
- Video analysis that exploits knowledge provides
significant advantages - Improved accuracy of semantics from video
- Higher level concepts inferred through
exploitation of knowledge combined with video
processing - Knowledge about behavior, event detection
- More efficient storage, access, retrieval,
dissemination of multimodal data because of the
(automatically generated) annotations
6Video Analysis in JUMAS
7Text-based indexing
- Manual annotation
- Straightforward
- High/Semantic level
- Efficient during content creation
- Most commonly used
- Necessary in a number of applications
- - Time consuming
- - Operator-application dependent
- - Text related problems (synonyms etc)?
- Annotation using captions and related text
- Web, Video, Documents etc
- Straightforward
- High/Semantic level
- Multimodal approach
- - Text processing restrictions and limitations
- - Captions must exist
8Addressing the Semantic Gap
- Semantic Gap for multimedia To map automatically
generated numerical low level-features to higher
level human-understandable semantic concepts
lt?xml version'1.0' encoding'ISO-8859-1'
?gt ltMpeg7 xmlnsgt ltDescriptionUnit xsitype
"DescriptorCollectionType"gt ltDescriptor
xsitype "DominantColorType"gt
ltSpatialCoherencygt31lt/SpatialCoherencygt
ltValuegt ltPercentagegt31lt/Percentagegt
ltIndexgt19 23 29 lt/Indexgt
ltColorVariancegt0 0 0 lt/ColorVariancegt
lt/Valuegt lt/Descriptorgt lt/DescriptionUnitgt lt/
Mpeg7gt
This image contains a sky region and is a
holiday image
Dominant Color Descriptor of a sky region
9Problem definition
- Semantic image analysis how to translate the
automatically extracted visual descriptions into
human like conceptual ones - Low-level features provide cues for
strengthen/weaken evidence based on visual
similarity - Prior knowledge is needed to support semantics
disambiguation
10Knowledge ExtractionA common view
Feature extraction Text, Image analysis Segmentati
on, SVMs Evidence generation Vehicle, Building
Reasoning Fusion of annotations Consistency
checking Higher-level concepts/events Emergency
scene
Classifiers fusion Global vs. Local Modalities
fusion Context Ambulance
Multimedia content annotation tools Training (Stat
istical) Modeling
Domain Multimedia content Annotations Algorithms
- Features Context
11Knowledge from Video analysis
- Semantics from video
- Implicitly derived via machine learning methods
i.e. based on training - SVM, HMM, Neural Networks, Bayesian Networks
- Training uses appropriate data, relevant to the
semantics that interest us - Training finds models that connect low level
features (e.g. motion trajectories) with
high-level annotations - These models are then applied to test data
12Classification ResultsaceMedia
Natural-Person 0.456798 Sailing-Boat
0.463645 Sand 0.476777 Building
0.415358 Pavement 0.454740 Road
0.503242 Body-Of-Water 0.489957 Cliff
0.472907 Cloud 0.757926 Mountain 0.512597 Sea
0.455338 Sky 0.658825 Stone 0.471733 Waterfall
0.500000 Wave 0.476669 Dried-Plant
0.494825 Dried-Plant-Snowed 0.476524 Foliage
0.497562 Grass 0.491781 Tree 0.447355 Trunk
0.493255 Snow 0.467218 Sunset 0.503164 Car
0.456347 Ground 0.454769 Lamp-Post
0.499387 Statue 0.501076
Segments hypothesis set
13Frame Region Concept Association
- Region feature vector formed from local
descriptors - Individual SVM introduced for every defined local
concept, receiving as input the region feature
vector - Training identical to global concept training
case - Every region evaluated by all trained SVMs,
segments local concept hypothesis set created (
)?
Segments hypothesis set
Ground 0.89 Grass
0.44 Mountain 0.21 Boat
0.07 Smoke 0.41 Dirty-Water 0.18 Trunk
0.12 Foam 0.19 Debris
0.34 Mud 0.31 Water 0.42
Sky 0.22 Ashes 0.11
Subtitles 0.24 Flames 0.13 Vehicle
0.12 Building 0. 25 Foliage
0.84 Person 0.32 Road 0.39
14Initial Region-Concept Association
- Region feature vector formed from local
descriptors - Individual SVM introduced for every defined
concept, receiving as input the region feature
vector - Training identical to global training case
- Every region evaluated by all trained SVMs,
segments concept hypothesis set created ( )?
Segments hypothesis set
Building 0.89 Roof 0.29 Grass
0.21 Tree 0.07 Stone 0.41
Ground 0.15 Dried-plant 0.12 Sky
0.19 Person 0.34 Trunk
0.31 Vegetation 0.42 Rock 0.22 Boat
0.11 Sand 0.44 Sea
0.13 Wave 0.12
15Knowledge for Video analysis
- Explicit Semantics from video
- Based on previously known models
- Explicitly defined models, rules, facts
- Rules from preliminary scripts and standards
from similar cases - Explicit and implicit knowledge can be combined
with results from low-level video processing to
extract meaningful high-level knowledge
16System Overview
17Video analysis
- Motion Analysis
- Motion detection
- Tracking
- Detection of when motion occurs
- Motion Segmentation
- Object segmentation based on motion
characteristics - Generation of active regions
18Activity Areas from motion analysis
19Sub-activity Areas
- After statistical processing for temporal
localization of motion and events
People walking towards each other
People leave together
People meet
20Fight Sequence
21Video Processing (1)?
- Pre-processing
- Separate video from audio
- Split video into frames
- Noise removal via spatiotemporal filtering
- Scene/shot detection
- Shot frames taken by single camera
- Detect transition between frames
- Uses only low-level information
- Scene story-telling unit
- Uses higher-level knowledge, semantics
22Video Processing (2)?
- Spatial segmentation
- Spatial segmentation in images, video frames
- Extracts object(s) based on color, texture
features - Motion segmentation
- Groups pixels with similar motion
- Spatiotemporal segmentation
- Finds objects over several frames through
combination of motion, appearance features - Merges spatial and motion segmentation results
23Knowledge in Video Analysis (1)?
- Low level features can be combined with
knowledge/rules for higher-level results - Spatiotemporally segmented objects can be used
for object recognition - Face/gesture recognition after training with
faces/gestures of significance - Motion in specific parts of a video (e.g. near
court entrance, near prisoners seat) has
additional significance - Needs prior knowledge of which parts of the video
frames are important and why
24Knowledge in Video Analysis (2)?
- Knowledge structures can provide additional
information about the relations between different
low-level features - Interactions e.g. two motions in opposite
directions, relation of extracted gestures, may
mean something people meeting, fighting,
pointing, gesticulating - Face recognition combined with prior knowledge
can show who is present when an event occurs
25Conclusions
- Combined use of video processing with knowledge
can lead to richer and more accurate high-level
descriptions of multimedia data - Can be used in many more applications than
currently, because the knowledge introduces
flexibility and adaptability to the system - The same algorithms and low-level features can
provide much more information when used in
combination with explicit and implicit knowledge
26Thank you! CERTH-ITI / Multimedia Knowledge
Laboratory http//mklab.iti.gr
26
27Video Analysis State of the Art
- Spatiotemporal segmentation
- Find spatiotemporally homogeneous objects i.e.
similar appearance and motion - Apply spatial segmentation on each frame
- Match segmented objects in successive frames
using low-level features (e.g. similar color,
texture, continuous motion)? - Use motion information project position of
object in current/next frames
28Video Analysis State of the Art
29Video Analysis State of the Art
30Video Analysis State of the Art
- Spatial segmentation
- Spatial segmentation in images, video frames
- Region Based Most methods are based on grouping
similar features like color, texture, location
based on homogeneity of intensity, texture,
position - Gradient/edge based detecting changes in spatial
distribution of features e.g. pixel illumination - Some methods combine region/edge information