Content based multimedia Signal processing - PowerPoint PPT Presentation

1 / 77
About This Presentation
Title:

Content based multimedia Signal processing

Description:

Breaking media into atomic units. Reverse engineer media capture and editing process ... 1998 ABC/CNN news C-span. 13 videos, 4.9 GB. 24 groups participate. 4 tasks ... – PowerPoint PPT presentation

Number of Views:484
Avg rating:3.0/5.0
Slides: 78
Provided by: YuHe8
Category:

less

Transcript and Presenter's Notes

Title: Content based multimedia Signal processing


1
Content based multimedia Signal processing
Feng-Chia University Augus 11t , 2004
  • Yu Hen Hu
  • University of Wisconsin Madison
  • Dept. Electrical and Computer Engineering
  • Madison, WI 53706
  • hu_at_engr.wisc.edu

2
Outline
  • Content-based Multimedia Signal Processing
  • Content
  • Multimedia signal processing
  • Potential Applications of CBMSP
  • MM representation,
  • MPEG-7
  • Object based organization
  • Syntactic structure
  • Semantic structure
  • MM database
  • Query
  • User profile
  • Relevance feedback
  • Similarity measure
  • Personalized MM services
  • Filtering
  • Authoring/composing
  • Intelligent surveillance
  • Object recognition
  • Action recognition
  • Event recognition
  • Content-assisted MSP
  • Post-processing
  • coding

3
What is Content?
  • (digital) Content the syntactic and semantic
    information inherent in a digital material.
  • Examples text document, say email
  • Syntactic content headers, fields, protocols
  • Semantic content key words, subject, types of
    emails (information/expression, etc)
  • Example multimedia documents, movie clips
  • Syntactic content scene cuts, shots,
  • Semantic content motion, summary, index,
    caption, etc.

4
Content-based Multimedia Information Processing
(CBMIP)
  • Why we need to know content?
  • Information processing, in terms of creation,
    archiving, indexing, delivering, accessing and
    other processing, require in-depth knowledge of
    content to optimize the performance.
  • What is CBMIP?
  • Use content information to enable personalized,
    intelligent processing of multimedia information

5
Current CBMIP Research
  • Structure Analysis
  • Shots, scene cut detection, etc.
  • Breaking media into atomic units
  • Reverse engineer media capture and editing
    process
  • Object processing
  • Breaking individual frame into objects
  • Object based content description
  • Foreground/background separation
  • Media skim generation
  • Enable browsing with light-weight devices
  • Summary, fast forward, preview generation
  • Semantic content exploited to match user needs
  • Semantic annotation
  • Recognize generic semantic content
  • Story telling by computer
  • Appearance, situation, scene understanding
  • Detection, classification, tracking of object,
    events

Chang, S. F., IEEE Multimedia Magazine, 2002
6
CBMIP Research That Makes Sense
  • Producing meta-data that are not readily
    available
  • If editing status are already recorded, there is
    no need for reverse-engineering
  • Producing meta-data that human operator is
    difficult in generating
  • E.g. low level features, texture, etc.
  • Quantified measurements
  • Annotating content of large volume and low
    individual values
  • News archives, etc. things that accumulate and
    usefulness are not known in advance

Chang, S. F., IEEE Multimedia Magazine, 2002
7
CB Multimedia Information Retrieval
  • Multimedia catalog indexing and shopping
  • Color and texture matching
  • Dinning set, wall paper, carpet, drapery, cloth,
    etc.
  • Paintings, stock photos
  • 3D shape matching
  • Tools, nails, bolts
  • Vase, decoration, furniture
  • Melody
  • Karaoke
  • Imitation detection for IP protection
  • Logos, trade marks,
  • Jewelries, Paintings, art work
  • Songs, poems, articles

8
CB Multimedia Surveillance
  • Purposes
  • Detect intrusion and illegal action so it can be
    stopped on the spot
  • Preserve relevant information for future
    references
  • Requirements
  • Understanding specific content of video/audio
  • Semantically meaningful recording and compression
    of data
  • CB indexing and summary for easy retrieval of
    archived information

9
Surveillance Applications
  • Traffic
  • Accidents, congestion, dangerous behavior
  • Security
  • Airport, train station, public buildings
  • Shopping mall, dept store
  • Defense
  • Border patrol
  • Health care
  • Home care
  • Life sign monitoring
  • Smart living space
  • Monitoring human activities and take appropriate
    actions
  • Emotion, gesture, voice, speech, body movement
    recognition
  • Suitable for monitoring tasks that have
  • Large area
  • Many object presents
  • Prolonged durations

10
CB Multimedia Authoring
  • Given home videos, raw video clips,
  • Perform laborious task for human author
  • Segment video into shots and scenes
  • Annotate individual shots with semantic
    description
  • Automatic generation of draft of closed caption
  • Index individual shots with meta data
  • Assist user in composing
  • Story board
  • Associate shots with script
  • Formatting individual shots
  • Length stretching, shortening
  • Transition
  • Content manipulation
  • E.g. remove that unwanted person in the
    background.

11
MPEG-7 Overview
  • Objective
  • Provide inter-operability among systems and
    applications used in generation, management,
    distribution, and consumption of audio-visual
    content description.
  • Help user to identify, retrieve, or filter
    audio-video information.
  • Requirement of Content Descriptors
  • Object oriented multilevel abstraction
  • Generic applications
  • Effective, comprehensive, flexible, extensible,
    scalable, and simple
  • Use XML (extensible markup language)

12
Potential Application of MPEG-7
  • Summary,
  • Generation of multimedia program guide or content
    summary
  • Generation of content description of A/V archive
    to allow seamless exchange among content creator,
    aggregator, and consumer.
  • Filtering
  • Filter and transform multimedia streams in
    resource limited environment by matching user
    preference, available resource and content
    description.
  • Retrieval
  • Recall music using samples of tunes
  • Recall pictures using sketches of shape, color
    movement, description of scenario
  • Recommendation
  • Recommend program materials by matching user
    preference (profile) to program content
  • Indexing
  • Create family photo or video library index

13
Content descriptions
  • Descriptors
  • MPEG-7 contains standardized descriptors for
    audio, visual, generic contents.
  • Standardize how these content features are being
    characterized, but not how to extract.
  • Different levels of syntax and semantic
    descriptions are available
  • Description Scheme
  • Specify the structure and relations among
    different A/V descriptors
  • Description Definition Language (DDL)
  • Standardized language based on XML (eXtended
    Markup Language) for defining new Ds and DSs
    extending or modifying existing Ds and Dss.

14
Content descriptions
  • Descriptors
  • MPEG-7 contains standardized descriptors for
    audio, visual, generic contents.
  • Standardize how these content features are being
    characterized, but not how to extract.
  • Different levels of syntax and semantic
    descriptions are available
  • Description Scheme
  • Specify the structure and relations among
    different A/V descriptors
  • Description Definition Language (DDL)
  • Standardized language based on XML (eXtended
    Markup Language) for defining new Ds and DSs
    extending or modifying existing Ds and Dss.

15
Visual Color Descriptors
  • Color space HSV (hue-saturation-value)
  • Scalable color descriptor (SCD) color histogram
    (uniform 255 bin) of an image in HSV encoded by
    Haar transform.
  • Color layout descriptor
  • spatial distribution of color in an arbitrarily
    shaped region.
  • Dominant color descriptor (DCD)
  • colors are clustered first.
  • Color structure descriptor (CSD)
  • scan 8x8 block in slide window, and count
    particular color in window.
  • Group of Frame/Group of Picture color descriptor

16
Visual Texture Descriptor
  • Texture Browsing D.
  • Regularity
  • 0 irregular 3 periodic
  • Directionality
  • Up to 2 directions
  • 1-6 in 30O increment
  • Coarseness
  • 0 fine 3 coarse
  • Edge histogram D.
  • 16 sub-images
  • 5 (edge direction) bins/sub-image
  • Homogeneous Texture D. (HTD)
  • Divide frequency space into 30 bins (5 radial, 6
    angular)
  • 2D Gabor filter bank applied to each bin
  • Energy and energy deviation in each bin computed
    to form descriptor.

17
Visual Shape Descriptor
  • 3D Shape D. Shape spectrum
  • Histogram (100 bins, 12bits/bin) of a shape
    index, computed over 3D surface.
  • Each shape index measures local convexity.
  • Region-based D. Art
  • Angular radial transform
  • Shape analysis based on moments
  • ART basis
  • Vnm(?, ?) exp(jm?)Rn(?)
  • Rn(?) 2 cos(n??) n ?0
  • 1 n 0
  • Contour based shape descriptor
  • Curvature scale space (CSS)
  • N points/curve, successively smoothed by 0.25
    0.5 0.25 till curve become convex.
  • Curvature at each point form a curvature at that
    scale.
  • Peaks of each scale are used as feature
  • 2D/3D descriptors
  • Use multiple 2D descriptors to describe 3D shape

18
Visual Motion Descriptor
Motion region
  • Motion activity D.
  • Intensity
  • Direction of activity
  • Spatial distribution of activity
  • Temporal distribution of activity
  • Camera motion
  • Panning
  • Booming (lift up)
  • Tracking
  • Tilting
  • Zooming
  • Rolling (around image center)
  • Dollying (backward)

Video segment
trajectory
Mosaic
Camera motion
Parametric motion
Warping parameter
Motion activity
  • Warping (w.r.t. mosaic)
  • Motion trajectory

19
MPEG-7 Audio Content Descriptors
  • Spoken content Ds
  • Speaker type
  • Link type
  • Extraction info type
  • Confusion info type
  • Timbre Ds
  • Instrument
  • Harmonic instrument
  • Percussive instrument
  • Melody contour Ds
  • Contour
  • Meter
  • beat
  • 4 classes of audio signals
  • Pure music
  • Pure speech
  • Pure sound effect
  • Arbitrary sound track
  • Audio descriptors
  • Silence Ds silencetype
  • Sound effect Ds
  • Audio Spectrum
  • Sound effect features

20
Spoken content description
Speech waveform
Audio processing
MPEG-7 Encoder
  • Spoken content Header
  • Word lexicon (vocabulary)
  • Phone lexicon
  • IPA (international phonetic association.
    Alphabet)
  • SAMPA (speech assessment method phonetic
    alphabet)
  • Phone confusion statistics
  • Speaker
  • Spoken content lattice (word or phone)
  • Lattice Node
  • Word and phone link

ASR
  • Goal To support potentially erroneous decoding
    extracted using an automatic speech recognition
    system for robust retrieval.

lattice
?
?
Header
IS P0.7
BORE P0.6
lattice
HIS P0.3
21
Multimedia Content Analysis (1)
  • What to do with the content features?
  • Structure analysis
  • Parsing multimedia object into pre-defined
    structures
  • Automated organization of information
  • Sometimes a reverse engineering task
  • Example
  • Parse email into parts
  • Parse video into shots, scene, etc.
  • Parse song into paragraph, sentence
  • Method
  • Detecting syntactic, semantic discontinuity
  • Fitting into pre-defined meta structure
  • Summary and Skimming
  • Examples
  • thumnails of an image,
  • Preview of a movie
  • Abstract of an article
  • Challenge
  • Extract semantically important information

22
Multimedia Content Analysis (2)
  • Information filtering
  • Examples
  • Junk email blocking
  • Personalized TV programming
  • Challenges
  • Matching users preference to content
  • Information retrieval
  • Example
  • Find similar picture, songs
  • Challenges
  • Similarity measures
  • Object and event recognition
  • Examples
  • Face recognition
  • Event detection
  • Object tracking
  • Challenges
  • Temporal and spatial features
  • Multi-modality features

23
SBD Problem
  • Shots
  • In professional video, a shot is often taken
    while the camera is stationary or in a regular
    movement
  • The on and off of the camera defines a shot.
    Hence, shot boundaries may be recorded during
    video acquisition.
  • A shot is also the basic unit of video editing.
    Hence shot boundaries may be recorded during
    editing process
  • Why SBD then?
  • Process existing archives of news, TV programs
    where shot information is not available
  • Process raw video clips where shots recording may
    not be the best unit to manipulate video

24
Shot Boundary Detection
  • Shot
  • A sequence of frames captured by one camera in a
    single continuous action in time and space
  • Shot boundary detection
  • Temporal segmentation of video
  • Types
  • Cut abrupt changes
  • Gradual Fade-in, fade-out, dissolve
  • Methods
  • Statistical change detection of multi-dimensional
    time series
  • Content features
  • Color histogram
  • Edge pixels
  • Dominant motion vectors and residue errors
  • Feature points extraction
  • Distance measure
  • Feature dependent
  • May not be in a norm vector space

25
SBD Methods
  • Features
  • Intensity Abs difference of intensity
  • Robust intensity of pixel intensity differ
    more than a threshold
  • Motion Amount of motion between corresponding
    blocks
  • Color histogram
  • Edges numbers and positions
  • Graduate shot transition
  • Dissolve, fade, wipe, etc.
  • Extend over multiple frames
  • Discontinuity between two shots will be separated
    farther.
  • One way is to model transition as a separate
    shots
  • Need to model how features characterizing
    different shots vary for different types of
    transition effects.

26
Shot Boundary Detection Example
  • Dissolve detection
  • L1 norm of 1st 2nd derivatives of image
  • Flash
  • Sudden intensity increase over 1-2 frames
  • Cut detection
  • motion-compensated image difference

http//www-nlpir.nist.gov/projects/tvpubs/papers/c
lipsimag.paper.pdf
27
2003 TREC-VID Results
  • TREC-VID
  • Text retrieval conference
  • Video-retrieval InDep. Evaluation
  • funded by ARDA/NIST
  • 2003 test
  • 133 hr MPEG-I video
  • 1998 ABC/CNN news C-span
  • 13 videos, 4.9 GB
  • 24 groups participate
  • 4 tasks
  • Shot boundary detectioin
  • High-level feature extraction (17)
  • Story segmentation and classification
  • Search (manual and interactive)

http//www-nlpir.nist.gov/projects/trecvid/
28
TRECVID 2003 Shot Boundary Detection
  • Tasks
  • Identify video shot transitions
  • Classify each identified transition into
  • Cut
  • Dissolve
  • Fadeout/in
  • Others
  • 596,604 frames are used for the SBD task in 2003
    that contains 3734 shots
  • Ground truth
  • Manually created
  • 3734 of them
  • 70 cut
  • 20 dissolve
  • 4 fades
  • 6 others
  • Performance criteria
  • Precision
  • of correct transitions/ of transitions
    reported
  • Recall
  • of correct transitions/ of transitions in
    ground truth

29
TRECVID 03 Results for Cuts (Zoomed)
http//www-nlpir.nist.gov/projects/tvpubs/papers/t
v3.overview.slides.pdf
30
TRECVID 03 Results for Graduate Transitions
http//www-nlpir.nist.gov/projects/tvpubs/papers/t
v3.overview.slides.pdf
31
Observations
  • Most techniques are based on frame- frame
    comparisons, some with sliding windows
  • Comparisons are based on color and on luminance,
    mostly
  • Some use adaptive thresholding, some dont
  • Most operate on decoded video stream
  • Some have special treatment of motion during GTs,
    of flashes, of camera wipes
  • Performances are getting better

http//www-nlpir.nist.gov/projects/tvpubs/papers/t
v3.overview.slides.pdf
32
Key frame Selection
  • Key-frame
  • A typical frame of a shot,
  • Representing a salient content feature
  • May have more than one key-frame per shot
  • Method
  • Fixed e.g. First, middle frame of each shot
  • Frames closest to cluster center of a shot
    sequence
  • New key frame is generated if the video
    complexity of the current frame deviate from key
    frame more than a threshold
  • Minima of motions
  • Future directions
  • Exploit semantic information recognize face,
    activities
  • Higher level syntactic structures video editing
    effects (switching, split frame, etc.)

33
Key Frame Selection Methods
  • Approach 1
  • Assume video complexity is a time function to be
    approximated by piece-wise linear approximation.
  • Key frames are knot points
  • Approach 2
  • Cluster video frames
  • Close in time indices
  • Similar in content
  • Question How many key frames are needed?

34
Key Frame Selection Example
  • Perceived Motion Energy (PME)
  • Product of fraction of dominant motion vector and
    magnitude of motion vectors
  • Relate to amount of motion perceived by viewer
  • Often show multiple triangles w.r.t. time for a
    video
  • Peak of triangle is key frame

Liu, et al, IEEE Trans. CSVT, Oct 2003
35
Content-based Retrieval
Input Module
Feature Database
Feature extraction
Image Database
Multimedia data
36
Multimedia CBR System Design Issues
  • Requirement analysis
  • How the multimedia materials are to be used
  • Determines what set of features are needed.
  • Archiving
  • How should individual objects are stored?
    Granularity?
  • Indexing (query) and retrieving
  • With multi-dimensional indices, what is an
    effective and efficient retrieval method?
  • What is a suitable perceptually-consistent
    similarity measure?
  • User interface
  • Modality? Text or spoken language or others?
  • Interactive or batch? Will dialogue be available?

37
Indexing and Retrieving
  • Index
  • A very high dimensional binary vector
  • Encoding of content features
  • Text-based content can be represented with term
    vectors
  • A/V content features can be either Boolean
    vectors or term vectors
  • Retrieval
  • Retrieval is a pattern classification problem
  • Use index vector as the feature vector
  • Classify each object as relevant and irrelevant
    to a query vector (template)
  • A perceptually consistent similarity measure is
    essential

38
Term Vector Query
  • Each document is represented by a specific term
    vector
  • A term is a key-word or a phrase
  • A term vector is a vector of terms. Each
    dimension of the vector corresponding to a term.
  • Dimension of a term vector total number of
    distinct terms.
  • Example
  • Set of terms tree, cake, happy, cry, mother,
    father, big, small
  • document Father gives me a big cake. I am so
    happy, mother planted a small tree
  • Term vectors 0, 1, 1, 0, 0, 1, 1, 0, 1, 0,
    0, 0, 1, 0, 0, 1

39
Inverse Term Frequency Vector
  • A probabilistic term vector representation.
  • Relative Term Frequency (within a document)
  • tf (t,d) count of term t / of terms in
    document d
  • Inverse document Frequency
  • df(t) total count of document/ of doc
    contain t
  • Weighted term frequency
  • dt tf(t,d) log df(t)
  • Inverse document frequency term vector D d1,
    d2,

40
ITF Vector Example
  • Document 1 The weather is great these days.
  • Document 2 These are great ideas
  • Document 3 You look great
  • Eliminate The, is, these, are, you

41
Content-based Querying
  • Keywords
  • Most natural for user
  • Semantic abstraction of content
  • Meta data
  • Ontology needed
  • Same thing, same name
  • Examples
  • Convenient for properties that are difficult to
    put into words
  • Texture, shape, color
  • Require at least one good example
  • Need to specify features that need to be matched.
  • Icon
  • Good to express spatial relations
  • Limit in vocabulary, need to search for
    appropriate icon may become a retrieval problem
    itself
  • Example composite picture of suspect
  • Sketch
  • Needs skills
  • Non-standards, need to recognize sketch first
  • Suitable for shape features

42
User Models
  • User Profiles
  • Categorize users using features relevant to tasks
  • Static features age, sex, etc.
  • Dynamic features activity logs, etc.
  • Derived features skill levels, preferences, etc.
  • Use of Profiles for HCI
  • Adaptation Customize HCI for different category
    of users
  • Better understanding of users needs

43
Content-Based Visual Query (1)
  • Advantage
  • Ease of creating, capturing and collecting
    digital imaginary
  • Approaches
  • Extract significant features (Color, Texture,
    Shape, Structure)
  • Organize Feature Vectors
  • Compute the closeness of the feature vectors
  • Retrieve matched or most similar images

44
Content-Based Visual Query (2)Improve Efficiency
  • Keyword-based search
  • Match images with particular subjects and narrow
    down the search scope
  • Clustering
  • Classify images into various categories based on
    their contents
  • Indexing
  • Applied to the image feature vectors to support
    efficient access to the database

45
Conceptual structure of the meta-search database.
46
Object and Event Recognition
  • High level understanding of the context of the
    multimedia object.
  • Most useful, yet most challenging!
  • Relates to vision research, object recognition,
    speech recognition, emotion computing,
  • Current research directions
  • Face recognition
  • Event detection
  • Story segmentation

47
Face Recognition
  • Challenge tasks
  • Lighting, shade variations, Pose variations, Time
    laps, Distance/resolution, Disguise, occlusion
  • Pre-processing
  • Face detection
  • 3D model
  • Features
  • Eigen-face
  • Elastic branch net
  • 3D features
  • Invariant features
  • Classifiers
  • Template matching, ML, Bayes, SVM, neural net,
    fuzzy logic
  • Applications to CBMIP
  • Detecting all human faces or a particular human
    face from image or video sequences
  • Identify all faces belong to the same person
  • Difficulties
  • Controlled
  • Semi-controlled
  • Uncontrolled.

48
Multi-AV-Sensor Surveillance Event Mining
http//www.ee.columbia.edu/dvmm/
49
CB Video Surveillance Architecture
Foresti, et al, IEEE Trans. MM, Dec. 2002
50
Abandoned Object Detection
Foresti, et al, IEEE Trans. MM, Dec. 2002
51
Event Detection
Foresti, et al, IEEE Trans. MM, Dec. 2002
52
Object Tracking
Foresti, et al, IEEE Trans. MM, Dec. 2002
53
Abandoned Object Detection
Foresti, et al, IEEE Trans. MM, Dec. 2002
54
Video Recurrent Event Modeling
HHMM Hierarchical Hidden Markov Model
http//www.ee.columbia.edu/dvmm/
55
Content-adaptive Video Streaming
Chang, S. F., IEEE Multimedia Magazine, 2002
56
Content-based Music Selection
  • Facts
  • Personal entertainment system can store thousands
    of songs
  • Songs are published on-line, giving more
    selections
  • User cant remember the title, melody or lyric of
    each song he/she likes
  • Problems
  • User which song to listen to?
  • Musician how to sell a song to a buyer who likes
    it
  • Three goals of a personalized music DJ
  • Repetition
  • Surprise
  • Exploitation of catalog
  • Combinatorial optimization problem
  • Given a set of criteria,
  • Select a finite number of songs out of a large
    catalog
  • To satisfy given constraints

Pachet, et. al. IEEE Multimedia Magazine, Jan 2000
57
Personalized Music Disk Jockey
  • Listeners profile
  • Group profile
  • Age, etc.
  • Specified profile
  • Favorite and hated songs
  • Content features
  • Genre, tempo, musician, instruments, pub. dates,
    length, etc.
  • Motive, chord, melody
  • Lyric, title (songs)
  • Popularity, prices
  • Similarity measures
  • Surprise dissimilar
  • Constraints
  • Based on user profile and query
  • Individual objects
  • sequences
  • Select music objects that meet the constraints
  • Generate the sequence from selected objects while
    meeting sequence constraints
  • Stochastic optimization methods may be needed.

58
Semantic Analysis of Film
Edge detection on pace flow, and corresponding
story and event sections (Titanic)
  • Estimate camera pan/tilt parameters from video
    frames
  • Estimate pace from camera motion
  • Exploit relations between pace boundaries and
    story line boundaries

Adams, et al, IEEE Trans. Multimedia, Dec. 2002
59
Detection Results
60
ClassView
  • Hierarchical video shots classification,
    indexing, and accessing
  • Addressing semantic gap between low-level video
    features and high-level semantic concepts.
  • Goal
  • Classify unlabeled video into the semantic labels
    so that they can be accessed in a semantically
    meaningful way.
  • The concept (labels) is organized in a
    tree-structure that is generated by human experts
    with the help of WordNet (wordnet.princeton.edu)
  • Pre-labeled training samples provide examples
    that relates low level content features to
    semantic labels at various levels of the
    classification tree.

Fan, et al, IEEE Trans. Multimedia, Feb.2004
61
Hierarchical Video Database Model
Fan, et al, IEEE Trans. Multimedia, Feb.2004
62
Video News Classification Tree
63
Video Cut Detection Results
  • Detected scene cuts
  • Color histogram difference at different thresholds

64
Bottom Up Procedure
65
Multimedia summary and filtering
  • Summary
  • Text email reading
  • Image caption generation
  • Video high-lights, story board
  • Issues
  • Segmentation
  • Clustering of segments
  • Labeling clusters
  • Associate with syntactic and semantic labels
  • Filtering
  • Same as retrieval filter out irrelevant objects
    based on a given criterion (query)
  • Often need to be performed based on content
    features
  • E.g. filtering traffic accidents or law
    violations from traffic monitoring videos

66
Content based Coding and Post-processing
  • Different coding decisions based on low level
    content features
  • coding mode (inter/intra selection)
  • motion estimation
  • Object based coding
  • Encoding different regions (VOP) separately
  • Using different coder for different types of
    regions
  • Multiple abstraction layer coding
  • An analysis/synthesis approach
  • Synthesize low level contents from higher level
    abstraction
  • E.g. texture synthesis
  • Content based post-processing
  • Identify content types and en synthesize low
    level content

67
Conclusion
  • Issues related to content-based multimedia
    information processing are surveyed.
  • Current focus is on low level content analysis
    based on statistical approach.
  • Statistical analysis methods, especially fusion,
    is reviewed.
  • High level knowledge based understanding should
    be incorporated in CBMIP algorithms to further
    advance the state of the art.

68
Statistical Tools for CBMIP
  • Hypothesis testing
  • Detection
  • Pattern classification
  • Pattern classifiers
  • MAP (Bayse) classifier
  • ML classifier based on mixture of Gaussian
  • Rule based, fuzzy logic
  • Decision tree
  • Clustering based LVQ
  • Linear classifier
  • With kernel SVM
  • Nearest neighbor
  • Multi-layer perceptron
  • Information fusion
  • Basically a pattern classification task
  • Decision fusion vs value fusion
  • The key
  • Feature selection!

69
MAP Maximum A Posteriori Classifier
  • The MAP classifier stipulates that the classifier
    that minimizes pr. of mis-classification should
    choose
  • g(x) c(i) if
  • P(c(i)x) P(c(i)x), i ? i.
  • This is an optimal decision rule.
  • Unfortunately, in real world applications, it is
    often difficult to estimate P(c(i)x).
  • Fortunately, to derive the optimal MAP decision
    rule, one can instead estimate a discriminant
    function Gi(x) such that for any x ? X, i ? i.
  • Gi(x) Gi(x) iff
  • P(c(i)x) P(c(i)x)
  • Gi(x) can be an approximation of P(c(i)x) or any
    function satisfying above relationship.

70
Maximum Likelihood Classifier
  • Use Bayes rule,
  • p(c(i)x) p(xc(i))p(c(i))/p(x).
  • Hence the MAP decision rule can be expressed as
  • g(x) c(i) if
  • p(c(i))p(xc(i)) p(c(i))p(xc(i)), i ? i.
  • Under the assumption that the a priori Pr. is
    unknown, we may assume p(c(i)) 1/M. As such,
    maximizing p(xc(i)) is equivalent to maximizing
    p(c(i)c).
  • The likelihood function p(xc(i)) may assume a
    uni-variate Gaussian model. That is,
  • p(xc(i)) N(?i,?i)
  • ?i,?i can be estimated using samples from
    xt(x) c(i).
  • A priori pr. p(c(i)) can be estimated as

71
Nearest-Neighbor Classifier
  • Let y(1), , y(n) ? X be n samples which
    has already been classified. Given a new
    sample x, the NN decision rule chooses g(x)
    c(i) if
  • is labeled with c(i).
  • As n ??, the prob. Mis-classification using NN
    classifier is at most twice of the prob.
    Mis-classification of the optimal (MAP)
    classifier.
  • k-Nearest Neighbor classifier examine the
    k-nearest, classified samples, and classify x
    into the majority of them.
  • Problem of implementation require large storage
    to store ALL the training samples.

72
MLP Classifier
  • Each output of a MLP will be used to approximate
    the a posteriori probability P(c(i)x) directly.
  • The classification decision then amounts to
    assign the feature to the class whose
    corresponding output at the MLP is maximum.
  • During training, the classification labels (1 of
    N encoding) are presented as target values
    (rather than the true, but unknown, a posteriori
    probability)
  • Denote y(x,W) to be the ith output of MLP, and
    t(x) to be the corresponding target value (0 or
    1) during the training.
  • Hence y(x,W) will approximate E(t(x)x)
    P(c(i)x)

73
Optimal Hyper-plane Linearly Separable Case
  • Optimal hyper-plane should be in the center of
    the gap.
  • Support Vectors ? Samples on the boundaries.
    Support vectors alone can determine optimal
    hyper-plane.
  • Question How to find optimal hyper-plane?
  • For di 1, g(xi) wTxi b ? ?w ?
    woTxi bo ? 1
  • For di ?1, g(xi) wTxi b ? ??w ? woTxi
    bo ? ?1

74
Quadratic Optimization Problem Formulation
  • Given (xi, di) i 1 to N, find w and b such
    that
  • ?(w) wTw/2
  • is minimized subject to N constraints
  • di ? (wTxi b) ? 0 1 ? i ? N.
  • Method of Lagrange Multiplier

75
Formulating the Dual Problem
  • At the saddle point, we have
    and , substituting these relations
    into above, then we have the
  • Dual Problem

Maximize Subject to and ?i ? 0 for i 1,
2, , N.
Note
76
Inner Product Kernels
In general, if the input is first transformed via
a set of nonlinear functions ?i(x) and then
subject to the hyperplane classifier
Define the inner product kernel as
one may obtain a dual optimization problem
formulation as
Often, dim of ? (p1) dim of x!
77
Information Fusion
  • What is fusion?
  • Decision-making (hypothesis testing) and value
    computation (estimation) based on two or more
    information sources.
  • Eg. speech recognition based on audio,
    lip-reading and gesture
  • Why fusion?
  • It is easier to process individual info sources
    separately
  • Transmitting raw data to the fusion center too
    costly
  • Types of fusion
  • Decision fusion vs value fusion
  • Stack generalization vs mixture of experts

78
Decision Fusion
?(d)
Fusion Center
d d1 d2 ? dK T Decision vector
Low data rate channel
Local Decisions
dK
d1
d2
Member Classifier K
Member Classifier 1
Member Classifier 2
? ? ?
High data rate channel
x
79
Decision Fusion Methods
  • Weighted linear combination methods
  • Assume dk independent
  • Majority voting
  • Weighted voting
  • Follow the leader
  • Stack generalization
  • Fusion pattern classification
  • Optimal decision fusion
  • Behavioral knowledge space Huang95
  • Table look up method
  • CODF
  • Complement ODF with a weighted linear combination
    method
  • Handle better when there are few training samples

80
Optimal Decision Fusion
  • If dk has N labels, and K member classifiers,
    there are at most NK different decision vectors
    d.
  • A table assigning each of the NK d to a class
    label may yield optimal decision fusion that
    minimizes prob. Mis-classification.
  • When the table is constructed using training data
    samples, this method is called the BKS method by
    Huang and Seun Huang95.
  • Practical difficulties
  • Table too large when NK becomes large
  • Some entries have too few training samples
  • CODF
  • Complement the ODF with weighted linear
    combination method.

81
Weighted linear combination methods
  • General solution
  • Perceptron learning problem
  • If not linearly separable, will not converge
  • Relax constraint
  • Replace the nonlinear step function by a
    sigmoidal function or slop function
  • Support vector machine (SVM)
  • Back-propagation learning
  • Least square estimation
  • Majority voting
  • wk 1, ? K/2
  • Threshold voting
  • wk 1, ? ? K/2
  • Following the leader
  • wk 1, wk 0, k ? k, ? 0
  • k classifier has best performance

82
Mixture of Expert
? gi (x) 1, 0? gi (x) ?1
z(x) ? gi (x)di(x)

Gating Network
?
gK
?
?
g1
g2
d1
d2
dK
Member Classifier 1
Member Classifier 2
Member Classifier K

x
83
Mixture of Expert
  • Gating network,
  • locating at fusion center,
  • based on raw data ? communication resources
  • For each x, minimize ??T(x) ??kdk(x) gk(x) ??
    subject to ?kgk(x) 1, and 0?gk(x)?1 ? gk(x)1
    if dk(x) T(x).
  • Given gk(x) for all x in training set, dk(x) can
    be determined to training the member classifier.
  • Iterative training using EM algorithm
  • Fix gk train dk
  • Fix dk train gk
Write a Comment
User Comments (0)
About PowerShow.com