Multimedia Information Retrieval - PowerPoint PPT Presentation

Loading...

PPT – Multimedia Information Retrieval PowerPoint presentation | free to download - id: 6ec935-YmFiO



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Multimedia Information Retrieval

Description:

Title: Data Mining at Yasuda Author: Jing Luan Last modified by: Q Created Date: 7/24/2001 2:17:31 AM Document presentation format: On-screen Show Company – PowerPoint PPT presentation

Number of Views:153
Avg rating:3.0/5.0
Slides: 96
Provided by: Jing114
Learn more at: http://ce.sharif.edu
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Multimedia Information Retrieval


1
Multimedia Information Retrieval
  • Modern Information Retrieval Course
  • Computer Engineering Department
  • Sharif University of Technology
  • Spring 2006

2
Outline
  • Introduction
  • Text-Based MMIR
  • Content-Based Retrieval
  • Multimedia IR Model
  • Image Retrieval
  • Audio Retrieval
  • Video Retrieval
  • Conclusions

3
Outline
  • Introduction
  • Text-Based MMIR
  • Content-Based Retrieval
  • Multimedia IR Model
  • Image Retrieval
  • Audio Retrieval
  • Video Retrieval
  • Conclusions

4
Support variety of data
  • Different kinds of media
  • Image
  • Graph,
  • Audio
  • Music, speech,
  • Video

5
MMIR Motivations
  • Content, content, and more content How to get
    what is needed ?
  • Increasing availability of multimedia information
  • Difficult to find, select, filter, manage AV
    content
  • More and more situations where it is necessary to
    have information about the content

6
Key Issues in MMIR
7
Goals
  • Want to make multimedia content searchable like
    text information, Because the value of content
    depends on how easy it is to find, filter,
    manage, and use it.
  • Need content description method beyond simple
    text annotation

8
MMIR Approaches
  • Text Based MMIR
  • Content Based MMIR

9
Outline
  • Introduction
  • Text-Based MMIR
  • Content-Based Retrieval
  • Multimedia IR Model
  • Image Retrieval
  • Audio Retrieval
  • Video Retrieval
  • Conclusions

10
Text-Based Retrieval
  • based on text associated with the file
  • URL
  • http//www.host.com/animals/dogs/poodle.gif
  • Alt text
  • ltimg srcURL alt"picture of poodle"gt
  • Hyperlink text
  • lta hrefURLgtSally the poodlelt/agt

11
Text-based Search Engines
  • Indexing based on text in the container webpage
  • Http//www.google.com
  • Http//www.ditto.com

12
Keyword-based System
Video Database
Automatic Annotation
Keyword
Information Need
Including filename, video title, caption, related
web page
13
Why this happens?
  • Most of these search engines are keyword based
  • Have to represent your idea in keywords
  • These keywords are expected to appear in the
    filename, or corresponding webpage

14
Image The Google Approach
  • How does image search work?
  • Google analyzes the text on the page adjacent to
    the image, the image caption and dozens of other
    factors to determine the image content. Google
    also uses sophisticated algorithms to remove
    duplicates and ensure that the highest quality
    images are presented first in your results.
  • Examples
  • Campanile tcd
  • Cliffs of Moher
  • Recall may not be great

15
Google image search
16
Google Image Search
17
Problems with Text-Based
  • The text in the ALT tag has to be done manually
  • Expensive
  • Time consuming
  • It is incomplete and subjective
  • Some features are difficult to define in text
    such as texture or object shape

18
Therefore
  • Unable to handle semantic meaning of images
  • Unable to handle visual position
  • Unable to handle time information
  • Unable to use images as query
  • .

19
So
  • Better for simple concepts
  • e.g. A picture of a giraffe
  • Dont work for complex queries
  • e.g. A picture of a brick home with black
    shutters and white pillars, with a pickup truck
    in front of it (image)

20
Outline
  • Introduction
  • Text-Based MMIR
  • Content-Based Retrieval
  • Multimedia IR Model
  • Image Retrieval
  • Audio Retrieval
  • Video Retrieval
  • Conclusions

21
Architecture for Multimedia Retrieval
Human or machine
22
Query-retrieval matrix
text stills sketch speech sound humming examples
query doc
Example
text video images speech music sketches multimedia
23
Main Components
  • Feature Extraction Analysis
  • Description Schemes
  • Searching Filtering
  • Examples
  • IBMs Query By Image Content (QBIC)
  • Viragess VIR Image Engine
  • Online http//collage.nhil.com/

24
Internal representation
  • Using attributes is not sufficient
  • Feature
  • Information extracted from objects
  • Multimedia object is represented as a set of
    features
  • Features can be assigned manually, automatically,
    or using a hybrid approach

25
Features for MMIR
  • high-level features
  • words and phrases from text, speech recognition
  • medium-level features
  • face detector, regions classifiers, outdoor etc
  • low-level features
  • Fourier transforms, wavelet decomposition,
    texture histograms, colour histograms, shape
    primitives, filter primitives

26
Internal representation
  • Values of some specific features are assigned to
    a object by comparing the object with some
    previously classified objects
  • Feature extraction cannot be precise
  • A weight is usually assigned to each feature
    value representing the uncertainty of assigning
    such a value to that feature
  • 80 sure that a shape is a square

27
Outline
  • Introduction
  • Text-Based MMIR
  • Content-Based Retrieval
  • Multimedia IR Model
  • Image Retrieval
  • Audio Retrieval
  • Video Retrieval
  • Conclusions

28
MMIR Models Main Components
  • Query Language
  • Indexing and Searching

29
Query languages
  • In designing a multimedia query language, two
    main aspects require attention
  • How the user enters his/her request to the system
  • Which conditions on multimedia objects can be
    specified in the user request

30
Request specification
  • Interfaces
  • Browsing and navigation
  • Specifying the conditions the objects of interest
    must satisfy, by means of queries
  • Queries can be specified in two different ways
  • Using a specific query language
  • Query by example
  • Using actual data (object example)

31
Conditions on multimedia data
  • Query predicates
  • Attribute predicates
  • Concern the attributes for which an exact value
    is supplied for each object
  • Exact-match retrieval
  • Structural predicates
  • Concern the structure of multimedia objects
  • Can be answered by metadata and information about
    the database schema
  • Find all multimedia objects containing at least
    one image and a video clip

32
Conditions on multimedia data
  • Semantic predicates
  • Concern the semantic content of the required
    data, depending on the features that have been
    extracted and stored for each multimedia object
  • Find all the red houses
  • Exact match cannot be applied

33
Indexing and searching
  • Searching similar patterns
  • Distance function
  • Given two objects, O1 and O2, the distance
    (dissimilarity) of the two objects is denoted by
    D(O1,O2)
  • Similarity queries
  • Whole match
  • Sub-pattern match
  • Nearest neighbors
  • All pairs

34
Spatial access methods
  • Map objects into points in f-D space, and to use
    multiattribute access methods (also referred to
    as spatial access methods or SAMs) to cluster
    them and to search for them
  • Methods
  • R-trees and the rest of the R-tree family
  • Linear quadtrees
  • Grid-files
  • Linear quadtrees and grid files explode
    exponentially with the dimensionality

35
R-tree
  • R-tree
  • Represent a spatial object by its minimum
    bounding rectangle (MBR)
  • Data rectangles are grouped to form parent nodes
    (recursively grouped)
  • The MBR of a parent node completely contains the
    MBRs of its children
  • MBRs are allowed to overlap
  • Nodes of the tree correspond to disk pages

36
(No Transcript)
37
Outline
  • Introduction
  • Text-Based MMIR
  • Content-Based Retrieval
  • Multimedia IR Model
  • Image Retrieval
  • Audio Retrieval
  • Video Retrieval
  • Conclusions

38
Visual Features ...
39
Histograms
Greyscale histogram of image A Assuming 256
intensity levels hA(l) (l1 ? 256) hA(l)
(i,j)A(i,j)l, i 1 ? m, for j 1 ? n
i.e. a count of the number of pixels at each level
40
Colour Histogram
  • Describe the colors and its percentages in an
    image.

41
Texture Matching
  • Texture characterizes small-scale regularity
  • Color describes pixels, texture describes regions
  • Described by several types of features
  • e.g., smoothness, periodicity, directionality
  • Perform weighted vector space matching
  • Usually in combination with a color histogram

42
Texture Test Patterns
43
Image Retrieval using low level features
  • See IBM demos at
  • http//wwwqbic.almaden.ibm.com/
  • http//mp7.watson.ibm.com/ (video)
  • Hermitage Museum
  • www.hermitagemuseum.org

44
Berkeley Blobworld
45
Berkeley Blobworld
46
But..
  • Low-level feature doesnt work in all the cases

47
Solution Regional Low-level Image Feature
  • Segmentation into objects
  • Extract low-level features from each regions

48
Solution High-level Image Feature
  • Objects Persons, Roads, Cars, Skies
  • Scenes Indoors, Outdoors, Cityscape, Landscape,
    Water, Office, Factory
  • Event Parade, Explosion, Picnic, Playing Soccer
  • Generated from low-level features

49
Outline
  • Introduction
  • Text-Based MMIR
  • Content-Based Retrieval
  • Multimedia IR Model
  • Image Retrieval
  • Audio Retrieval
  • Video Retrieval
  • Conclusions

50
Audio Genres
  • Important types of audio data
  • Speech-centered
  • Radio programs
  • Telephone conversations
  • Recorded meetings
  • Music-centered
  • Instrumental, vocal
  • Other sources
  • Alarms, instrumentation, surveillance,

51
Speech-based Documents
  • Radio/TV news retrieval.
  • Search archival radio/news broadcasts.
  • Video and audio email.
  • Knowledge management transfert of tacit
    knowledge to others.
  • Search audio archives of meetings, lectures, etc

52
Preamble
  • Two utterances of the same words by the same
    person under the same conditions generate very
    different waveforms.
  • Variations due to loudness, pitch, brightness,
    bandwidth, harmonisity, and others are all
    continuous variables and are equivalent to color
    and texture in images.

53
Detectable Speech Features
  • Content
  • Phonemes, one-best word recognition, n-best
  • Identity
  • Speaker identification, speaker segmentation
  • Language
  • Language, dialect, accent
  • Other measurable parameters
  • Time, duration, channel, environment

54
How Speech Recognition Works
  • Three stages
  • What sounds were made?
  • Convert from waveform to subword units (phonemes)
  • How could the sounds be grouped into words?
  • Identify the most probable word segmentation
    points
  • Which of the possible words were spoken?
  • Based on likelihood of possible multiword
    sequences
  • All three stages are learned from training data
  • Using hill climbing (a Hidden Markov Model)

55
Speech Recognition
Phoneme n-grams
One-best phoneme transcription
Phoneme Detection
N-best phoneme sequences
Phoneme lattices
Phoneme transcription dictionary
Word Construction
One-best word transcript
Word Selection
Word n-gram language model
Words
56
Music and audio analysis
  • Music is a large and extremely variable audio
    class.
  • The range of sounds is large, from music genres
    to animal cries to synthesizer samples.
  • Any of the above can and will occur in
    combination.

57
Audio retrieval-by-content
  • Require some measure of audio similarity.
  • Most approaches to general audio retrieval take a
    perceptual approach, using measures such as
    loudness.
  • Neural net to map a sound clip to a text
    description An obvious drawback is the
    subjective nature of audio description.

58
Sample system Muscle fish
  • To analyze sound files for a specific set of
    psychoacoustic features.
  • This results in a vector of attributes that
    include loudness, pitch, bandwidth and
    harmonicity.
  • Given enough training samples, a Gaussian
    classifier can be constructed, or for retrieval.

59
  • An Euclidean distance is used as a measure of
    similarity.
  • For retrieval, the distance is computed between a
    given sound example and all other sound examples
    (about 400 in the demonstration).
  • Sounds are ranked by distance, with the closer
    ones being more similar.

60
Music and MIDI retrieval
  • Using archives of MIDI files, which are
    score-like representations of music intended for
    musical synthesizers or sequencers.
  • Given a melodic query, the MIDI files can be
    searched for similar melodies.

61
Polyphonic Music Indexing Technique
  • n-grams
  • encode music as text strings using pitch and
    onsets
  • index text words with text search engine
  • process query in the same way
  • application eg, Query by Humming

62
Monophonic pitch n-gramming
0 7 0 2
0 -2 0 -2 0
Interval
0 7 0 2
7 0 2 0
ZGZB
0 2 0 -2
GZBZ
ZBZb
Example musical strings with interval-only
representation
63
Outline
  • Introduction
  • Text-Based MMIR
  • Content-Based Retrieval
  • Multimedia IR Model
  • Image Retrieval
  • Audio Retrieval
  • Video Retrieval
  • Conclusions

64
Application
  • Increasing demand for visual information
    retrieval
  • Retrieve useful information from databases
  • Sharing and distributing video data through
    computer networks
  • Example BBC
  • BBC archive has 500k queries plus 1M new items
    per year
  • From the BBC
  • Police car with blue light flashing
  • Government plan to improve reading standards
  • Two shot of Kenneth Clarke and William Hague

65
Video Search
  • Active Research Area

66
Video Search Features
  • Texture
  • One of the earliest Image features Harlick et al
    70s
  • Co-occurrence matrix
  • Orientation and distance on gray-scale pixels
  • Contrast, inverse deference moment, and entropy
    Gotlieb Kreyszig
  • Human visual texture properties coarseness,
    contrast, directionality, likeliness, regularity
    and roughness Tamura et al
  • Wavelet Transforms 90s
  • Smith Chang extracted mean and variance from
    wavelet subbands
  • Gabor Filters
  • And so on
  • Region Segmentation
  • Partition image into regions
  • Strong Segmentation Object segmentation is
    difficult.
  • Weak segmentation Region segmentation based on
    some homegenity criteria
  • Scene Segmentation
  • Shot detection, scene detection
  • Look for changes in color, texture, brightness
  • Context based scene segmentation applied to
    certain categories such as broadcast news
  • Color
  • Robust to background
  • Independent of size, orientation
  • Color Histogram Swain Ballard
  • Sensitive to noise and sparse- Cumulative
    Histograms Stricker Orgengo
  • Color Moments
  • Color Sets Map RGB Color space to Hue Saturation
    Value, quantize Smith, Chang
  • Color layout- local color features by dividing
    image into regions
  • Color Autocorrelograms

67
Video Search Features
  • Face
  • Face detection is highly reliable
  • - Neural Networks Rwoley
  • - Wavelet based histograms of facial features
    Schneiderman
  • Face recognition for video is still a challenging
    problem.
  • - EigenFaces Extract eigenvectors and use as
    feature space
  • OCR
  • OCR is fairly successful technology.
  • Accurate, especially with good matching
    vocabularies.
  • Script recognition still an open problem.
  • ASR
  • Automatic speech recognition fairly accurate for
    medium to large vocabulary broadcast type data
  • Large number of available speech vendors.
  • Still open for free conversational speech in
    noisy conditions.
  • Shape
  • Outer Boundary based vs. region based
  • Fourier descriptors
  • Moment invariants
  • Finite Element Method (Stiffness matrix- how each
    point is connected to others Eigen vectors of
    matrix)
  • Turing function based (similar to Fourier
    descriptor) convex/concave polygonsArkin et al
  • Wavelet transforms leverages multiresolution
    Chuang Kao
  • Chamfer matching for comparing 2 shapes (linear
    dimension rather than area)
  • 3-D object representations using similar
    invariant features
  • Well-known edge detection algorithms.

68
Video Structures
  • Image structure
  • Absolute positioning, relative positioning
  • Object motion
  • Translation, rotation
  • Camera motion
  • Pan, zoom, perspective change
  • Shot transitions
  • Cut, fade, dissolve,

69
Typical Retrieval Framework
  • User provide query information that represents
    his information needs
  • Database store a large collection of video data
  • Goal Find the most relevant shots from the
    database
  • Shots paragraph in video, typically 20 40
    seconds, which is the basic unit of video
    retrieval

70
Bridging the Gap
Video Database
Result
71
Automatically Structure Video Data
  • The first step for video retrieval Video
    programmes are structured into logical scenes,
    and physical shots
  • If dealing with text, then the structure is
    obvious
  • paragraph, section, topic, page, etc.
  • All text-based indexing, retrieval, linking, etc.
    builds upon this structure
  • Automatic shot boundary detection and selection
    of representative keyframes is usually the first
    step

72
Typical automatic structuring of video

a video document
73
Ideal solution
Video Database
Information Need
Video Structure
Understanding the semantic meaning and retrieve
Result
74
Ideal solution
  • However,
  • Hard to represent query in natural language and
    for computer to understand
  • Computers have no experience
  • Other representation restriction like position,
    time

Video Database
Information Need
Video Structure
Understanding the semantic meaning and retrieve
Result
75
Alternative Solution
Video Database
Provide evidence of relevant information ( text,
image, audio)
Information Need
Video Structure
Match and combine
Result
76
Evidence-based Retrieval System
  • General framework for current video retrieval
    system
  • Video retrieval based on the evidence from both
    users and database, including
  • Text information
  • Image information
  • Motion information
  • Audio information
  • Return a relevant score for each evidence
  • Combination of the scores

77
Keyword-based System
Video Database
Automatic Annotation
Keyword
Information Need
Video Structure
Including filename, video title, caption, related
web page
78
Keyword-based System
Video Database
Automatic Annotation
Information Need
Keyword
Video Structure
Manual Annotation
79
Manual Annotation
  • Manually creating annotation/keywords for image /
    video data
  • Examples Gettyimage.com (image retrieval)
  • Pros
  • Represent the semantic meaning of video
  • Cons
  • Time-consuming, labor-intensive
  • Keyword is not enough to represent information
    need

80
Speech and OCR transcription
Video Database
Annotation
Information Need
Keyword
Video Structure
Speech Transcription
OCR Transcription
81
Query using speech/OCR information
Query Find pictures of Harry Hertz, Director of
the National Quality Program, NIST
82
What we lack?
Video Database
Annotation
Information Need
Keyword
Video Structure
Speech Transcription
Image Information
OCR Transcription
83
Image-based Retrieval
Video Database
Text Information
Keyword
Information Need
Video Structure
Image Feature
Query Images
84
Image-based Retrieval
Video Database
Text Information
Keyword
Information Need
Video Structure
Image Feature
Query Images
Low-level Feature
High-level Feature
85
More Evidence in Video Retrieval
Video Database
Text Information
Keyword
Information Need
Video Structure
Image Information
Query Images
Motion Information
Motion
Audio Information
Audio
86
MPEG-7 The Objective
  • Standardize object-based description tools for
    various types of audiovisual information,
    allowing fast and efficient content searching,
    filtering and identification, and addressing a
    large range of applications.
  • New objective for MPEG
  • MPEG-1, -2 and -4 represent the content itself
    (the bits)
  • MPEG-7 should represent information about the
    content (the bits about the bits)

87
Scope of MPEG-7
Description creation
Description consumption
description
  • Not the description creation
  • Not the description consumption
  • Just the description !

The goal is to define the minimum that enables
interoperability.
88
MPEG-7 Terminology Descriptor
  • Descriptor (D) A Descriptor is a representation
    of a Feature. A Descriptor defines the syntax and
    the semantics of the Feature representation.
  • Examples Feature Descriptor
  • Color Histogram of Y,U,V components
  • Shape ART moments
  • Motion Motion field, coefficients of a model
  • Audio frequency Average frequency components
  • Title Text
  • Annotation Text
  • Genre Text, index in as thesaurus

89
Outline
  • Introduction
  • Text-Based MMIR
  • Content-Based Retrieval
  • Multimedia IR Model
  • Image Retrieval
  • Audio Retrieval
  • Video Retrieval
  • Conclusions

90
Conclusions
  • Simple image retrieval is commercially available
  • Color histograms, texture, limited shape
    information
  • Segmentation-based retrieval is still in the lab
  • Keep an eye on the Berkeley group
  • Limited audio indexing is practical now
  • Audio feature matching, answering machine
    detection

91
Conclusions
  • Multimedia IR
  • Text good solutions exist
  • Video, Image, Sound a lot of work to do.

92
Conclusions
  • The goal of content-based video retrieval is to
    build more intelligent video retrieval engine via
    semantic meaning
  • Many applications in daily life
  • Combine evidence from different aspects
  • Hot research topic, few business system
  • State-of-the-art performance is still
    unacceptable for normal users, space to improve

93
Conclusions
  • Problems with Content-Based MMIR
  • Must have an example image
  • Example image is 2-D
  • Hence only that view of the object will be
    returned
  • Large amount of image data
  • Similar colour histogram does not equal similar
    image
  • Usually the best results come from a combination
    of both text and content searching

94
Conclusions
  • Combination of multi-modal results
  • Difference characteristics between multi-modal
    information
  • Text-based Information better for middle and
    high level queries
  • Image-based Information better for low and
    middle level queries
  • Combination of multi-modal information

95
Conclusions
  • Challenging research questions
  • Draws on
  • computer vision,
  • audio processing,
  • natural language analysis,
  • unstructured document analysis,
  • information retrieval,
  • information visualisation,
  • computer human interaction,
  • artificial intelligence
About PowerShow.com