Techniques for Information Searching and Retrieval of Multimedia Digital Library - PowerPoint PPT Presentation

About This Presentation
Title:

Techniques for Information Searching and Retrieval of Multimedia Digital Library

Description:

It is difficult to extract the idea in documents and identify the keywords for indexing ... Extract text from digital video programs can help for indexing, ... – PowerPoint PPT presentation

Number of Views:278
Avg rating:3.0/5.0
Slides: 35
Provided by: Vincent8
Category:

less

Transcript and Presenter's Notes

Title: Techniques for Information Searching and Retrieval of Multimedia Digital Library


1
Techniques for Information Searching and
Retrieval of Multimedia Digital Library
  • Presented by Vincent Cheung
  • Supervised by Prof. Michael Lyu
  • Prof. K. W. Ng
  • 18 December, 1999

2
Abstract
  • Digital Library is getting more and more popular,
    due to its strength in searching and retrieving
    information.
  • The trend that more multimedia information are
    needed to be stored instead of pure text.
  • As the nature of multimedia information is very
    different from that of pure text, new challenge
    in information searching and retrieval techniques
    is arose.

3
Presentation Outline
  • General Information Retrieval Methods
  • Multimedia Their Retrieval Techniques
  • Retrieval Techniques in Other Information
    Searching Application
  • An Indexing Tool Implemented
  • Conclusion and QA Session

4
Overview- Information Searching and Retrieval
Procedures
  • Give indexes to the existing information
  • Store information with good organization
  • Get the user queries
  • Search the information
  • Evaluate the importance of all query results
  • Present the results to the users
  • Process the feedback of the users

5
Flowchart of Retrieval Processes
6
Indexing
  • Aim to give abstract of the document and label
    it with a few keywords
  • Manual indexing
  • Using whole passage
  • Content Words counting
  • Natural language processing

7
Query Modification
  • Aim to modify the query such that it can yield
    the largest amount of relevant results
  • Problems related to linguistic
  • Words carry out only syntactic functions
  • Words supply the same or related meaning
  • Words can be used in different senses, depends on
    contents
  • Different structures represent the same idea

8
Solving Linguistic Problems
  • Use of Dictionaries
  • Negative Dictionary
  • Thesaurus (or Synonym Dictionary)
  • Phrase Dictionary
  • Use of Fuzzy Logic for matching synonym
  • Construct a set of fuzzy relations, which
    represented by fuzzy graphs that are obtained
    from statistics of occurrence and co-occurrence
    of keywords.

9
Searching and Storage
  • Aim Good organization in storing can give good
    performance in searching.
  • Two main principals of file organization direct
    and inverted systems
  • Direct system files are stored in order by
    document numbers, and items are retrieved by
    sequential scan of the complete files.
  • Advantage of Direct system allows several
    searches to perform at the same time.

10
Searching and Storage (cont)
  • Inverted system arrange the files in order by a
    set of keywords or index terms. Each item is
    normally listed as many times as there are
    assigned keywords.
  • Advantage of Inverted system only need to
    extract from the files in the sections that
    correspond to the index terms used in queries
  • More other methods variations of these two
    principals

11
Evaluation on Searching Results
  • Aim to rank the list of answers from the search
    by using some ranking functions
  • Different ranking functions for calculating the
    weight of returned answers
  • One simple and popular function Counting the
    occurrence of query keywords
  • Not very fair longer passages would have higher
    opportunity to contain more keywords

12
Feedback
  • Aim to let users redefined the query statements
    for more responsive results
  • Asking users to give feedback to the query
    results because of unclear queries, change in
    user interest, etc.
  • Query statements may be modified, and system
    should performs further searching. The relevant
    items should produce higher correlation than the
    original.

13
Flowchart of Feedback
14
Concept Based Query
  • An object oriented method for indexing
  • Conceptual indexes (classes) are used, and a
    decision tree hierarchy is formed by those
    classes.
  • Users make the same queries
  • Instead of returning answering documents, list of
    concepts are returned at first time.
  • Then narrow their search by indicating the
    desired classes or concepts

15
Characteristics of Multimedia
  • Large in file size
  • May be dynamic in nature (e.g. audio or video)
    instead of static (e.g. text, image)
  • No simple methods for indexing or describing the
    contents of the files
  • Varies kinds of file formats (e.g. JPEG, GIF,
    TIFF in images, MOV, MPEG in video)

16
Existing Multimedia Digital Library - Informedia
  • Convert multimedia to text - Speech Recognition
    and Optical Character Recognition. So, indexing
    and searching can be done by traditional methods
  • Face Recognition - non-text-based technique, for
    matching faces of persons in videos
  • Presenting Results - Poster frame, Filestrip, and
    skimming. Give users a faster review of the query
    answers for choosing desired video

17
Internet Search Engines
  • Internet is similar to Digital Library
  • a huge database
  • heterogeneous information
  • dynamic
  • decentralized
  • Common Internet search engines are using
    centralized index database
  • Disadvantages
  • heavy workload of server
  • inefficient use of bandwidth
  • bad quality of results

18
Distributed Search Engine
  • Local proxy servers can be enhanced to perform
    web searching, a network of search engines then
    can be established
  • Faster response time and network traffic can be
    reduced
  • Better results should be given

19
Video-on-Demand Systems
  • VoD systems deliver videos to clients upon their
    requests
  • VoD system is similar to Digital Library
  • deliver videos upon user requests, which are
    large in content sizes
  • Efficient retrieval is needed, and it can be
    archived only if there is an efficient storage
    method.

20
How Data be Stored in VoD
  • Primary design goal is to maximize the ratio of
    the number of concurrent streams to system cost
    while guaranteeing glitch-free operation
  • An array of magnetic harddisks, and a large RAM
    buffer are used.
  • RAM is faster in I/O rates than harddisks, so
    popular videos are put in RAM
  • A popular video should not be stored with other
    popular videos. Better balance of workload.
  • RAID is used and I/O is done by the whole array
    of disks at the same time.

21
Image Databases
  • Documents are not indexed by verbal description,
    as it may not be able to well-described the
    contents.
  • Other means would be used, e.g. histogram
    representation, shape chains, etc.
  • Similar to Digital Library
  • They are storing multimedia information.

22
Motion Databases
  • Implemented by Deng (1997). Closer to digital
    library.
  • Index the video by three primary features
  • color (color histogram)
  • texture (Gabor texture features)
  • motion (motion histogram)
  • Good for sports or movie data

23
Chinese Searching Engines
  • Similar methods as English can be used
  • Chinese is very different from English as it is
    less structural. (e.g. ??????) Cannot parse the
    sentence according the grammers
  • It is difficult to extract the idea in documents
    and identify the keywords for indexing
  • Subject-verb-object (SVO) can be used for
    identify the syntactic components

24
An Indexing Tool Chinese Subtitles Extraction in
Video
  • Many dialects in Chinese, but Chinese Characters
    is common in anywhere
  • Many video programs have Chinese subtitles
    nowadays
  • Extract text from digital video programs can help
    for indexing, searching and retrieval

25
Features of Subtitles
  • Characters are in foreground
  • They are monochrome
  • They are rigid, from frame to frame
  • They are upright
  • They have size restrictions
  • They contrast with the background
  • They appear in clusters at a limited distance
    aligned to a horizontal line

26
Implementation
  • Two main challenges
  • to segment the character areas
  • to recognize the characters
  • Four phases
  • extract the subtitle block from the background
  • extract each character from subtitle block
  • recognize the Chinese Characters
  • process the whole video

27
Sample Frame
  • ATV video news in MPEG format about Airport
    Authority
  • First, extract one frame from the video

28
Edge Filtering
  • Do edge filtering to the frame by using Sobel
    filter.

29
Subtitle Block Extraction
High Density of Edge indicates there is a
subtitle block
30
Character Extraction
  • Filter the area with background and keep the
    subtitle block
  • Use the same method, segment the characters

31
Results of Recognition
  • A Chinese Character Image Library is built for
    recognition
  • 5401 frequently used Chinese characters
  • Simple subtraction is used for recognition
  • Characters segmented
  • Characters recognized

32
Evaluation
  • The successful rate of segmenting the characters
    is quite high (90 in general)
  • Low successful rate in character recognition
    (15 in general)
  • Better algorithms for character recognition would
    be tried
  • Can be used for indexing video clips for digital
    library

33
Conclusion
  • Information Retrieval is relating to many
    different fields linguistic, image processing,
    data organization, hardware utilization, etc.
  • Many procedures in Information Retrieval
    indexing, searching, organizing data, etc.
  • Choose one specific area to work on in the coming
    semester.

34
Q A Session
Write a Comment
User Comments (0)
About PowerShow.com