Using HTML Metadata to Retrieve Relevant Images from the World Wide Web - PowerPoint PPT Presentation

1 / 62
About This Presentation
Title:

Using HTML Metadata to Retrieve Relevant Images from the World Wide Web

Description:

... recall for people's names and excellent recall for less-famous cities. Famous names have poorer precision than non-famous and place names. Image file name ... – PowerPoint PPT presentation

Number of Views:129
Avg rating:3.0/5.0
Slides: 63
Provided by: yel8
Category:

less

Transcript and Presenter's Notes

Title: Using HTML Metadata to Retrieve Relevant Images from the World Wide Web


1
Using HTML Metadata to Retrieve Relevant Images
from the World Wide Web
  • Ethan V. Munson
  • University of Wisconsin-Milwaukee

2
Why is image search important?
  • The Web is becoming the worlds primary
    information source
  • Images are one of the Webs key features
  • Few WWW image search engines exist currently
  • Using textual search engines to find images
    manually is laborious

3
A Requirement for Web Image Search
  • We need an efficient method of discovering and
    indexing image content.
  • Two main sources of information about image
    content
  • image processing
  • associated text
  • text content
  • markup

4
Related work
  • QBIC (the IBM Almaden Research Center)
  • indexes and retrieves images according to
  • shape
  • color
  • texture
  • object layout
  • queries are formulated through visual examples
  • a sample image
  • user provided sketches

5
Related work QBIC system
6
Related work QBIC system
7
Related work QBIC system
8
QBIC Advantages and Disadvantages
  • Advantages
  • well-developed visual query language
  • interesting GUI
  • queries are based on image appearance
  • Disadvantages
  • works only at the primitive feature level (color,
    texture, shape)
  • doesnt recognize semantics of image
  • very sensitive to camera viewpoint
  • doesnt scale up to the Web

9
Related work
  • WebSeek (J. Smith S. Chang, Columbia
    University)
  • performs a semi-automated classification of the
    images
  • automatically extracts keywords from image file
    names
  • computes the keyword histogram
  • manually creates a subject hierarchy
  • manually maps the images into the subject
    hierarchy
  • User can
  • browse the categories
  • search the categories by keyword
  • search the database using image features
  • color content

10
(No Transcript)
11
(No Transcript)
12
(No Transcript)
13
(No Transcript)
14
(No Transcript)
15
(No Transcript)
16
(No Transcript)
17
Webseek Advantages/Disadvantages
  • Advantages
  • Large index of Web images
  • Supports both text and image search
  • Disadvantages
  • Not clear that database can scale up
  • Manual categorization is very expensive
  • Relevance feedback mechanism is computationally
    expensive

18
Related work
  • WebSeer (M. Swain et al., The University of
    Chicago)
  • uses associated text and markup to supplement
    information derived from analyzing image content
  • uses multiple kinds of metadata
  • image file names
  • alternate text
  • text of a hyperlink
  • decides which images are photographs, portraits,
    or computer generated drawing
  • research emphasized categorization, not
    metadata-based search

19
Why seek new image retrieval methods?
  • The number of WWW documents is growing rapidly
    and constantly changing
  • We need fast and efficient methods for finding
    images
  • Image processing is
  • complex
  • computationally expensive
  • limited (misses true image semantics)
  • unnecessary

20
Research Goals
  • Show that images can be found using HTML
    metadata
  • textual content
  • HTML tag structure
  • attribute values
  • Determine which metadata features are the best
    clues to image content

21
(No Transcript)
22
(No Transcript)
23
(No Transcript)
24
(No Transcript)
25
(No Transcript)
26
(No Transcript)
27
(No Transcript)
28
(No Transcript)
29
(No Transcript)
30
The URL Filter
  • assembles a list of URLs from the results
    returned by Alta Vista
  • parses the first page returned by Alta Vista
  • follows the URLs of results pages, retrieves
    these pages, and parses them
  • extracts list of URLs from the results pages

31
The Crawler
  • retrieves the pages
  • saves each pages HTML source code in a separate
    file

32
Tidy
  • converts arbitrary and probably ill-formed HTML
    into XHTML

33
XHTML Parser
  • parses an XHTML document
  • builds an XHTML parse tree

34
The Document Analyzer
  • scans the parse tree for image URLs
  • an image URL appears in either an image or anchor
    element
  • converts relative URLs into absolute URLs
  • uses various heuristics to determine which URLs
    point to relevant images

35
(No Transcript)
36
Search Strategies
  • Images file name
  • Textual content of the TITLE element
  • Value of the ALT attribute of IMG elements
  • Textual content of anchor elements
  • Value of the title attribute of anchor elements
  • Textual content of the paragraph surrounding an
    image
  • Textual content of any paragraph located within
    the same center element as the image
  • Textual content of heading elements

37
(No Transcript)
38
(No Transcript)
39
(No Transcript)
40
(No Transcript)
41
(No Transcript)
42
(No Transcript)
43
(No Transcript)
44
(No Transcript)
45
(No Transcript)
46
(No Transcript)
47
(No Transcript)
48
(No Transcript)
49
Image Retrieval Experiment
50
Experimental Questions
  • Which HTML features reveal the most information
    about image?
  • Do particular patterns of HTML structure carry
    useful information?
  • Do image search results depend on the type of
    query?

51
Informal Experiments
  • Conducted extensive informal testing
  • to check software correctness
  • to investigate possible metadata clues
  • to determine rules for filtering out images based
    on size
  • images smaller than 65 pixels in either dimension
    almost never contained useful content
  • reduced the number of images we had to classify

52
Metadata Clues
  • Images file name
  • Textual content of the TITLE element
  • Value of the ALT attribute of IMG elements
  • Textual content of anchor elements
  • Value of the title attribute of anchor elements
  • Textual content of the paragraph surrounding an
    image
  • Textual content of any paragraph located within
    the same center element as the image
  • Textual content of heading elements

53
Query Categories
  • Famous people Gorbachev, Yeltsin, and
    Streisand
  • Non-famous people Yelena and Ekaterina
  • Famous places Paris and London
  • Less-famous places Bremen and Spokane
  • Phenomena Explosion, Sunset, and
    Hurricane

54
Experimental Procedure
  • For each of the 12 queries
  • Alta Vista returned 200 URLs (20 groups of 10)
  • We used first, middle, and last groups (30 URLs)
  • Downloaded pages and all images on pages
  • excluding small images (dimension)
  • 276 pages and 1578 images were accessible
  • Manually determined relevance of each image
  • Used our system to determine the effectiveness of
    each of the 8 metadata clue
  • standard information retrieval measures
    precision and recall

55
Information Retrieval Measures
Relevant, retrieved B
Relevant, not retrieved A
Nonrelevant, not retrieved C
Nonrelevant, retrieved D
  • Recall B/(A B)
  • Warning our study does not really test recall
  • We need a controlled sample of the Web, but
    instead, we are using Alta Vistas biased sample
  • Precision B/(B D)

56
Recall Table
57
Precision Table
58
Key Results
Image file name
Textual content of TITLE
Value of ALT
Overall percent of recall
43.5
62.1
13.7
Overall percent of precision
70.7
58.2
87.5
  • Image file name has poor recall for peoples
    names and excellent recall for less-famous
    cities
  • Famous names have poorer precision than
    non-famous and place names

59
Problems with this study
  • This is a single, small study
  • results must be replicated
  • No standard corpus for testing Web image search
  • our recall results are not reliable or truly
    sound
  • Our choice of tools may bias our results
  • Title tag may be important only because Alta
    Vista considers it important
  • Tidy may remove some clues
  • What is the structure of Text ?
  • Analysis of header clue is questionable

60
Body
Body
P
IMG
P
IMG
61
Conclusion
  • Existing content-based image retrieval systems
    are not good models for Web image search
  • HTML metadata is useful for Web image search
  • Image file name and document title are most
    useful
  • Alternate text is extremely precise, when
    present
  • HTML metadata should provide faster image search
    than image processing approaches
  • no need to download and analyze images
  • can take advantage of existing search engines

62
Using HTML Metadata to Retrieve Relevant Images
from the Web
  • Ethan V. Munson
  • Dept. of Electrical Engineering Computer
    Science
  • University of Wisconsin - Milwaukee
  • munson_at_cs.uwm.edu
  • http//www.cs.uwm.edu/multimedia
Write a Comment
User Comments (0)
About PowerShow.com