Title: Using HTML Metadata to Retrieve Relevant Images from the World Wide Web
1Using HTML Metadata to Retrieve Relevant Images
from the World Wide Web
- Ethan V. Munson
- University of Wisconsin-Milwaukee
2Why is image search important?
- The Web is becoming the worlds primary
information source
- Images are one of the Webs key features
- Few WWW image search engines exist currently
- Using textual search engines to find images
manually is laborious
3A Requirement for Web Image Search
- We need an efficient method of discovering and
indexing image content.
- Two main sources of information about image
content
- image processing
- associated text
- text content
- markup
4Related work
- QBIC (the IBM Almaden Research Center)
- indexes and retrieves images according to
- shape
- color
- texture
- object layout
- queries are formulated through visual examples
- a sample image
- user provided sketches
5Related work QBIC system
6Related work QBIC system
7Related work QBIC system
8QBIC Advantages and Disadvantages
- Advantages
- well-developed visual query language
- interesting GUI
- queries are based on image appearance
- Disadvantages
- works only at the primitive feature level (color,
texture, shape)
- doesnt recognize semantics of image
- very sensitive to camera viewpoint
- doesnt scale up to the Web
9Related work
- WebSeek (J. Smith S. Chang, Columbia
University)
- performs a semi-automated classification of the
images
- automatically extracts keywords from image file
names
- computes the keyword histogram
- manually creates a subject hierarchy
- manually maps the images into the subject
hierarchy
- User can
- browse the categories
- search the categories by keyword
- search the database using image features
- color content
10(No Transcript)
11(No Transcript)
12(No Transcript)
13(No Transcript)
14(No Transcript)
15(No Transcript)
16(No Transcript)
17Webseek Advantages/Disadvantages
- Advantages
- Large index of Web images
- Supports both text and image search
- Disadvantages
- Not clear that database can scale up
- Manual categorization is very expensive
- Relevance feedback mechanism is computationally
expensive
18Related work
- WebSeer (M. Swain et al., The University of
Chicago)
- uses associated text and markup to supplement
information derived from analyzing image content
- uses multiple kinds of metadata
- image file names
- alternate text
- text of a hyperlink
- decides which images are photographs, portraits,
or computer generated drawing
- research emphasized categorization, not
metadata-based search
19Why seek new image retrieval methods?
- The number of WWW documents is growing rapidly
and constantly changing
- We need fast and efficient methods for finding
images
- Image processing is
- complex
- computationally expensive
- limited (misses true image semantics)
- unnecessary
20Research Goals
- Show that images can be found using HTML
metadata
- textual content
- HTML tag structure
- attribute values
- Determine which metadata features are the best
clues to image content
21(No Transcript)
22(No Transcript)
23(No Transcript)
24(No Transcript)
25(No Transcript)
26(No Transcript)
27(No Transcript)
28(No Transcript)
29(No Transcript)
30The URL Filter
- assembles a list of URLs from the results
returned by Alta Vista
- parses the first page returned by Alta Vista
- follows the URLs of results pages, retrieves
these pages, and parses them
- extracts list of URLs from the results pages
31The Crawler
- retrieves the pages
- saves each pages HTML source code in a separate
file
32Tidy
- converts arbitrary and probably ill-formed HTML
into XHTML
33XHTML Parser
- parses an XHTML document
- builds an XHTML parse tree
34The Document Analyzer
- scans the parse tree for image URLs
- an image URL appears in either an image or anchor
element
- converts relative URLs into absolute URLs
- uses various heuristics to determine which URLs
point to relevant images
35(No Transcript)
36Search Strategies
- Images file name
- Textual content of the TITLE element
- Value of the ALT attribute of IMG elements
- Textual content of anchor elements
- Value of the title attribute of anchor elements
- Textual content of the paragraph surrounding an
image
- Textual content of any paragraph located within
the same center element as the image
- Textual content of heading elements
37(No Transcript)
38(No Transcript)
39(No Transcript)
40(No Transcript)
41(No Transcript)
42(No Transcript)
43(No Transcript)
44(No Transcript)
45(No Transcript)
46(No Transcript)
47(No Transcript)
48(No Transcript)
49Image Retrieval Experiment
50Experimental Questions
- Which HTML features reveal the most information
about image?
- Do particular patterns of HTML structure carry
useful information?
- Do image search results depend on the type of
query?
51Informal Experiments
- Conducted extensive informal testing
- to check software correctness
- to investigate possible metadata clues
- to determine rules for filtering out images based
on size
- images smaller than 65 pixels in either dimension
almost never contained useful content
- reduced the number of images we had to classify
52Metadata Clues
- Images file name
- Textual content of the TITLE element
- Value of the ALT attribute of IMG elements
- Textual content of anchor elements
- Value of the title attribute of anchor elements
- Textual content of the paragraph surrounding an
image
- Textual content of any paragraph located within
the same center element as the image
- Textual content of heading elements
53Query Categories
- Famous people Gorbachev, Yeltsin, and
Streisand
- Non-famous people Yelena and Ekaterina
- Famous places Paris and London
- Less-famous places Bremen and Spokane
- Phenomena Explosion, Sunset, and
Hurricane
54Experimental Procedure
- For each of the 12 queries
- Alta Vista returned 200 URLs (20 groups of 10)
- We used first, middle, and last groups (30 URLs)
- Downloaded pages and all images on pages
- excluding small images (dimension)
- 276 pages and 1578 images were accessible
- Manually determined relevance of each image
- Used our system to determine the effectiveness of
each of the 8 metadata clue
- standard information retrieval measures
precision and recall
55Information Retrieval Measures
Relevant, retrieved B
Relevant, not retrieved A
Nonrelevant, not retrieved C
Nonrelevant, retrieved D
- Recall B/(A B)
- Warning our study does not really test recall
- We need a controlled sample of the Web, but
instead, we are using Alta Vistas biased sample
- Precision B/(B D)
56Recall Table
57Precision Table
58Key Results
Image file name
Textual content of TITLE
Value of ALT
Overall percent of recall
43.5
62.1
13.7
Overall percent of precision
70.7
58.2
87.5
- Image file name has poor recall for peoples
names and excellent recall for less-famous
cities
- Famous names have poorer precision than
non-famous and place names
59Problems with this study
- This is a single, small study
- results must be replicated
- No standard corpus for testing Web image search
- our recall results are not reliable or truly
sound
- Our choice of tools may bias our results
- Title tag may be important only because Alta
Vista considers it important
- Tidy may remove some clues
- What is the structure of Text ?
- Analysis of header clue is questionable
60Body
Body
P
IMG
P
IMG
61Conclusion
- Existing content-based image retrieval systems
are not good models for Web image search
- HTML metadata is useful for Web image search
- Image file name and document title are most
useful
- Alternate text is extremely precise, when
present
- HTML metadata should provide faster image search
than image processing approaches
- no need to download and analyze images
- can take advantage of existing search engines
62Using HTML Metadata to Retrieve Relevant Images
from the Web
- Ethan V. Munson
- Dept. of Electrical Engineering Computer
Science
- University of Wisconsin - Milwaukee
- munson_at_cs.uwm.edu
- http//www.cs.uwm.edu/multimedia