Using HTML Metadata to Retrieve Relevant Images from the World Wide Web - PowerPoint PPT Presentation

1 / 62

About This Presentation

Title:

Using HTML Metadata to Retrieve Relevant Images from the World Wide Web

Description:

... recall for people's names and excellent recall for less-famous cities. Famous names have poorer precision than non-famous and place names. Image file name ... – PowerPoint PPT presentation

Number of Views:129

Avg rating:3.0/5.0

Slides: 63

Provided by: yel8

Category:

more less

Transcript and Presenter's Notes

Title: Using HTML Metadata to Retrieve Relevant Images from the World Wide Web

1
Using HTML Metadata to Retrieve Relevant Images
from the World Wide Web

Ethan V. Munson
University of Wisconsin-Milwaukee

2
Why is image search important?

The Web is becoming the worlds primary
information source
Images are one of the Webs key features
Few WWW image search engines exist currently
Using textual search engines to find images
manually is laborious

3
A Requirement for Web Image Search

We need an efficient method of discovering and
indexing image content.
Two main sources of information about image
content
image processing
associated text
text content
markup

4
Related work

QBIC (the IBM Almaden Research Center)
indexes and retrieves images according to
shape
color
texture
object layout
queries are formulated through visual examples
a sample image
user provided sketches

5
Related work QBIC system
6
Related work QBIC system
7
Related work QBIC system
8
QBIC Advantages and Disadvantages

Advantages
well-developed visual query language
interesting GUI
queries are based on image appearance
Disadvantages
works only at the primitive feature level (color,
texture, shape)
doesnt recognize semantics of image
very sensitive to camera viewpoint
doesnt scale up to the Web

9
Related work

WebSeek (J. Smith S. Chang, Columbia
University)
performs a semi-automated classification of the
images
automatically extracts keywords from image file
names
computes the keyword histogram
manually creates a subject hierarchy
manually maps the images into the subject
hierarchy
User can
browse the categories
search the categories by keyword
search the database using image features
color content

10
(No Transcript)
11
(No Transcript)
12
(No Transcript)
13
(No Transcript)
14
(No Transcript)
15
(No Transcript)
16
(No Transcript)
17
Webseek Advantages/Disadvantages

Advantages
Large index of Web images
Supports both text and image search
Disadvantages
Not clear that database can scale up
Manual categorization is very expensive
Relevance feedback mechanism is computationally
expensive

18
Related work

WebSeer (M. Swain et al., The University of
Chicago)
uses associated text and markup to supplement
information derived from analyzing image content
uses multiple kinds of metadata
image file names
alternate text
text of a hyperlink
decides which images are photographs, portraits,
or computer generated drawing
research emphasized categorization, not
metadata-based search

19
Why seek new image retrieval methods?

The number of WWW documents is growing rapidly
and constantly changing
We need fast and efficient methods for finding
images
Image processing is
complex
computationally expensive
limited (misses true image semantics)
unnecessary

20
Research Goals

Show that images can be found using HTML
metadata
textual content
HTML tag structure
attribute values
Determine which metadata features are the best
clues to image content

21
(No Transcript)
22
(No Transcript)
23
(No Transcript)
24
(No Transcript)
25
(No Transcript)
26
(No Transcript)
27
(No Transcript)
28
(No Transcript)
29
(No Transcript)
30
The URL Filter

assembles a list of URLs from the results
returned by Alta Vista
parses the first page returned by Alta Vista
follows the URLs of results pages, retrieves
these pages, and parses them
extracts list of URLs from the results pages

31
The Crawler

retrieves the pages
saves each pages HTML source code in a separate
file

32
Tidy

converts arbitrary and probably ill-formed HTML
into XHTML

33
XHTML Parser

parses an XHTML document
builds an XHTML parse tree

34
The Document Analyzer

scans the parse tree for image URLs
an image URL appears in either an image or anchor
element
converts relative URLs into absolute URLs
uses various heuristics to determine which URLs
point to relevant images

35
(No Transcript)
36
Search Strategies

Images file name
Textual content of the TITLE element
Value of the ALT attribute of IMG elements
Textual content of anchor elements
Value of the title attribute of anchor elements
Textual content of the paragraph surrounding an
image
Textual content of any paragraph located within
the same center element as the image
Textual content of heading elements

37
(No Transcript)
38
(No Transcript)
39
(No Transcript)
40
(No Transcript)
41
(No Transcript)
42
(No Transcript)
43
(No Transcript)
44
(No Transcript)
45
(No Transcript)
46
(No Transcript)
47
(No Transcript)
48
(No Transcript)
49
Image Retrieval Experiment
50
Experimental Questions

Which HTML features reveal the most information
about image?
Do particular patterns of HTML structure carry
useful information?
Do image search results depend on the type of
query?

51
Informal Experiments

Conducted extensive informal testing
to check software correctness
to investigate possible metadata clues
to determine rules for filtering out images based
on size
images smaller than 65 pixels in either dimension
almost never contained useful content
reduced the number of images we had to classify

52
Metadata Clues

Images file name
Textual content of the TITLE element
Value of the ALT attribute of IMG elements
Textual content of anchor elements
Value of the title attribute of anchor elements
Textual content of the paragraph surrounding an
image
Textual content of any paragraph located within
the same center element as the image
Textual content of heading elements

53
Query Categories

Famous people Gorbachev, Yeltsin, and
Streisand
Non-famous people Yelena and Ekaterina
Famous places Paris and London
Less-famous places Bremen and Spokane
Phenomena Explosion, Sunset, and
Hurricane

54
Experimental Procedure

For each of the 12 queries
Alta Vista returned 200 URLs (20 groups of 10)
We used first, middle, and last groups (30 URLs)
Downloaded pages and all images on pages
excluding small images (dimension)
276 pages and 1578 images were accessible
Manually determined relevance of each image
Used our system to determine the effectiveness of
each of the 8 metadata clue
standard information retrieval measures
precision and recall

55
Information Retrieval Measures
Relevant, retrieved B
Relevant, not retrieved A
Nonrelevant, not retrieved C
Nonrelevant, retrieved D

Recall B/(A B)
Warning our study does not really test recall
We need a controlled sample of the Web, but
instead, we are using Alta Vistas biased sample
Precision B/(B D)

56
Recall Table
57
Precision Table
58
Key Results
Image file name
Textual content of TITLE
Value of ALT
Overall percent of recall
43.5
62.1
13.7
Overall percent of precision
70.7
58.2
87.5

Image file name has poor recall for peoples
names and excellent recall for less-famous
cities
Famous names have poorer precision than
non-famous and place names

59
Problems with this study

This is a single, small study
results must be replicated
No standard corpus for testing Web image search
our recall results are not reliable or truly
sound
Our choice of tools may bias our results
Title tag may be important only because Alta
Vista considers it important
Tidy may remove some clues
What is the structure of Text ?
Analysis of header clue is questionable

60
Body
Body
P
IMG
P
IMG
61
Conclusion

Existing content-based image retrieval systems
are not good models for Web image search
HTML metadata is useful for Web image search
Image file name and document title are most
useful
Alternate text is extremely precise, when
present
HTML metadata should provide faster image search
than image processing approaches
no need to download and analyze images
can take advantage of existing search engines

62
Using HTML Metadata to Retrieve Relevant Images
from the Web