Title: Using HTML Textual and Structural Data for Web Image Search
1Using HTML Textual and Structural Data for Web
Image Search Cheng Thao, Ethan Munson, Jim
Dabrowski, Nikolas D. Bohne University of
Wisconsin-Milwaukee
2Which image is George Bush or has George Bush?
3Which images are similar to this image?
4(No Transcript)
5Does the HTML source tell which images is George
Bush?
lttrgt lttd width400 bgcolorffffffgtltcentergtltFONT
FACE"Arial, Helvetica" SIZE-1gt ltIMG
SRC"http//www.gopbi.com/community/groups/bigband
/images/bill20cosby.jpg" alt"" BORDER0
VSPACE3 HSPACE3gt lt/FONTgtlt/centergtltemgtltcentergtltFO
NT FACE"Arial, Helvetica" SIZE-1gtBill
Cosbylt/centergtlt/FONTgtlt/emgtlt/tdgt lttd width210
bgcolorffffffgtltcentergtltFONT FACE"Arial,
Helvetica" SIZE-1gt ltIMG SRC"http//www.gopbi.com
/community/groups/bigband/images/betty20white.jpg
" alt"" BORDER0 VSPACE3 HSPACE3gt lt/FONTgtlt/cen
tergtltemgtltcentergtltFONT FACE"Arial, Helvetica"
SIZE-1gtBetty Whitelt/centergtlt/FONTgtlt/emgtlt/tdgt lt/tr
gt lttrgt lttd width400 bgcolorffffffgtltcentergtltFONT
FACE"Arial, Helvetica" SIZE-1gt ltIMG
SRC"http//www.gopbi.com/community/groups/bigband
/images/tom20brokaw.jpg" alt"" BORDER0
VSPACE3 HSPACE3gt lt/FONTgtlt/centergtltemgtltcentergtltFO
NT FACE"Arial, Helvetica" SIZE-1gtTom
Brokawlt/centergtlt/FONTgtlt/emgtlt/tdgt lttd width210
bgcolorffffffgtltcentergtltFONT FACE"Arial,
Helvetica" SIZE-1gt ltIMG SRC"http//www.gopbi.com
/community/groups/bigband/images/george20bush.jpg
" alt"" BORDER0 VSPACE3 HSPACE3gt lt/FONTgtlt/cen
tergtltemgtltcentergtltFONT FACE"Arial, Helvetica"
SIZE-1gtPres. George Bushlt/centergtlt/FONTgtlt/emgtlt/td
gt lt/trgt lttrgt lttd width400 bgcolorffffffgtltcentergt
ltFONT FACE"Arial, Helvetica" SIZE-1gt ltIMG
SRC"http//www.gopbi.com/community/groups/bigband
/images/ed20mcmahon.jpg" alt"" BORDER0
VSPACE3 HSPACE3gt lt/FONTgtlt/centergtltemgtltcentergtltFO
NT FACE"Arial, Helvetica" SIZE-1gtEd
McMahonlt/centergtlt/FONTgtlt/emgtlt/tdgt lttd width210
bgcolorffffffgtltcentergtltFONT FACE"Arial,
Helvetica" SIZE-1gt ltIMG SRC"http//www.gopbi.com
/community/groups/bigband/images/bob20barker.jpg"
alt"" BORDER0 VSPACE3 HSPACE3gt lt/FONTgtlt/cent
ergtltemgtltcentergtltFONT FACE"Arial, Helvetica"
SIZE-1gtBob Barkerlt/centergtlt/FONTgtlt/emgtlt/tdgt lt/trgt
6Introduction
-image search is difficult - performance is
slow - image identification is a complex,
inaccurate task -most research on image search
has emphasized analysis of image content -few Web
image search engines - commercial Alta Vista,
Google - research WebSeek -little research in
textual image search
7HTML overview
-HTML document composed of -head -title -met
a -body -paragraph, -table, -text, -link
, -image,
8lthtmlgt ltheadgt lttitlegtSample HTMLlt/titlegt
ltmeta keywords"html, html elements"gt ltmeta
description"showing a simple html and some html
elements"gt lt/headgt ltbodygt lth1gtHTML
overiewlt/h1gt ltpgt first paragraphlt/pgt
lttable border"1"gt ltcaptiongt Simple
Tablelt/captiongt lttrgtlttdgt1lttdgt2lttdgt3
lttrgtlttdgt4lttdgt5lttdgt6 lttrgtlttdgt7lttdgt8lttdgt9
lt/tablegt Here is a photo of George Bush.
ltbrgt ltimg src"g-bush.jpg"gt lt/bodygt lt/htmlgt
9(No Transcript)
10Previous work - Yelena Tsymbalenko
-studied HTML constructs and determine what can
be used in image search. -found the following
to be effective - title of the page - image
filename - image alt attribute
11Research Goals
- What HTML features make good clues to the content
of images? - Structural features (document, table)
- File names or URLs
- Formatting of material (bold, heading)
- - How can clues be combined into a single
relevance rating?
12Image Search Study Process
-Downloading pages with matching text - Use
existing search engine to identify matches -
These pages provide a corpus of images - We
download pages so that our corpus remains
static -Download acts as a snap shot -Clue
extraction -Analyze each page in corpus for all
possible clues to image content -Human
relevance ratings -human rates if an image is
relevant to the query -Statistical analysis to
find clue-based relevance functions
13Process Downloading Web Pages
query
Downloading Software
queries
Search Engine
ltquerygt
URLs
Web Pages
images
Web pages and images are saved to local disk.
14Design Queries in XML Multiple queries are
stored in an XML file
Engine 1Altavista, 2Excite, 3Hotbot,
4Google Method 1 or, 2 and, 3
expression Search for George Bush using Alta
Vista and must have all the words Search for
Bill Clinton using Hotbot and search for exact
expression
ltquerygt ltenginegt 1 lt/enginegt
ltmethodgt 2 lt/methodgt ltwordgt George lt/wordgt
ltwordgt Bush lt/wordgt lt/querygt ltquerygt
ltenginegt 2 lt/enginegt ltmethodgt 3 lt/methodgt
ltwordgt Bill lt/wordgt ltwordgt Clinton lt/wordgt lt/qu
erygt
15Process Clue Extraction
Extraction Software
clues
queries
Clues Extraction Software
ltquerygt
clues
16Data to be analyzed
- For each page
- Query used to find page
- Source URL
- For each image
- Source URL
- Attributes
- Position in document
- For each clue
- Whether clue feature occurs in document at all
- If feature occurs with text matching the query
- Position in document for each occurrence
17Process Relevance Rating
Query image
Relevant Rating Software
queries
Human
ltquerygt
Relevant/not
Presents images from each query to the user from
the database, and record the human relevance
rating back to the database.
18Clues global
Global Clues - clues that apply every image on
the page - filename of page - path of page
- host of page - title element of the web
page - keywords found in meta element -
description found in meta element Why do we
break the URL into three clues? Different parts
of the URL contributes different relevance factor
to the overall relevance of the image in that
page.
19Clues global
ltHEADgt ltTITLEgtApplelt/TITLEgt ltMETA
NAME"keywords" CONTENT"Apple Computer, Power
Macintosh, PowerBook, AppleWorks, WebObjects,
iMovie, QuickTime, Desktop Movies, Software,
Operating Systems, Mac OS, iMac, iBook"gt ltMETA
NAME"Description" CONTENT"Visit www.apple.com
for the latest news, the hottest products, and
technical support resources from Apple
Computer, Inc."gt ltMETA HTTP-EQUIV"Expires"
CONTENT"Fri, 26 Mar 1999 235959 GMT"gt ltMETA
NAME"Date.Modified" CONTENT"19992109"gt lt/HEADgt
20Clues image file
Image file properties - external properties -
filename - path - host An image can be
from another host, and have different paths.
21Clues common attributes
Elements have common attributes -title -
describe what the element is -id - used in
identifying the element -name - same as id,
older HTML Clues that use these
attributes link, image, object, table, cell, row
22Clues Image Container
Link to an image ltagt - text enclosed within
the link element Embed image element ltimggt -alt
attribute (usually describes what the image
is) Object element ltobjectgt -text that enclosed
within the object element
23Clues table
Table (lttablegt) - summary attribute -
describes the table content - caption -
describes table content - row heading - row -
column heading - column - cell - neighboring
cells (above, below, right, left)
24Clues table
25Clues table
26Clues table
27Clues table
28Clues table
29Clues table
30Clues table
31Clues headings
Heading elements(h1, h2, ..h6) - headings above
image - headings below image lth1gtheader above
imagelt/h1gt ltimg srcsample.jpggt lth2gtheader
below imagelt/h2gt Heading can indicate a topic
and images below the heading maybe relate to the
heading. Some use headings as caption above
images, and sometimes below images. Some headers
are used where fonts should be used or bold
should be used.
32In this photo, the heading comes after the image.
Often if it is used as a topic, it usually comes
before the image. But some images have heading
as caption below the images.
33Clues text
Emphasized text elements - bold - italic -
underline - strong - emphasis - big Body text
- text that surrounds the image - distance
34Current Project Status
- Prototype download and clue extraction software
nearly complete - now testing implementation
- - data (without human relevance ratings) in early
November - Recruiting students to build on-line relevance
rating system - hope to get students outside lab to help with
ratings via Web interface
35Challenges for image search systems
- - computing word distance from image
- - Stylesheet used for presentation
- - table pattern
- - pattern of HTML elements usage
- - CGI returned images
- structural boundaries
- patterns in Web page design
- -HTML generators
36- Cheng Thao, chengt_at_uwm.edu