Using HTML Textual and Structural Data for Web Image Search - PowerPoint PPT Presentation

About This Presentation
Title:

Using HTML Textual and Structural Data for Web Image Search

Description:

IMG SRC='http://www.gopbi.com/community/groups/bigband/images/bill cosby.jpg' alt='' BORDER=0 ... alt attribute (usually describes what the image is) Object ... – PowerPoint PPT presentation

Number of Views:67
Avg rating:3.0/5.0
Slides: 37
Provided by: cheng
Category:

less

Transcript and Presenter's Notes

Title: Using HTML Textual and Structural Data for Web Image Search


1
Using HTML Textual and Structural Data for Web
Image Search Cheng Thao, Ethan Munson, Jim
Dabrowski, Nikolas D. Bohne University of
Wisconsin-Milwaukee
2
Which image is George Bush or has George Bush?
3
Which images are similar to this image?
4
(No Transcript)
5
Does the HTML source tell which images is George
Bush?
lttrgt lttd width400 bgcolorffffffgtltcentergtltFONT
FACE"Arial, Helvetica" SIZE-1gt ltIMG
SRC"http//www.gopbi.com/community/groups/bigband
/images/bill20cosby.jpg" alt"" BORDER0
VSPACE3 HSPACE3gt lt/FONTgtlt/centergtltemgtltcentergtltFO
NT FACE"Arial, Helvetica" SIZE-1gtBill
Cosbylt/centergtlt/FONTgtlt/emgtlt/tdgt lttd width210
bgcolorffffffgtltcentergtltFONT FACE"Arial,
Helvetica" SIZE-1gt ltIMG SRC"http//www.gopbi.com
/community/groups/bigband/images/betty20white.jpg
" alt"" BORDER0 VSPACE3 HSPACE3gt lt/FONTgtlt/cen
tergtltemgtltcentergtltFONT FACE"Arial, Helvetica"
SIZE-1gtBetty Whitelt/centergtlt/FONTgtlt/emgtlt/tdgt lt/tr
gt lttrgt lttd width400 bgcolorffffffgtltcentergtltFONT
FACE"Arial, Helvetica" SIZE-1gt ltIMG
SRC"http//www.gopbi.com/community/groups/bigband
/images/tom20brokaw.jpg" alt"" BORDER0
VSPACE3 HSPACE3gt lt/FONTgtlt/centergtltemgtltcentergtltFO
NT FACE"Arial, Helvetica" SIZE-1gtTom
Brokawlt/centergtlt/FONTgtlt/emgtlt/tdgt lttd width210
bgcolorffffffgtltcentergtltFONT FACE"Arial,
Helvetica" SIZE-1gt ltIMG SRC"http//www.gopbi.com
/community/groups/bigband/images/george20bush.jpg
" alt"" BORDER0 VSPACE3 HSPACE3gt lt/FONTgtlt/cen
tergtltemgtltcentergtltFONT FACE"Arial, Helvetica"
SIZE-1gtPres. George Bushlt/centergtlt/FONTgtlt/emgtlt/td
gt lt/trgt lttrgt lttd width400 bgcolorffffffgtltcentergt
ltFONT FACE"Arial, Helvetica" SIZE-1gt ltIMG
SRC"http//www.gopbi.com/community/groups/bigband
/images/ed20mcmahon.jpg" alt"" BORDER0
VSPACE3 HSPACE3gt lt/FONTgtlt/centergtltemgtltcentergtltFO
NT FACE"Arial, Helvetica" SIZE-1gtEd
McMahonlt/centergtlt/FONTgtlt/emgtlt/tdgt lttd width210
bgcolorffffffgtltcentergtltFONT FACE"Arial,
Helvetica" SIZE-1gt ltIMG SRC"http//www.gopbi.com
/community/groups/bigband/images/bob20barker.jpg"
alt"" BORDER0 VSPACE3 HSPACE3gt lt/FONTgtlt/cent
ergtltemgtltcentergtltFONT FACE"Arial, Helvetica"
SIZE-1gtBob Barkerlt/centergtlt/FONTgtlt/emgtlt/tdgt lt/trgt
6
Introduction
-image search is difficult - performance is
slow - image identification is a complex,
inaccurate task -most research on image search
has emphasized analysis of image content -few Web
image search engines - commercial Alta Vista,
Google - research WebSeek -little research in
textual image search
7
HTML overview

-HTML document composed of -head -title -met
a -body -paragraph, -table, -text, -link
, -image,
8
lthtmlgt ltheadgt lttitlegtSample HTMLlt/titlegt
ltmeta keywords"html, html elements"gt ltmeta
description"showing a simple html and some html
elements"gt lt/headgt ltbodygt lth1gtHTML
overiewlt/h1gt ltpgt first paragraphlt/pgt
lttable border"1"gt ltcaptiongt Simple
Tablelt/captiongt lttrgtlttdgt1lttdgt2lttdgt3
lttrgtlttdgt4lttdgt5lttdgt6 lttrgtlttdgt7lttdgt8lttdgt9
lt/tablegt Here is a photo of George Bush.
ltbrgt ltimg src"g-bush.jpg"gt lt/bodygt lt/htmlgt
9
(No Transcript)
10
Previous work - Yelena Tsymbalenko

-studied HTML constructs and determine what can
be used in image search. -found the following
to be effective - title of the page - image
filename - image alt attribute
11
Research Goals
  • What HTML features make good clues to the content
    of images?
  • Structural features (document, table)
  • File names or URLs
  • Formatting of material (bold, heading)
  • - How can clues be combined into a single
    relevance rating?

12
Image Search Study Process
-Downloading pages with matching text - Use
existing search engine to identify matches -
These pages provide a corpus of images - We
download pages so that our corpus remains
static -Download acts as a snap shot -Clue
extraction -Analyze each page in corpus for all
possible clues to image content -Human
relevance ratings -human rates if an image is
relevant to the query -Statistical analysis to
find clue-based relevance functions

13
Process Downloading Web Pages
query
Downloading Software
queries
Search Engine
ltquerygt
URLs
Web Pages
images
Web pages and images are saved to local disk.
14
Design Queries in XML Multiple queries are
stored in an XML file
Engine 1Altavista, 2Excite, 3Hotbot,
4Google Method 1 or, 2 and, 3
expression Search for George Bush using Alta
Vista and must have all the words Search for
Bill Clinton using Hotbot and search for exact
expression
ltquerygt ltenginegt 1 lt/enginegt
ltmethodgt 2 lt/methodgt ltwordgt George lt/wordgt
ltwordgt Bush lt/wordgt lt/querygt ltquerygt
ltenginegt 2 lt/enginegt ltmethodgt 3 lt/methodgt
ltwordgt Bill lt/wordgt ltwordgt Clinton lt/wordgt lt/qu
erygt
15
Process Clue Extraction
Extraction Software
clues
queries
Clues Extraction Software
ltquerygt
clues


16
Data to be analyzed
  • For each page
  • Query used to find page
  • Source URL
  • For each image
  • Source URL
  • Attributes
  • Position in document
  • For each clue
  • Whether clue feature occurs in document at all
  • If feature occurs with text matching the query
  • Position in document for each occurrence

17
Process Relevance Rating
Query image
Relevant Rating Software
queries
Human
ltquerygt
Relevant/not
Presents images from each query to the user from
the database, and record the human relevance
rating back to the database.
18
Clues global
Global Clues - clues that apply every image on
the page - filename of page - path of page
- host of page - title element of the web
page - keywords found in meta element -
description found in meta element Why do we
break the URL into three clues? Different parts
of the URL contributes different relevance factor
to the overall relevance of the image in that
page.
19
Clues global
ltHEADgt ltTITLEgtApplelt/TITLEgt ltMETA
NAME"keywords" CONTENT"Apple Computer, Power
Macintosh, PowerBook, AppleWorks, WebObjects,
iMovie, QuickTime, Desktop Movies, Software,
Operating Systems, Mac OS, iMac, iBook"gt ltMETA
NAME"Description" CONTENT"Visit www.apple.com
for the latest news, the hottest products, and
technical support resources from Apple
Computer, Inc."gt ltMETA HTTP-EQUIV"Expires"
CONTENT"Fri, 26 Mar 1999 235959 GMT"gt ltMETA
NAME"Date.Modified" CONTENT"19992109"gt lt/HEADgt
20
Clues image file
Image file properties - external properties -
filename - path - host An image can be
from another host, and have different paths.
21
Clues common attributes
Elements have common attributes -title -
describe what the element is -id - used in
identifying the element -name - same as id,
older HTML Clues that use these
attributes link, image, object, table, cell, row
22
Clues Image Container
Link to an image ltagt - text enclosed within
the link element Embed image element ltimggt -alt
attribute (usually describes what the image
is) Object element ltobjectgt -text that enclosed
within the object element
23
Clues table
Table (lttablegt) - summary attribute -
describes the table content - caption -
describes table content - row heading - row -
column heading - column - cell - neighboring
cells (above, below, right, left)
24
Clues table
25
Clues table
26
Clues table
27
Clues table
28
Clues table
29
Clues table
30
Clues table
31
Clues headings
Heading elements(h1, h2, ..h6) - headings above
image - headings below image lth1gtheader above
imagelt/h1gt ltimg srcsample.jpggt lth2gtheader
below imagelt/h2gt Heading can indicate a topic
and images below the heading maybe relate to the
heading. Some use headings as caption above
images, and sometimes below images. Some headers
are used where fonts should be used or bold
should be used.
32
In this photo, the heading comes after the image.
Often if it is used as a topic, it usually comes
before the image. But some images have heading
as caption below the images.
33
Clues text
Emphasized text elements - bold - italic -
underline - strong - emphasis - big Body text
- text that surrounds the image - distance
34
Current Project Status

  • Prototype download and clue extraction software
    nearly complete
  • now testing implementation
  • - data (without human relevance ratings) in early
    November
  • Recruiting students to build on-line relevance
    rating system
  • hope to get students outside lab to help with
    ratings via Web interface

35
Challenges for image search systems

  • - computing word distance from image
  • - Stylesheet used for presentation
  • - table pattern
  • - pattern of HTML elements usage
  • - CGI returned images
  • structural boundaries
  • patterns in Web page design
  • -HTML generators

36
  • Cheng Thao, chengt_at_uwm.edu
Write a Comment
User Comments (0)
About PowerShow.com