Building a Web Thesaurus from Web Link Structure Zheng Chen, Shenging Liu, Liu Wenyin, Geguang Pu, W - PowerPoint PPT Presentation

1 / 29

About This Presentation

Title:

Building a Web Thesaurus from Web Link Structure Zheng Chen, Shenging Liu, Liu Wenyin, Geguang Pu, W

Description:

'children's clothes' : book, clothing, toy, accessory, fashion, vintage ... 'children's clothes' : baby, boy, girl, cardigan, shirt, sweater. The End ... – PowerPoint PPT presentation

Number of Views:69

Avg rating:3.0/5.0

Slides: 30

Provided by: shuilun

Category:

more less

Transcript and Presenter's Notes

Title: Building a Web Thesaurus from Web Link Structure Zheng Chen, Shenging Liu, Liu Wenyin, Geguang Pu, W

1
Building a Web Thesaurus from Web Link
StructureZheng Chen, Shenging Liu, Liu Wenyin,
Geguang Pu, Wei-Ying Ma (MSRA)SIGIR 2003
Paper Presentation

Shui-Lung Chang
May 29, 2003

2
Motivation

To build domain-specific thesauri for the purpose
of search
The toughest Web-search problems
Word mismatch (indexer v.s. users)
Short query
Query expansion
(How to select the expansion terms?)
Global analysis method
Local analysis method

Co-occurrence analysis of terms in corpus
Not work well for the Web
3
Traditional Automatic Thesaurus

The term association is estimated by counting the
number of two terms co-occur in a window of w
words in a corpus
Reference
K. W. Church and P. Hanks. Word association
norms, mutual information and lexicography.
Computational Linguistics, Vol. 16, No. 1, 1990.

4
Motivation Web Link Structure

Hyperlink the major difference between a web
page and a pure text
Topic locality Pages connected together are
more likely of the same topic than those
unconnected
Anchor description Anchor texts can effectively
describe its target pages
Motivation of using Web link structure

Link Structure
Semantic Network
Web pages as nodes Hyperlinks as edges
Anchor texts as nodes Semantic relations as edges
5
Overview of the Approach

Select a set of high quality websites for a given
domain
Apply link analysis techniques to construct
website content structure
Remove navigational and meaningless links
Discover the semantic relationship between web
pages (hierarchy relationship or horizontal
relationship)
Summarize a web page to a concept category
(web-page concept naming)
Apply a statistical method to construct the
thesaurus
Calculate the mutual information of the
words/phrases within the content structures

6
Getting High Quality Websites

Obtain the authority websites through the search
engine with a successful website ranking
mechanism
Google directory search (http//directory.google.c
om/)

7
Website Content Structure

Two general semantic relationship for concepts
Aggregation the parent concept is semantic
broader than the child concept (Hierarchical
relationship)
Association concepts are semantically related
to each other (Horizontal relationship)
Functions of hyperlinks
Assist navigation
Bring semantic-related Web pages together
Semantic links
explicit semantic link the link is represented
by a hyperlink
implicit the link is inferred from explicit
semantic links

8
Website Content Structure (cont.)

A hyperlink, in the navigation structure, is
called as a semantic link if the connected two
pages have explicit semantic relationship

http//eshop.msn.com
9
Website Content Structure (cont.)

A website content structure is defined as a
directed graph G(V,E), V nodes, E edges
A node is a 4-tuple (ID, Type, Concept,
Description)
Type index page or content page
Concept a keyword or phrase to represent a web
page
Description others, e.g., ltpage title, gt, ltURL,
gt
An edge is a 4-tuple (Source, Target, Type,
Description)
Source, target source and target node
Type aggregation or association
Description others, e.g., anchor text

10
Overview of the Approach

Select a set of high quality websites for a given
domain
Apply link analysis techniques to construct
website content structure
Remove navigational and meaningless links
Discover the semantic relationship between web
pages (hierarchy relationship or horizontal
relationship)
Summarize a web page to a concept category
(web-page concept naming)
Apply a statistical method to construct the
thesaurus
Calculate the mutual information of the
words/phrases within the content structures

11
Detecting Navigational Links

Use the information encoded in URL, i.e., the
sever-side local directory structure (e.g.,
http//google/aaa/bbb.html)
a link
Apply rules to detect navigational links
(92.82)
Upward link function as a return to the
previous page
Link within a high-level navigation bar
Link within a navigation list

Upward to parent directory
Downward to a subdirectory
Forward to a sub-subdirectory
Sibling to the same directory
Crosswise other than the above cases

12
Tagging Objects in Web Pages

Identify the functions/categories of the object
on a page using Function-based Object Model (FOM)
Index page or content page?
Navigation bar and list?
Reference
J. L. Chen, et al. Function-based Object Model
Towards Website Adaptation, In Proc. of the 10th
World Wide Web Conference, pp. 587596, May, 2001.

13
FOM Example
14
FOM Example
15
FOM Navigation Bar/List?
16
Discovering the Semantic Relationship

Apply the following rules
A link in a content page conveys association
relationship
A link in an index page usually conveys
aggregation relationships (further revised by the
following rules)
A link conveys aggregation relationship if it is
in navigation bar which belongs to an index page
If two web pages have aggregation relationship in
both directions, the relationship is changed to
association

17
FOM Index or Content Page?

By statistical analysis, (out degree OD, in
degree ID)
A page with relatively large OD or ID may be an
index page
A page with relatively small OD and ID may be a
content page
Rules
If OD gt OD0 or ID gt ID0, the page is an Index
page
If OD lt OD0 and ID lt ID0, the page is a Content
page
Ways to get the constants OD0 and ID0
The OD(i)-i (ID(i)-i) curve (i is the ordered
number of a page)

Beeline
18
Notating a Web Page

Select the anchor text with most discriminative
power, measured by TFIDF weighting scheme

Anchor texts
19
Generate the Thesaurus

A term segmentation tool (NLPWin, COLING 2000) is
applied because the format of anchor text is in
many ways, e.g., words, phrases, and short
sentences.
Three relationships to extract
Ancestor STi(ancestor)( ni, parents1(ni),
, parentsd(ni) )
Offspring STi(offspring) ( ni, sons1(ni),
, sonsd(ni) )
Sibling STi(sibling) ( ni, sibs1(ni),
, sibsd(ni) )

A node ni wi1, wi2, , wim , where wij is a
term
20
Generate the Thesaurus (cont.)

For each generated sub-tree (e.g.,
Sti(offspring)), the mutual information of a
term-pair is counted as

ni
sons1(ni)
sons2(ni)
parents2(ni)
stands for the counts that two terms appear
together in the sub-tree
parents1(ni)
ni
21
Generate the Thesaurus (cont.)

The entropy is used to realize the heuristic The
more sub-trees contain the term-pair, the more
similar the two terms are
The similarity of two terms
Those term-pairs with values beyond a pre-defined
threshold are selected as similar term candidates.

22
Experiment

Three testing domain (queries)
online shopping, photography, PDA
The top 13 websites are selected, and crawled
25 web pages are manually analyzed by 4 users
Whether navigational links are correctly
recognized?
Whether the nodes in Web content structure are
correct?
15 terms and their associated terms are manually
evaluated
Application of the generated thesaurus on query
expansion

23
Evaluation Navigational Links
24
Evaluation Correct Nodes
25
Evaluation Associated Terms

15 terms from the obtained thesaurus were chosen

26
Experiment on Query Expansion

Use the Okapi system for full-text search
10 queries for each domain
e.g., Shopping domain women shoes, mother day
gift, childrens clothes, antivirus software,
listening jazz, wedding dress, palm, movie about
love, Cannon camera, cartoon products
top 30 ranked documents are justified by 4 users
Methods for comparison
Baseline no query expansion
Full-text thesaurus
Their Web Thesaurus (sibling)
Their Web Thesaurus (offspring)

27
Experiment Result Query Expansion
28
Discussions

The baseline retrieval precision is still above
45
The naïve automatic full-text thesaurus decreases
performance
childrens clothes book, clothing, toy,
accessory, fashion, vintage
The query expansion based on sibling relationship
is bad
childrens clothes book, toy, video, women,
accessories, design
Query expansion based on offspring relationship
improves the performance
childrens clothes baby, boy, girl, cardigan,
shirt, sweater

29
The End

Anchor texts link structure provide many
possibilities of various Web applications
A further work to construct a personalized
thesaurus based on users navigation history and
accessed documents
Trick on query expansion
Expand with more specific terms usually achieve
better precision

Write a Comment

User Comments (0)