Title: Building a Web Thesaurus from Web Link Structure Zheng Chen, Shenging Liu, Liu Wenyin, Geguang Pu, W
1Building a Web Thesaurus from Web Link
StructureZheng Chen, Shenging Liu, Liu Wenyin,
Geguang Pu, Wei-Ying Ma (MSRA)SIGIR 2003
Paper Presentation
- Shui-Lung Chang
- May 29, 2003
2Motivation
- To build domain-specific thesauri for the purpose
of search - The toughest Web-search problems
- Word mismatch (indexer v.s. users)
- Short query
- Query expansion
- (How to select the expansion terms?)
- Global analysis method
- Local analysis method
Co-occurrence analysis of terms in corpus
Not work well for the Web
3Traditional Automatic Thesaurus
- The term association is estimated by counting the
number of two terms co-occur in a window of w
words in a corpus - Reference
- K. W. Church and P. Hanks. Word association
norms, mutual information and lexicography.
Computational Linguistics, Vol. 16, No. 1, 1990.
4Motivation Web Link Structure
- Hyperlink the major difference between a web
page and a pure text - Topic locality Pages connected together are
more likely of the same topic than those
unconnected - Anchor description Anchor texts can effectively
describe its target pages - Motivation of using Web link structure
Link Structure
Semantic Network
Web pages as nodes Hyperlinks as edges
Anchor texts as nodes Semantic relations as edges
5Overview of the Approach
- Select a set of high quality websites for a given
domain - Apply link analysis techniques to construct
website content structure - Remove navigational and meaningless links
- Discover the semantic relationship between web
pages (hierarchy relationship or horizontal
relationship) - Summarize a web page to a concept category
(web-page concept naming) - Apply a statistical method to construct the
thesaurus - Calculate the mutual information of the
words/phrases within the content structures
6Getting High Quality Websites
- Obtain the authority websites through the search
engine with a successful website ranking
mechanism - Google directory search (http//directory.google.c
om/)
7Website Content Structure
- Two general semantic relationship for concepts
- Aggregation the parent concept is semantic
broader than the child concept (Hierarchical
relationship) - Association concepts are semantically related
to each other (Horizontal relationship) - Functions of hyperlinks
- Assist navigation
- Bring semantic-related Web pages together
- Semantic links
- explicit semantic link the link is represented
by a hyperlink - implicit the link is inferred from explicit
semantic links
8Website Content Structure (cont.)
- A hyperlink, in the navigation structure, is
called as a semantic link if the connected two
pages have explicit semantic relationship
http//eshop.msn.com
9Website Content Structure (cont.)
- A website content structure is defined as a
directed graph G(V,E), V nodes, E edges - A node is a 4-tuple (ID, Type, Concept,
Description) - Type index page or content page
- Concept a keyword or phrase to represent a web
page - Description others, e.g., ltpage title, gt, ltURL,
gt - An edge is a 4-tuple (Source, Target, Type,
Description) - Source, target source and target node
- Type aggregation or association
- Description others, e.g., anchor text
10Overview of the Approach
- Select a set of high quality websites for a given
domain - Apply link analysis techniques to construct
website content structure - Remove navigational and meaningless links
- Discover the semantic relationship between web
pages (hierarchy relationship or horizontal
relationship) - Summarize a web page to a concept category
(web-page concept naming) - Apply a statistical method to construct the
thesaurus - Calculate the mutual information of the
words/phrases within the content structures
11Detecting Navigational Links
- Use the information encoded in URL, i.e., the
sever-side local directory structure (e.g.,
http//google/aaa/bbb.html) - a link
- Apply rules to detect navigational links
(92.82) - Upward link function as a return to the
previous page - Link within a high-level navigation bar
- Link within a navigation list
- Upward to parent directory
- Downward to a subdirectory
- Forward to a sub-subdirectory
- Sibling to the same directory
- Crosswise other than the above cases
12Tagging Objects in Web Pages
- Identify the functions/categories of the object
on a page using Function-based Object Model (FOM) - Index page or content page?
- Navigation bar and list?
- Reference
- J. L. Chen, et al. Function-based Object Model
Towards Website Adaptation, In Proc. of the 10th
World Wide Web Conference, pp. 587596, May, 2001.
13FOM Example
14FOM Example
15FOM Navigation Bar/List?
16Discovering the Semantic Relationship
- Apply the following rules
- A link in a content page conveys association
relationship - A link in an index page usually conveys
aggregation relationships (further revised by the
following rules) - A link conveys aggregation relationship if it is
in navigation bar which belongs to an index page - If two web pages have aggregation relationship in
both directions, the relationship is changed to
association
17FOM Index or Content Page?
- By statistical analysis, (out degree OD, in
degree ID) - A page with relatively large OD or ID may be an
index page - A page with relatively small OD and ID may be a
content page - Rules
- If OD gt OD0 or ID gt ID0, the page is an Index
page - If OD lt OD0 and ID lt ID0, the page is a Content
page - Ways to get the constants OD0 and ID0
- The OD(i)-i (ID(i)-i) curve (i is the ordered
number of a page)
Beeline
18Notating a Web Page
- Select the anchor text with most discriminative
power, measured by TFIDF weighting scheme
Anchor texts
19Generate the Thesaurus
- A term segmentation tool (NLPWin, COLING 2000) is
applied because the format of anchor text is in
many ways, e.g., words, phrases, and short
sentences. - Three relationships to extract
- Ancestor STi(ancestor)( ni, parents1(ni),
, parentsd(ni) ) - Offspring STi(offspring) ( ni, sons1(ni),
, sonsd(ni) ) - Sibling STi(sibling) ( ni, sibs1(ni),
, sibsd(ni) )
A node ni wi1, wi2, , wim , where wij is a
term
20Generate the Thesaurus (cont.)
- For each generated sub-tree (e.g.,
Sti(offspring)), the mutual information of a
term-pair is counted as
ni
sons1(ni)
sons2(ni)
parents2(ni)
stands for the counts that two terms appear
together in the sub-tree
parents1(ni)
ni
21Generate the Thesaurus (cont.)
- The entropy is used to realize the heuristic The
more sub-trees contain the term-pair, the more
similar the two terms are - The similarity of two terms
- Those term-pairs with values beyond a pre-defined
threshold are selected as similar term candidates.
22Experiment
- Three testing domain (queries)
- online shopping, photography, PDA
- The top 13 websites are selected, and crawled
- 25 web pages are manually analyzed by 4 users
- Whether navigational links are correctly
recognized? - Whether the nodes in Web content structure are
correct? - 15 terms and their associated terms are manually
evaluated - Application of the generated thesaurus on query
expansion
23Evaluation Navigational Links
24Evaluation Correct Nodes
25Evaluation Associated Terms
- 15 terms from the obtained thesaurus were chosen
26Experiment on Query Expansion
- Use the Okapi system for full-text search
- 10 queries for each domain
- e.g., Shopping domain women shoes, mother day
gift, childrens clothes, antivirus software,
listening jazz, wedding dress, palm, movie about
love, Cannon camera, cartoon products - top 30 ranked documents are justified by 4 users
- Methods for comparison
- Baseline no query expansion
- Full-text thesaurus
- Their Web Thesaurus (sibling)
- Their Web Thesaurus (offspring)
27Experiment Result Query Expansion
28Discussions
- The baseline retrieval precision is still above
45 - The naïve automatic full-text thesaurus decreases
performance - childrens clothes book, clothing, toy,
accessory, fashion, vintage - The query expansion based on sibling relationship
is bad - childrens clothes book, toy, video, women,
accessories, design - Query expansion based on offspring relationship
improves the performance - childrens clothes baby, boy, girl, cardigan,
shirt, sweater
29The End
- Anchor texts link structure provide many
possibilities of various Web applications - A further work to construct a personalized
thesaurus based on users navigation history and
accessed documents - Trick on query expansion
- Expand with more specific terms usually achieve
better precision