Building a Web Thesaurus from Web Link Structure Zheng Chen, Shenging Liu, Liu Wenyin, Geguang Pu, W - PowerPoint PPT Presentation

1 / 29
About This Presentation
Title:

Building a Web Thesaurus from Web Link Structure Zheng Chen, Shenging Liu, Liu Wenyin, Geguang Pu, W

Description:

'children's clothes' : book, clothing, toy, accessory, fashion, vintage ... 'children's clothes' : baby, boy, girl, cardigan, shirt, sweater. The End ... – PowerPoint PPT presentation

Number of Views:69
Avg rating:3.0/5.0
Slides: 30
Provided by: shuilun
Category:

less

Transcript and Presenter's Notes

Title: Building a Web Thesaurus from Web Link Structure Zheng Chen, Shenging Liu, Liu Wenyin, Geguang Pu, W


1
Building a Web Thesaurus from Web Link
StructureZheng Chen, Shenging Liu, Liu Wenyin,
Geguang Pu, Wei-Ying Ma (MSRA)SIGIR 2003
Paper Presentation
  • Shui-Lung Chang
  • May 29, 2003

2
Motivation
  • To build domain-specific thesauri for the purpose
    of search
  • The toughest Web-search problems
  • Word mismatch (indexer v.s. users)
  • Short query
  • Query expansion
  • (How to select the expansion terms?)
  • Global analysis method
  • Local analysis method

Co-occurrence analysis of terms in corpus
Not work well for the Web
3
Traditional Automatic Thesaurus
  • The term association is estimated by counting the
    number of two terms co-occur in a window of w
    words in a corpus
  • Reference
  • K. W. Church and P. Hanks. Word association
    norms, mutual information and lexicography.
    Computational Linguistics, Vol. 16, No. 1, 1990.

4
Motivation Web Link Structure
  • Hyperlink the major difference between a web
    page and a pure text
  • Topic locality Pages connected together are
    more likely of the same topic than those
    unconnected
  • Anchor description Anchor texts can effectively
    describe its target pages
  • Motivation of using Web link structure

Link Structure
Semantic Network
Web pages as nodes Hyperlinks as edges
Anchor texts as nodes Semantic relations as edges
5
Overview of the Approach
  • Select a set of high quality websites for a given
    domain
  • Apply link analysis techniques to construct
    website content structure
  • Remove navigational and meaningless links
  • Discover the semantic relationship between web
    pages (hierarchy relationship or horizontal
    relationship)
  • Summarize a web page to a concept category
    (web-page concept naming)
  • Apply a statistical method to construct the
    thesaurus
  • Calculate the mutual information of the
    words/phrases within the content structures

6
Getting High Quality Websites
  • Obtain the authority websites through the search
    engine with a successful website ranking
    mechanism
  • Google directory search (http//directory.google.c
    om/)

7
Website Content Structure
  • Two general semantic relationship for concepts
  • Aggregation the parent concept is semantic
    broader than the child concept (Hierarchical
    relationship)
  • Association concepts are semantically related
    to each other (Horizontal relationship)
  • Functions of hyperlinks
  • Assist navigation
  • Bring semantic-related Web pages together
  • Semantic links
  • explicit semantic link the link is represented
    by a hyperlink
  • implicit the link is inferred from explicit
    semantic links

8
Website Content Structure (cont.)
  • A hyperlink, in the navigation structure, is
    called as a semantic link if the connected two
    pages have explicit semantic relationship

http//eshop.msn.com
9
Website Content Structure (cont.)
  • A website content structure is defined as a
    directed graph G(V,E), V nodes, E edges
  • A node is a 4-tuple (ID, Type, Concept,
    Description)
  • Type index page or content page
  • Concept a keyword or phrase to represent a web
    page
  • Description others, e.g., ltpage title, gt, ltURL,
    gt
  • An edge is a 4-tuple (Source, Target, Type,
    Description)
  • Source, target source and target node
  • Type aggregation or association
  • Description others, e.g., anchor text

10
Overview of the Approach
  • Select a set of high quality websites for a given
    domain
  • Apply link analysis techniques to construct
    website content structure
  • Remove navigational and meaningless links
  • Discover the semantic relationship between web
    pages (hierarchy relationship or horizontal
    relationship)
  • Summarize a web page to a concept category
    (web-page concept naming)
  • Apply a statistical method to construct the
    thesaurus
  • Calculate the mutual information of the
    words/phrases within the content structures

11
Detecting Navigational Links
  • Use the information encoded in URL, i.e., the
    sever-side local directory structure (e.g.,
    http//google/aaa/bbb.html)
  • a link
  • Apply rules to detect navigational links
    (92.82)
  • Upward link function as a return to the
    previous page
  • Link within a high-level navigation bar
  • Link within a navigation list
  • Upward to parent directory
  • Downward to a subdirectory
  • Forward to a sub-subdirectory
  • Sibling to the same directory
  • Crosswise other than the above cases

12
Tagging Objects in Web Pages
  • Identify the functions/categories of the object
    on a page using Function-based Object Model (FOM)
  • Index page or content page?
  • Navigation bar and list?
  • Reference
  • J. L. Chen, et al. Function-based Object Model
    Towards Website Adaptation, In Proc. of the 10th
    World Wide Web Conference, pp. 587596, May, 2001.

13
FOM Example
14
FOM Example
15
FOM Navigation Bar/List?
16
Discovering the Semantic Relationship
  • Apply the following rules
  • A link in a content page conveys association
    relationship
  • A link in an index page usually conveys
    aggregation relationships (further revised by the
    following rules)
  • A link conveys aggregation relationship if it is
    in navigation bar which belongs to an index page
  • If two web pages have aggregation relationship in
    both directions, the relationship is changed to
    association

17
FOM Index or Content Page?
  • By statistical analysis, (out degree OD, in
    degree ID)
  • A page with relatively large OD or ID may be an
    index page
  • A page with relatively small OD and ID may be a
    content page
  • Rules
  • If OD gt OD0 or ID gt ID0, the page is an Index
    page
  • If OD lt OD0 and ID lt ID0, the page is a Content
    page
  • Ways to get the constants OD0 and ID0
  • The OD(i)-i (ID(i)-i) curve (i is the ordered
    number of a page)

Beeline
18
Notating a Web Page
  • Select the anchor text with most discriminative
    power, measured by TFIDF weighting scheme

Anchor texts
19
Generate the Thesaurus
  • A term segmentation tool (NLPWin, COLING 2000) is
    applied because the format of anchor text is in
    many ways, e.g., words, phrases, and short
    sentences.
  • Three relationships to extract
  • Ancestor STi(ancestor)( ni, parents1(ni),
    , parentsd(ni) )
  • Offspring STi(offspring) ( ni, sons1(ni),
    , sonsd(ni) )
  • Sibling STi(sibling) ( ni, sibs1(ni),
    , sibsd(ni) )

A node ni wi1, wi2, , wim , where wij is a
term
20
Generate the Thesaurus (cont.)
  • For each generated sub-tree (e.g.,
    Sti(offspring)), the mutual information of a
    term-pair is counted as

ni
sons1(ni)
sons2(ni)
parents2(ni)
stands for the counts that two terms appear
together in the sub-tree
parents1(ni)
ni
21
Generate the Thesaurus (cont.)
  • The entropy is used to realize the heuristic The
    more sub-trees contain the term-pair, the more
    similar the two terms are
  • The similarity of two terms
  • Those term-pairs with values beyond a pre-defined
    threshold are selected as similar term candidates.

22
Experiment
  • Three testing domain (queries)
  • online shopping, photography, PDA
  • The top 13 websites are selected, and crawled
  • 25 web pages are manually analyzed by 4 users
  • Whether navigational links are correctly
    recognized?
  • Whether the nodes in Web content structure are
    correct?
  • 15 terms and their associated terms are manually
    evaluated
  • Application of the generated thesaurus on query
    expansion

23
Evaluation Navigational Links
24
Evaluation Correct Nodes
25
Evaluation Associated Terms
  • 15 terms from the obtained thesaurus were chosen

26
Experiment on Query Expansion
  • Use the Okapi system for full-text search
  • 10 queries for each domain
  • e.g., Shopping domain women shoes, mother day
    gift, childrens clothes, antivirus software,
    listening jazz, wedding dress, palm, movie about
    love, Cannon camera, cartoon products
  • top 30 ranked documents are justified by 4 users
  • Methods for comparison
  • Baseline no query expansion
  • Full-text thesaurus
  • Their Web Thesaurus (sibling)
  • Their Web Thesaurus (offspring)

27
Experiment Result Query Expansion
28
Discussions
  • The baseline retrieval precision is still above
    45
  • The naïve automatic full-text thesaurus decreases
    performance
  • childrens clothes book, clothing, toy,
    accessory, fashion, vintage
  • The query expansion based on sibling relationship
    is bad
  • childrens clothes book, toy, video, women,
    accessories, design
  • Query expansion based on offspring relationship
    improves the performance
  • childrens clothes baby, boy, girl, cardigan,
    shirt, sweater

29
The End
  • Anchor texts link structure provide many
    possibilities of various Web applications
  • A further work to construct a personalized
    thesaurus based on users navigation history and
    accessed documents
  • Trick on query expansion
  • Expand with more specific terms usually achieve
    better precision
Write a Comment
User Comments (0)
About PowerShow.com