Adaptive XML Search - PowerPoint PPT Presentation

About This Presentation
Title:

Adaptive XML Search

Description:

Ranking Support vector machine in voting SpyNB Framework ... LINE King Malchus of Arabia; King of Pont; /LINE LINE Herod of Jewry; Mithridates, king /LINE ... – PowerPoint PPT presentation

Number of Views:44
Avg rating:3.0/5.0
Slides: 41
Provided by: tqz
Category:
Tags: xml | adaptive | malchus | search

less

Transcript and Presenter's Notes

Title: Adaptive XML Search


1
Adaptive XML Search
  • Dr Wilfred Ng
  • Department of Computer Science
  • The Hong Kong University of Science and
    Technology

2
Outline
  • Motivation
  • Key-Tag Search
  • Multi-Ranker Model
  • Ranking Support vector machine in voting SpyNB
    Framework (RSSF)
  • Experiments
  • Conclusions and Ongoing Work

3
Motivation
4
Why we need XML Search Engine?
  • Different nature of HTML and XML data
  • HTML data
  • Hyperlink-intensive
  • Declarative languages
  • Tags have no semantic meaning
  • XML data
  • Self-describing tags
  • Extra structural information
  • XML search engines retrieve more accurate
    fragments

5
Why we need XML Search Engine?
  • Web searching
  • Document paradigm
  • Matching keywords Vs documents
  • Return links to whole document (web page)
  • XML searching
  • Query Keywords maybe tags or data values
  • Structure of XML document is diverse, e.g. DBLP
    and Shakespeare
  • Not return whole document 100Mb or larger
  • Return fragments

6
DBLP
  • ltdblpgt
  • ltincollection mdate"2002-01-03"
    key"books/acm/kim95/AnnevelinkACFHK95"gt
  •   ltauthorgtJurgen Annevelinklt/authorgt
  •   lttitlegtObject SQL - A Language for the Design
    and Implementation of Object Databases.lt/titlegt
  •   ltpagesgt42-68lt/pagesgt
  •   ltyeargt1995lt/yeargt
  •   ltbooktitlegtModern Database Systemslt/booktitlegt
  •   lturlgtdb/books/collections/kim95.htmllt/urlgt
  •   lt/incollectiongt
  • .

7
Shakespeare
  • ltSPEECHgt
  • ltSPEAKERgtOCTAVIUS CAESARlt/SPEAKERgt
  • ltLINEgtNo, my most wronged sister
    Cleopatralt/LINEgt
  • ltLINEgtHath nodded him to her. He hath given his
    empirelt/LINEgt
  • ltLINEgtUp to a whore who now are levyinglt/LINEgt
  • ltLINEgtThe kings o' the earth for war he hath
    assembledlt/LINEgt
  • ltLINEgtBocchus, the king of Libya
    Archelaus,lt/LINEgt
  • ltLINEgtOf Cappadocia Philadelphos, kinglt/LINEgt
  • ltLINEgtOf Paphlagonia the Thracian king,
    Adallaslt/LINEgt
  • ltLINEgtKing Malchus of Arabia King of
    Pontlt/LINEgt
  • ltLINEgtHerod of Jewry Mithridates, kinglt/LINEgt
  • ltLINEgtOf Comagene Polemon and Amyntas,lt/LINEgt
  • ltLINEgtThe kings of Mede and Lycaonia,lt/LINEgt
  • ltLINEgtWith a more larger list of sceptres.lt/LINEgt
  • lt/SPEECHgt

8
Research Ideas
  • In Information Retrieval community, many ranking
    techniques are developed
  • Weighted keywords
  • Vector space
  • Searching and ranking XML as plain text using IR
    techniques is possible but
  • Too simple
  • Do not use the advantage of XML data
  • Can achieve better accuracy using features of XML
    data
  • Structures
  • Tag semantics

9
Outline
  • Motivation
  • Key-Tag Search
  • Multi-Ranker Model
  • Ranking Support vector machine in voting SpyNB
    Framework (RSSF)
  • Experiments
  • Conclusions and Ongoing Work

10
Key-Tag Search
11
Key-Tag Query vs. XQuery
  • Keywords in Web search engine vs. SQL
  • The goals of key-tag query and XQuery are
    different
  • Key-Tag Query
  • Simple
  • Easy to understand
  • Flexible

Too complicate for ordinary users!!
XQuery for x in doc(some.xml") where
x/author(.ftcontains(Mary) return x/title
Key-Tag Query ltauthorgtMarylt/authorgt
Will users input such complex XQuery in search
engines?
12
Key-Tag Search Query
Key
Tag
  • ltauthorgtMarylt/authorgt
  • For example,

13
Key-Tag Query Semantics
  • A fragment is considered as a result candidate if
    at least one key-tag is found in it.
  • If F1 and F2 both contain the same instance of
    key-tag and F1 is a subtree of F2, F1is chosen to
    be the only answer.
  • For example, a query ltbgtblt/bgt

14
Outline
  • Motivation
  • Key-Tag Search
  • Multi-Ranker Model
  • Ranking Support vector machine in voting SpyNB
    Framework (RSSF)
  • Experiments
  • Conclusions and Ongoing Work

15
Multi-Ranker Model
16
Introduction to MRM
  • Handle diversified XML documents and user
    preferences

17
Multi-Ranker Model
User Profiles
Adaptive Ranking Level (AR)
RSSF
w11
w12
w13
w14
Standard Ranking Level (XR)
NEW
Feature Ranking Level
Keyword Access Path Element Order Category
Sibling Children Distance Distance- Tag Attribute
18
Adaptive Ranking Level (AR)
  • AR maintains a feature vector,?, which adapts to
    the four XRs, the vector is weighted and trained
    by RSSF
  • ? (?STR, ?DAT, ?DFT, ?CUS, ?STR, ?DAT, ?DFT,
    ?CUS)
  • The adaptive ranking of fragments is calculated
    by
  • W ?,
  • where W is generated by RSSF, we will introduce
    it later.

19
Standard Ranking Level (XR)
  • Four XRs
  • Structure ranker (STR) focus on ranking XML
    fragments based on their structure
  • Data ranker (DAT) ignore the structure and rank
    the XML fragments with their textual data
  • System default ranker (DFT) a balance of
    structure and data ranker
  • Customized ranker (CUS) system administrator
    selects low-level feature for tuning, in our
    experiment, the low-level features are randomly
    pick

20
Feature Ranking Level
For example, Q ltauthorgtMarylt/authorgt,
lttitlegtXMLlt/titlegt
  • Similarity Features
  • Keyword
  • Access
  • Path
  • Element
  • Order
  • Category

Order in Q author gt title
Sibling order in F authorgttitle, authorgtyear,
titlegtyear, firstgtlast
Ancestor order similarity 0 Sibling order
similarity 1/4
21
Feature Ranking Level
For example, Q ltauthorgtMarylt/authorgt,
lttitlegtXMLlt/titlegt
  • Granularity Features
  • Sibling
  • Children
  • Distance
  • Distance-
  • Tag
  • Attribute
  • Involves statistical data in the database

Number of fragments whose roots are dblp
Number of tags whose parent are dblp
The length of the path from root to farthest
leaf dblp/article/author/first length 4
The length of the path from root to nearest
leaf dblp/article/title length 3
Number of tag in F 7
Number of attributes in F 0
22
Highlights of MRM
  • Highly Flexible
  • Add or remove of new features or new XR is
    straightforward
  • Only require to update the feature vector, ?
  • Ranking Level Independence
  • Analogous to data independence in relational model

23
Outline
  • Motivation
  • Key-Tag Search
  • Multi-Ranker Model
  • Ranking Support vector machine in voting SpyNB
    Framework (RSSF)
  • Experiments
  • Conclusions and Ongoing Work

24
Features of RSSF
  • Input set of labeled fragment
  • Output a trained ranker
  • Naïve Bayes is a successful algorithm for
    learning to classify text documents
  • Require small amount of training data, both
    positive and negative samples
  • In our setting, we only have labeled and
    unlabeled data, we extend the Naïve Bayes with
    spying technique to obtain the negative training
    samples

25
The RSSF
26
Ranking SVM Techniques
  • Find a vector that makes the inequality holds F1
    lt F2 ltF3

27
Voting Spy Naïve Bayes
28
Voting Spy Naïve Bayes
P1
P2
P3
Estimated Negative
Training Completed
Training Naïve Bayes
Positive
Unclassified
Negative
29
Voting Spy Naïve Bayes
The Final Estimated Negative is
F11
F11
F11
F12
F12
F13
F14
Positive
Unclassified
Negative
30
Outline
  • Motivation
  • Key-Tag Search
  • Multi-Ranker Model
  • Ranking Support vector machine in voting SpyNB
    Framework (RSSF)
  • Experiments
  • Conclusions and Ongoing Work

31
Effect of Varying Voting Threshold
X voting threshold Y Relative average rank of
labeled fragments new average rank /
original average rank
32
Effectiveness of Low-Level Features on XR
  • In this experiment, we remove individual
    low-level feature from STR and DAT rankers and
    measure the new precision
  • The queries we use can be found in the appendix
    of the proposal

33
Processing Time
34
Comparison with TopX
  • TopX is a searching engine for XML data available
    online
  • State-of-the-art XML search engine
  • We measure the MAP and precison_at_k
  • MAP mean average precision
  • precison_at_k top k precision

Average precision over 100 recall points for
each query. Then, take the average.
Number of top k relevant results k
35
Outline
  • Motivation
  • Key-Tag Search
  • Multi-Ranker Model
  • Ranking Support vector machine in voting SpyNB
    Framework (RSSF)
  • Experiments
  • Conclusions and Ongoing Work

36
Further remarks
  • Searching and ranking XML data are important,
    since existing Web search engines cannot handle
    them well
  • We present effective approach to perform adaptive
    XML searching and ranking by extending
    traditional IR techniques by considering
    different features of XML data

37
Ongoing Work INEX 2007
  • The Initiative for Evaluation of XML retrieval
    (INEX)
  • A community which aims to provide large test data
    and scoring method for researchers to evaluate
    their retrieval systems
  • It is getting attention recently
  • We participate INEX in 2006 and 2007
  • INEX 2007 Collection is a Wikipedia XML Corpus
    with a set of 659388 XML documents
  • We are running experiments using their data and
    queries

38
Ongoing Work INEX 2007
39
Ongoing Work Merging
  • Displaying a list of fragments one by one to the
    user may not be adequate in XML setting.
  • Fragments may be scattered on the list
  • Duplicated fragments in different structures
  • Refine a search query to obtain more and better
    results
  • Ideas Make use of the schema information (DTD)
    and consider the fragments as entities and merge
    them in a concise way

40
My Publications
  • Ho-Lam LAU and Wilfred NG. A Multi-Ranker Model
    for Adaptive XML Searching. Accepted and to
    appear VLDB Journal. (2007).
  • Ho-Lam LAU and Wilfred NG. Towards an Adaptive
    Information Merging Using Selected XML Fragments.
    International Conference of Database Systems for
    Advanced Applications. DASFAA 2007, Lecture Notes
    in Computer Science Vol. 4443, Bangkok, Thailand,
    pp. 1013-1019, (2007).
  • James CHENG and Wilfred NG. A Development of
    Hash-Lookup Trees to Support Querying Streaming
    XML. International Conference of Database Systems
    for Advanced Applications. DASFAA 2007, Lecture
    Notes in Computer Science Vol. 4443, Bangkok,
    Thailand, pp. 768-780, (2007).
  • Wilfred NG and James CHENG. An Efficient Index
    Lattice for XML Query Evaluation. International
    Conference of Database Systems for Advanced
    Applications. DASFAA 2007, Lecture Notes in
    Computer Science Vol. 4443, Bangkok, Thailand,
    pp. 753-767, (2007).
  • Wilfred NG and Ho-Lam LAU. A Co-Training
    Framework for Searching XML Documents.
    Information Systems, 32(3), pp. 477-503, (2007).
  • Yin YANG, Wilfred NG, Ho-Lam LAU and James CHENG
    . An Efficient Approach to Support Querying
    Secure Outsourced XML Information. Conference on
    Advanced Information Systems Engineering. CAiSE
    2006, Lecture Notes in Computer Science Vol.
    4007, Luxembourg, pp. 157-171, (2006).
  • Wilfred NG and Ho-Lam LAU. Effective Approaches
    for Watermarking XML Data. 10th International
    Conference on Database Systems for Advanced
    Applications DASFAA 2005, Lecture Notes of
    Computer Science Vol.3453, Beijing, China, page
    68-80, (2005).
  • Ho-Lam LAU and Wilfred NG. A Unifying Framework
    for Merging and Evaluating XML Information. 10th
    International Conference on Database Systems for
    Advanced Applications DASFAA 2005, Lecture Notes
    of Computer Science Vol.3453, Beijing, China,
    page 81-94, (2005).
Write a Comment
User Comments (0)
About PowerShow.com