High Performance Database Lab - PowerPoint PPT Presentation

1 / 67
About This Presentation
Title:

High Performance Database Lab

Description:

Supervisor:Prof. Shan Wang. School of information, Renmin University of China ... Vera Watson. Jim Gray. Donald D. Chamberlin. A103. A102. A101. a1. a2. a3 ... – PowerPoint PPT presentation

Number of Views:53
Avg rating:3.0/5.0
Slides: 68
Provided by: sebastianm
Category:

less

Transcript and Presenter's Notes

Title: High Performance Database Lab


1
High Performance Database Lab
Ph.D CandidateJun Zhang
Zhaohui Peng SupervisorProf. Shan Wang School
of information, Renmin University of China Key
Laboratory of Data Engineering and Knowledge
Engineering, MOE 2006-05-25
2
Outline
  1. Introduction to Our Lab
  2. Introduction to KSORD
  3. Our work on KSORD

3
High Performance Database Lab
  • High Performance Database Lab Key Laboratory of
    Data Engineering and Knowledge Engineering, MOE
  • Cooperated with NCR Teradata Data Warehouse and
    Business Intelligence United Lab
  • Cooperated with HP Labs,China Main Memory DBMS,
    Parallel DBMS
  • Composition7 faculty,20 graduate students(9
    Ph.D students) , 3 research groups
  • ? Core Technology of Databases
  • ? Grid Data Management
  • ? Database and IR

4
High Performance Database Lab
  •  1 Core Technology of Databases
  •    (1)KingBaseES
  • Large-scale, universal, highly efficient,
    National relational DBMS of China, having
    independent intellectual property rights.
    (database expert companyBasesoft IT Ltd)
  • Awarded by 863 research project of China,
    KingbaseES got the highest score in National DBMS
    testing of China in 2005
  • KingbaseES mainly applied ine-government,
    education, manufacturing, mobile communicating,
    and etc.
  •    (2)DBMS Self-tuning and Self-management
  •  Self-configuration, self-optimizing,
    self-healing and self-protecting makes DBMS more
    intelligent and available.
  •    (3)Main Memory DBMS, Parallel DBMS
  • (4)XML RDBMS

5
High Performance Database Lab
2 Grid Data Management (1)Information grid and
platform construction (2)Magnanimous Data
Management in Grid (3)PDBMS Existing P2P
systems lack data management capabilities that
are typically found in DBMS. PDBMS presents a
flexible framework for data sharing of
heterogeneous data sources. Our main research
focus on the quality of semantic mapping, query
algorithm and data consistency.
6
High Performance Database Lab
  • DBIR
  • Information retrieval on relational
    databases
  • (1)keyword search techniques
  • (2)semantic keyword search
  • (3)clustering search results
  • (4)relevance feedback

7
Outline
  1. Introduction to Our Lab
  2. Introduction to KSORD
  3. Our work on KSORD

8
DB-IR is a Hot Topic
SIGMOD2006 Effective Keyword Search in
Relational Databases SIGMOD2005 Panels
Databases and Information Retrieval Rethinking
the Great Divide VLDB2005 Two Sessions
Session 18 DB and IR 1 Session 22 DB
and IR 2 VLDB2004 DB-IR Tutorial vldb2004
DB-IR Tutorial VLDB2002 Tutorial Text Search
for Fine-grained Semi-structured Data
9
How to integrate DB and IR ?
  • Option 1 Tie together existing DB and IR systems
  • Example Approaches based on SQL/MM
  • Option 2 Extend existing DB systems with IR
    functionality, or vice versa
  • Example Add searching and ranking to RDBMS
  • Option 3 Design a new data management system
    from the ground-up
  • Example Quark data management system

10
Why is KSORD Necessary ?
  • IR systems search unstructured data by keywords.
  • Results are usually imprecise and incomplete
  • DB system search structured data by SQL.
  • Results are sound and complete, all results are
    equally good.

Can we search databases with keywords?
Yes Searching DB with keywords is
necessary? Motivation
11
Why is KSORD Necessary ?
  • Web user or Casual user expect to query database
    by using free-form keyword query.
  • Just like searching the web by using search
    engine
  • Web users dont know the database schema and SQL
  • Hidden Web Problem Search hidden database by
    using keyword query.
  • Most of data on the Web are stored in databases,
    hidden to search engines, only a few data on
    the Web can be found by search engines
  • the mismatch of search interfaces between search
    engines and databases
  • If database systems support keyword search,
    publishing or searching a database is expected to
    be simpler and easier in the web, and the deep
    web problem can be alleviated.

12
Why is KSORD Necessary ?
  • Keyword Query a unified query language/interface
    to integrating diverse kinds of information
    systems
  • Modern information systems should manage many
    kinds of data
  • structured relational data SQL
  • semi-structured XML documents XQUERY
  • unstructured text documents Keyword Query
  • different query languages must be used for
    searching different kinds of data

13
What is KSORD ?
Query gray transaction
WRITE
AUTHOR
AuthorID Name
A101 Donald D. Chamberlin
A102 Jim Gray
A103 Vera Watson
PaperId AuthorID
P101 A101
P102 A102
P103 A102
a1
w1
a2
w2
a3
w3
PAPER
PaperID Title Year Type
P101 Specifying Queries as Relational Expressions 1974 Inproceedings
P102 Transaction Processing Concepts and Techniques 1982 book
P103 Database and Transaction Processing Benchmarks 1992 inproceedings
p1
p2
p3
Result a2-w2-p2, a2-w3-p3
14
How to Realize KSORD ?
In the integration of database and IR techniques,
we focus on how to implement IR in relational
databases
  • method 1 improve DBMS
  • Improve relational algebra model, query
    processor, SQL, so that enhance the ability of
    proximity search, Top-k and semantic search in
    databases. eg FR97, PRA,ACM97, IAE03,TOP-K,
    VLDB03, DCE04,OSS, VLDB2004
  • DBMS of the new generation should have these
    properties
  • method 2 middleware
  • Based on the full text indexing provided by
    RDBMS, keyword search over relational databases
    enables casual users to use keyword queries (a
    set of keywords) to search relational databases
    just like searching the Web, without any
    knowledge of the database schema or any need of
    writing SQL queries
  • e.g. GSV98,VLDB98 , BHN02, ICDE02 ,
    HP02, VLDB02, VLDB03 , BHP04, VLDB04

15
How to Realize KSORD ?
  • method 2 middleware
  • (1)offline systems
  • Retrieve results for a keyword query from a
    mediate representation generated by "crawling"
    the database in advance. The offline Systems
    execute queries efficiently, but they can't query
    the up-to-date data in time, and also need a long
    preprocessing time and large physical space to
    generate the mediate representation.
  • ESKOSU03, Stanford03 indexing Text Object
    (Virtual Document)
  • DataSpotDEG98, VLDB98 build an external
    graph-based hyperbase
  • DbSurferWLK03 indexing textual content of
    tuples as virtual web pages
  • ObjectRankBHP04,VLDB04 ObjectRank similar to
    PageRank.

16
How to Realize KSORD ?
  • (2) online systems
  • Convert a keyword query into many SQL queries
    and retrieve the database itself. The online
    systems can retrieve the latest data from
    database, but their execution may be inefficient
    because those converted SQL queries usually
    contain many join operators.
  • data graph based method
  • BANKSHAL02, ICDE02 model the whole
    database as a graph
  • Schema graph based method
  • DBxploreACD02, ICDE02, DISCOVERhr02,
    VLDB02, EfficientIRHR03, VLDB03 model
    database schema as a graph

17
What is Our Previous Work ( Basic Work)
? Survey Search relational databases with
keywords SK05,JCST ? Prototype systems
implementation BANKS, DISCOVER ? Study
Prototype IR-Style (DISCOVER II)
18
What is Our Previous Work ( Innovative Work)
  • SEEKER Keyword-based Information Retrieval Over
    Relational Databases
  • DETECTOR A universal database retrieval system
    based on dynamic database
  • Research on New Preprocessing Technology for
    Keyword Search in Databases
  • A Study of Content-based Search Techniques in
    Peer-to-Peer Network
  • A Study of Integration Techniques of
    Heterogeneous Information Sources Based on Grid

19
Outline
  1. Introduction to Our Lab
  2. Introduction to KSORD
  3. Our work on KSORD

20
What are Worthy Doing?
  • How to Realize KSORD?
  • SEEKER (schema-graph based online system, done)
  • DETECTOR (data-graph based online system, done)
  • ITREKS ( offline system, doing )
  • How to Improve KSORD?
  • Efficiency
  • HUNTER (preprocessing techniques, done)
  • Effectiveness
  • Effective Keyword Search in Relational Databases
    FangLiu, SIGMOD06

21
What are We Doing?
  • How to Improve KSORD?
  • Efficiency
  • QuickCN
  • PreCN Preprocessing Candidate Networks for
    Effcient Keyword Search over Databases.
    (Submitted ).
  • CLASCN Candidate Network Selection for Efficient
    Top-k Keyword Queries over Databases. (Submitted
    ).
  • JoinCN doing.
  • Effectiveness
  • Result Representation
  • TreeCluster Clustering Results of Keyword Search
    over Databases. WAIM2006, HONGKONG (Accepted )
  • Semantic Search
  • Si-SEEKER Ontology-based Semantic Search over
    Databases. KSEM2006, August 5, Guilin,
    China(Accepted).

22
Our FrameWork
SemCN (Semantic Search Ontology)
PreCN (Preprocessing)
ClASCN (Classification, learning, and Selection)
JoinCN (DBMS,TOP-K Algorithm)
23
SemCN
SemCN (Semantic Search Ontology)
24
SemCN
25
SemCN
  • Basic Ideas
  • Exploiting Domain Ontology to construct semantic
    indexes in relational database
  • Semantic indexes to support semantic search, just
    like full-text to support keyword search.
  • Generalized Vector Space Model to compute
    semantic similarity by utilizing domain ontology
    hierarchical structure
  • Semantic Search Combined with Keyword Search

26
SemCN
27
SemCN
  • Domain Ontology
  • ACMCSS98 ( 1475 concepts, 2 relationships (
    subClassOf, relatedTo)
  • Data Set DBLP
  • Annotation
  • SIGMOD XML data set (477 annotated papers and
    1369 semantic index entries)
  • Crawling ACM Digital Library about 20,000 Papers
    (doing)
  • Concept Extractor
  • a simple Concept Extractor (Stanford Parser)

28
SemCN
Domain Ontology Computer Science
29
SemCN
  • how to compute semantic similarity?
  • Data Concept Vector and Query Concept Vector
  • Structured Data Semantic Indexes
  • D gt Concept Extractor gt D(C1, C2, C3, .)
  • Keyword Query Concept Extractor
  • Q gt Concept Extractor gt Q(C1, C2, C3, ,
    )
  • For Example
  • Keyword Query Data Management gt Q(H.2)
  • Data (Paper Title) Enriching the conceptual
    basis for query formulation through relationship
    semantics in databases. D(H.2.1.1, H.2.3.4)
  • the classic vector space model are supposed to be
    perpendicular to each other and the dot product
    of them is zero?

30
SemCN
Generalized Vector Space Model GMW03
31
SemCN
32
SemCN
Full-Text Indexes
Semantic Indexes
33
SemCN
Show the Effectiveness
Efficiency is the future work!
34
SemCN
Tuple Sets Merger
Score Normalization
Combine two kinds of Scores
35
SemCN
36
SemCN
37
PreCN
PreCN (Preprocessing)
38
PreCN
Fig. 2 Comparing Tts,Tcn and Tsql.Fix MaxCNsize
5 and topk 100,vary Key- wNum
39
PreCN
40
PreCN
  • Basic Ideas
  • Preprocessing the maximum Tuple Sets Graph to
    generate CNs in advance
  • Preprocess the schema information (stable), not
    the data themselves (changeful).
  • Retrieving the set of CNs for a user query
    instead of temporarily generating CNs

41
PreCN
42
PreCN
43
PreCN
44
PreCN
45
PreCN
46
PreCN
47
PreCN
48
CLASCN
ClASCN (Classification, learning, and Selection)
49
CLASCN
50
CLASCN
Important observations the top-k results only
distribute in a few CNs while tens or hundreds of
CNs can be generated for a keyword query. For
example, as for DBLP(http//dblp.uni-trier.de/)
database, top 100 results per user query only
distribute in 3 CNs on average,and top 100
results for a mass of user queries only
distribute in about 22 of all CNs.
51
CLASCN
  • Basic Ideas
  • Each CN is viewed as a database
  • Construct CN language model, CN(k1,k2,,km)
  • Compute similarity between a user query
    Q(k1,k2,,km) and CNs
  • Select the most promising CNs to produce top-k
    results.

52
CLASCN
53
CLASCN
54
CLASCN
55
CLASCN
Selecting
Learning
56
CLASCN
57
CLASCN
58
CLASCN
59
CLASCN
60
CLASCN
61
CLASCN
62
References(1)
HW05 Yingjie He, Shan Wang. Efficient Top-k
Query Processing in Pure Peer-to-Peer Network.
Journal of Software. 2005,16(4).540552. HFW04
Yingjie He, Yanfeng Su, Shan Wang, Xiaoyong
Du,Efficient top-k query processing in P2P
network, Database and EXpert systems
Applications(DEXA 2004), Proceedings Lecture
Notes In Computer Science 3180 pp.381-390,
August 2004 Spain SK05 Shan Wang, Kun-Long
Zhang. Searching Databases with Keywords. Journal
of Computer Science Technology. 2005,20(1).
5562. WW05 Jijun Wen,Shan Wang. SEEKER
Keyword-based Information Retrieval Over
Relational Databases. Journal of Software.
2005 WZ04 Shan Wang, Kunlong Zhang. Database
system on the Grid. Journal of Computer
Applications, 24(10)1-3, 2004. MZW04 Xiaofeng
Meng, Longxiang Zhou, Shan Wang. State of the Art
and Trends in Database Research. Journal of
Software,2004,15(12). 1822 1836. HYJ04
Yingjie He. A Study of Content-based Search
Techniques in Peer-to-Peer Network. PhD thesis of
Renmin University of China. 2004.
63
References(2)
WEN05 Jijun Wen. Keyword-based Information
Retrieval over Relational Databases. PhD thesis
of Renmin University of China. 2005. ZKL05
Kunlong Zhang. Research on New Preprocessing
Technology for Keyword Search in Databases. PhD
thesis of Renmin University of China.
2005. YJL05 Jiali Yao. DETECTOR A universal
database retrieval system based on dynamic
database. Master thesis of Renmin University of
China. 2005. WYT05 Yunting Wang. A Study of
Integration Techniques of Heterogeneous
Information Sources Based on Grid. Master thesis
of Renmin University of China. 2005. WAN05 Shan
Wang et al. Database and Information System
Research and Challenge (1988-2003 research
reports). Higher Education Press 2005.8
64
References(3)
FR97 Norbert Fuhr, Thomas Rolleke. A
probabilistic Relational Algebra for the
Integration of Information Retrieval and Database
Systems. ACM Transactions on Information Systems,
15(1). 1997. 3266. IAE03 I. Ilyas, W. Aref,
and A. Elmagarmid. Supporting Top-k Join Queries
in Relational Databases. In Proceedings of the
29th International Conference on Very Large Data
Bases, 2003. LCI05 Chengkai Li, Kevin
Chen-Chuan Chang, Ihab F. Ilyas, Sumin Song.
RankSQL Query Algebra and Optimization for
Relational Top-k Queries. SIGMOD 2005. 131-142.
DCE04 Souripriya Das, Eugene Inseok Chong,
George Eadon, Jagannathan Srinivasan. Supporting
Ontology-Based Semantic matching in RDBMS.
Proceedings of the Thirtieth International
Conference on Very Large Data Bases.2004.
1054-1065. SB98 G. Salton, C. Buckley.
Term-Weighting Approaches in Automatic Retrieval.
Information Processing and Management,
24(5).1998 513-523.
65
References(4)
GSV98 R. Goldman, N. Shivajumar, S.
Venkatasubramanian, and H. Garcia-Molina.
Proximity Search in Databases. In Proceedings of
the 24th International Conference on Very Large
Databases, 1998. BHN02 G. Bhalotia, A.
Hulgeri, C. Nakhe, S. Chakrabarti, and S.
Sudarshan. Keyword Searching and Browsing in
Databases using BANKS. In Proceedings of 18th
International Conference on Data Engineering,
2002. ABP02 B. Aditya, Gaurav Bhalotia, Parag,
Charuta Nakhey, Arvind Hulgeri, Soumen
Chakrabarti, and S. Sudarshan. Banks Browsing
and keyword searching in relational databases. In
Proceedings of the 28th International Conference
on Very Large Data Bases, 2002.
Demonstration. ACD02 S. Agrawal, S. Chaudhuri,
and G. Das. DBXplorer A System For Keyword-Based
Search Over Relational Databases. In Proceedings
of 18th International Conference on Data
Engineering, 2002. HP02 V. Hristidis and Y.
Papakonstantinou. DISCOVER Keyword Search in
Relational Databases. In Proceedings of the 28th
International Conference on Very Large Data
Bases, 2002.
66
References(5)
HGP03 V. Hristidis, L. Gravano, and Y.
Papakonstantinou. Efficient IR-Style Keyword
Search over Relational Databases. In Proceedings
of the 29th International Conference on Very
Large Data Bases, 2003. SW03 Qi Su, Jennifer
Widom. Efficient and Extensible Keyword Search
over Rleational Databases. Stanford University
Technical Report. 2003. BHP04 A. Balmin, V.
Hristidis, and Y. Papakonstantinou. ObjectRank
Authority-Based Keyword Search in Databases. In
Proceedings of the 30th International Conference
on Very Large Data Bases, 2004. SU03 Q.Su,
J.Widom. Indexing relational database Content
offine for efficient keyword-based search.
Technical Report, http//dbpubs.stanford.edu/pub/2
003-13, stanfordStanford University,
2003. DEG98 S.Dar,G.Entin,S.Geva,and
E.Palmon.DTLs DataSpotDatabase exploration
using plain languages. In Proceedings of the 24th
Internaltional Confererence on Very Large
Databases,1998. WLK03 R.Wheeldon,M.Levene,and
K.Keenoy. Search and Navigation in Relational
Databases.http//arxiv.org/abs/cs.DB/0307073 GMW0
3 P. Ganesan, H. Garcia-Molina, and J. Widom.
Exploiting Hierarchical Domain Structure to
Compute Similarity. ACM Trans. Inf. Syst. 21(1).
200364-93
67
High Performance Database Lab
Q A
Thanks!
Ph.D CandidateJun Zhang (zhangjun11_at_ruc.edu.cn)
Zhaohui Peng (pengch_at_ruc.edu.cn) Su
pervisorProf. Shan Wang School of information,
Renmin University of China Key Laboratory of Data
Engineering and Knowledge Engineering, MOE
2006-05-25
Write a Comment
User Comments (0)
About PowerShow.com