in collaboration with - PowerPoint PPT Presentation

1 / 44
About This Presentation
Title:

in collaboration with

Description:

in collaboration with Georgiana Ifrim, Gjergji Kasneci, Josiane Parreira, Maya Ramanath, Ralf Schenkel, Fabian Suchanek, Martin Theobald – PowerPoint PPT presentation

Number of Views:124
Avg rating:3.0/5.0
Slides: 45
Provided by: weik2
Category:

less

Transcript and Presenter's Notes

Title: in collaboration with


1
in collaboration with Georgiana Ifrim, Gjergji
Kasneci, Josiane Parreira, Maya Ramanath, Ralf
Schenkel, Fabian Suchanek, Martin Theobald
2
DB and IR Two Parallel Universes
Database Systems
Information Retrieval
canonical application
accounting
libraries
text
numbers, short strings
data type
foundation
algebraic / logic based
probabilistic / statistics based
Boolean retrieval (exact queries, result
sets/bags)
ranked retrieval (vague queries, result lists)
search paradigm
market leaders
Oracle, IBM DB2, MS SQL Server, etc.
Google, Yahoo!, MSN, Verity, Fast, etc.
3
Why DBIR Now? Application Needs
Simplify life for application areas like
  • Global health-care management for monitoring
    epidemics
  • News archives for journalists, press agencies,
    etc.
  • Product catalogs for houses, cars, vacation
    places, etc.
  • Customer support CRM in insurances, telcom,
    retail, software, etc.
  • Bulletin boards for social communities
  • Enterprise search for projects, skills,
    know-how, etc.
  • Personalized collaborative search in digital
    libraries, Web, etc.
  • Comprehensive archive of blogs with time-travel
    search

Typical data Disease (DId, Name, Category,
Pathogen ) UMLS-Categories ( ) Patient (
Age, HId, Date, Report, TreatedDId) Hospital
(HId, Address )
Typical query symptoms of tropical virus
diseases and reported anomalies with young
patients in central Europe in the last two weeks
4
Why DBIR Now? Platform Desiderata
Keyword Search on Relational Graphs (IIT Bombay,
UCSD, MSR, Hebrew U, CU Hong Kong, Duke U, ...)
Unstructured search (keywords)
IR Systems Search Engines
Querying entities relations from IE (MSR
Beijing, UW Seattle, IBM Almaden, UIUC, MPI, )
Structured search (SQL,XQuery)
DB Systems
Structured data (records)
Unstructured data (documents)
5
Why DBIR Forever?
Turn the Web, Web2.0, and Web3.0 into the worlds
most comprehensive knowledge base (semantic
DB) !
  • Data enrichment at very large scale
  • Text and speech are key sources of
  • knowledge production (publications, patents,
    conferences, meetings, ...)

6
Outline
Past
Matter, Antimatter, and Wormholes

Present
XML and Graph IR

Future
From Data to Knowledge

7
Quiz Time
Gerard Salton in which country was he born
(and did grow up) ?
A USA B England C Netherlands D Germany E
Singapore F Indonesia
D Germany
Gerard Salton 1927 1995 Prof. Cornell Univ.
1965 1995
8
Parallel Universes A Closer Look
Matter Antimatter
  • user programmer
  • query precise spec.
  • of info request
  • interaction via API
  • user your kids
  • query approximation
  • of users real info needs
  • interaction process via GUI
  • strength indexing, QP
  • weakness user model
  • strength ranking model
  • weakness interoperability
  • eval. measure efficiency
  • (throughput, response time,
  • TPC-H, XMark, )
  • eval. measure effectiveness
  • (precision, recall, F1, MAP, NDCG,
  • TREC INEX benchmarks,

9
DB
Prob. DB (CavalloPittarelli)
XPath
Prob. Tuples (Barbara et al.)
WHIRL (Cohen)
XPath Full-Text
INEX
VAGUE (Motro)
Prob. Datalog (Fuhr et al.)
IR
Proximal Nodes (Baeza-Yates et al.)
1990
1995
2000
2005
10
WHIRL IR over Relations W.W. Cohen SIGMOD98
Add text-similarity selection and join to
relational algebra
Example Select From Movies M, Reviews R
Where M.Plot fight And M.Year gt
1990 And R.Rating gt 3 And
M.Title R.Title And M.Plot R.Comment
Movies
Reviews
Title Plot Year
Title Comment Rating
Matrix
Matrix 1
cool fights new techniques
1999
In the near future computer hacker Neo
fight training

4
  • DBIR for query-time data integration
  • More recent work MinorThird, Spider, DBLife,
    etc.
  • But scoring models fairly ad hoc

Matrix Reloaded
fights and more fights fairly boring
1
Hero
2002
In ancient China fights sword fight
fights Broken Sword
Matrix Eigenvalues
matrix spectrum orthonormal
5
2004
Shrek 2
In Far Far Away our lovely hero fights with cat
killer
Ying xiong aka. Hero
fight for peace sword fight dramatic
colors
5
11
XXL Early XML IR
Anja Theobald, GW Adding Relevance toXML,
WebDB00
Union of heterogeneous sources without global
schema
Similarity-aware XPath //Professor //
SB //Course //
IR //Research //
XML
Similarity-aware XPath //Professor //
SB //Course //
IR //Research //
XML
Which professors from Saarbruecken (SB) are
teaching IR and have research projects on XML?
12
XXL Early XML IR
Anja Theobald, GW Adding Relevance toXML,
WebDB00
Motivation Union of heterogeneous sources has no
schema
Similarity-aware XPath //Professor //
Saarbruecken //Course
// IR
//Research // XML
Which professors from Saarbruecken (SB) are
teaching IR and have research projects on XML?
  • Scoring and ranking
  • tfidf for content condition
  • ontological similarity for
  • relaxed tag condition
  • score aggregation with
  • probabilistic independence

13
The Past Lessons Learned
  • DBIR added flexible ranking to (semi)
    structured querying
  • to cope with schema and instance diversity
  • but ranking seems ad hoc and
  • not consistently good in benchmarks
  • to win benchmark tuning needed,
  • but tuning is easier if ranking is principled !
  • ontologies are mixed blessing
  • quality diverse, concept similarity subtle,
  • danger of topic drift
  • ontology-based query expansion
  • (into large disjunctions)
  • poses efficiency challenge

14
Outline
?
Past
Matter, Antimatter, and Wormholes
Present
XML and Graph IR

Future
From Data to Knowledge

15
Quiz Time
Which is the largest XML data collection in the
universe?
A Yahoo! Answers B INEX benchmark C Derwent
WPI D Elsevier Scopus E 51.com F Traffic
violations in EU
C Derwent WPI
?
16
TopX 2nd Generation XML IR
Martin Theobald, Ralf Schenkel, GW VLDB05,
VLDB Journal
  • Exploit tags structure for better precision
  • Can relax tag names structure for better
    recall
  • Principled ranking by probabilistic IR (Okapi
    BM25 for XML)
  • Efficient top-k query processing (using improved
    TA)
  • Robust ontology integration (self-throttling to
    avoid topic drift)
  • Efficient query expansion (on demand, by
    extended TA)
  • Relevance feedback for automatic query rewriting

Semantic XPath Full-Text query /Article
ftcontains(//Person, Max Planck)
ftcontains(//Work, quantum physics) //Children
_at_Gender female//Birthdates
supported by TopX engine http//infao5501.ag5.mpi
-sb.mpg.de8080/topx/
http//topx.sourceforge.net

17
Commercial Break
Martin Theobald, Ralf Schenkel, GW VLDB95
TopX demo today 330 530
18
Principled Ranking by Probabilistic IR
binary features, conditional independence of
features Robertson Sparck-Jones 1976
God does not play dice. (Einstein) IR does.
related to but different from statistical
language models
odds for item d with terms di being relevant for
query q q1, , qm
with
  • Now estimate pi and qi values from
  • relevance feedback,
  • pseudo-relevance feedback,
  • corpus statistics
  • by MLE (with statistical smoothing)
  • and store precomputed pi, qi in index

19
Probabilistic Ranking for SQL
S. Chaudhuri, G. Das, V. Hristidis, GW TODS06
SQL queries that return many answers need ranking
  • Examples
  • Houses (Id, City, Price, Rooms, View, Pool,
    SchoolDistrict, )
  • Select From Houses Where View Lake And
    City In (Redmond, Bellevue)
  • Movies (Id, Title, Genre, Country, Era, Format,
    Director, Actor1, Actor2, )
  • Select From Movies Where Genre Romance
    And Era 90s

odds for tuple d with attributes X?Y relevant for
query q X1x1 ? ? Xmxm
Estimate probs, exploiting workload W
  • Example frequent queries
  • Where Genre Romance And Actor1 Hugh
    Grant
  • Where Actor1 Hugh Grant And Actor2
    Julia Roberts
  • boosts HG and JR movies in ranking for Genre
    Romance And Era 90s

20
From Tables and Trees to Graphs
BANKS, Discover, DBExplorer, KUPS, SphereSearch,
BLINKS
Schema-agnostic keyword search over multiple
tables graph of tuples with foreign-key
relationships as edges
Example Conferences (CId, Title, Location,
Year) Journals (JId, Title) CPublications (PId,
Title, CId) JPublications (PId, Title, Vol, No,
Year) Authors (PId, Person) Editors (CId,
Person) Select From Where Contains Gray,
DeWitt, XML, Performance And Year gt 95
  • Related use cases
  • XML beyond trees
  • RDF graphs
  • ER graphs (e.g. from IE)
  • social networks

Result is connected tree with nodes that contain
as many query keywords as possible
Ranking
with nodeScore based on tfidf or prob. IR and
edgeScore reflecting importance of relationships
(or confidence, authority, etc.)
Top-k querying compute best trees, e.g. Steiner
trees (NP-hard)
21
The Present Observations Opportunities
  • Probabilistic IR and statistical language models
  • yield principled ranking and high effectiveness
  • (related to prob. relational models (Suciu,
    Getoor, ) but different)
  • Structural similarity and ranking
  • based on tree edit distance (FleXPath, Timber,
    )
  • Aim for comprehensive XML ranking model
  • capturing content, structure, ontologies
  • Aim to generate structure skeleton
  • in XPath query from user feedback
  • Good progress on performance
  • but still many open efficiency issues

22
Outline
?
Past
Matter, Antimatter, and Wormholes
?
Present
XML and Graph IR
Future
From Data to Knowledge

23
Quiz Time
Who
said Information is not knowledge. Knowledge is
not wisdom. Wisdom is not truth. Truth is not
beauty. Beauty is not love. Love is not music.
Music is the best.
?
A Richard Feynman B Sigmund Freud C Larry
Page D Frank Zappa E Marie Curie F Lao-tse
D Frank Zappa
24
Knowledge Queries
Turn the Web, Web2.0, and Web3.0 into the worlds
most comprehensive knowledge base (semantic
DB) !
Answer knowledge queries such as
proteins that inhibit both protease and some
other enzyme
neutron stars with Xray bursts gt 1040 erg s-1
black holes in 10
differences in Rembetiko music from Greece and
from Turkey
connection between Thomas Mann and Goethe
market impact of Web2.0 technology in December
2006
sympathy or antipathy for Germany from May to
August 2006
Nobel laureate who survived both world wars and
his children
drama with three women making a prophecy to a
British nobleman that he will become king
25
Three Roads to Knowledge
  • Handcrafted High-Quality Knowledge Bases
  • (Semantic-Web-style ontologies, encyclopedias,
    etc.)
  • Large-scale Information Extraction Harvesting
  • (using pattern matching, NLP, statistical
    learning, etc.
  • for product search, Web entity/object search,
    ...)
  • Social Wisdom from Web 2.0 Communities
  • (social tagging, folksonomies, human computing,
  • e.g. del.icio.us, flickr, answers.yahoo,
    iknow.baidu, ...)
  • Social Wisdom from Web 2.0 Communities
  • (social tagging, folksonomies, human computing,
  • e.g. del.icio.us, flickr, answers.yahoo,
    iknow.baidu, ...)

26
High-Quality Knowledge Sources
  • universal common-sense ontologies
  • SUMO (Suggested Upper Merged Ontology) 60 000
    OWL axioms
  • Cyc 5 Mio. facts (OpenCyc 2 Mio. facts)
  • domain-specific ontologies
  • UMLS (Unified Medical Language System) 1 Mio.
    biomedical concepts
  • 135 categories, 54 relations (e.g. virus causes
    disease symptom)
  • GeneOntology, etc.
  • thesauri and concept networks
  • WordNet 200 000 concepts (word senses) and
    hypernym/hyponym relations
  • can be cast into OWL-lite (or typed graph with
    statistical weights)
  • lexical sources
  • Wikipedia (1.8 Mio. articles, 40 Mio. links, 100
    languages) etc.
  • hand-tagged natural-language corpora
  • TEI (Text Encoding Initiative) markup of
    historic encyclopedia
  • FrameNet sentences classified into frames with
    semantic roles

growing with strong momentum
27
High-Quality Knowledge Sources
General-purpose thesauri and concept networks
WordNet family
  • can be cast into
  • OWL-lite or into
  • graph, with weights for relation strengths
  • (derived from co-occurrence statistics)

enzyme -- (any of several complex proteins that
are produced by cells and
act as catalysts in specific biochemical
reactions) gt protein -- (any of a large
group of nitrogenous organic compounds
that are essential
constituents of living cells ...)
gt macromolecule, supermolecule ... gt
organic compound -- (any compound of carbon
and
another element or a radical) ... gt
catalyst, accelerator -- ((chemistry) a substance
that initiates or

accelerates a chemical reaction
without itself
being affected) gt activator -- ((biology)
any agency bringing about activation ...)

28
High-Quality Knowledge Sources
Wikipedia and other lexical sources
29
Exploit Hand-Crafted Knowledge
Wikipedia, WordNet, and other lexical sources
Infobox_Scientist name Max Planck
birth_date April 23, 1858
birth_place Kiel, Germany death_date
October 4, 1947 death_place
Göttingen, Germany residence
Germany nationality GermanyGerman
field Physicist work_institution
University of Kiellt/brgt Humboldt-Universi
tät zu Berlinlt/brgt Georg-August-Universität
Göttingen alma_mater Ludwig-Maximilians-Un
iversität München doctoral_advisor
Philipp von Jolly doctoral_students
Gustav Ludwig Hertzlt/brgt known_for
Planck's constant,
Quantum mechanicsquantum theory prizes
Nobel Prize in Physics (1918)
30
YAGO Yet Another Great OntologyF. Suchanek, G.
Kasneci, GW WWW 2007
  • Turn Wikipedia into explicit knowledge base
    (semantic DB)
  • Exploit hand-crafted categories and templates
  • Represent facts as explicit knowledge triples
  • relation (entity1, entity2)
  • (in 1st-order logic, compatible with RDF,
    OWL-lite, XML, etc.)
  • Map (and disambiguate) relations into WordNet
    concept DAG

relation
entity1
entity2
Examples
bornIn
isInstanceOf
City
Max_Planck
Kiel
Kiel
31
YAGO Knowledge Representation
Accuracy 97
Entity
subclass
subclass
Person
concepts
Location
subclass
Scientist
subclass
subclass
subclass
subclass
City
Country
Biologist
Physicist
instanceOf
instanceOf
Nobel Prize
Erwin_Planck
Kiel
bornIn
hasWon
FatherOf
individuals
bornOn
diedOn
October 4, 1947
Max_Planck
April 23, 1858
means
means
means
Dr. Planck
Max Karl Ernst Ludwig Planck
Max Planck
words
Online access and download at http//www.mpi-inf.m
pg.de/suchanek/yago/
32
NAGA Graph IR on YAGO G. Kasneci et al. WWW07
Graph-based search on YAGO-style knowledge bases
with built-in ranking based on confidence and
informativeness ? statistical language
model for result graphs
conjunctive queries
isa
bornIn
Kiel
x
scientist
queries with regular expressions
isa
hasFirstName hasLastName
Ling
x
scientist
worksFor
(coAuthor advisor)
locatedIn
y
Zhejiang
Beng Chin Ooi
33
Ranking Factors
  • Confidence
  • Prefer results that are likely to be correct
  • Certainty of IE
  • Authenticity and Authority of Sources

bornIn (Max Planck, Kiel) from Max Planck was
born in Kiel (Wikipedia)
livesIn (Elvis Presley, Mars) from They believe
Elvis hides on Mars (Martian Bloggeria)
  • Informativeness
  • Prefer results that are likely important
  • May prefer results that are likely new to user
  • Frequency in answer
  • Frequency in corpus (e.g. Web)
  • Frequency in query log

q isa (Einstein, y)
isa (Einstein, scientist)
isa (Einstein, vegetarian)
q isa (x, vegetarian)
isa (Einstein, vegetarian)
isa (Al Nobody, vegetarian)
  • Compactness
  • Prefer results that are tightly connected
  • Size of answer graph

vegetarian
Tom Cruise
isa
isa
bornIn
Einstein
1962
won
won
Bohr
Nobel Prize
diedIn
34
Information Extraction (IE) Text to Records
combine NLP, pattern matching, lexicons,
statistical learning
35
Knowledge Acquisition from the Web
  • Learn Semantic Relations from Entire Corpora at
    Large Scale
  • (as exhaustively as possible but with high
    accuracy)
  • Examples
  • all cities, all basketball players, all
    composers
  • headquarters of companies, CEOs of companies,
    synonyms of proteins
  • birthdates of people, capitals of countries,
    rivers in cities
  • which musician plays which instruments
  • who discovered or invented what
  • which enzyme catalyzes which biochemical
    reaction

Existing approaches and tools (Snowball
Gravano et al. 2000, KnowItAll Etzioni et al.
2004, ) almost-unsupervised pattern matching
and learning seeds (known facts) ? patterns (in
text) ? (extraction) rule ? (new) facts
36
Methods for Web-Scale Fact Extration
seeds ? text ?
rules ? new facts
Example city (Seattle) in
downtown Seattle city (Seattle)
Seattle and other towns city (Las Vegas)
Las Vegas and other towns plays (Zappa,
guitar) playing guitar Zappa plays
(Davis, trumpet) Davis blows trumpet
Example city (Seattle) in
downtown Seattle in downtown X city (Seattle)
Seattle and other towns X
and other towns city (Las Vegas) Las Vegas
and other towns X and other towns plays (Zappa,
guitar) playing guitar Zappa playing Y
X plays (Davis, trumpet) Davis blows
trumpet X blows Y
Example city (Seattle) in
downtown Seattle in downtown X city (Seattle)
Seattle and other towns X
and other towns city (Las Vegas) Las Vegas
and other towns X and other towns plays (Zappa,
guitar) playing guitar Zappa playing Y
X plays (Davis, trumpet) Davis blows
trumpet X blows Y
Example city (Seattle) in
downtown Seattle in downtown X city (Seattle)
Seattle and other towns X
and other towns city (Las Vegas) Las
Vegas and other towns X and other towns plays
(Zappa, guitar) playing guitar
Zappa playing Y X plays (Davis, trumpet)
Davis blows trumpet X blows Y
in downtown
Beijing city(Beijing)
Coltrane blows sax plays(C., sax)
city(Beijing) plays(Coltrane, sax)
city(Beijing) old center of
Beijing plays(Coltrane, sax) sax player
Coltrane
city(Beijing) old center of Beijing old
center of X plays(Coltrane, sax) sax player
Coltrane Y player X
37
Performance of Web-IE
State-of-the-art precision/recall results
relation precision recall corpus
systems countries 80 90 Web
KnowItAll cities 80 ??? Web
KnowItAll scientists 60 ???
Web KnowItAll headquarters 90 50 News
Snowball, LEILA birthdates 80 70
Wikipedia LEILA instanceOf 40 20
Web Text2Onto, LEILA Open IE 80 ???
Web TextRunner
precision value-chain entities 80, attributes
70, facts 60, events 50
Anecdotic evidence
invented (A.G. Bell, telephone) married (Hillary
Clinton, Bill Clinton) isa (yoga, relaxation
technique) isa (zearalenone, mycotoxin) contains
(chocolate, theobromine) contains (Singapore
sling, gin)
invented (Johannes Kepler, logarithm
tables) married (Segolene Royal, Francois
Hollande) isa (yoga, excellent way) isa (your
day, good one) contains (chocolate,
raisins) plays (the liver, central role) makes
(everybody, mistakes)
38
Beyond Surface Learning with LEILA
Learning to Extract Information by Linguistic
Analysis F.Suchanek, G.Ifrim, GW KDD06
Limitation of surface patterns who discovered or
invented what Teslas work
formed the basis of AC electric power

Al Gore funded more work for a better basis of
the Internet
Almost-unsupervised Statistical Learning with
Dependency Parsing
  • LEILA outperforms other Web-IE methods
  • in terms of precision, recall, F1, but
  • dependency parser is slow
  • one relation at a time

39
IE Efficiency and Accuracy Tradeoffs
see also tutorials by Cohen, Doan/Ramakrishnan/Va
ithyanathan, Agichtein/Sarawagi
IE is cool, but whats in it for DB folks?
  • precision vs. recall two-stage processing
    (filter pipeline)
  • recall-oriented harvesting
  • precision-oriented scrutinizing
  • preprocessing
  • indexing NLP trees graphs, N-grams, PoS-tag
    patterns ?
  • exploit ontologies? exploit usage logs ?
  • turn crawlextract into set-oriented query
    processing
  • candidate finding
  • efficient phrase, pattern, and proximity queries
  • optimizing entire text-mining workflows
    Ipeirotis et al. SIGMOD06

40
The Future Challenges
  • Generalize YAGO approach (Wikipedia WordNet)
  • Methods for comprehensive, highly accurate
  • mappings across many knowledge sources
  • cross-lingual, cross-temporal
  • scalable in size, diversity, number of sources
  • Pursue DB support towards efficient IE (and NLP)
  • Achieve Web-scale IE throughput that can
  • sustain rate of new content production (e.g.
    blogs)
  • with gt 90 accuracy and Wikipedia-like coverage
  • Integrate handcrafted knowledge with
    NLP/ML-based IE
  • Incorporate social tagging and human computing

41
Outline
?
Past
Matter, Antimatter, and Wormholes
?
Present
XML and Graph IR
?
Future
From Data to Knowledge
42
Major Trends in DB and IR
Database Systems
Information Retrieval
malleable schema (later)
deep NLP, adding structure
record linkage
info extraction
graph mining
entity-relationship graph IR
dataspaces
Web objects
ontologies
statistical language models
ranking
data uncertainty
programmability
search as Web Service
Web 2.0
Web 2.0
43
Conclusion
  • DBIR integration agenda
  • models - ranking, ontologies, prob. SQL ?,
    graph IR ?
  • languages and APIs - XQuery Full-Text ?
  • systems - drop SQL, go light-weight ?
  • - combine with P2P, Deep Web,
    ... ?
  • Rethink progress measures and experimental
    methodology
  • Address killer app(s) and grand challenge(s)
  • from data to knowledge (Web, products,
    enterprises)
  • integrate knowledge bases, info extraction,
    social wisdom
  • cope with uncertainty ranking as first-class
    principle
  • Bridge cultural differences between DB and IR
  • co-locate SIGIR and SIGMOD

44
DBIR Both Sides Now
Joni Mitchell (1969) Both Sides Now I've
looked at life from both sides now,From up and
down, and still somehowIt's life's illusions i
recall.I really don't know life at all.
Thank You !
Write a Comment
User Comments (0)
About PowerShow.com