Joint work with - PowerPoint PPT Presentation

1 / 26
About This Presentation
Title:

Joint work with

Description:

2) express knowledge queries, search, and rank. 3) everything efficient and scalable ... (Frank Zappa) extract facts from Web pages. capture user intention by ... – PowerPoint PPT presentation

Number of Views:72
Avg rating:3.0/5.0
Slides: 27
Provided by: wei9163
Category:
Tags: joint | work

less

Transcript and Presenter's Notes

Title: Joint work with


1
Joint work with Georgiana Ifrim, Gjergji
Kasneci, Thomas Neumann, Maya Ramanath, Fabian
Suchanek
2
Vision
Opportunity Turn the Web (and Web 2.0 and Web
3.0 ...) into the worlds most comprehensive
knowledge base
  • Approach
  • 1) harvest and combine
  • hand-crafted knowledge sources
  • (Semantic Web, ontologies)
  • automatic knowledge extraction
  • (Statistical Web, text mining)
  • social communities and human computing
  • (Social Web, Web 2.0)
  • 2) express knowledge queries, search, and rank
  • 3) everything efficient and scalable

3
Why Google and Wikipedia Are Not Enough
Answer knowledge queries such as
proteins that inhibit proteases and other human
enzymes
connection between Thomas Mann and Goethe
German Nobel prize winner who survived both world
wars and all of his four children
politicians who are also scientists
4
Why Google and Wikipedia Are Not Enough
Which politicians are also scientists ?
  • What is lacking?
  • Information is not Knowledge.
  • Knowledge is not Wisdom.
  • Wisdom is not Truth
  • Truth is not Beauty.
  • Beauty is not Music.
  • Music is the best.
  • (Frank Zappa)
  • extract facts from Web pages
  • capture user intention by
  • concepts, entities, relations

5
Related Work
Cimple DBlife
Libra
TextRunner
START
Avatar
Answers
information extraction ontology building
Web entity search QA
UIMA
Hakia
Powerset
Freebase
Cyc
EntityRank
DBpedia
semistructured IR graph search
TopX
XQ-FT
Yago Naga
Tijah
SPARQL
DBexplorer
Banks
SWSE
6
Outline
?
Motivation
Information Extraction Knowledge Harvesting
(YAGO)


Ranking for Search over Entity-Relation Graphs
(NAGA)

Efficient Query Processing (RDF-3X)

Conclusion
7
High-Quality Knowledge Sources
General-purpose ontologies and thesauri WordNet
family
  • 200 000 concepts and relations
  • can be cast into
  • description logics or
  • graph, with weights for relation strengths
  • (derived from co-occurrence statistics)

scientist, man of science (a person with
advanced knowledge) gt cosmographer,
cosmographist gt biologist, life scientist
gt chemist gt cognitive scientist gt
computer scientist ... gt principal
investigator, PI HAS INSTANCE gt Bacon,
Roger Bacon
8
Exploit Hand-Crafted Knowledge
Wikipedia, WordNet, and other lexical sources
Infobox_Scientist name Max Planck
birth_date April 23, 1858
birth_place Kiel, Germany death_date
October 4, 1947 death_place
Göttingen, Germany residence
Germany nationality GermanyGerman
field Physicist work_institution
University of Kiellt/brgt Humboldt-Universi
tät zu Berlinlt/brgt Georg-August-Universität
Göttingen alma_mater Ludwig-Maximilians-Un
iversität München doctoral_advisor
Philipp von Jolly doctoral_students
Gustav Ludwig Hertzlt/brgt known_for
Planck's constant,
Quantum mechanicsquantum theory prizes
Nobel Prize in Physics (1918)
9
Exploit Hand-Crafted Knowledge
Wikipedia, WordNet, and other lexical sources
10
YAGO Yet Another Great OntologyF. Suchanek et
al. WWW07
  • Turn Wikipedia into explicit knowledge base
    (semantic DB)
  • keep source pages as witnesses
  • Exploit hand-crafted categories and infobox
    templates
  • Represent facts as explicit knowledge triples
  • relation (entity1, entity2)
  • (in FOL, compatible with RDF, OWL-lite, XML,
    etc.)
  • Map (and disambiguate) relations into WordNet
    concept DAG

relation
entity1
entity2
Examples
bornIn
isInstanceOf
City
Max_Planck
Kiel
Kiel
11
YAGO Knowledge Base F. Suchanek et al. WWW07
Accuracy ? 95
Entity
subclass
subclass
Person
concepts
Location
subclass
Scientist
subclass
subclass
subclass
subclass
City
Country
Biologist
Physicist
instanceOf
instanceOf
Nobel Prize
Erwin_Planck
Kiel
bornIn
hasWon
FatherOf
individuals
bornOn
diedOn
October 4, 1947
Max_Planck
April 23, 1858
means
means
means
Dr. Planck
Max Karl Ernst Ludwig Planck
Max Planck
words
Online access and download at http//www.mpi-inf.m
pg.de/suchanek/yago/
12
Wikipedia Harvesting Difficulties Solutions
  • instanceOf relation isleading and difficult
    category names
  • (disputed articles, particle physics,
    American Music of the 20th Century,
  • Nobel laureates in physics, naturalized
    citizens of the United States, )
  • ? noun group parser ignore when head word in
    singular
  • isA relation mapping categories onto WordNet
    classes
  • Nobel laureates in physics ?
    Nobel_laureates, people from Kiel ? person
  • ? map to (singular of) head exploit synsets
    and statistics
  • Entity name ambiguities
  • St. Petersburg, Saint Petersburg, M31,
    NGC224 ? means ...
  • ? exploit Wikipedia redirects
    disambiguations, WN synsets
  • type checking for scrutinizing candidates
  • accept fact candidate only if arguments have
    proper classes
  • marriedTo (Max Planck, quantum physics) ?
    Person ? Person

13
Higher-Order Facts in YAGO
CapitalOf
CapitalOf
Bonn
Berlin
Germany
14
Outline
?
Motivation
?
Information Extraction Knowledge Harvesting
(YAGO)

Ranking for Search over Entity-Relation Graphs
(NAGA)

Efficient Query Processing (RDF-3X)

Conclusion
15
NAGA Graph Search G. Kasneci et al. ICDE08
Graph-based search on YAGO-style knowledge bases
with built-in ranking based on confidence and
informativeness
discovery queries
connectedness queries

isa
Thomas Mann
German novelist
isa
isa
Goethe
politician
x
scientist
complex queries (with regular expressions)
isa
wonPrize
inField
computer science
x
scientist
p
worksAt graduatedFrom
locatedIn
u
university
Switzerland
isa
capitalOf
queries over reified facts
isa
c
city
Germany
validIn
1988
16
Search Results Without Ranking
q Fisher isa scientist Fisher isa x
_at_Fisher Ronald_Fisher _at_scientist
scientist_109871938 X alumnus_109165182
_at_Fisher Irving_Fisher _at_scientist
scientist_109871938 X social_scientist_1099273
04 _at_Fisher James_Fisher _at_scientist
scientist_10981938 X ornithologist_109711173
_at_Fisher Ronald_Fisher _at_scientist
scientist_109871938 X theorist_110008610
_at_Fisher Ronald_Fisher _at_scientist
scientist_109871938 X colleague_109301221
_at_Fisher Ronald_Fisher _at_scientist
scientist_109871938 X organism_100003226
mathematician_109635652   subClassOfgt
  scientist_109871938 Alumni_of_Gonville_and_Caiu
s_College,_Cambridge   subClassOfgt
  alumnus_109165182 "Fisher"   familyNameOfgt
  Ronald_Fisher Ronald_Fisher   typegt
  Alumni_of_Gonville_and_Caius_College,_Cambridge
Ronald_Fisher   typegt   20th_century_mathematic
ians "scientist"   meansgt   scientist_109871938

17
Ranking with Statistical Language Model
q Fisher isa scientist Fisher isa x
_at_Fisher Ronald_Fisher _at_scientist
scientist_109871938 X mathematician_109635652
_at_Fisher Ronald_Fisher _at_scientist
scientist_109871938 X statistician_109958989
_at_Fisher Ronald_Fisher _at_scientist
scientist_109871938 X president_109787431 _at_Fi
sher Ronald_Fisher _at_scientist
scientist_109871938 X geneticist_109475749
_at_Fisher Ronald_Fisher _at_scientist
scientist_109871938 X scientist_109871938
Score 7.184462521168058E-13 mathematician_1096356
52   subClassOfgt   scientist_109871938
"Fisher"   familyNameOfgt   Ronald_Fisher
Ronald_Fisher   typegt   20th_century_mathematic
ians "scientist"   meansgt   scientist_109871938
20th_century_mathematicians   subClassOfgt
  mathematician_109635652
? statistical language model for result
graphs
Online access at http//www.mpi-inf.mpg.de/kasnec
i/naga/
18
Ranking Factors
  • Confidence
  • Prefer results that are likely to be correct
  • Certainty of IE
  • Authenticity and Authority of Sources

bornIn (Max Planck, Kiel) from Max Planck was
born in Kiel (Wikipedia)
livesIn (Elvis Presley, Mars) from They believe
Elvis hides on Mars (Martian Bloggeria)
  • Informativeness
  • Prefer results that are likely important
  • May prefer results that are likely new to user
  • Frequency in answer
  • Frequency in corpus (e.g. Web)
  • Frequency in query log

q isa (Einstein, y)
isa (Einstein, scientist)
isa (Einstein, vegetarian)
q isa (x, vegetarian)
isa (Einstein, vegetarian)
isa (Al Nobody, vegetarian)
  • Compactness
  • Prefer results that are tightly connected
  • Size of answer graph

vegetarian
Tom Cruise
isa
isa
bornIn
Einstein
1962
won
won
Bohr
Nobel Prize
diedIn
19
NAGA Example
Query x isa politician x isa
scientist Results Benjamin Franklin Paul
Wolfowitz Angela Merkel
20
Outline
?
Motivation
?
Information Extraction Knowledge Harvesting
(YAGO)
?
Ranking for Search over Entity-Relation Graphs
(NAGA)

Efficient Query Processing (RDF-3X)

Conclusion
21
Why RDF? Why a New DB Engine?
Person
Location
subclass
subclass
(id1, Name, Max Planck), (id1, bornOn, 23 Apr
1858), (id1, bornIn, id2), (id2, Name, Kiel),
(id2, locatedIn, id3), (id3, Name, Germany),
(id1, FatherOf, id4) (id4, Name, Erwin
Planck),
Scientist
City
subclass
Physicist
instanceOf
instanceOf
Nobel Prize
Erwin_Planck
Kiel
bornIn
Father Of
hasWon
Apr 23, 1858
bornOn
Oct 4, 1947
diedOn
Max_Planck
  • RDF triples (subject property/predicate
    value/object)
  • pay-as-you-go schema-agnostic or schema later
  • RDF triples form fine-grained ER graph
  • queries bound to need many star-joins and long
    chain-joins
  • physical design critical, but hardly predictable
    workload

22
SPARQL Query Language
SPJ combinations of triple patterns
Example Select ?c Where ?p
isa scientist . ?p bornIn ?t . ?p hasWon ?a .
?t inCountry ?c . ?a Name NobelPrize
options for filter predicates, duplicate
handling, wildcard join, etc.
Example Select Distinct ?c Where ?p
?r1 ?t . ?t ?r2 ?c . ?c isa ltcountrygt .

?p bornOn ?b . Filter (?b gt 1945)
support for RDFS types
23
RDF-3X a RISC-style Engine
  • Design rationale
  • RDF-specific engine
  • Simplify operations
  • Reduce implementation choices
  • Optimize for common case
  • Eliminate tuning knobs
  • Key principles
  • Mapping dictionary for encoding all literals
    into ids
  • Exhaustive indexing of id triples
  • Index-only store, high compression
  • QP mostly merge joins with order-preservation
  • Very fast DP-based query optimizer
  • Frequent-paths synopses, property-value
    histograms

Benchmarks on gt50 Mio. triples ltlt 100 ms
response times for queries with gt 10
joins Three important things in DBS
performance, performance, performance
!
24
Outline
?
Motivation
?
Information Extraction Knowledge Harvesting
(YAGO)
?
Ranking for Search over Entity-Relation Graphs
(NAGA)
?
Efficient Query Processing (RDF-3X)

Conclusion
25
Large-Scale Knowledge Gathering
Turn Web (2.0, 3.0, ...) into worlds most
comprehensive knowledge base
info extraction text mining
ontologies encyclopedia
Semantic Web
Statistical Web
Web 2.0 communities human computing
Social Web
26
Thank You !
Write a Comment
User Comments (0)
About PowerShow.com