In collaboration with Giorgiana Ifrim, Gjergji Kasneci, - PowerPoint PPT Presentation

About This Presentation
Title:

In collaboration with Giorgiana Ifrim, Gjergji Kasneci,

Description:

In collaboration with Giorgiana Ifrim, Gjergji Kasneci, Josiane Parreira, Maya Ramanath, Ralf Schenkel, Fabian Suchanek, Martin Theobald Vision Proof of Relevance ... – PowerPoint PPT presentation

Number of Views:302
Avg rating:3.0/5.0
Slides: 54
Provided by: Wei84
Category:

less

Transcript and Presenter's Notes

Title: In collaboration with Giorgiana Ifrim, Gjergji Kasneci,


1
In collaboration with Giorgiana Ifrim, Gjergji
Kasneci, Josiane Parreira, Maya Ramanath, Ralf
Schenkel, Fabian Suchanek, Martin Theobald
2
Vision
Opportunity Turn the Web (and Web 2.0 and Web
3.0 ...) into the worlds most comprehensive
knowledge base (semantic DB)
Challenge seize opportunity and make it happen!
  • Approach combine and exploit synergies of
  • hand-crafted, high-quality knowledge sources
  • ? Semantic Web
  • automatic knowledge extraction
  • ? Statistical Web
  • social networks and human computing
  • ? Social Web

3
Proof of Relevance
Vannevar Bush As We May Think, 1945.
There is a growing mountain of research. A
memex is a device in which an individual stores
all his books, records, and communications, and
which is mechanized so that it may be consulted
with exceeding speed and flexibility. It is an
enlarged intimate supplement to his memory.
4
Proof of Relevance
Tim Berners-Lee In the Semantic Web information
is given well-defined meaning. Jim
Gray system can answer questions
about the text as precisely and quickly
as a human expert. Brewster
Kahle The goal of universal access to
our cultural heritage is within our grasp. Jimmy
Wales Our big-picture vision is to share
knowledge with all of humanity.
Al Gore The future will be better
tomorrow.
5
Proof of Relevance ?
To know that we know what we know, and that we
do not know what we do not know, that is true
knowledge. 
You cannot open a book without learning
something.
A journey of a thousand miles begins with a
single step.
Confucius, 551-479 BC
Ignorance is less remote from the truth than
prejudice.
When science, art, literature, and philosophy are
simply the manifestation of personality, they
can make a man's name live for thousands of
years.
Sentences are like sharp nails, which force truth
upon our memories.
Denis Diderot, 1713-1784
6
Why Google and Wikipedia Are Not Enough
Turn the Web, Web2.0, and Web3.0 into the worlds
most comprehensive knowledge base (semantic
DB/graph) !
Answer knowledge queries such as
proteins that inhibit both protease and some
other enzyme
neutron stars with Xray bursts gt 1040 erg s-1
black holes in 10
differences in Rembetiko music from Greece and
from Turkey
connection between Thomas Mann and Goethe
market impact of Web2.0 technology in December
2006
sympathy or antipathy for Germany from May to
August 2006
Nobel laureate who survived both world wars and
his children
drama with three women making a prophecy to a
British nobleman that he will become king
7
Outline
Introduction Search for Knowledge
?
  • Harvesting Knowledge
  • Leibniz Approach
  • Planck Approach
  • Darwin Approach


Conclusion

8
Three Roads to Knowledge
Leibniz Approach Handcrafted High-Quality
Knowledge Sources (Semantic Web)
Planck Approach Large-scale Information
Extraction Harvesting (Statistical Web)
Darwin Approach Social Wisdom from Web 2.0
Communities (Social Web)
9
Leibniz Approach (Semantic Web)
  • Handcrafted High-Quality Knowledge
  • Ontologies and other Lexical Sources
  • Build on Rigorous Knowledge Atoms
  • (Characteristica Universalis)

Gottfried Wilhelm Leibniz (1646 - 1716)
10
High-Quality Knowledge Sources
General-purpose ontologies for Semantic Web
SUMO, Cyc, etc.
11
High-Quality Knowledge Sources
General-purpose thesauri and concept networks
WordNet family
woman, adult female (an adult female person)
gt amazon, virago (a large strong and
aggressive woman) gt donna -- (an Italian
woman of rank) gt geisha, geisha girl --
(...) gt lady (a polite name for any woman)
... gt wife (a married woman, a mans
partner in marriage) gt witch (a being,
usually female, imagined to
have special powers derived from the devil)
12
High-Quality Knowledge Sources
General-purpose thesauri and concept networks
WordNet family
  • 200 000 concepts and relations
  • can be cast into
  • description logics or
  • graph, with weights for relation strengths
  • (derived from co-occurrence statistics)

enzyme -- (any of several complex proteins that
are produced by cells and
act as catalysts in specific biochemical
reactions) gt protein -- (any of a large
group of nitrogenous organic compounds
that are essential
constituents of living cells ...)
gt macromolecule, supermolecule ... gt
organic compound -- (any compound of carbon
and
another element or a radical) ... gt
catalyst, accelerator -- ((chemistry) a substance
that initiates or

accelerates a chemical reaction
without itself
being affected) gt activator -- ((biology)
any agency bringing about activation ...)

13
High-Quality Knowledge Sources
Domain ontologies (UMLS, GeneOntology, etc.)
  • 1 Mio. biomedical concepts, 135 categories,
  • 54 relationships (e.g. virus causes (disease
    symptom) )

14
High-Quality Knowledge Sources
Wikipedia and other lexical sources
  • 2 Mio. articles
  • 40 Mio. hyperlinks
  • many 1000s of categories and lists
  • more than 100 languages
  • growing very fast

15
Exploit Hand-Crafted Knowledge
Wikipedia, WordNet, and other lexical sources
Infobox_Scientist name Max Planck
birth_date April 23, 1858
birth_place Kiel, Germany death_date
October 4, 1947 death_place
Göttingen, Germany residence
Germany nationality GermanyGerman
field Physicist work_institution
University of Kiellt/brgt Humboldt-Universi
tät zu Berlinlt/brgt Georg-August-Universität
Göttingen alma_mater Ludwig-Maximilians-Un
iversität München doctoral_advisor
Philipp von Jolly doctoral_students
Gustav Ludwig Hertzlt/brgt known_for
Planck's constant,
Quantum mechanicsquantum theory prizes
Nobel Prize in Physics (1918)
16
YAGO Yet Another Great OntologyF. Suchanek, G.
Kasneci, G. Weikum WWW07
  • Turn Wikipedia into explicit knowledge base
    (semantic DB)
  • Exploit hand-crafted categories and templates
  • Represent facts as explicit knowledge triples
  • relation (entity1, entity2)
  • (in FOL, compatible with RDF, OWL-lite, XML,
    etc.)
  • Map (and disambiguate) relations into WordNet
    concept DAG

relation
entity1
entity2
Examples
bornIn
isInstanceOf
City
Max_Planck
Kiel
Kiel
17
YAGO Knowledge Representation
Accuracy ? 97
Entity
subclass
subclass
Person
concepts
Location
subclass
Scientist
subclass
subclass
subclass
subclass
City
Country
Biologist
Physicist
instanceOf
instanceOf
Nobel Prize
Erwin_Planck
Kiel
bornIn
hasWon
FatherOf
individuals
bornOn
diedOn
October 4, 1947
Max_Planck
April 23, 1858
means
means
means
Dr. Planck
Max Karl Ernst Ludwig Planck
Max Planck
words
Online access and download at http//www.mpi-inf.m
pg.de/suchanek/yago/
18
YAGO Disambiguation Uncertainty
capture confidence value for each fact
Entity
subclass
1.0
subclass
1.0
Person
Location
subclass
1.0
subclass
1.0
1.0
subclass
subclass
City
Country
Mythological Figure
Celebrity
0.7
instanceOf
0.8
instanceOf
0.9
instanceOf
1.0
0.4
instanceOf
locatedIn
Paris(Myth.)
Paris(France)
France
Paris Hilton
0.95
means
0.7
means
0.1
means
0.9
means
0.2
means
0.05
Paris
France
La Grande Nation
additional harvesting of relations from
natural-language texts by info-extraction tools
19
NAGA Graph IR on YAGO G. Kasneci et al. WWW07
Graph-based search on YAGO-style knowledge bases
with built-in ranking based on statistical
language model
discovery queries
hasWon
Nobel prize
diedOn
x
a
isa
bornIn
Kiel
x
scientist
hasSon
gt
diedOn
y
b
connectedness queries

isa
Thomas Mann
Goethe
German novelist
queries with regular expressions
isa
hasFirstName hasLastName
Ling
x
scientist
worksFor
(coAuthor advisor)
locatedIn
y
Zhejiang
Beng Chin Ooi
20
NAGA Searching Knowledge
q Fisher isa scientist Fisher isa x
_at_Fisher Ronald_Fisher _at_scientist
scientist_109871938 X alumnus_109165182
_at_Fisher Irving_Fisher _at_scientist
scientist_109871938 X social_scientist_1099273
04 _at_Fisher James_Fisher _at_scientist
scientist_10981938 X ornithologist_109711173
_at_Fisher Ronald_Fisher _at_scientist
scientist_109871938 X theorist_110008610
_at_Fisher Ronald_Fisher _at_scientist
scientist_109871938 X colleague_109301221
_at_Fisher Ronald_Fisher _at_scientist
scientist_109871938 X organism_100003226
mathematician_109635652   subClassOfgt
  scientist_109871938 Alumni_of_Gonville_and_Caiu
s_College,_Cambridge   subClassOfgt
  alumnus_109165182 "Fisher"   familyNameOfgt
  Ronald_Fisher Ronald_Fisher   typegt
  Alumni_of_Gonville_and_Caius_College,_Cambridge
Ronald_Fisher   typegt   20th_century_mathematic
ians "scientist"   meansgt   scientist_109871938

21
NAGA Searching Ranking Knowledge
q Fisher isa scientist Fisher isa x
_at_Fisher Ronald_Fisher _at_scientist
scientist_109871938 X mathematician_109635652
_at_Fisher Ronald_Fisher _at_scientist
scientist_109871938 X statistician_109958989
_at_Fisher Ronald_Fisher _at_scientist
scientist_109871938 X president_109787431 _at_Fi
sher Ronald_Fisher _at_scientist
scientist_109871938 X geneticist_109475749
_at_Fisher Ronald_Fisher _at_scientist
scientist_109871938 X scientist_109871938
Score 7.184462521168058E-13 mathematician_1096356
52   subClassOfgt   scientist_109871938
"Fisher"   familyNameOfgt   Ronald_Fisher
Ronald_Fisher   typegt   20th_century_mathematic
ians "scientist"   meansgt   scientist_109871938
20th_century_mathematicians   subClassOfgt
  mathematician_109635652
Online access at http//www.mpi-inf.mpg.de/kasnec
i/naga/
22
Ranking Factors
  • Confidence
  • Prefer results that are likely to be correct
  • Certainty of IE
  • Authenticity and Authority
  • of Sources

bornIn (Max Planck, Kiel) from Max Planck was
born in Kiel (Wikipedia)
livesIn (Elvis Presley, Mars) from They believe
Elvis hides on Mars (Martian Bloggeria)
  • Informativeness
  • Prefer results that are likely important
  • May prefer results that are likely new to user
  • Frequency in answer
  • Frequency in corpus (e.g. Web)
  • Frequency in query log

q isa (Einstein, y)
isa (Einstein, scientist)
isa (Einstein, vegetarian)
q isa (x, vegetarian)
isa (Einstein, vegetarian)
isa (Al Nobody, vegetarian)
  • Compactness
  • Prefer results that are
  • tightly connected
  • Size of answer graph

vegetarian
Tom Cruise
isa
isa
bornIn
Einstein
1962
won
won
Bohr
Nobel Prize
diedIn
23
Summary of Leibniz Approach
Hand-crafted knowledge sources are great assets,
but expensive, partial, and isolated
Great mileage even from informal semiformal
sources
Connecting reconciling different sources gives
added value (and sometimes is not even that hard)
  • Challenge
  • Develop methods for comprehensive, highly
    accurate
  • mappings across many knowledge sources
  • Cross-lingual
  • Cross-temporal
  • Scalable

24
Planck Approach (Statistical Web)
  • Information Extraction Harvesting
  • Gather Entities, Relations, Facts
  • Live with Uncertainty

Max Planck (1858 - 1947)
25
Information Extraction (IE) Text to Records
combine NLP, pattern matching, lexicons,
statistical learning
26
IE Technology Rules, Patterns, Learning
  • For natural-language text and for heterogeneous
    sources
  • NLP techniques (parser, PoS tagging) for
    tokenization
  • identify patterns (e.g. regular expressions) as
    features
  • train statistical learners for segmentation and
    labeling
  • use learned model to automatically tag newly
    seen input

Training data The WWW conference takes place in
Banff in Canada. Todays keynote speaker is Dr.
Berners-Lee from W3C. The panel in Edinburgh,
chaired by Ron Brachman from Yahoo!,
ltlocationgt
ltorganizationgt
ltpersongt
lteventgt
Ian Foster, father of the Grid, talks at the GES
conference in Germany on 05/02/07.
ltpersongt
lteventgt
ltlocationgt
ltdategt
27
Knowledge Acquisition from the Web
  • Learn Semantic Relations from Entire Corpora at
    Large Scale
  • (as exhaustively as possible but with high
    accuracy)
  • Examples
  • all cities, all basketball players, all
    composers
  • headquarters of companies, CEOs of companies,
    synonyms of proteins
  • birthdates of people, capitals of countries,
    rivers in cities
  • which musician plays which instruments
  • who discovered or invented what
  • which enzyme catalyzes which biochemical
    reaction

Existing approaches and tools use almost-unsupervi
sed pattern matching and learning seeds (known
facts) ? patterns (in text) ? (extraction) rule ?
(new) facts
28
Methods for Web-Scale Fact Extration
seeds ? text ?
rules ? new facts
Example city (Seattle) in
downtown Seattle city (Seattle)
Seattle and other towns city (Las Vegas)
Las Vegas and other towns plays (Zappa,
guitar) playing guitar Zappa plays
(Davis, trumpet) Davis blows trumpet
Example city (Seattle) in
downtown Seattle in downtown X city (Seattle)
Seattle and other towns X
and other towns city (Las Vegas) Las Vegas
and other towns X and other towns plays (Zappa,
guitar) playing guitar Zappa playing Y
X plays (Davis, trumpet) Davis blows
trumpet X blows Y
Example city (Seattle) in
downtown Seattle in downtown X city (Seattle)
Seattle and other towns X
and other towns city (Las Vegas) Las Vegas
and other towns X and other towns plays (Zappa,
guitar) playing guitar Zappa playing Y
X plays (Davis, trumpet) Davis blows
trumpet X blows Y
Example city (Seattle) in
downtown Seattle in downtown X city (Seattle)
Seattle and other towns X
and other towns city (Las Vegas) Las
Vegas and other towns X and other towns plays
(Zappa, guitar) playing guitar
Zappa playing Y X plays (Davis, trumpet)
Davis blows trumpet X blows Y
in downtown
Delhi city(Delhi)
Coltrane blows sax plays(C., sax)
city(Delhi) plays(Coltrane, sax)
city(Delhi) old center of
Delhi plays(Coltrane, sax) sax player
Coltrane
city(Delhi) old center of Delhi old
center of X plays(Coltrane, sax) sax player
Coltrane Y player X
29
Performance of Web-IE
State-of-the-art precision/recall results
relation precision recall corpus
systems countries 80 90 Web
KnowItAll cities 80 ??? Web
KnowItAll scientists 60 ???
Web KnowItAll CEOs 80 50 News
Snowball, LEILA birthdates 80 70
Wikipedia LEILA instanceOf 40 20
Web Text2Onto, LEILA
precision value-chain entities 80, attributes
70, facts 60, events 50
Anecdotic evidence
invented (A.G. Bell, telephone) married (Hillary
Clinton, Bill Clinton) isa (yoga, relaxation
technique) isa (zearalenone, mycotoxin) contains
(chocolate, theobromine) contains (Singapore
sling, gin)
invented (Johannes Kepler, logarithm
tables) married (Segolene Royal, Francois
Hollande) isa (yoga, excellent way) isa (your
day, good one) contains (chocolate,
raisins) plays (the liver, central role) makes
(everybody, mistakes)
30
Beyond Surface Learning with LEILA
Learning to Extract Information by Linguistic
Analysis F. Suchanek et al. KDD06
Limitation of surface patterns who discovered or
invented what Teslas work
formed the basis of AC electric power
Al Gore funded more work for a better
basis of the Internet
Almost-unsupervised Statistical Learning with
Dependency Parsing
LEILA outperforms other Web-IE methods in
precision and recall, but dependency parser is
slow
31
IE Efficiency and Accuracy Tradeoffs
see also tutorials by Cohen, Doan/Ramakrishnan/Va
ithyanathan, Agichtein/Sarawagi
IE is cool, but whats in it for DB folks?
  • precision vs. recall two-stage processing
    (filter pipeline)
  • recall-oriented harvesting
  • precision-oriented scrutinizing
  • preprocessing
  • indexing NLP trees graphs, N-grams, PoS-tag
    patterns ?
  • exploit ontologies? exploit usage logs ?
  • turn crawlextract into set-oriented query
    processing
  • candidate finding
  • efficient phrase, pattern, and proximity queries
  • optimizing entire text-mining workflows
    Ipeirotis et al. SIGMOD06

32
Summary of Planck Approach
Human text (and speech) is diverse and produced
at higher rate than manual high-quality
annotations
IE offers reasonably robust and scalable methods
for harvesting named entities and binary relations
? Deep NLP and advanced ML are computational
bottleneck
? Disambiguation (entity matching, record
linkage) needed Joe Hellerstein (UC
Berkeley) Prof. Joseph M. Hellerstein,
California) Max Planck Institute MPI
? MPI Message Passing Institute
  • Challenge
  • Achieve Web-scale IE throughput that can
  • sustain rate of new content production (e.g.
    blogs)
  • (may need large-scale P2P/Grid)
  • with gt 90 accuracy and Wikipedia-like coverage

33
Darwin Approach (Social Web)
  • Social Wisdom Natural Selection
  • Evolution of (Web 2.0) species
  • Survival of the fittest

Charles Darwin (1809 - 1882)
34
Wisdom of Crowds at Work on Web 2.0
  • Information enrichment knowledge extraction by
    humans
  • Collaborative Recommendations QA
  • Amazon (product ratings reviews, recommended
    products)
  • Netflix movie DVD rentals ? 1 Mio. Challenge
  • answers.yahoo, iknow.baidu, etc.
  • Social Tagging and Folksonomies
  • del.icio.us Web bookmarks and tags
  • flickr photo annotation, categorization, rating
  • YouTube same for video
  • Human Computing in Game Form
  • ESP and Google Image Labeler image tagging
  • Peekaboom image segmenting and tagging
  • Verbosity facts from natural-language sentences
  • Online Communities
  • dblife.cs.wisc.edu for database research
  • www.lt-world.org for language technology
  • Yahoo! Groups, Myspace, Facebook, etc. etc.

35
Social Tagging Example Flickr (1)
36
Social Tagging Example Flickr (2)
37
Social Tagging Example Flickr (3)
38
Social-Tagging Community
gt 1 Mio. users gt 100 Mio. photos gt 1 Bio.
tags 30 monthly growth
Source www.flickr.com
39
ESP Game Luis von Ahn et al. 2004
played against random, anonymous partner on
Internet
taboo pyramid Louvre museum Paris art
  • Game with a purpose
  • Collects annotations (wisdom)
  • Can exploit tag statistics (crowds)
  • Attracts people, fun to play, some play hours
  • ESP game collected gt 10 Mio. tags from gt 20000
    users
  • 5000 people could tag all photos on the Web in 4
    weeks
  • (human computing)

my labels
my labels reflection
your partner has suggested 3 labels
my labels reflection water
your partner has suggested 7 labels
my labels reflection water Mitterand Mona Lisa
your partner has suggested 11 labels
my labels reflection water Mitterand Mona
Lisa metro lignes 7, 14
your partner has suggested 17 labels
my labels reflection water Mitterand Mona
Lisa metro lignes 7, 1 Da Vinci code
Congratulations! You scored 1 point!
40
More Human Computing
  • Verbosity von Ahn 2006
  • Collect common-knowledge facts (relation
    instances)
  • 2 players Narrator (N) and Guessor (G)
  • N gives stylized clues
  • is a kind of , is used for , is typically
    near/in/on , is the opposite of
  • random pairing for independence,
  • can build statistics over many games for same
    concept

Peekaboom, Phetch, etc. locating tagging
objects in images, finding images, etc.
  • incentives to play ?
  • game design for moving up the value-chain ?

41
Dark Side of Social Wisdom
  • Spam (Web spam not just for email anymore)
  • lucky online casino, easy MBA diploma, cheap
    V!-4-gra, etc.
  • law suits about appropriate Google rank
  • Truthiness
  • degree to which something is truthy (not
    necessarily facty)
  • truthy property of something you know from
    your guts
  • Disputes
  • editorial fights over critical Wikipedia
    articles
  • Citizendium new endeavor with "gentle expert
    oversight"
  • Dishonesty, Bias,

42
(No Transcript)
43
(No Transcript)
44
The Wisdom of Crowds PageRank
PageRank (PR) links are endorsements increase
page authority, authority is
higher if links come from high-authority pages
Social Ranking
with
and
equivalent to principal eigenvector
random walk uniformly random choice of links
random jumps add bias to transitions and jumps
for personal PR, TrustRank, etc.
45
The Wisdom of Crowds Beyond PR
users
tags
docs
Typed graphs data items, users, friends, groups,
postings, ratings,
queries, clicks, with weighted edges ?
spectral analysis of various graphs
Evolving over time ? tensor analysis
46
Decentralized Graph Analysis
  • Graph spectral analysis applied to
  • pages, sites, tags, users, groups, queries,
    clicks, opinions, etc. as nodes
  • assessment and interaction relations as weighted
    edges
  • can compute various notions of authority,
    reputation, trust, quality

  • Decentralized computation in peer-to-peer network
  • with arbitrary, a-priori unknown overlaps of
    graph fragments

47
JXP Algorithm J.X. Parreira, G. Weikum
WebDB05, VLDB06
Decentralized, asynchronous, peer-to-peer
algorithm based on theory of Markov-chain
aggregation (state lumping)
P.J.
Courtois 1977, C.D. Meyer 1988
  • each peer aggregates non-local part of global
    graph into world node
  • peers meet randomly,
  • exchange data about their local computations,
    and
  • iterate their local computations

Theorem authority scores from local
computations converge to global scores
supported by Minerva system http//www.mpi-inf.mpg
.de/departments/d5/software/minerva/index.html
48
Summary of Darwin Approach
Social tagging and social networks (Web 2.0) are
potentially valuable knowledge sources
Games (human computing) are an interesting way of
enticing knowledge input and collecting
statistics
Spectral analysis is a highly versatile tool for
rating ranking that can be extended and scaled
by decentralized algorithms
  • Challenges
  • Design a game that intrigues serious scientists
  • to semantically annotate their scholarly work
  • Develop an analysis method that identifies the
    best facts,
  • resilient to egoistic and malicious behaviors
    (incl. coalitions)

49
Outline
Introduction Search for Knowledge
?
  • Harvesting Knowledge
  • Leibniz Approach
  • Planck Approach
  • Darwin Approach

?
Conclusion

50
Summary
  • Harvesting knowledge organizing in semantic
    DB/graph for
  • scholarly Web,
  • digital libraries,
  • enterprise know-how,
  • online communities, etc.
  • Three roads to knowledge
  • Leibniz / Semantic Web ontologies,
    encyclopedia, etc.
  • Planck / Statistical Web large-scale IE from
    text, speech, etc.
  • Darwin / Social Web wisdom of crowds, tagging,
    folksonomies
  • Not covered here search and ranking
  • graph IR (for ER graphs, RDF, cross-linked XML,
    etc.)
  • new ranking models (e.g. statistical LM for
    graphs)
  • efficient and scalable query processing

51
Major Challenges
  • Generalize YAGO approach (Wikipedia WordNet)
  • Methods for comprehensive, highly accurate
  • mappings across many knowledge sources
  • cross-lingual, cross-temporal
  • scalable in size, diversity, number of sources
  • Pursue DB support towards efficient IE (and NLP)
  • Achieve Web-scale IE throughput that can
  • sustain rate of new content production (e.g.
    blogs)
  • with gt 90 accuracy and Wikipedia-like coverage
  • Integrate handcrafted knowledge with
    NLP/ML-based IE
  • Incorporate social tagging and human computing

52
Potential Synergies amongLeibniz, Planck, and
Darwin
knowledge core
bootstrap
Leibniz Semantic Web
Planck Statistical Web
validate
emerge
statistics feedback
Darwin Social Web
communities
53
Thank you !
Write a Comment
User Comments (0)
About PowerShow.com