Creating and Exploiting a Web of Semantic Data - PowerPoint PPT Presentation

About This Presentation
Title:

Creating and Exploiting a Web of Semantic Data

Description:

Creating and Exploiting a Web of Semantic Data * * * * * * * * * * * * The Wikitology KB was constructed by integrating information from three sources: Wikipedia ... – PowerPoint PPT presentation

Number of Views:81
Avg rating:3.0/5.0
Slides: 58
Provided by: JHU85
Category:

less

Transcript and Presenter's Notes

Title: Creating and Exploiting a Web of Semantic Data


1
Creating and Exploiting a Web of Semantic Data
2
Overview
  • Introduction
  • Semantic Web 101
  • Recent Semantic Web trends
  • Examples DBpedia, Wikitology
  • Conclusion

3
The Age of Big Data
  • Massive amounts of data is available today
  • Advances in many fields driven by availability of
    unstructured data, e.g., text, audio, images
  • Increasingly, large amounts of structured and
    semi-structured data is also online
  • Much of this available in the Semantic Web
    language RDF, fostering integration and
    interoperability
  • Such structured data is especially important for
    the sciences

4
Twenty years ago
  • Tim Berners-Lees 1989 WWW proposal described a
    web of rela- tionships among named objects
    unifying many information management tasks
  • Capsule history
  • Guhas MCF (94)
  • XMLMCFgtRDF (96)
  • RDFOOgtRDFS (99)
  • RDFSKRgtDAMLOIL (00)
  • W3Cs SW activity (01)
  • W3Cs OWL (03)
  • SPARQL, RDFa (08)
  • Rules (09)
  • http//www.w3.org/History/1989/proposal.html

5
Ten years ago .
  • The W3C started developing standards for the
    Semantic Web
  • The vision, technology and use cases are still
    evolving
  • Moving from a web of documents to a web of data

6
Today
4.5 billion integrated facts published on the Web
as RDF Linked Open Data
7
Tomorrow
Large collections of integrated facts published
on the Web for many disciplines and domains
8
W3Cs Semantic Web Goal
  • The Semantic Web is an extension of the current
    web in which information is given well-defined
    meaning, better enabling computers and people to
    work in cooperation.
  • -- Berners-Lee, Hendler and Lassila, The Semantic
    Web, Scientific American, 2001

9
From a Web of linked documents
10
To a Web of linked data
11
Contrast with a non-Web approach
  • The W3C Semantic Web approach is
  • Distributed
  • Open
  • Non-proprietary
  • Standards based

12
How can we share data on the Web?
  • POX, Plain Old XML, is one approach, but it has
    deficiencies
  • The Semantic Web languages RDF and OWL offer a
    simpler and more abstract data model (a graph)
    that is better for integration
  • Its well defined semantics supports knowledge
    modeling and inference
  • Supported by a stable, funded standards
    organization, the World Wide Web Consortium

13
Simple RDF Example
http//umbc.edu/finin/talks/idm02/
dcTitle
Intelligent Information Systemson the Web and
in the Aether
dcCreator
Note blank node
bibAff
bibemail
http//umbc.edu/
bibname
finin_at_umbc.edu
Tim Finin
14
The RDF Data Model
  • An RDF document is an unordered collection of
    statements, each with a subject, predicate and
    object
  • Such triples can be thought of as a labelled arc
    in a graph
  • Statements describe properties of resources
  • A resource is any object that can be referenced
    or denoted by a URI
  • Properties themselves are also resources (URIs)
  • Dereferencing a URI produces useful additional
    information, e.g., a definition or additional
    facts

15
RDF is the first SW language
Graph
XML Encoding
RDF Data Model
ltrdfRDF ..gt lt.gt lt.gt lt/rdfRDFgt
Good for human viewing
Good for Machineprocessing
Triples
stmt(docInst, rdf_type, Document) stmt(personInst,
rdf_type, Person) stmt(inroomInst, rdf_type,
InRoom) stmt(personInst, holding,
docInst) stmt(inroomInst, person, personInst)
RDF is a simple language for graph based
representations
Good for storage and reasoning
16
XML encoding for RDF
ltrdfRDF xmlnsrdf"http//www.w3.org/1999/02/22-r
df-syntax-ns" xmlnsdc"http//purl.org/dc/el
ements/1.1/" xmlnsbib"http//daml.umbc.edu/o
ntologies/bib/"gt ltdescription about"http//umbc.e
du/finin/talks/idm02/"gt ltdctitlegtIntelligent
Information and in the Aetherlt/dcTitlegt
ltdccreatorgt ltdescriptiongt
ltbibNamegtTim Fininlt/bibNamegt
ltbibEmailgtfinin_at_umbc.edult/bibEmailgt
ltbibAff resource"http//umbc.edu/" /gt
lt/descriptiongt lt/dcCreatorgt lt/descriptiongt lt/r
dfRDFgt
17
N3 is a friendlier encoding
  • _at_prefix rdf http//www.w3.org/1999/02/22-rdf-synt
    ax-ns .
  • _at_prefix dc http//purl.org/dc/elements/1.1/ .
  • _at_prefix bib http//daml.umbc.edu/ontologies/bib/
    .
  • lthttp//umbc.edu/finin/talks/idm02/gt
  • dctitle "Intelligent ... and in the
    Aether"
  • dccreator
  • bibName "Tim Finin"
  • bibEmail "finin_at_umbc.edu"
  • bibAff "http//umbc.edu/" .

18
RDFS supports simple inferences
  • RDF Schema adds vocabulary for classes,
    properties constraints
  • An RDF ontology plus some RDF statements may
    imply additional RDF statements (not possible in
    XML)
  • Note that this is part of the data model and not
    of the accessing or processing code.

_at_prefix rdfs lthttp//www.....gt. _at_prefix
ltgenesis.n3gt. parent a rdf property
rdfsdomain person rdfsrange
person. mother rdfssubProperty parent
rdfsdomain woman rdfsrange person. eve
mother cain.
person a class. woman subClass person. mother a
property. eve a person a woman
parent cain. cain a person.
19
OWL adds further richness
  • OWL adds richer representational vocabulary, e.g.
  • parentOf is the inverse of childOf
  • Every person has exactly one mother
  • Every person is a man or a woman but not both
  • A man is the equivalent of a person with a sex
    property with value male
  • OWL is based on description logic a logic
    subset with efficient reasoners that are complete
  • Good algorithms for reasoning about descriptions

20
That was then, this is now
  • 1996-2000 focus on RDF and data
  • 2000-2007 focus on OWL, developing ontologies,
    sophisticated reasoning
  • 2008- Integrating and exploiting large RDF data
    collections backed by lightweight ontologies

21
A Linked Data story
  • Wikipedia as a source of knowledge
  • Wikis are a great ways to collaborateon building
    up knowledge resources
  • Wikipedia as an ontology
  • Every Wikipedia page is a concept or object
  • Wikipedia as RDF data
  • Map this ontology into RDF
  • DBpedia as the lynchpin for Linked Data
  • Exploit its breadth of coverage to integrate
    things

22
Populating Freebase KB
23
Underlying Powersets KB
24
Mined by TrueKnowledge
25
Wikipedia as an ontology
  • Using Wikipedia as an ontology
  • each article (3M) is an ontology concept or
    instance
  • terms linked via category system (200k), infobox
    template use, inter-article links, infobox links
  • Article history contains metadata for trust,
    provenance, etc.
  • Its a consensus ontology with broad coverage
  • Created and maintained by a diverse community for
    free!
  • Multilingual
  • Very current
  • Overall content quality is high

26
Wikipedia as an ontology
  • Uncategorized and miscategorized articles
  • Many administrative categories articles
    needing revision useless ones 1949 births
  • Multiple infobox templates for the same class
  • Multiple infobox attribute names for same
    property
  • No datatypes or domains for infobox attribute
    values
  • etc.

27
Dbpedia Wikipedia in RDF
  • A community effort to extractstructured
    information fromWikipedia and publish as RDFon
    the Web
  • Effort started in 2006 with EU funding
  • Data and software open sourced
  • DBpedia doesnt extract information from
    Wikipedias text, but from the its structured
    information, e.g., links, categories, infoboxes

28
DBpedia Linked Data lynchpin
29
http//lookup.dbpedia.org/
30
(No Transcript)
31
(No Transcript)
32
(No Transcript)
33
Dbpedia uses WP structured data
  • DBpedia extracts structured data from Wikipedia,
    especially from Infoboxes

34
Dbpedia ontology
  • Dbpedia 3.2 (Nov 2008) added a manually
    constructed ontology with
  • 170 classes in a subsumption hierarchy
  • 880K instances
  • 940 properties with domain and range
  • A partial, manual mapping was constructed from
    infobox attributes to these term
  • Current domain and range constraints are loose
  • Namespace http//dbpedia.org/ontology/

Place 248,000 Person 214,000 Work
193,000 Species 90,000 Org.
76,000 Building 23,000
35
Person
56 properties
36
Organisation
50 properties
37
Place
110 properties
38
PREFIX dbp lthttp//dbpedia.org/resource/gt PREFIX
dbpo lthttp//dbpedia.org/ontology/gt SELECT
distinct ?Property ?Place WHERE dbpBarack_Obama
?Property ?Place . ?Place rdftype
dbpoPlace .
http//dbpedia.org/sparql/
39
DBpedia Linked Data lynchpin
40
Consider Baltimore, MD
41
Looking at the RDF description
  • We find assertions equating DBpedia's object for
    Baltimore with those in other LOD datasets
  • dbpediaBaltimore2C_Maryland
  • owlsameAs censusus/md/counties/baltimore/balti
    more
  • owlsameAs cycconcept/Mx4rvVin-5wpEbGdrcN5Y29yc
    A
  • owlsameAs freebaseguid.9202a8c04000641f8000000
    00004921a
  • owlsameAs geonames4347778/ .
  • Since owlsameAs is defined as an equivalence
    relation, the mapping works both ways

42
Linked Data Cloud, March 2009
43
Four principles for linked data
  • Use URIs to identify things that you expose to
    the Web as resources
  • Use HTTP URIs so that people can locate and look
    up (dereference) these things.
  • When someone looks up a URI, provide useful
    information
  • Include links to other, related URIs in the
    exposed data as a means of improving information
    discovery on the Web

-- Tim Berners-Lee, 2006
44
4.5 billion triples for free
  • The full public LOD dataset has about 4.5 billion
    triples as of March 2009
  • Linking assertions are spotty, but probably
    include order 10M equivalences
  • Availability
  • download the data in RDF
  • Query it via a public SPARQL servers
  • load it as an Amazon EC2 public dataset
  • Launch it and required software as an Amazon
    public AMI image

45
Wikitology
  • Weve been exploring a different approach to
    derive an ontology from Wikipedia through a
    series of use cases
  • Identifying user context in a collaboration
    system from documents viewed (2006)
  • Improve IR accuracy by adding Wikitology tags to
    documents (2007)
  • ACE cross document co-reference resolution for
    named entities in text (2008)
  • TAC KBP Knowledge Base population from text
    (2009)
  • Improve Web search engine by tagging documents
    and queries (2009)

46
Wikitology 2.0 (2008)
RDF
RDF
text
graphs
Freebase KB
Yago
WordNet
Human input editing
Databases
47
Wikitology tagging
  • Using Serifs output, we produced an entity
    document for each entity.
  • Included the entitys name, nominal and
    pronominal mentions, APF type and subtype, and
    words in a window around the mentions
  • We tagged entity documents using Wiki-tology
    producing vectors of (1) terms and (2) categories
    for the entity
  • We used the vectors to compute features measuring
    entity pair similarity/dissimilarity

48
Wikitology Entity Document Tags
Wikitology article tag vector Webster_Hubbell
1.000 Hubbell_Trading_Post National Historic
Site 0.379 United_States_v._Hubbell 0.377
Hubbell_Center 0.226 Whitewater_controversy
0.222 Wikitology category tag vector
Clinton_administration_controversies 0.204
American_political_scandals 0.204 Living_people
0.201 1949_births 0.167 People_from_Arkansas
0.167 Arkansas_politicians 0.167
American_tax_evaders 0.167 Arkansas_lawyers 0.167
  • Wikitology entity document
  • ltDOCgt
  • ltDOCNOgtABC19980430.1830.0091.LDC2000T44-E2
    ltDOCNOgt
  • ltTEXTgt
  • Webb Hubbell
  • PER
  • Individual
  • NAM "Hubbell "Hubbells "Webb Hubbell
    "Webb_Hubbell"
  • PRO "he "him "his"
  • abc's accountant after again ago all alleges
    alone also and arranged attorney avoid been
    before being betray but came can cat charges
    cheating circle clearly close concluded
    conspiracy cooperate counsel counsel's department
    did disgrace do dog dollars earned eightynine
    enough evasion feel financial firm first four
    friend friends going got grand happening has he
    help him hi s hope house hubbell hubbells hundred
    hush income increase independent indict indicted
    indictment inner investigating jackie jackie_judd
    jail jordan judd jury justice kantor ken knew
    lady late law left lie little make many mickey
    mid money mr my nineteen nineties ninetyfour not
    nothing now office other others paying
    peter_jennings president's pressure pressured
    probe prosecutors questions reported reveal rock
    saddened said schemed seen seven since starr
    statement such tax taxes tell them they thousand
    time today ultimately vernon washington webb
    webb_hubbell were what's whether which white
    whitewater why wife years
  • lt/TEXTgt
  • lt/DOCgt

Name
Type subtype
Mention heads
Words surrounding mentions
49
Top Ten Features (by F1)
Prec. Recall F1 Feature Description
90.8 76.6 83.1 some NAM mention has an exact match
92.9 71.6 80.9 Dice score of NAM strings (based on the intersection of NAM strings, not words or n-grams of NAM strings)
95.1 65.0 77.2 the/a longest NAM mention is an exact match
86.9 66.2 75.1 Similarity based on cosine similarity of Wikitology Article Medium article tag vector
86.1 65.4 74.3 Similarity based on cosine similarity of Wikitology Article Long article tag vector
64.8 82.9 72.8 Dice score of character bigrams from the 'longest' NAM string
95.9 56.2 70.9 all NAM mentions have an exact match in the other pair
85.3 52.5 65.0 Similarity based on a match of entities' top Wikitology article tag
85.3 52.3 64.8 Similarity based on a match of entities' top Wikitology article tag
85.7 32.9 47.5 Pair has a known alias
50
Knowledge Base Population
  • The 2009 NIST Text Analysis Conference (TAC) will
    include a new Knowledge Base Population track
  • Goal discover information about named entities
    (people, organizations, places) and incorporate
    it into a KB
  • TAC KBP has two related tasks
  • Entity linking doc. entity mention -gt KB entity
  • Slot filling given a document entity mention,
    find missing slot values in large corpus

51
KBs and IE are Symbiotic
KB info helps interpret text
KnowledgeBase
Information Extraction from Text
IE helps populate KBs
52
Wikitology 3.0 (2009)
Articles
IRcollection
Application Specific Algorithms
CategoryLinks Graph
Infobox Graph
WikitologyCode
Application Specific Algorithms
Infobox Graph
Page LinkGraph
RDFreasoner
Application Specific Algorithms
Relational Database
TripleStore
LinkedSemanticWeb data ontologies
53
Wikipedias social network
  • Wikipedia has an implicit social network that
    can help disambiguate PER mentions
  • Resolving PER mentions in a short document to KB
    people who are linked in the KB is good
  • The same can be done for the network of ORG and
    GPE entities

54
WSN Data
  • We extracted 213K people from the DBpedias
    Infobox dataset, 30K of which participate in an
    infobox link to another person
  • We extracted 875K people from Freebase, 616K of
    were linked to Wikipedia pages, 431K of which are
    in one of 4.8M person-person article links
  • Consider a document that mentions two people
    George Bush and Mr. Quayle

55
Which Bush which Quayle?
Six George Bushes
Nine Male Quayles
56
A simple closeness metric
  • Let Si two hop neighbors of Si
  • Cij intersection(Si,Sj) / union(Si,Sj)
  • Cijgt0 for six of the 56 possible pairs
  • 0.43 George_H._W._Bush -- Dan_Quayle
  • 0.24 George_W._Bush -- Dan_Quayle
  • 0.18 George_Bush_(biblical_scholar) -- Dan_Quayle
  • 0.02 George_Bush_(biblical_scholar) --
    James_C._Quayle
  • 0.02 George_H._W._Bush -- Anthony_Quayle
  • 0.01 George_H._W._Bush -- James_C._Quayle

57
Application to TAC KBP
  • Using entity network data extracted from Dbpedia
    and Wikipedia provides evidence to support KBP
    tasks
  • Mapping document mentions into infobox entities
  • Mapping potential slot fillers into infobox
    entities
  • Evaluating the coherence of entities as potential
    slot fillers

58
Conclusion
  • The Semantic Web approach is a powerful approach
    for data interoperability and integration
  • The research focus is shifting to a Web of Data
    perspective
  • Many research issue remain uncertainty,
    provenance, trust, parallel graph algorithms,
    reasoning over billions of triples, user-friendly
    tools, etc.
  • Just as the Web enhances human intelligence, the
    Semantic Web will enhance machine intelligence
  • The ideas and technology are still evolving

59
http//ebiquity.umbc.edu/
Write a Comment
User Comments (0)
About PowerShow.com