Introduction to RDF, Jena, SparQL, and the - PowerPoint PPT Presentation


PPT – Introduction to RDF, Jena, SparQL, and the PowerPoint presentation | free to view - id: 5e75a1-YWJmY


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation

Introduction to RDF, Jena, SparQL, and the


Introduction to RDF, Jena, SparQL, and the Semantic Web Michael Grobe Pervasive Technology Institute Indiana University October 12, 2009 This presentation in ... – PowerPoint PPT presentation

Number of Views:390
Avg rating:3.0/5.0
Slides: 66
Provided by: dgr5


Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Introduction to RDF, Jena, SparQL, and the

  • Introduction to RDF, Jena, SparQL, and the
    Semantic Web
  • Michael Grobe
  • Pervasive Technology Institute
  • Indiana University
  • October 12, 2009

  • This presentation in perspective
  • This is actually one of a series of presentations
    on Linked Data Web and semantic technologies
  • Introduction to ontologies
  • This on RDF, Jena, SparQL, and the Semantic
  • Using inference and OWL
  • In general, these Semantic technology topics seem
    deceptively simple, but are fraught with
    complications, limitations, and
    qualificationsespecially when the casual user
    attempts to compare them with relational data
    approaches to the same or similar problems.

  • Topics
  • Simple introduction to the semantic approach
  • - sentences as triples and graphs
  • - sentence components encoded using URIs
  • - serializing sentences using the Resource
    Description Format (RDF)
  • - storing semantically encoded data in
  • - browsing information encoded in RDF
  • Accessing and querying semantic data
  • - Introduction to SparQL
  • - Free-standing query clients Twinkle,
    RDF-gravity, Explorator
  • - Jena software for manipulating triples
  • Preeminent semantic resources
  • - DBpedia
  • - Bio2RDF semantic web atlas of postgenomic
  • - Queries using Virtuoso SparQL and iSparQL

  • From raw data to sentences
  • Here is some information that might be useful to
  • Smith 21
  • Smith Jones
  • Do you get it?
  • Would it help to see the data tables? Perhaps
    you could guess what Im trying to say if you
    look at column names.
  • Whats missing here the relationships between
    the separate pieces of data.
  • In natural languages these relationships are
    established by using predicates to form
    sentences that connect these components,
  • . . . as in the sentences on the next slide

  • Sentences
  • . . . some information in sentence form
  • Smith has age 21.
  • Jones has age 45.
  • Blake has age 12.
  • George has age 21.
  • Smith has favorite friend Jones.
  • Jones has favorite friend Smith.
  • Blake has favorite friend Blake.
  • George has favorite friend Smith.
  • where each sentence has the form
  • Subject Predicate
  • also known as

  • A Sentence base
  • We can put these triples into one or more files
    to build a sentence base to hold these
  • To help with manipulation and searching, each
    grammatical component is stored and accessed
    separately, so that each sentence retains its
    triple form
  • Subject Predicate
  • Smith has age 21
  • Jones has age 45
  • Blake has age 12
  • George has age 21
  • Smith has favorite friend Jones
  • Jones has favorite friend Smith
  • Blake has favorite friend Blake
  • George has favorite friend Smith

  • Query sentences
  • We can query such information with queries like
  • Someone has friend Smith?
  • where Someone acts like a variable and
    resolves as the list
  • Jones
  • George
  • because the pattern Someone has friend Smith
    matches both triples
  • Jones has favorite friend Smith
  • George has favorite friend Smith

  • Query sentences
  • We can interpret a more complicated query like
  • "Someone has favorite friend Smith and has age
  • as a pair of requirements
  • "Someone has favorite friend Smith?
  • and
  • "Someone has age 21?
  • where we mean that same someone has both
    characteristics . . .
  • in which case Someone will resolve as "George,
    since George is the only Someone who satisfies
    both requirements via the following triples
  • George has age
  • George has favorite friend Smith

  • Using graphs used to represent sentences
  • If we want to complicate things, we can also
    represent the same information in graph form as
    with these 2 graphs that represent the 2 kinds of
    information in the collection of sentences
  • Graph 1 Person ages Graph 2
    Favorite Friends
  • Typically we dont really want to complicate
    these issues, but the semantic web literature
    often thinks in graph terms and some
    applications display results as visual graphs.

  • Using graphs to represent sentences
  • Here the 2 graphs are combined using named edges
    to represent 2 kinds of information associated
    with the same 4 persons.
  • Graph 3 Person ages (age) and favorite friends
  • Each arc represents the predicate of a
    sentence, connecting a subject with an
    object. (Note that a subject may have gt 0
    arcs of each type.)

  • Using URIs and URLs to identify predicates and
  • Now if it hadnt already happened someone would
    come up with the idea to use URLs to point to Web
    documents that describe the exact meaning of
    each predicate, or metadata.
  • For example, http//
    could contain a definition of favorite friend,
    and other documents would define BFF,
    long-time-friend, family-friend, friends
    with benefits, etc,
  • And, in fact, these definitions could themselves
    refer to other definitions like some superset
    of relationships such as
  • http//
  • or the personal_relationships file could include
    a collection of subset definitions that we might
    refer to like
  • http//
  • using the convention for targeting a specific
    location within a URL.
  • Note that this form of metadata is not the only
    useful form of metadata, but it is clearly
    integrated with the data in a unique fashion.
  • The basic triplet structure of each sentence
    provides another (implicit) form of metadata.

  • The sentences as a set of 8 triples (2 for each
  • -------------------------------------
  • Subject Predicate Object
  • Blake examplefav Blake
  • Blake infohas_age "12"
  • Jones examplefav Smith
  • Jones infohas_age "35"
  • George examplefav Smith
  • George infohas_age "21"
  • Smith examplefav Jones
  • Smith infohas_age "21"
  • ---------------------------------------
  • Here the abbreviation example stands for

  • Representing sentence components using URIs
  • To specify exactly which person named Blake,
    Smith, etc. we are referring to, we can again
    use URIs.
  • --------------------------------------------------
  • Subject Predicate

  • lthttp// examplefav
  • lthttp// infohas_age
  • lthttp// examplefav
  • lthttp// infohas_age
  • lthttp// examplefav
  • lthttp// infohas_age
  • lthttp// examplefav
  • lthttp// infohas_age
  • --------------------------------------------------

  • Triplestore summary and outrageous claims
  • Sentences are composed of subject, predicate,
    object triples.
  • Subjects and predicates are specified as URIs
    that may be dereferenceable, and predicate URLs
    may provide metadata describing the meaning of
    the predicate.
  • A collection of triples can be represented as a
    graph, and may be known as a graph.
  • Sentences are stored in triplestores or quad
    stores (when they are members of identifiable
    graphs whose names give the 4th component).
  • Triples will contain URIs that
  • - identify and/or name resources subjects
    and/or objects, and
  • - serve to identify and/or reference
    predicate definitions,
  • and object data types (as in 25xsdint), and
  • One way to think about this, is that triplestores
    do NOT contain data, but rather sentences,
    information, assertions (not necessarily true
    or correct assertions), units of thought
    (Mons), or maybe little chunks o meaning.

  • Triples may be serialized in various forms
  • There are several ways to convert such triples
    into a serialized, or text-based, form. Here is
    the simplest. It is the N3 (for Notation 3) form
    of a standard known as Turtle (for Terse RDF
    Triple Language), with each line holding 3 URIs,
    and ending with a .
  • _at_prefix example lthttp//
    onal_relationshipsgt .
  • _at_prefix info lthttp//
    risticsgt .
  • lthttp// examplefav
    lthttp// .
  • lthttp// infohas_age "12"
  • lthttp// examplefav
    lthttp// .
  • lthttp// infohas_age "35"
  • lthttp// examplefav
    lthttp// .
  • lthttp// infohas_age "21"
  • lthttp// examplefav
    lthttp// .
  • lthttp// infohas_age "21"

  • Triples may be serialized in various forms
  • Another serialization format is the standard
    Resource Description Format (RDF), which is used
    in this encoding of the Smith information (with
    non-dereferenceable URIs)
  • ltrdfRDF   xmlnsrdf"http//
    2-rdf-syntax-ns"   xmlnsexample"http//fake.ho"gt  ltexamplePerson
  • rdfabouthttp//
  •    ltexamplenamegtSmithlt/examplenamegt
  •    ltexampleagegt21lt/examplehas_agegt
  • ltexamplefav
  • rdfresourcehttp//
  •  lt/examplePersongt           lt/rdfRDFgt
  • Note There exist other, standard schemas for
    encoding personal information, such as the Friend
    of a Friend (FOAF) schema.

  • Dereferenceable URI version of the Smith RDF
  • Here is the same information encoded with
    dereferenceable URIs, URIs that can actually be
    accessed and from which content can be
  • ltrdfRDF   xmlnsrdf"http//
    2-rdf-syntax-ns"   xmlnsexample"http//fake.ho"gt  ltexamplePerson
  • rdfabouthttp//discern.uits.iu.edu8421/
  •    ltexamplenamegtSmithlt/examplenamegt
  •    ltexampleagegt21lt/examplehas_agegt
  • ltexamplefav
  • rdfresourcehttp//discern.uits.iu.edu8421
  •  lt/examplePersongt           lt/rdfRDFgt

  • Browsing RDF documents
  • Here is a view of the Smith RDF file from within
    Firefox using the Tabulator plug-in
  • You can click on the jones.rdf link to see the
    Jones record, and browse from there, or choose
    the Person link to examine its definition (if its

  • The Semantic Web
  • In general, if URIs are dereferenceable they can
    link into a Gigantic Global Graph, usually know
    as the Linked Data Web or the Semantic Web.
  • If HTML and the Web make all online documents
    look like one huge book, RDF, schema, and
    inference languages will make all the data (sic)
    in the world look like on huge database.
  • --TimBL

  • Documents in RDF format may be interrogated
  • - by physical inspection (for anyone willing to
    read XML)
  • - by using an RDF browser (like the Tabulator
    plug-in, etc.)
  • - by writing programs (in Jena, for example)
    that read RDF files,
  • construct the represented graphs internally,
    and then
  • - access graph triples in sequential order,
  • - select triples according to specified
    content, and/or
  • - apply SparQL queries and access results in
    sequential order
  • - using command-line tools that apply SparQL
    queries, and/or
  • - using GUI interfaces accepting SparQL queries
  • - written in text, or
  • - represented graphically
  • - using SparQL endpoints that accept
    queries embedded in URLs

  • A SparQL example
  • If http//discern.uits.iu.edu8421/all-persons.rd
  • contains all the triples listed earlier, then
    this SparQL query should find all the triples
    related to smith
  • select
  • p o
  • from lthttp//
  • where
  • lthttp//discern.uits.iu.edu8421/smith.rdfgt
    p o .
  • Intuitively, this query asks Smith has what
    relationship(s) to whom/what?
  • and should identify these 2 value pairs
  • lthttp//

  • Another SparQL example
  • If
  • http//discern.uits.iu.edu8421/all-persons.rd
  • contains all the triples listed earlier, then
    this SparQL query simply asks for a list of all
    those triple values
  • select
  • from lthttp//discern.uits.iu.ed8421/all-persons.r
  • where
  • sub pred obj .
  • Intutitively, this query asks Who has what
    relationship to whom?
  • sub, pred, and obj will each be assigned one
    or more values as the query is satisified and all
    three will be printed ().

  • Results of the single (unified) file SparQL query
  • --------------------------------------------------
  • sub pred

  • http//...8421/blake.rdf examplefav
  • http//...8421/blake.rdf examplehas_age
  • http//...8421/jones.rdf examplefav
  • http//...8421/jones.rdf examplehas_age
  • http//...8421/george.rdf examplefav
  • http//...8421/george.rdf examplehas_age
  • http//...8421/smith.rdf examplefav
  • http//...8421/smith.rdf examplehas_age
  • --------------------------------------------------

  • A distributed SparQL query against 4 separate
    RDF files
  • The next query searches 4 dereferenceable files
    holding the same data broken into 4 files, one
    for each subject
  • select
  • from lthttp//discern.uits.iu.edu8421/smith.rdfgt
  • from lthttp//discern.uits.iu.edu8421/jones.rdfgt
  • from lthttp//discern.uits.iu.edu8421/george.rdfgt
  • from lthttp//discern.uits.iu.edu8421/blake.rdfgt
  • where
  • sub pred obj .
  • The results of this query will be the same as the
    results for the single file query (though order
    my vary due to remote URL access latency).

  • Use SparQL to find the predicates
  • This SparQL example query simply asks for a list
    of all the unique predicates that occur in all
    the triples
  • select
  • distinct p
  • from lthttp//discern...8421/friend-network.rdfgt
  • where
  • s p o .
  • If you dont use distinct you will get multiple
    occurrences of the same predicate.
  • This can be very useful when you are trying to
    figure out what predicates are available to
    interrogate a triplestore that you dont know
    much about.

  • SparQL (incomplete) basic syntax
  • some_variable_list
  • FROM
  • ltsome_RDF_source_URIgt
  • some_n3_triple_pattern .
  • another n3_triple_pattern .
  • Notes
  • - the lt and gt characters are required.
  • - other commands in place of SELECT are
  • - is a valid variable list, specifying any
    variable included in a triple pattern, and may be
    preceded by DISTINCT, which will prevent
    duplicate triples.
  • - there may be multiple FROM clauses, whose
    targets will be combined and treated as a single
  • - a . separating multiple triple patterns is
    intuitively similar to a natural language and,
    but actually behaves like an SQL natural join.

  • Optional clauses in SparQL queries
  • Clauses permitted within a where clause
  • optional triple_pattern identifies a triple
    that need not appear in an RDF target but whose
    absence will not prohibit a pattern match.
  • filter restricts variable matches in the
    preceding triple to specified filter patterns, as
  • s p date FILTER ( date gt
    "2005-01-01T000000Z"xsddateTime )
  • or
  • s p d FILTER ( xsddateTime( d ) lt
    xsddateTime( "2005-01-01T000000Z ) )
  • or
  • ?s ?p ?name FILTER regex( ?name, "smi",
    some_flag )
  • union where clauses may be constructed as
  • triple_pattern_1 UNION triple_pattern_2
  • and any RDF element matching either of these
    triples will be included in the resulting output.

  • Some useful SparQL pattern patterns
  • Display two property values of some entity
    (ltsome_URIgt) on the same line
  • select
  • where
  • ltsome_URIgt ltsome_predicategt ?o .
  • ltthe_same_URIgt ltsome_other_predicategt ?o1 .
  • Example using the friend information and PREFIX
  • PREFIX example lthttp//
  • PREFIX info lthttp//
  • select
  • where

  • Some more useful SparQL pattern patterns
  • Merge results of 2 pattern matches into a single
    output column
  • select
  • where
  • ltsome_URIgt ltsome_predicategt ?o .
  • ltsome_other_URIgt ltsome_other_predicategt ?o .
  • Example
  • PREFIX example lthttp//
  • PREFIX info lthttp//
  • select

  • Some more useful SparQL pattern patterns
  • Slowly find all triples whose object components
    mention hexokinase
  • select
  • where
  • ?s ?p ?o . FILTER regex( o, "hexokinase" ) .
  • Quickly find all entries with object components
    mentioning hexokinase, but works only through a
    Virtuoso SparQL endpoint when applied to indexed
    graphs (and will return nothing when applied to a
    non-indexed graph)
  • select
  • where
  • ?s1 ?p1 ?o1 .
  • ?o1 bifcontains "hexokinase" .

  • SparQL desktop client Twinkle (version of the
    upward paths query)

  • SparQL desktop client RDF-gravity (using the
    friend data)

  • SparQL desktop client Explorator RDF explorer
  • The Explorator can download (extracts from)
    multiple RDF resources, and manipulate them in
    combination. Here with the Russian lakes example.
  • This approach provides an interface using a set
    algebra model of data manipulation. (See Araujo,
    et al. and http//

  • Jena
  • The Java-based Jena package from HP Labs allows
    users to manipulate and query RDF graphs. You
    can write a program that uses Jena classes to
  • - retrieve and parse an RDF file containing a
    graph or a collection of graphs,
  • - store it in memory,
  • - examine each triple in turn, examine one
    component (say, the subject) of
  • each triple in turn, or examine only triples
    that meet specified criteria, and,
  • - write a serialized version of a graph to a
    file or STDOT.
  • For example, one might examine each stored triple
    searching for a specific reference URI, or for a
    specific literal value, as with a search for
    triples containing a specific value,
    21xsdage, in their object portions.
  • An RDF graph is stored in Jena as a model, and
    a Jena model is created by a factory, as in
  • Model m ModelFactory.createDefaultMode
  • Once a model has been defined, Jena can populate
    it by reading data from files, backend data
    bases, etc. in various formats, and once it has
    been populated, Jena can perform set operations
    on pairs of populated models and/or search models
    for specific values or combinations (patterns) of

  • Jena
  • For example, there are several methods for
    creating iterators over a model to access
    specific components. Iterators may be built by
  • - listing the components of each triple
  • - model.listSubjects()
  • - model.listObjects()
  • - comparing a specific component with a
    specified value, as in
  • model.listSubjectsWithProperty( Prop p,
    RDFNode object )
  • which will get you a collection of subjects
    possessing property/predicate
  • p and specific value object )
  • - comparing all components against specific
    values in 2 steps
  • - construct a selector possessing specific
    values s, p and o

  • Preeminent Linked Data resources
  • The DBpedia and Bio2RDF
  • The DBpedia is a community effort to extract
    structured information from Wikipedia and to make
    this information available on the Web
  • DBpedia currently holds over 200 million triples,
    harvested by scraping DBpedia Infoboxes included
    within the Wikipedia.
  • The DBpedia is currently housed in a OpenLink
    Virtuoso Universal Database, which can store
    relational, object, XML, and semantic
  • Details at http//

  • Bio2RDF Atlas of postgenomic knowledge
  • Bio2RDF integrates (extracts from) some 40
    biomedical information resources (such as GO,
    Uniprot, etc.) recoded in RDF (gt2 Gtriples)
  • - currently runs over the Virtuoso Universal
    Database server at
  • http//
  • but each resource has its own SparQL endpoint,
    in addition to
  • the endpoint accessing the unified triplestore
  • http//
  • - a list of included resources is at
  • (http//
  • and includes links to the SparQL endpoint for
    each resource,
  • as well as descriptions of the resource contents
    and triple counts.
  • - raw text N3 formats for this data use
    around 1 TB, but install in
  • much less space within Virtuoso (perhaps
    100 GB).
  • - there is also a Bio2RDF proxy service that
    takes queries and

  • Resources included in Bio2RDF
  • (downloadable from http//
  • PUbMed INOH
  • GeneID IProClass
  • UniProt MGI
  • UniRef CellMap
  • UniParc BioPAX
  • Kegg Pathway InterPro
  • CPATH Pfam
  • Reactome PROSITE
  • Biocyc Protein
  • MeSH SID
  • CPD Kegg Ligand for chemical compound PubChem
  • GL Kegg Ligand for carbohydrate structure UniSTS
  • EC Homologene

  • Bio2RDF resources
  • (Edge width is proportional to link density.)

  • SparQL endpoints
  • Triplestores like the Virtuoso Universal Database
    Server publish SparQL endpoints that will take
    SparQL queries through several interfaces.
  • For example you can query the DBpedia through a
    Virtuoso SparQL endpoint at
  • http//
  • by sending SparQL queries
  • - encoded in URLs addressed to the triplestore
    endpoint, like
  • http//
  • WHERE s p o .
  • o bifcontains Goethe_Johann_Wolfgang .
  • - entered into Web forms that present text areas
    into which one can
  • enter queries, as on the next pages

  • The SparQL interface to DBpedia

  • The iSparQL Advanced interface to DBpedia

  • The iSparQL QBE interface to DBpedia (close up)
  • Here is the same query in graphical form as
    constructed using the iSparql QBE interface

  • The iSparQL QBE interface to DBpedia

  • Results from the iSparql text and/or QBE queries

  • Using SparQL to get RDF extracts
  • Suppose you want to build a local RDF triplestore
    from DBpedia containing only the Goethe entries,
    or import these entries into some other desktop
    client like the Explorator.
  • Documents returned by SparQL select queries are
    usually not RDF documents. They may not have
    triples, and they are usually structured for
    display or storage in HTML, Excel or some other
  • You can use the CONSTRUCT command (in place of
    SELECT) within a SparQL query to build a proper
    RDF formatted response
  • construct
  • lthttp//
    _Goethegt p o
  • where
  • lthttp//
    _Goethegt p o .
  • The structure of the triple to be created is
    specified in the construct clause.

  • Ontologies
  • The term ontology is used in different ways by
    different people.
  • Pidcock writes that People use the word to mean
    different things, e.g. glossaries and data
    dictionaries, thesauri and taxonomies, schema and
    data models, and formal ontologies and
  • And Uschold writes An ontology may take a
    variety of forms, but necessarily it will include
    a vocabulary of terms, and some specification of
    their meaning. . .This includes definitions and
    an indication of how concepts are inter-related
    which collectively impose a structure on the
    domain and constrain the possible interpretations
    of terms.

  • The DBpedia ontology
  • The DBpedia ontology is shallow, cross-domain
    ontology. In
  • http//
  • it appears as a tree with maximum depth of 4.
  • The main level class is a Thing, and the first
    sublevel classes are Person, Organization,
    Anatomical structure, Place, Species, etc.
  • The next level persons are Scientist, College
    Coach, Monarch, Politician, etc.
  • Some classes are also assigned properties. For
    example, a species may have Order and Family
    properties (even though an organisms Order and
    Family could be inferred from its position in the
    (ontology that is the) evolutionary tree.

  • Wikipedia Infoboxes
  • The DBpedia gets its information from the
    Wikipedia Infoboxes, such as this one for
    Johann Wolfgang von Goethe that appears on his
    Wikipedia page.
  • Infobox contents are mapped to DBpedia ontology
    classes and properties, which are used as RDF
  • Here the Goethe resource is
  • http//
  • Johann_Wolfgang_von_Goethe
  • and you know how to find all the predicates and
    objects by now?

  • The DBpedia ontology
  • Here is a query to find all the Places known to
  • select distinct
  • where
  • s a lthttp//
  • limit 1000
  • And a query to find every persons birth info
  • select s o
  • where
  • s a lthttp// .
  • s lthttp// o
  • limit 1000

  • The DBpedia faceted browser
  • DBpedia ontology and property components are
    displayed in the left column and can be used is
    used to define filters for viewing content.

  • GO An example biomedical ontology (or 3)
  • Consider the Gene Ontology, very widely used in,
    and probably crucial for, bioinformatics and
    biological research.
  • The Gene Ontology actually has 3 major
    components, or separate sections for defining
    terms related to
  • - Biological Process,
  • - Cellular Component (physical structures or
  • locations within biological cells), and
  • - Molecular Function,
  • each of which defines several thousand terms.

  • A small portion of the Molecular Function portion
    of the Gene Ontology Directed Acyclic Graph (DAG)

  • Find parents of GO0004003 in the example GO DAG
    using a SparQL query
  • select
  • where
  • lthttp//
  • lthttp//
  • parent .
  • Result
  • -----------------------------------
  • parent
  • lthttp//
  • lthttp//

  • Find all 3-element paths up from GO0004003
  • PREFIX go lthttp//
  • select
  • where
  • lthttp//
  • gois_a
  • a .
  • a gois_a b .
  • b gois_a c .
  • Note the use of the PREFIX to define an
    abbreviation that will be substituted for the
    string go.

  • Find all 3-element paths up from GO0004003 using
    the example GO DAG

a b c
http// http// http//
http// http// http//
http// http// http//
http// http// http//
  • Find all 3-element paths up from GO0004003 using
  • select
  • a.parent_id, b.parent_id, c.parent_id
  • from
  • GO.molecular_function_DAG a
  • join
  • GO.molecular_function_DAG b
  • on
  • a.parent_id b.child_id
  • join
  • GO.molecular_function_DAG c
  • on
  • b.parent_id c.child_id
  • where
  • a.child_id like GO0004003
  • This query is posed as a series of joins on the
    GO.molecular_function_DAG just as the SparQL
    version uses structures like

  • Auer and Lehmann asked
  • What DO Innsbruck and Leipzig have in common?
  • . . .or to be more exact
  • What query will reveal what properties 2 entities
    have in common?
  • select
  • where
  • lt . . . Innsbruckgt ?p ?o .
  • lt . . . Leipziggt ?p ?o .
  • will direct the resolver will find every
    characteristic of each city and see which
    characteristic is shared by both cities.
  • This doesn't have an equivalent in SQL because
    you can't treat table and variable names as
    variables in SQL.
  • (You can of course get around this by using
    system tables, or by storing all your data
    normalized as a single table containing 3
    columns, which might not be a bad idea in some
    unusual circumstances.)

  • Auer and Lehmann asked
  • What DO Innsbruck and Leipzig have in common?
  • . . .or to extend this train of thought
  • What query will reveal what properties Innsbruck
    and Leipzig do NOT have in common?
  • And can these ideas be extended to notions of
    semantic similarity or semantic distance
    between resources.
  • Or extended to a notion of semantic clustering?
  • We might want to ask questions like
  • Which cities are most like Innsbruck?
  • Which cities are most unlike Innsbruck?
  • Which cities are more like Innsbruck than any
    other city?
  • How can we cluster cities into functional

  • What do go0004145 and go0004059 have in common?
  • select
  • where
  • lthttp// predicate
    ?object .
  • lthttp// predicate
    ?object .
  • --------------------------------------------------
  • predicate
  • object
  • -------------------------------------------------
  • http//
  • http//
  • --------------------------------------------------
  • http//
  • http//
  • -------------------------------------------------

  • Evaluating the semantic approach?
  • The semantic approach is complicated, often
    produces ugly-looking and slow results, and new
    tools emerge like Topsy . . .
  • . . . but it allows users to do some things more
    easily than they can be done using the relational
  • Information stored in sentences is easier for
    (some) users to understand, and extract relevant
  • Data merged with metadata makes metadata easy to
  • Being sentence-based, SparQL may be more
    intuitive (and more declarative?) than SQL, and
    may more easily support the use of ontologies and
  • Distributed information is can be more easily
    utilized users can access multiple RDF documents
    in a single SparQL query, and even browse
    distributed RDF sources as part of the LDW or
  • Information resources can often be more easily
    integrated. Since no unified storage schema is
    required, RDF versions of multiple resources can
    be manipulated within the same triplestore, and
    ontologies may be exploited in a more natural
  • Some types of queries are much more easily
    composed than they could be in SQL (Leipzig and

  • Conclusions?
  • The usefulness of the semantic approach is
    difficult to evaluate, but it is safe to say the
    relational model is not going away.
  • Use/value depends on whos doing what, using what
    information, over what platforms, and how usage
    patterns will vary over time (and they are!).
  • The semantic approach appears to be especially
    useful for integrating information resources and
    for finding connections/relationships, but
    integrating resources is not straightforward (see
    Satoo, et al. and Antezana, et al. for examples),
    nor is quantifying connectivity.
  • You may need to differentiate between the
    semantic approach itself, and the distributed
    capabilities of the Semantic Web. (Do the RDF
    warehouses contradict the underlying intent to
    support distributed information resources?)
  • Where and how should metadata/semantics be
    injected into the data stack? (caBIG does it
  • Where and how should ontologies be applied in
    information management? (Using ontologies is
    mostly orthogonal to RDF proper, but see Renear,
    et al.)
  • What kinds of relational/semantic technology
    integrations are possible? Which will prove
  • If we have an RDF version of Wikipedia, can we
    have an ontology-enabled, RDF version of Pubmed?
    (Consider Enju and TexFlame.)

  • A long term role for the semantic approach?
  • This from the Oracle Semantic Technologies Center
    (circa 2001!)
  • By the end of this short paper, the reader
    should understand the overall superiority of
    Semantic Web technologies and be able to describe
    why it is very likely that they will be embedded
    in the fabric of nearly all data-intensive
    software within several years.
  • -- Jeff Pollock, Oracle
  • http//

  • A long term role in scholarly communication?
  • The Concept Web, according to their Web site
  • a dynamic, interactive fabric of concepts and
    their relationships. The Concept Web is
    constructed from, inter alia, research
    literature, Internet databases and other web
    sites together with off-line resources.
  • Mission of the Alliance
  • To enable an open collaborative environment to
    jointly address the challenges associated with
    high volume scholarly and professional data
    production, storage, interoperability and
    analyses for knowledge discovery.
  • Specific goals
  • The development and refinement of ways to
    capture information in Semantically Rich
    Triples, and to store, manage, and query such
  • The big issue we have here is this perverse
    situation in publishing, or in formal scholarly
    communication, where researchers take data,
    convert it into narrative form, and then employ
    really complex text-mining tools based on complex
    natural language processing . . . to try and
    turn this stuff into data again. (Bilder, 2009)

  • For more information, see
  • Antezana, Erik, et al., "BioGateway a semantic
    systems biology tool for the life sciences, BMC
    Bioinformatics, 2009.
  • http//
  • Auer, Soren and Jens Lehmann, "What do Innsbruck
    and Leipzig have in common? Extracting Semantics
    from Wiki Content, European Semantic Web
    Conference (ESWC), 2007.
  • Bilder, Geoffry, Conceptweblog Conference
    Podcast, Concept Web Alliance Inaugural Meeting,
    May 2009. http//
  • Bizer, Christian, Tom Heath, Tim Berners-Lee,
    Linked Data--The story so far.
  • http//
  • Herper, Matthew, Forbes Magazine, November 10,
    2008. http//
  • Grobe, Michael, RDF, Jena, SparQL, and the
    Semantic Web, SIGUCCS, 2009.
  • Marajo S. Schwabe D., Barbosa S. - Experimenting
    with Explorator a Direct Manipulation Generic
    RDF Browser and Querying Tool. Visual Interfaces
    to the Social and the Semantic Web (VISSW 2009),
    Sanibel Island, Florida - February 2009.
  • http//