Selforganization and the Semantic Web - PowerPoint PPT Presentation

1 / 54
About This Presentation
Title:

Selforganization and the Semantic Web

Description:

ISWeb Informationssysteme & Semantic Web. Estimations of Data Sizes ... Mayotte island 31540. EU country 28035. UNESCO organization 27739. Austria group 24266 ... – PowerPoint PPT presentation

Number of Views:84
Avg rating:3.0/5.0
Slides: 55
Provided by: steffen2
Category:

less

Transcript and Presenter's Notes

Title: Selforganization and the Semantic Web


1
Self-organization and the Semantic Web
  • Steffen Staab
  • New Trends in Semantic Web
  • December 2, 2004

2
Estimations of Data Sizes
  • My personal data about 30GByte
  • SAP 104 tables
  • Large insurance company 5000 databases
  • Google 8,000,000,000 URLs
  • about 90 of web content from underlying
    databases
  • 95 of data is not in databases (files, etc.)

3
Data Integration Purpose
  • ERP 104 tables
  • WWW 1010 documents

Find Condense Content
  • eLearning 106 schools, colleges ...
  • Email Staab 24874
  • Content Management 106 documents
  • Laptop file system 17150 data files

4
Data Integration Capabilities
Self-organising systems
  • Manual data integration technology and
    maintenance feasible for up to 102 databases

5
Dimensions of Self-organization
  • Peer-to-Peer-like systems
  • Ontology Learning Population
  • Automatic mapping
  • Self-adaptive query routing
  • Peer-to-peer services
  • Autonomy
  • Terminology
  • Terminology mapping
  • Query routing
  • Self-organising services

6
Dimensions of Self-organization
  • Peer-to-Peer-like systems
  • Ontology Learning Population
  • Automatic mapping
  • Self-adaptive query routing
  • Peer-to-peer services
  • Autonomy
  • Terminology
  • Terminology mapping
  • Query routing
  • Self-organising services

7
The OL Layer Cake
Rules
Relations
cure(domDOCTOR,rangeDISEASE)
Concept Hierarchies
is_a(DOCTOR,PERSON)
Concepts
DISEASE
disease,illness
Terms
disease, illness, hospital
8
The ontology population/semantic annotation
problem in 4 cartoons
9
The annotation problem from a scientific point
of view
10
The annotation problem in practice
11
The viscious cycle
12
Current State-of-the-art
  • Large-scale IE SemTagSeeker_at_WWW03
  • only disambiguation w.r.t TAP
  • Standard IE (MUC)
  • need of handcrafted rules
  • ML-based IE (e.g.Amilcare_at_OntoMat,MnM)
  • need of hand-annotated training corpus
  • does not scale to large numbers of concepts
  • rule induction takes time
  • KnowItAll (Etzioni et al. WWW04)
  • shallow (pattern-matching-based) approach

13
The Self-Annotating Web
  • There is a huge amount of implicit knowledge in
    the Web
  • Make use of this implicit knowledge together with
    statistical information to propose formal
    annotations and overcome the viscious cycle
  • semantics syntax statistics?
  • Annotation by maximal statistical evidence

PANKOW Pattern-based ANotation by Knowledge On
the Web
14
A small quiz
What is Laksa?
A dish
B city
C temple
D mountain
15
Asking Google!
  • cities such as Laksa 0 hits
  • dishes such as Laksa 10 hits
  • mountains such as Laksa 0 hits
  • temples such as Laksa 0 hits
  • Google knows more than all of you together!
  • Example of using syntactic information
    statistics to derive semantic information

16
Patterns
  • HEARST1 s such as
  • HEARST2 such s as
  • HEARST3 s, (especially/including)

  • HEARST4 (and/or) other s
  • Examples
  • dishes such as Laksa
  • such dishes as Laksa
  • dishes, especially Laksa
  • dishes, including Laksa
  • Laksa and other dishes
  • Laksa or other dishes

17
Patterns (Contd)
  • DEFINITE1 the
  • DEFINITE2 the
  • APPOSITION, a
  • COPULA is a
  • Examples
  • the Laksa dish
  • the dish Laksa
  • Laksa, a dish
  • Laksa is a dish

18
PANKOW Process
19
Asking Google (more formally)
  • Instance i?I, concept c ?C, pattern p ?
    Hearst1,...,Copula count(i,c,p) returns the
    number of Google hits of instantiated pattern
  • E.g. count(Laksa,dish)count(Laksa,dish,def1)...

  • Restrict to the best ones beyond threshold

20
Examples
Atlantic city 1520837 Bahamas island 649166 USA
country 582275 Connecticut state 302814 Caribbea
n sea 227279 Mediterranean sea 212284 Canada cou
ntry 176783 Guatemala city 174439 Africa region
131063 Australia country 128607 France country 1
25863 Germany country 124421 Easter island 96585
St Lawrence river 65095 Commonwealth state 4969
2 New Zealand island 40711 Adriatic sea 39726 N
etherlands country 37926
St John church 34021 Belgium country 33847 San J
uan island 31994 Mayotte island 31540 EU country
28035 UNESCO organization 27739 Austria group 2
4266 Greece island 23021 Malawi lake 21081 Isra
el country 19732 Perth street 17880 Luxembourg c
ity 16393 Nigeria state 15650 St Croix river 149
52 Nakuru lake 14840 Kenya country 14382 Benin
city 14126 Cape Town city 13768
21
Evaluation Scenario
  • Corpus 45 texts from http//www.lonelyplanet.com/
    destinations
  • Ontology tourism ontology from GETESS project
  • concepts original 1043 pruned 682
  • Manual Annotation by two subjects
  • A 436 instance/concept assignments
  • B 392 instance/concept assignments
  • Overlap 277 instances (Gold Standard)
  • A and B used 59 different concepts
  • Categorial (Kappa) agreement on 277 instances
    63.5

22
Results
23
Comparison
24
Dimensions of Self-organization
  • Peer-to-Peer-like systems
  • Ontology Learning Population
  • Automatic mapping
  • Self-adaptive query routing
  • Peer-to-peer services
  • Autonomy
  • Terminology
  • Terminology mapping
  • Query routing
  • Self-organising services

25
Bibliography Use Case
I am searching forpublications aboutSemantics.
Do you have items about Semantics?
Bibster Network
I know a peersharing metadata about Semantics.
26
Bibster Screenshot
Open Source http//bibster.sourceforge.net/
27
Sample BibTeX Entry
  • _at_ARTICLEcodd81relational,
  • author Edgar F. Codd,
  • title The capabilities of relational
    database management systems,
  • journal IBM Research Report, San Jose,
    California,
  • volume RJ3132,
  • year 1981

28
Sample Entry
29
BIBSTER Lifecycle
  • Wrapping / Scraping
  • RDF Store Sesame
  • SeRQL
  • INGA Interest-based Node Grouping
    Architecture
  • Duplicate Detection
  • Generation of Data _at_ Peer
  • Storage _at_ Peer
  • Querying _at_ Peer
  • Query Routingin Network
  • Answering to Peer

30
  • Expertise-based Peer Selection

31
Expertise-Based Peer Selection
  • Expertise Abstract semantic description of the
    knowledge base of a peer, expressed using a
    shared ontology
  • Advertisements to promote semantic descriptions
    of expertise in the network
  • Peer Selection ranks peers according to
    similarity between their expertise and query
    subject wrt. shared ontology

32
Expertise-Based Peer Selection
SimilarityFunction
Find articles by Codd aboutDatabase Management

Peer 1
Peer 2
33
Semantic topology
  • Advertising strategy determines
  • whom to send advertisements (e.g. random,
    semantically close)
  • which advertisements to accept (e.g. all,
    semantically close)
  • Semantic topology formed by the knowledge about
    the expertise of other peers
  • Idea Cluster peers with similar expertise
  • Route queries along gradient of increasing
    similarity between expertise and query subject

34
Semantic Topologies
Peer
Peer
Peer
QueryResult
DigitalLibraries
DigitalLibraries
DigitalLibraries
DigitalLibraries
DigitalLibraries
DigitalLibraries
DatabaseManagement
Information Searchand Retrieval
Peer
InformationSystems
Peer
Peer
ArtificialIntelligence
Information Storageand Retrieval
Peer
Find articles by Codd aboutDatabase Management

Robotics
35
Simulation of the Scenario
  • DBLP data set (380440 publications)
  • Document Classification using ACM topic hierarchy
    (based on title), classified subset of 126247
    publications
  • Document Distribution
  • Topic Distributions one peer for each of the ACM
    Topics (1287 peers)
  • Proceedings Distribution according to
    proceedings and journals (2335 peers)
  • Simulation Steps
  • Setup network topology
  • Advertise Knowledge
  • Query Processing

36
Evaluation Criteria
  • Output Parameters
  • Peer Selection (Peer Level)
  • Recall How many of the relevant peers were
    reached
  • Precision How many of the reached peers were
    relevant
  • Query Answering (Document Level)
  • Recall How many of the relevant documents where
    returned
  • Number of messages
  • Input Parameters
  • Distribution of documents
  • Peer selection function
  • Advertising strategy
  • Maximum number of hops

37
Hypotheses for Simulation
  • Expertise based selection is better than a naive
    broadcast approach based on random selection.
  • Using a shared ontology with a metric for
    semantic similarity improves the system compared
    with an approach with exact matches (e.g. keyword
    based)
  • Performance can be improved further, if the
    semantic topology reflects the semantic
    similarity of the expertise of the peers
  • The Perfect topology Perfect results, if the
    semantic topology coincides with a distribution
    of the documents according to the shared ontology

38
Experimental Settings
  • Setting 1 baseline - naively selects random
    peers
  • Setting 2 expertise based selection using
    similarity measure
  • Setting 3 peers accept advertisements that are
    semantically similar to their own expertise
  • Setting 4 perfect topology where the topology
    coincides with the ACM topic hierarchy

39
Recall (Peer Selection)
40
Precision (Peer Selection)
41
Number of Messages
42
Simulation Results
43
Advertisement-based Approach
  • Expertise-based peer selection improves
    performance of peer selection by an order of
    magnitude
  • Ontology-based similarity measure allows further
    improvements
  • Semantic topology that mirrors the domain
    ontology yields best results
  • Test driven in http//bibster.semanticweb.org

44
....many open question
  • Still an eager approach,
  • What about real data
  • What about changes in the data?
  • Now a lazy approach!
  • Learning and Recommending Shortcuts in
    Semantic Peer-to-Peer Networks INGA

45
Social expert network
I am searching forpublications aboutSemantic
Web.
Bibster Network
Do you have items about Semantics?
Here is an entry of the book Handbook on
Ontologies.
Bootstrapping shortcut
Contentshortcut
Experts.expert
Expert
Recommender shortcut
Experts expert
I know a peersharing metadata about Semantics.
46
Semantic overlay network
I am searching forpublications aboutSemantic
Web.
Query independent shortcut
Contentshortcut
Recommender shortcut
47
Semantic overlay network
I am searching forpublications aboutSemantic
Web.
Contentshortcut
48
Semantic overlay network
I am searching forpublications aboutLogics.
Recommender shortcut
49
Semantic overlay network
I am searching forpublications aboutRobotics.
Query independent shortcut
50
Semantic overlay network
I am new to the network and search for archeology.
Baseline (e.g. JXTA visibility)
51
Build content shortcut index
  • Send query using most promising available layer
    of semantic overlay topology
  • Evaluate result of query
  • Update shortcut index

52
Content Provider Shortcut Creation
53
Shortcut Index
54
Build recommender shortcut index
  • Active
  • When answers are returned including the query
    message path
  • The one butlast in the path is a recommender peer
  • Passive
  • Listen to incoming queries
  • If query is relevant to ones interests add
    querying peer as recommender

55
Recommender Shortcut Creation
56
Shortcut Index - 2
57
Query independent shortcut
58
Limit index size
  • Retain only a small number of shortcuts in the
    index (e.g. 40 in our experiments)
  • Delete based on least utility

59
while forwarding/answering a query
  • Active forwarding of Pq.Bo Current message
    contains Pq.Bo of querying peer ? compare
    against Pi.Bo and use if better
  • Interest based IndexingIf similarity(query,conte
    nti) threshold then add Pq to our list of
    recommender peers
  • Add own Pid to message

60
Query routing
  • Greedy search preferring query dependent
    shortcuts
  • Query independent and baseline shortcuts for
    fallback

Fireworks in regions of high similarity between
content and query
61
Random contribution to query routing
  • Greedy search preferring query dependent
    shortcuts
  • Query independent and baseline shortcuts for
    fallback

Fireworks in regions of high similarity between
content and query
62
Experimental hypotheses
  • INGA performs at least equal in terms of recall
    than the naive algorithm, KUNWADEE
    (Sripanidkulchai et al.) and REMINDIN
  • INGA performs better in terms of messages per
    query the naive algorithm, KUNWADEE and
    REMINDIN.
  • The gain in efficiency can be attributed to equal
    account the different layers
  • A dynamic combination of query dependent and
    independent search strategies reduces the number
    of consumed per query while it retains a high
    recall.

63
Comparison of Query Routing Algorithms (recall)
64
Comparison of Routing Algorithms ( messages)
65
Contribution of different layers (peer f-measure)
66
Contribution of different layers to message
reduction (messages)
67
Lessons learned
  • Focus on interest based shortcuts.
  • Interest based Listening
  • High Degree Shortcuts
  • Scrutinize the result message of ones issued
    queries to create content provider and
    recommender shortcuts
  • Prefer a query dependent search strategy
  • Greedy
  • top-k
  • Use a highest out degree strategy for baseline
    selection

68
Relevant Publications
  • Peer-to-Peer-like systems
  • Ontology Learning
  • Automatic mapping
  • Self-adaptive query routing
  • Peer-to-peer services
  • ISWC-041
  • WWW-041, ECAI-04,
  • SIGKDD Expl., WWW-051 submit
  • ISWC-042, WWW-052 submit
  • WWW-042, WWW-053 submit
  • EU IST Integrated Project Adaptive Services
    Grid

Thank You!
69
Thank You!
Write a Comment
User Comments (0)
About PowerShow.com