OntologyDriven Automatic Entity Disambiguation in Unstructured Text - PowerPoint PPT Presentation

1 / 24
About This Presentation
Title:

OntologyDriven Automatic Entity Disambiguation in Unstructured Text

Description:

To create a method to disambiguate entities within unstructured text by using ... Set of entities found by our algo (B) Precision = size of (A B)/size of (B) ... – PowerPoint PPT presentation

Number of Views:41
Avg rating:3.0/5.0
Slides: 25
Provided by: abhe
Category:

less

Transcript and Presenter's Notes

Title: OntologyDriven Automatic Entity Disambiguation in Unstructured Text


1
Ontology-Driven Automatic Entity Disambiguation
inUnstructured Text
  • Joseph Hassell, Boanerges Aleman-Meza I. Budak
    Arpinar
  • Presented by
  • Abhelaksh Thakur

2
What is the paper about?
  • Problem Entity Disambiguation
  • How to exploit background information for entity
    disambiguation?
  • Aim
  • To create a method to disambiguate entities
    within unstructured text by using clues in the
    text and exploiting metadata from an ontology.
  • Also to provide an implementation of the method
    using a very large, real world ontology.
  • Dataset used
  • Approach
  • Algorithm
  • Evaluation of the algorithm (Precision Recall)
  • Related Work
  • Future work

3
TERMS USED
  • Entity disambiguation
  • Ontology
  • Semantic web
  • RDF
  • DBLP
  • DBWorld

4
DATASET
  • DBLP Digital Bibliography Library Project
  • Contains bibliographic information for computer
    science researchers, journals and proceedings.
  • it indexes more than 725,000 articles and
    contains a few thousand links to home pages of
    computer scientists.
  • It provides two XML files
  • First contains objects such as authors,
    proceedings and journals.
  • Second contains lists of papers.
  • The first XML file was converted into an RDF
    containing 3,079,414 entities.

5
DATASET(cont.)
  • DBWorld
  • DBWorld is a mailing list of information for
    upcoming conferences related to the databases
    field.
  • An HTML scraper was created that visits the
    DBWorld site and downloads only the posts that
    contain Call for Papers, Call for
    Participation or CFP in the subject
  • The system disambiguates the people listed in
    these postings and provides a URI to the
    corresponding entity in the ontology.

6
APPROACH
  • Aim To find the correct entity out of the
    various possible matches.

7
APPROACH (cont.)
  • Scenario Disambiguating researchers by their
    names appearing in the DBWorld postings.
  • Spotting entity names.
  • Text proximity relationships.
  • Text Co-occurrence Relationships.
  • Popular Entities.
  • Semantic Relationships.

8
(No Transcript)
9
ENTITY DISAMBIGUATION ALGORITHM- Terms Used
  • es confidence score
  • cf initial confidence score
  • acf initial abbreviated confidence score
  • pr proximity score
  • co text occurrence score
  • sr semantic relationship score
  • pe popular entity score

10
ENTITY DISAMBIGUATION ALGORITHM
  • Step 1 Spot Entity Names
  • es cf / no. of entities with
  • same label
  • es acf / no. of related entities in
  • the ontology
  • acf feature is used for abbreviated names and
    can be turned on or off.

11
  • Step 2 Spot literal value of Text-proximity
    relationships
  • es espr if Text-proximity relationship found
    near candidate entity

12
  • Step 3 Spot literal value of Text-co occurrence
    relationships
  • es esco if text co occurrence relationship
    found.
  • Step 4 Using popular entities
  • es espe for those entities that have many
    relationships that were predefined as popular.

13
  • Step 5 Using semantic relationships
  • esessr
  • a) Choose candidate entity with highest es and
    increase es of entities that have semantic
    relationships (eg. co-authorship) in ontology
    with it.
  • b) Ignore candidate entity with low es.

14
  • Step 5 (contd.)
  • Threshold es es(t)
  • Iterate till atleast one candidate entity gets
    its es increased over es(t).

15
ENTITY DISAMBIGUATION ALGORITHM
  • Algorithm Disambiguation( )
  • for (each entity in ontology)
  • if (entity found in document)
  • create candidate entity
  • es for candidate entity lt- cf / (entities in
    ontology)
  • for (each candidate entity)
  • search for candidate entitys text proximity
    relationship
  • if (text proximity relationship found near
    candidate entity)
  • es for candidate entity lt- es for candidate
    entity pr
  • search for candidate entitys text
    co-occurrence relationship
  • if (text co-occurrence relationship found)
  • es for candidate entity lt- es for candidate
    entity co
  • if (ten or more popular entity relationships
    exist)
  • es for candidate entity lt- es for candidate
    entity pe

16
SAMPLE OUTPUT OF THE ALGORITHM
17
EVALUATION OF THE ALGORITHM
  • Disambiguated dataset (A)
  • Set of entities found by our algo (B)
  • Precision size of (A n B)/size of (B)
  • Recall size of ( A n B)/size of (A)

18
INPUT SETTINGS
  • Description Variable
    Value
  • charOffset
    50
  • Text proximity relationships pr
    50
  • Text co-occurrence relationships co
    10
  • Popular entity score pe
    10
  • Semantic relationship sr
    20
  • Initial confidence score cf
    90
  • Initial abbreviated confidence
  • score acf
    70
  • Threshold threshold
    90

19
RESULTS
  • Correct Disambiguation 602(A n B)
  • Found Entities 620 (B)
  • Total Entities 758 (A)
  • Precision 97.1
  • Recall 79.4

20
MEASURE OF PRECISION RECALL ON PER-DOCUMENT
BASIS
21
CHARACTERISTICS OF THE ALGORITHM
  • Performs well on unstructured text
  • Reduces need for string similarity computations.
  • Does not require training data

22
RELATED WORK
  • KIM
  • Automatic ontology population system.
  • Natural language processor/ indexing.
  • SCORE
  • Semantic metadata management system
  • Uses associations from knowledgebase.
  • It is a commercial system.
  • ESpotter
  • Uses a lexicon and/or patterns to recognize named
    entities.

23
FUTURE WORK
  • Integration of results of entity disambiguation
    into a more robust platform like UIMA
    (Unstructured Information Management Architecture)

24
THANK YOU..!!
Write a Comment
User Comments (0)
About PowerShow.com