Entity Resolution in Relational Data - PowerPoint PPT Presentation

1 / 32
About This Presentation
Title:

Entity Resolution in Relational Data

Description:

Ontology Alignment (work w/ Octavian Udrea) ... Results in a 40% improvement in recall on 30 OWL lite ontology pairs. ER References ... – PowerPoint PPT presentation

Number of Views:271
Avg rating:3.0/5.0
Slides: 33
Provided by: lise157
Category:

less

Transcript and Presenter's Notes

Title: Entity Resolution in Relational Data


1
Entity Resolution in Relational Data
  • Indrajit Bhattacharya and Lise Getoor
  • University of Maryland, College Park

2
Entity Resolution
  • The Problem
  • Relational Entity Resolution
  • Algorithms
  • Graph-based Clustering (GBC-ER)
  • Probabilistic Model (LDA-ER)
  • Experiments
  • Current Projects

3
InfoVis Co-Author Network Fragment
4
Hus Su Hua Su
before
after
5
L. Tweedie Lisa Tweedie
before
after
6
H. Dawkes Huw Dawkes
before
after
7
B. Spence Bob Spence
before
after
8
Bob Spence Robert Spence
before
after
9
Initial vs. Final
before
after
10
The Entity Resolution Problem
James Smith
John Smith
John Smith
Jim Smith
J Smith
James Smith
Jon Smith
Jonathan Smith
J Smith
Jonthan Smith
  • Issues
  • Identification
  • Disambiguation

11
The Entity Resolution Problem
James Smith
John Smith
John Smith
James Smith
Jim Smith
J Smith
J Smith
Jonathan Smith
  • Unsupervised clustering approach
  • Number of clusters/entities unknown apriori

Jon Smith
Jonthan Smith
12
Attribute-based Entity Resolution
?
J Smith
James Smith
0.8
Jim Smith
James Smith
Pair-wise classification
J Smith
James Smith
?
0.1
John Smith
James Smith
0.7
James Smith
Jon Smith
0.05
James Smith
Jonthan Smith
  • Inability to disambiguate
  • Choosing threshold precision/recall tradeoff
  • Perform transitive closure?

13
Relational Entity Resolution
  • References not observed independently
  • Links between references indicate relations
    between the entities
  • Co-author relations for bibliographic data
  • Use relations to improve identification and
    disambiguation

14
Relational Identification
Very similar names. Added evidence from shared
co-authors
15
Relational Disambiguation
Very similar names but no shared collaborators
16
Collective Entity Resolution Using Relations
One resolutions provides evidence for another gt
joint resolution
17
Relational Constraints For Resolution
Co-authors are typically distinct
18
Entity Resolution
  • The Problem
  • Relational Entity Resolution
  • Algorithms
  • Graph-based Clustering (GBC-ER)
  • Probabilistic Model (LDA-ER)
  • Experiments
  • Conclusion

19
Similarity Measure For Clustering
  • sim(ci, cj) (1- ?)simattr(ci, cj) ?
    simrel(ci, cj)
  • Relational similarity
  • between clusters
  • Attribute similarity
  • between clusters
  • Attribute Similarity Compare attributes of
    individual references in the two clusters
  • Name Single Valued Attribute
  • Cluster Similarity Metric / Representative
    Attribute
  • Jaro / Jaro-Winkler / Levenstein similarity with
    TF-IDF weights
  • Multi Valued Attributes
  • Countries, Addresses, Keywords, Classifications
  • Vector with TF-IDF weights Cosine Similarity

20
Similarity Measure For Clustering
  • sim(ci, cj) (1- ?)simattr(ci, cj) ?
    simrel(ci, cj)
  • Relational similarity
  • between clusters
  • Attribute similarity
  • between clusters
  • Relational Similarity Use set similarity (eg
    Jaccard) to find shared clusters (resolutions)
    between links
  • Neighborhood Similarity
  • Compare neighborhoods of two clusters
  • Reduce set of sets to multiset
  • Cheaper approximation
  • Edge Detail Similarity
  • Compare individual links of two clusters
  • Set of sets similarity
  • Expensive

21
Approach 1 Algorithm (GBC-ER)
  • Perform blocking step, which quickly identifies
    candidate duplicates
  • Iteratively merge the most similar cluster pairs
  • Similarities are dynamic Update related
    similarities after each merge
  • Indexed priority queue for fast update and
    extraction

22
Approach 2 Latent Dirichlet Model for ER
  • Probabilistic model of entity collaboration
    groups
  • Entities (authors) belong to groups
  • Entities (authors) in a link (document) depend on
    the groups that are involved
  • Latent group variable for each reference
  • Group labels and entity labels unobserved
  • For details see A Latent Dirichlet Model for
    Unsupervised Entity Resolution _at_ SIAM Data
    Mining Conference, 2006. (Winner of Best Paper
    Award)

23
Evaluation Datasets
  • CiteSeer
  • Machine Learning Citations
  • Originally created by Lawrence et al.
  • 2,892 references to 1,165 true authors
  • 1,504 links
  • arXiv HEP
  • Papers from High Energy Physics
  • Used for KDD-Cup 03 Data Cleaning Challenge
  • 58,515 references to 9,200 true authors
  • 29,555 links
  • BioBase
  • Biology papers on immunology and infectious
    diseases
  • IBM KDD Challenge dataset constructed at Cornell
  • 156,156 publications, 831,991 author references
  • Ground truth for only 1060 references

24
Experimental Evaluation
  • Compare relational ER methods, GBC-Nbr and
    GBC-Edge, with baselines
  • ATTR
  • Pairwise duplicate decisions using Soft-TFIDF
  • Secondary string similarity Scaled
    Levenstein(SL), Jaro(JA), Jaro-Winkler(JW)
  • ATTR
  • Transitive Closure over pairwise decisions
  • Precision, Recall and F1 over pairwise decisions
  • Requires similarity threshold
  • Report best performance over all thresholds

25
GBC Results Best F1
  • Relational measures improve performance over
    attribute baseline in terms of precision, recall
    and F1
  • Neighbor similarity performs almost as well as
    edge detail or better
  • Neighborhood similarity much faster than edge
    detail

26
Structural Difference between Data Sets
  • Percentage of ambiguous references
  • 0.5 for Citeseer
  • 9 for HEP
  • 32 for BioBase
  • Average number of collaborators per author
  • 2.15 for Citeseer
  • 4.5 for HEP
  • Average number of references per author
  • 2.5 for Citeseer
  • 6.4 for HEP
  • 106 for BioBase

27
Synthetic Data Generator
  • Data generator mimics real collaborations
  • Create collaboration graph in Stage 1
  • Create documents from this graph in Stage 2
  • Can control
  • Number of entities and documents
  • Average number of collaborators per author
  • Average number of references per entity
  • Average number of references per document
  • Percentage of ambiguous references

28
Trends in Synthetic Data
  • Improvement increases sharply with higher
    ambiguity in references

29
Trends in Synthetic Data
  • Improvement increases with more references per
    author

30
Trends in Synthetic Data
  • Improvement increases with more references per
    document

31
Current Projects
  • Entity Resolution in Geospatial Data
  • Using spatial information, location name
    information and location type information
  • D-Dupe Interactive ER Tool
  • Simple user-interface for entity resolution
  • Accepted to new Visual Analytics conference
  • Query-time Entity Resolution
  • Goal Allow users to query an unresolved database
  • Adaptive strategy constructs set of relevant
    references and performs collective resolution
  • Preliminary adaptive strategy as accurate 200 x
    faster
  • Ontology Alignment (work w/ Octavian Udrea)
  • Combines relational clustering with logical
    inference (e.g. equivalence and subsumption)
  • Results in a 40 improvement in recall on 30 OWL
    lite ontology pairs

32
ER References
  • Bibliographic Data
  • Author resolution using co-author links
  • Graph-based Clustering (GBC-ER)
    (DMKD 04, LinkKDD 04, Book
    Chapter, Tech Report)
  • LDA based Group model (LDA-ER)(SDM 06,best
    paper award)
  • Query-based Entity Resolution (QB-ER)
    Participants in IBM KDD Entity Resolution
    Challenge (KDD 06)
  • Email Archives
  • Name reference resolution using email traffic
    network
  • Using a variety of temporal social network
    models(SDM 06)
  • Natural Language
  • Sense resolution using translation links in
    parallel corpora (ACL 04)
  • Sense Model Senses in different languages depend
    directly on each other
  • Concept Model Semantic sense groups or Concepts
    relate senses from different languages

33
Thanks!!
Write a Comment
User Comments (0)
About PowerShow.com