Social Network Analysis on Name Disambiguation - PowerPoint PPT Presentation

1 / 21

About This Presentation

Title:

Social Network Analysis on Name Disambiguation

Description:

The same author names mistakenly appear under multiple name variants. ... Edit-distance, Affine Gap, Smith-Waterman, Jaro, etc. Token-based similarity metrics ... – PowerPoint PPT presentation

Number of Views:137

Avg rating:3.0/5.0

Slides: 22

Provided by: dong164

Category:

more less

Transcript and Presenter's Notes

Title: Social Network Analysis on Name Disambiguation

1
Social Network Analysis on Name Disambiguation

On, Byung-Won
U. British Columbia
Nov. 12, 2008

2
Outline

Motivation
Problem Definition
Solution
Context Information
Similarity Function
Our Framework
Experimental Analysis
Summary

3
Name Disambiguation _at_ DLs
Jeffrey D. Ullman _at_ Stanford Univ.
The same author names mistakenly appear under
multiple name variants.
Name Disambiguation Problem
Detect/consolidate all name variants!!
4
Problem Definition
and
,
Y
X
names
of
lists
two
Given
Î
Î
)
(
,...,
,
),
(
Y
y
y
y
names
of
set
a
find
X
x
name
each
for
2
1
m
.
variant
)
1
(
x
of
a
is
m
i
y
that
such

i
Y
X
A. Elbert
1
Paul R. McJones
1
Frank Manola
2
Frank Manola
2
F. Manola
...

Karl Swartz
K
Karl L. Swartz
N
5
Solution

Treat additional information associated with x
(resp. y) as a string.
What is additional info?
Compute all pair-wise string similarities.
How can similarities be measured?
If similarity(x,y) ?, y is the name variant of
x.

6
Context Information

Hypothesis
If two authors are identical, they will share
more number of coauthors and common title/venue
tokens in their citations.
Information associated with an author _at_ DL
Author field
Shawn R. Jeffrey, Michael J. Franklin, Alon Y.
Halevy
Title field
Pay-as-you-go user feedback for dataspace systems
Venue field
SIGMOD 2008
Ex Alon Y. Levy vs. Alon Halevy
Alon Y. Levy a set of title tokens data,
management, integration
Alon Halevy a set of title tokens data,
integration, lineage

7
Similarity Function

Why
Most useful for matching problems with little
prior knowledge or unstructured data (Cohen et
al. 2003)
Character-based similarity metrics
Edit-distance, Affine Gap, Smith-Waterman, Jaro,
etc.
Token-based similarity metrics
Jaccard, TF/IDF cosine similarity, Monge-Elkan,
etc.

8
Similarity Function

Every similarity function tends to work well in
particular data set
Each function has pros and cons in measuring the
similarity between two strings
Variations of token order
Jaccard(Jeffrey D Ullman, Ullman
Jeffrey)0.67
Jaro(Jeffrey D Ullman, Ullman Jeffrey)0
Spelling errors
Jaccard(Jeffrey D Ullman, Jeffrey
Ullmann)0.25
Jaro(Jeffrey D Ullman, Jeffrey Ullmann)0.94

9
Similarity Function

Given two strings S and T as the input
JaccardSnT/SUT
Cosine similarity
S (resp. T) is represented as vector VS (resp.
VT).
Cosine(?) VS VT / VS VT
Edit-distance (e.g., Levenshtein distance)
The cost of best sequence of edit operations that
convert S to T.
The operations can be character insertion,
deletion, or substitution.
Each operation must be assigned a cost.

10
Our Framework

Similarity Function (sim)
Jaccard, Cosine similarity, or Edit-distance
Input of each similarity function
Given two authors x and y
S a set of coauthor names (title tokens, or
venue tokens) collected from xs citations
T a set of coauthor names (title tokens, or
venue tokens) collected from ys citations
If sim(S,T) ?, y is the name variant of x.
Ex. sim(S,T)0.6 gt ? (0.5) consider x and y to
be identical.

11
james smiths citations
james smith, gene golub, xml query, vldb 06 james
smith, gene golub, xml preprocess, cikm 07 jame
smith, xml security, vldb 08
smith, j.s citations
smith, j. golub, g., xml query, very large
database 06 smith, j. golub, g., xml
preprocessing, cikm 07 smith, j. xml security,
very large database 08
Context information (e.g., title tokens)
S (james smith) xml, query, preprocess,
security
T (smith, j.) xml, query, preprocessing,
security
Similarity function (e.g., Jaccard)
smith, j.
sim(S,T)3/50.6
sim(S,T) ? (0.5) smith, j. is the variant
name of james smith
james smith
Duplicate name graph
12
Objective

Represent name disambiguation problem as a graph
A duplicate name graph is formed semantically by
the similarities of pair-wise nodes.
If two nodes are connected in the graph, they are
name variants.
Observing topological features in the graph,
investigate the effectiveness of similarity
functions and context information
Jaccard, Cosine similarity, Edit-distance
A set of coauthors, title tokens, venue tokens

13
Topological Features
14
Experimental Analysis

128 real author names and variants
Manually collected from ACM Portal
Manually verify that two authors (eg, Chong Kwan
Un vs. C. K. Un, Chong K Un) are the same author
name in ACM
From 128 author names,
Eg, two name variants Chun Wu Leng vs. Chun-Wu
Leng
Consider Chun Wu Leng as the representative name
Consider Chun-Wu Leng as a variant name
of representative names 43
Each representative name has 2.98 name variants
Max. of variants 5 (A. Y. Halevy, Alon
Halevy, Alon Levy, Alon Y. Halevy, Alon Y. Levy)

15
(No Transcript)
16

Each representative name has at most 2 variants
If a similarity function (e.g., Cosine
similarity) identifies variants effectively,
there are many forests and topological features
of random graphs.
But the duplicate name graph is a scale-free
network
Power-law distribution
Cosine similarity function does not find
identical author names effectively.
Due to false positives (co-authors) in the graph

17
(No Transcript)
18
(No Transcript)
19
Summary

Analyze/visualize the name disambiguation problem
using social network analysis methods
Jaccard, Cosine similarity, and Edit-distance do
not work effectively
Showing the scale-free topological feature
Best is Jaccard or Cosine similarity using
context info of coauthors or title tokens

20
TF/IDF Cosine Similarity