Name games The author of the Iliad is either Homer or, if not Homer, somebody else of the same name. - PowerPoint PPT Presentation

1 / 32
About This Presentation
Title:

Name games The author of the Iliad is either Homer or, if not Homer, somebody else of the same name.

Description:

Jaro-Winkler. Edit-Distance. Edit distance operations ... Jaro-Winkler (1994) Developed at the U.S. Census and used in their post-enumeration survey ... Jaro-Winkler ... – PowerPoint PPT presentation

Number of Views:128
Avg rating:3.0/5.0
Slides: 33
Provided by: davidwa4
Category:

less

Transcript and Presenter's Notes

Title: Name games The author of the Iliad is either Homer or, if not Homer, somebody else of the same name.


1
Name gamesThe author of the Iliad is either
Homer or, if not Homer, somebody else of the
same name. Aldous Huxley (1894 - 1963) Adrian
Esterman John Bass
2
Name Comparison Techniques
  • Exact matches are not sufficient - names can be
    misspelled for many different reasons
  • Need special algorithms to see if two names are
    approximately the same.
  • Phonetic coding algorithms
  • Similarity indices

3
Phonetic coding algorithms
Soundex Phonix NYSIIS Metaphone
4
Soundex system (Russell, 1890)
1. Use the first letter of name 2. Drop a, e, i,
o, u, h, w and y 3. Create three digit code
according to Code    Letter       
1            b,f,p,v 2           
c,g,j,k,q,s,x, 3            d, 4           
l 5            m, n 6            r 4. A drop
letters having the same code 5. If necessary pad
with zeros
5
Soundex system (Russell, 1890)
Examples Name Soundex code Adrian Esterman
A365 E236 Adrian Estermann A365 E236 Adrian
Easterman A365 E236
6
Soundex system (Russell, 1890)
  • Original soundex is flawed
  • geared towards English names
  • cant account for incorrect first letter
  • longer names are truncated

7
Phonix system (Gadd, 1990)
Soundex variant in which letters are mapped to a
set of codes using the same algorithm, but a
slightly different set of codes. Prior to
mapping about 160 letter-group transformations
are used to standardise the string. For example,
the sequence tjV (where V is any vowel) is
mapped to chV if it occurs at the start of a
string.
8
New York State Identification and Intelligence
Algorithm (NYSIIS) (Taft, 1970)
  • Builds a phonetic code of up to 6 letters for
  • each name
  • Much more complicated algorithm.
  • Builds more precise phonetic codes
  • Vowels are retained and all mapped to the
  • letter A
  • They found that NYSIIS is 98.7 accurate
  • whereas Soundex is 96.0 accurate

9
NYSIIS (Taft, 1970)
Examples Name NYSIIS Code Borthwick BART
WA Walker WALCAR Papadouka PAPADA
10
Metaphone (Philips, 1990)
A similar code to Soundex, however
only consonants are retained and these are
reduced to consonants (not digits) B X S K J T
F H L M N P R 0 W Y where 0 represents the TH
sound
11
Metaphone (Philips, 1990)
Examples Name Metaphone code Johnson JNSN Jen
sen JNSN Fitzwilliam FTSWLM
12
Similarity Indices
Edit distance Jaro-Winkler
13
Edit-Distance
  • Edit distance operations
  • Insertion, where an extra character is inserted
    into the string
  • Deletion, where a character has been removed from
    the string
  • Transposition, in which two characters are
    reversed in their sequence
  • Substitution, which is an insertion followed by a
    deletion

14
Edit distance
  • Edit distance is measured as a count of edit
    distance operations from one string to another
  • Strings with a small edit distance are likely to
    be similar
  • Example
  • internatianl to international has an edit
    distance of 2,
  • i.e., 1 transposition 1 deletion

15
Jaro-Winkler (1994)
  • Developed at the U.S. Census and used in their
    post-enumeration survey
  • Works as follows
  • Input Name1, Name2
  • Output Number between 0 and 1
  • 1.0 Perfect match
  • 0.5 Partial match
  • 0.0 Completely different

16
Jaro-Winkler
  • Matching characters defined as letters in name1
    which are within half the length of the name in
    name2. Thus, for two eight-character strings, the
    second letter of name 1 should appear in position
    1 thru 5 of name2

17
Jaro-Winkler
J-W
c Number of matching characters s1 Length
of name 1 s2 Length of name 2 t Number of
transpositions
18
  • However, despite even the most sophisticated
    algorithms for comparing names,
  • the weird and wonderful way in which people and
    cultures use names is a problem . . . .

19
Nicknames
  • James gt Jim, Jimmy
  • Elizabeth gt Beth, Liz, Liza, Betsy, Betty
    (Beth could also be short for Bethwyn)
  • Margaret gt Meg, Peg, Peggy, Margie
  • Raymond gt Ray (commonly also used as a name
    in itself)

20
Variant spellings
  • Smith, Smythe
  • Thomson, Thompson
  • Gray, Grey
  • Catherine, Katherine, Catharine, Kathryn, Kate
  • Brian, Bryan, Bryn

21
Changing ethnicity
  • Stefan gt Stephen, Steven, Steve
  • Josef gt Joseph, Joe
  • Giovanni gt John
  • Harold Morris gt Harry Maurice
  • Smit gt Smith
  • Johansson gt Johnson

22
Pilgrimages
  • Douf gt El Hadji Douf

23
Gender indicators
  • Vietnamese
  • Van indicates male, Thi indicates female
  • Dzung Do Van Dzung Do Dzung Van Do

24
Family and given name positions
  • Jiang Xiao
  • Xiao Jiang
  • Which is the given name?

25
Given name order
  • Alfred John Bass
  • John Bass
  • Alfred Bass
  • A John Bass
  • John Alfred Bass

26
Hyphenated names
  • Drake-Brockman
  • Lee-Steere
  • Owen-Smith gt Smith gt Owen
  • De-Klerk

27
Names with more than one word
  • du Plessis
  • el Garrouj
  • van der Merwe
  • O Day
  • Castelnuovo Tedesco

28
How many ways can a name appear?
  • ODell
  • Odell
  • O-Dell
  • O,Dell
  • O Dell

29
Name changes
  • Marriage
  • Adoption
  • Divorce
  • Forced into hiding
  • Personal choice

30
  • . . . . and, of coarse, people make
    mistakes

31
  • . . . . and, of course, people make
    mistakes

32
  • Thank you !
Write a Comment
User Comments (0)
About PowerShow.com