Title: Name games The author of the Iliad is either Homer or, if not Homer, somebody else of the same name.
1Name gamesThe author of the Iliad is either
Homer or, if not Homer, somebody else of the
same name. Aldous Huxley (1894 - 1963) Adrian
Esterman John Bass
2Name Comparison Techniques
- Exact matches are not sufficient - names can be
misspelled for many different reasons - Need special algorithms to see if two names are
approximately the same. - Phonetic coding algorithms
- Similarity indices
3Phonetic coding algorithms
Soundex Phonix NYSIIS Metaphone
4Soundex system (Russell, 1890)
1. Use the first letter of name 2. Drop a, e, i,
o, u, h, w and y 3. Create three digit code
according to Code Letter
1 b,f,p,v 2
c,g,j,k,q,s,x, 3 d, 4
l 5 m, n 6 r 4. A drop
letters having the same code 5. If necessary pad
with zeros
5Soundex system (Russell, 1890)
Examples Name Soundex code Adrian Esterman
A365 E236 Adrian Estermann A365 E236 Adrian
Easterman A365 E236
6Soundex system (Russell, 1890)
- Original soundex is flawed
- geared towards English names
- cant account for incorrect first letter
- longer names are truncated
7Phonix system (Gadd, 1990)
Soundex variant in which letters are mapped to a
set of codes using the same algorithm, but a
slightly different set of codes. Prior to
mapping about 160 letter-group transformations
are used to standardise the string. For example,
the sequence tjV (where V is any vowel) is
mapped to chV if it occurs at the start of a
string.
8New York State Identification and Intelligence
Algorithm (NYSIIS) (Taft, 1970)
- Builds a phonetic code of up to 6 letters for
- each name
- Much more complicated algorithm.
- Builds more precise phonetic codes
- Vowels are retained and all mapped to the
- letter A
- They found that NYSIIS is 98.7 accurate
- whereas Soundex is 96.0 accurate
9NYSIIS (Taft, 1970)
Examples Name NYSIIS Code Borthwick BART
WA Walker WALCAR Papadouka PAPADA
10Metaphone (Philips, 1990)
A similar code to Soundex, however
only consonants are retained and these are
reduced to consonants (not digits) B X S K J T
F H L M N P R 0 W Y where 0 represents the TH
sound
11Metaphone (Philips, 1990)
Examples Name Metaphone code Johnson JNSN Jen
sen JNSN Fitzwilliam FTSWLM
12Similarity Indices
Edit distance Jaro-Winkler
13Edit-Distance
- Edit distance operations
- Insertion, where an extra character is inserted
into the string - Deletion, where a character has been removed from
the string - Transposition, in which two characters are
reversed in their sequence - Substitution, which is an insertion followed by a
deletion
14Edit distance
- Edit distance is measured as a count of edit
distance operations from one string to another - Strings with a small edit distance are likely to
be similar - Example
- internatianl to international has an edit
distance of 2, - i.e., 1 transposition 1 deletion
15Jaro-Winkler (1994)
- Developed at the U.S. Census and used in their
post-enumeration survey - Works as follows
- Input Name1, Name2
- Output Number between 0 and 1
- 1.0 Perfect match
- 0.5 Partial match
- 0.0 Completely different
16Jaro-Winkler
- Matching characters defined as letters in name1
which are within half the length of the name in
name2. Thus, for two eight-character strings, the
second letter of name 1 should appear in position
1 thru 5 of name2
17Jaro-Winkler
J-W
c Number of matching characters s1 Length
of name 1 s2 Length of name 2 t Number of
transpositions
18- However, despite even the most sophisticated
algorithms for comparing names, - the weird and wonderful way in which people and
cultures use names is a problem . . . .
19Nicknames
- James gt Jim, Jimmy
- Elizabeth gt Beth, Liz, Liza, Betsy, Betty
(Beth could also be short for Bethwyn) - Margaret gt Meg, Peg, Peggy, Margie
- Raymond gt Ray (commonly also used as a name
in itself)
20Variant spellings
- Smith, Smythe
- Thomson, Thompson
- Gray, Grey
- Catherine, Katherine, Catharine, Kathryn, Kate
- Brian, Bryan, Bryn
21Changing ethnicity
- Stefan gt Stephen, Steven, Steve
- Josef gt Joseph, Joe
- Giovanni gt John
- Harold Morris gt Harry Maurice
- Smit gt Smith
- Johansson gt Johnson
22Pilgrimages
23Gender indicators
- Vietnamese
- Van indicates male, Thi indicates female
- Dzung Do Van Dzung Do Dzung Van Do
24Family and given name positions
- Jiang Xiao
- Xiao Jiang
- Which is the given name?
25Given name order
- Alfred John Bass
- John Bass
- Alfred Bass
- A John Bass
- John Alfred Bass
26Hyphenated names
- Drake-Brockman
- Lee-Steere
- Owen-Smith gt Smith gt Owen
- De-Klerk
27Names with more than one word
- du Plessis
- el Garrouj
- van der Merwe
- O Day
- Castelnuovo Tedesco
28How many ways can a name appear?
- ODell
- Odell
- O-Dell
- O,Dell
- O Dell
29Name changes
- Marriage
- Adoption
- Divorce
- Forced into hiding
- Personal choice
30- . . . . and, of coarse, people make
mistakes
31- . . . . and, of course, people make
mistakes
32