K-Nearest Neighbors (kNN) presentation

About This Presentation

Transcript and Presenter's Notes

Title: K-Nearest Neighbors (kNN)

1
K-Nearest Neighbors (kNN)

Given a case base CB, a new problem P, and a
similarity metric sim
Obtain the k cases in CB that are most similar
to P according to sim
Reminder we used a priority list with the top k
most similar cases obtained so far

2
Forms of Retrieval

Sequential Retrieval
Two-Step Retrieval
Retrieval with Indexed Cases

3
Retrieval with Indexed Cases

Sources
Bergmans book
Davenport Prusacks book on Advanced Data
Structures
Samets book on Data Structures

4
Range Search
Space of known problems
5
K-D Trees

Idea Partition of the case base in smaller
fragments
Representation of a k-dimension space in a binary
tree
Similar to a decision tree comparison with nodes
During retrieval
Search for a leaf, but
Unlike decision trees backtracking may occur

6
Definition K-D Trees

Given
K types T1, , Tk for the attributes A1, , Ak
A case base CB containing cases in T1? ? Tk
A parameter b (size of bucket)
A K-D tree T(CB) for a case base CB is a binary
tree defined as follows
If CB lt b then T(CB) is a leaf node (a bucket)
Else T(CB) defines a tree such that
The root is marked with an attribute Ai and a
value v in Ai and
The 2 k-d trees T(c ? CB c.i-attribute lt v)
and T(c ? CB c.i-attribute ? v) are the left
and right subtrees of the root

7
Example
A1
(0,100)
lt35
?35
(60,75) Toronto
Denver Omaha
A2
(80,65) Buffalo
lt40
(5,45) Denver
(35,40) Chicago
?40
Atlanta (85,15)
A1
(50,10) Mobile
lt85
(25,35) Omaha
?85
(90,5) Miami
Atlanta Miami
Mobile
(0,0)
(100,0)
A1

Notes
Supports Euclidean distance
May require backtracking
Closest city to P(32,45)?
Priority lists are used for computing kNN

lt60
?60
Toronto Buffalo
Chicago
8
Using Decision Trees as Index
Standard Decision Tree
Ai
vn

v1
v2

Notes
Supports Hamming distance
May require backtracking
Operates in a similar fashion as kd-trees
Priority lists are used for computing kNN

9
Variation Point QuadTree

Particularly suited for performing range search
(i.e, similarity assessment)
Adequate with fewer numerical and known-important
attributes

A node in a (point) quadtree contains
4 Pointers quad NW, quad NE,
quadSW, and quadSE
point, of type DataPoint, which in turn contains
name
(x,y) coordinates

10
Example
(0,100)
(60,75) Toronto
(80,65) Buffalo
(5,45) Denver
(35,40) Chicago
Atlanta (85,15)
(50,10) Mobile
(25,35) Omaha
(90,5) Miami
(0,0)
(100,0)
Insertion order Chicago, Mobile, Toronto,
Buffalo, Denver,
Omaha, Atlanta and Miami
11
Insertion in Quadtree
Chicago
Denver
Omaha
Mobile
Toronto
Atlanta Miami
Buffalo
12
Insertion Procedure
We define a new type quadrant
NW, NE, SW, SE
function PT_compare(DataPoint dP, dR)
quadrant //quadrant where dP belongs relative to
dR
if (dP.x lt dR.x) then if (dP.y lt dR.y) then
return SW else return NW else if (dP.y
lt dR.y) then return SE else return NE
13
Insertion Procedure (Cont.)
procedure PT_insert(Pointer P, R) //inserts P in
the tree rooted at R Pointer T //points to the
current node being examined Pointer F // points
to the parent of T Quadrant Q //auxiliary
variable T ? R F ? null
while not(T null) not(equalCoord(P.point,T.p
oint)) do F ? T Q ?
PT_compare(P.point, T.point) T ?
T.quadQ if (T null) then F.quadQ ? P
14
Search
Typical query find all cities within 50 miles
of Washington,DC
In the initial example find all cities within 8
data units from (83,13)

Solution
Discard NW, SW and NE of Chicago (that is, only
examine SE)
There is no need to search the NW and SW of Mobile

15
Search (II)
Let R be the root of the quadtree, what regions
need to be inspected if R is in the quadrant
1
2
3
9
10
r
5
4
A
1
SE
11
12
2
SW, SE
6
8
7
8
NW
11
NW, NE, SE
16
Priority Queues

Typical example printing in a Unix/Linux
environment. Printing jobs have different
priorities.
These priorities may override the FIFO policy of
the queues (i.e., jobs with the highest
priorities will get printed first).

Operations supported in a priority queue
Insert a new element
Extract/Delete of the element with the lowest
priority
In search trees, the priority is based on the
distance

Insertion, deletion can be done in O(Log N) and
look-head in O(1)

17
Nearest-Neighbor Search
Problem Given a point quadtree T and a point P
find the node in T that is the closest to P
Idea traverse the quadtree maintaining a
priority list, candidates, based on the distance
from P to the quadrants containing the candidate
nodes
(60,75) Toronto
(80,65) Buffalo
(5,45) Denver
(35,40) Chicago
(85,15) Atlanta
P(95,15)
(50,10) Mobile
(25,35) Omaha
(90,5) Miami
18
Distance from P to a Quadrant
Let f-1 be the inverse of the distance-similarity
compatible function
P2
P3
2
distance(P,SW) f-1(sim(P,(P.y,0))
3
(x,y)
distance(P,NW) f-1(sim(P,(x,y))
4
P1
P
1
P4
distance(P,NE) f-1(sim(P,(P.x,0))
distance(P,SE) 0
19
Idea of the Algorithm
Candidates Chicago (4225)
Buffer null (?)
(60,75) Toronto
(5,45) Denver
(35,40) Chicago
P (95,15)
(50,10) Mobile
(25,35) Omaha
Candidates Mobile(0),Toronto (25), Omaha (60),
Denver(4225)
Buffer Chicago (4225)
20
List of Candidates

Termination test Buffer.distance lt
distance(candidates.top,P)
if yes then return Buffer
if no then continue

In this particular example, is no since Mobile
is closer to P than Chicago

Examine the quadrant of the top of candidates
(Mobile) and make it the new buffer

distance(P,NE) 0 distance(P,SE) 5
(85,15) Atlanta
P(95,15)
(50,10) Mobile
(90,5) Miami
Buffer Mobile (1625)
21
Finally the Nearest Neighbor is Found
Candidates Atlanta(0), Miami(5), Toronto (25),
Omaha (60), Denver(4225)
Buffer Atlanta(100)
A new iteration
Candidates Miami(5), Toronto (25), Omaha (60),
Denver(4225)
The algorithm terminates since the distance from
Atlanta to P is less than the distance from Miami
to P
22
Complexity

Experiments show that random insertion of N nodes
is roughly O(N log4N)
Thus, insertion of a single node is O(log4N)
But worst case (actual complexity) can be much
worse
Range search can be performed in O(2 N ½)

23
Delete

First idea
Find the node N that you want to delete
Delete N and all of its descendants ND
For each node N in ND, add N back into the tree

Terrible idea it is too inefficient!.
24
Idealized Deletion in Quadtrees
If a point A is to be deleted find a point B such
that the region between A and B is empty and
replaced A with B
B
A
Hatched Region
Why?
Because all the remaining points will be in the
same quadrants relative to B as they are relative
to A. For example, Omaha could replace Chicago as
the root.
25
Problem with Idealized Situation
First Problem A lot of effort is required to
find such a B.
In the following example which point (C, F, D or
A) has a hatched region with A?
Answer none!. Second problem No such a B may
exit!
26
Problem with Defining a New Root
Several points will have to be re-positioned
Old root
New root
27
Deletion Process
Delete P
1. If P is a leaf then just delete it!.
2. If P has a single child C, then replace P with
C
3. For all other cases 3.1 Compute 4
candidate nodes, one for each
quadrant under P 3.2 Select one of the
candidate node, N according to
certain criteria 3.3 Delete several nodes
under P and collect them in a list,
ADD. Also delete N. 3.4 Make N.point the
new root P.point ? N.point 3.5
Re-insert all nodes in ADD
28
A Word of Warning About Deletion

In databases frequently deletion is not done
immediately because it is so time-consuming.

Sometimes they dont even do insertions
immediately!

Instead they keep a log with all deletions (and
additions), and periodically (i.e., every night,
weekend), the log is traversed to update the
database. The technique is called Differential
Databases.

Deleting cases is part of the general problem of
case base maintenance.

29
Properties of Retrieval with Indexed Cases

Advantage
Disadvantages

Efficient retrieval
Incremental dont need to rebuild index again
every time a new case is entered
?-error does not occur

Cost of construction is high
Only work for monotonic similarity relations

Write a Comment

User Comments (0)

About PowerShow.com

K-Nearest Neighbors (kNN) PowerPoint PPT Presentation