K-Nearest Neighbors (kNN) - PowerPoint PPT Presentation

About This Presentation
Title:

K-Nearest Neighbors (kNN)

Description:

Closest city to P(32,45)? Priority lists are used for computing kNN ... F.quad[Q] P. Search. Typical query: 'find all cities within 50 miles of Washington,DC' ... – PowerPoint PPT presentation

Number of Views:3028
Avg rating:3.0/5.0
Slides: 30
Provided by: cseLe
Category:
Tags: knn | nearest | neighbors

less

Transcript and Presenter's Notes

Title: K-Nearest Neighbors (kNN)


1
K-Nearest Neighbors (kNN)
  • Given a case base CB, a new problem P, and a
    similarity metric sim
  • Obtain the k cases in CB that are most similar
    to P according to sim
  • Reminder we used a priority list with the top k
    most similar cases obtained so far

2
Forms of Retrieval
  • Sequential Retrieval
  • Two-Step Retrieval
  • Retrieval with Indexed Cases

3
Retrieval with Indexed Cases
  • Sources
  • Bergmans book
  • Davenport Prusacks book on Advanced Data
    Structures
  • Samets book on Data Structures

4
Range Search
Space of known problems
5
K-D Trees
  • Idea Partition of the case base in smaller
    fragments
  • Representation of a k-dimension space in a binary
    tree
  • Similar to a decision tree comparison with nodes
  • During retrieval
  • Search for a leaf, but
  • Unlike decision trees backtracking may occur

6
Definition K-D Trees
  • Given
  • K types T1, , Tk for the attributes A1, , Ak
  • A case base CB containing cases in T1? ? Tk
  • A parameter b (size of bucket)
  • A K-D tree T(CB) for a case base CB is a binary
    tree defined as follows
  • If CB lt b then T(CB) is a leaf node (a bucket)
  • Else T(CB) defines a tree such that
  • The root is marked with an attribute Ai and a
    value v in Ai and
  • The 2 k-d trees T(c ? CB c.i-attribute lt v)
    and T(c ? CB c.i-attribute ? v) are the left
    and right subtrees of the root

7
Example
A1
(0,100)
lt35
?35
(60,75) Toronto
Denver Omaha
A2
(80,65) Buffalo
lt40
(5,45) Denver
(35,40) Chicago
?40
Atlanta (85,15)
A1
(50,10) Mobile
lt85
(25,35) Omaha
?85
(90,5) Miami
Atlanta Miami
Mobile
(0,0)
(100,0)
A1
  • Notes
  • Supports Euclidean distance
  • May require backtracking
  • Closest city to P(32,45)?
  • Priority lists are used for computing kNN

lt60
?60
Toronto Buffalo
Chicago
8
Using Decision Trees as Index
Standard Decision Tree
Ai
vn

v1
v2
  • Notes
  • Supports Hamming distance
  • May require backtracking
  • Operates in a similar fashion as kd-trees
  • Priority lists are used for computing kNN

9
Variation Point QuadTree
  • Particularly suited for performing range search
    (i.e, similarity assessment)
  • Adequate with fewer numerical and known-important
    attributes
  • A node in a (point) quadtree contains
  • 4 Pointers quad NW, quad NE,
  • quadSW, and quadSE
  • point, of type DataPoint, which in turn contains
  • name
  • (x,y) coordinates

10
Example
(0,100)
(60,75) Toronto
(80,65) Buffalo
(5,45) Denver
(35,40) Chicago
Atlanta (85,15)
(50,10) Mobile
(25,35) Omaha
(90,5) Miami
(0,0)
(100,0)
Insertion order Chicago, Mobile, Toronto,
Buffalo, Denver,
Omaha, Atlanta and Miami
11
Insertion in Quadtree
Chicago
Denver
Omaha
Mobile
Toronto
Atlanta Miami
Buffalo
12
Insertion Procedure
We define a new type quadrant
NW, NE, SW, SE
function PT_compare(DataPoint dP, dR)
quadrant //quadrant where dP belongs relative to
dR
if (dP.x lt dR.x) then if (dP.y lt dR.y) then
return SW else return NW else if (dP.y
lt dR.y) then return SE else return NE
13
Insertion Procedure (Cont.)
procedure PT_insert(Pointer P, R) //inserts P in
the tree rooted at R Pointer T //points to the
current node being examined Pointer F // points
to the parent of T Quadrant Q //auxiliary
variable T ? R F ? null
while not(T null) not(equalCoord(P.point,T.p
oint)) do F ? T Q ?
PT_compare(P.point, T.point) T ?
T.quadQ if (T null) then F.quadQ ? P
14
Search
Typical query find all cities within 50 miles
of Washington,DC
In the initial example find all cities within 8
data units from (83,13)
  • Solution
  • Discard NW, SW and NE of Chicago (that is, only
    examine SE)
  • There is no need to search the NW and SW of Mobile

15
Search (II)
Let R be the root of the quadtree, what regions
need to be inspected if R is in the quadrant
1
2
3
9
10
r
5
4
A
1
SE
11
12
2
SW, SE
6
8
7
8
NW
11
NW, NE, SE
16
Priority Queues
  • Typical example printing in a Unix/Linux
    environment. Printing jobs have different
    priorities.
  • These priorities may override the FIFO policy of
    the queues (i.e., jobs with the highest
    priorities will get printed first).
  • Operations supported in a priority queue
  • Insert a new element
  • Extract/Delete of the element with the lowest
    priority
  • In search trees, the priority is based on the
    distance
  • Insertion, deletion can be done in O(Log N) and
    look-head in O(1)

17
Nearest-Neighbor Search
Problem Given a point quadtree T and a point P
find the node in T that is the closest to P
Idea traverse the quadtree maintaining a
priority list, candidates, based on the distance
from P to the quadrants containing the candidate
nodes
(60,75) Toronto
(80,65) Buffalo
(5,45) Denver
(35,40) Chicago
(85,15) Atlanta
P(95,15)
(50,10) Mobile
(25,35) Omaha
(90,5) Miami
18
Distance from P to a Quadrant
Let f-1 be the inverse of the distance-similarity
compatible function
P2
P3
2
distance(P,SW) f-1(sim(P,(P.y,0))
3
(x,y)
distance(P,NW) f-1(sim(P,(x,y))
4
P1
P
1
P4
distance(P,NE) f-1(sim(P,(P.x,0))
distance(P,SE) 0
19
Idea of the Algorithm
Candidates Chicago (4225)
Buffer null (?)
(60,75) Toronto
(5,45) Denver
(35,40) Chicago
P (95,15)
(50,10) Mobile
(25,35) Omaha
Candidates Mobile(0),Toronto (25), Omaha (60),
Denver(4225)
Buffer Chicago (4225)
20
List of Candidates
  • Termination test Buffer.distance lt
    distance(candidates.top,P)
  • if yes then return Buffer
  • if no then continue
  • In this particular example, is no since Mobile
    is closer to P than Chicago
  • Examine the quadrant of the top of candidates
    (Mobile) and make it the new buffer

distance(P,NE) 0 distance(P,SE) 5
(85,15) Atlanta
P(95,15)
(50,10) Mobile
(90,5) Miami
Buffer Mobile (1625)
21
Finally the Nearest Neighbor is Found
Candidates Atlanta(0), Miami(5), Toronto (25),
Omaha (60), Denver(4225)
Buffer Atlanta(100)
A new iteration
Candidates Miami(5), Toronto (25), Omaha (60),
Denver(4225)
The algorithm terminates since the distance from
Atlanta to P is less than the distance from Miami
to P
22
Complexity
  • Experiments show that random insertion of N nodes
    is roughly O(N log4N)
  • Thus, insertion of a single node is O(log4N)
  • But worst case (actual complexity) can be much
    worse
  • Range search can be performed in O(2 N ½)

23
Delete
  • First idea
  • Find the node N that you want to delete
  • Delete N and all of its descendants ND
  • For each node N in ND, add N back into the tree

Terrible idea it is too inefficient!.
24
Idealized Deletion in Quadtrees
If a point A is to be deleted find a point B such
that the region between A and B is empty and
replaced A with B
B
A
Hatched Region
Why?
Because all the remaining points will be in the
same quadrants relative to B as they are relative
to A. For example, Omaha could replace Chicago as
the root.
25
Problem with Idealized Situation
First Problem A lot of effort is required to
find such a B.
In the following example which point (C, F, D or
A) has a hatched region with A?
Answer none!. Second problem No such a B may
exit!
26
Problem with Defining a New Root
Several points will have to be re-positioned
Old root
New root
27
Deletion Process
Delete P
1. If P is a leaf then just delete it!.
2. If P has a single child C, then replace P with
C
3. For all other cases 3.1 Compute 4
candidate nodes, one for each
quadrant under P 3.2 Select one of the
candidate node, N according to
certain criteria 3.3 Delete several nodes
under P and collect them in a list,
ADD. Also delete N. 3.4 Make N.point the
new root P.point ? N.point 3.5
Re-insert all nodes in ADD
28
A Word of Warning About Deletion
  • In databases frequently deletion is not done
    immediately because it is so time-consuming.
  • Sometimes they dont even do insertions
    immediately!
  • Instead they keep a log with all deletions (and
    additions), and periodically (i.e., every night,
    weekend), the log is traversed to update the
    database. The technique is called Differential
    Databases.
  • Deleting cases is part of the general problem of
    case base maintenance.

29
Properties of Retrieval with Indexed Cases
  • Advantage
  • Disadvantages
  • Efficient retrieval
  • Incremental dont need to rebuild index again
    every time a new case is entered
  • ?-error does not occur
  • Cost of construction is high
  • Only work for monotonic similarity relations
Write a Comment
User Comments (0)
About PowerShow.com