Improve sketching of Hamming Distance with Error Correcting

About This Presentation

Title:

Improve sketching of Hamming Distance with Error Correcting

Description:

Due to the birthday principal: The probability that 2 Error will ... Backward compatibility. Even if you don't have the whole file you can mix functionals. ... – PowerPoint PPT presentation

Number of Views:63

Avg rating:3.0/5.0

Slides: 35

Provided by: Goog186

Category:

more less

Transcript and Presenter's Notes

Title: Improve sketching of Hamming Distance with Error Correcting

1
Improve sketching of Hamming Distance with Error
Correcting
Ely Porat Bar-Ilan University Google Inc
Ohad Lipsky Bar-Ilan University Check Point Inc
December 2003
2
Problem Definition (1)
Alice
Bob
TA
TB
n
n
hamm(TA,TB)
Given k - bound on the number of mismatches
December 2003
3
Problem Definition (2)
TA
TB
n
n
S
S
SA
SB
Calculate hamm(TA,TB) given only SA,SB
Finding the mistakes
Given k - bound on the number of mismatches
December 2003
4
Motivations

Data Bases
Internet
Error Correcting

Router C
Router B
Router A
Router D
December 2003
5
Outline

Simple Solution
Error Correcting
Improved Solution
Improve more
Recursion
File sharing

December 2003
6
Simplest Solution - O(k2log1/?)

Binary Alphabet
Allocate k2 cells.
Take the input array and hash each bit to one of
the cells.
In each cell remember the xor of all the values
hash to it.

0
1
1
0
December 2003
7
Simplest Solution - O(k2log1/?)
0
1
0
0
1
1
0
0
December 2003
8
Simplest Solution - O(k2log1/?)

Due to the birthday principal The probability
that 2 Error will fall to the same cell
log1/? - to get a probability to fail ?

0
1
1
0
December 2003
9
Alphabet

Denote with S the size of the alphabet.
We can encode each latter with its unary
representation.
The only effect is that each mistake will be
counted twice.

0 - 1000000.0 1 - 0100000.0 . S-1 -
0000000.1
0 - 1000000.0 5 - 0000010.0
December 2003
10
Error correcting - O(k2logNS)

Here we allocate two kind of k2 cells k2 of logS
bits. k2 of logNS bits.

C1h(Ai)Ai
5
8
3
2
C2h(Ai)iAi
15
6
7
8
December 2003
11
Error correcting - O(k2logNS)

As before with probability 1/2 there wont fall
2 Errors in the same cell.

C1h(Ai)Ai
5
8
3
2
C1h(Ai)iAi
15
6
7
8
December 2003
12
Error correcting - O(k2logNS)

We get from the red cells

5
5
8
3
2
C1h(Ai)Ai
5
6
3
2
3
8 - 6 5 - 3
December 2003
13
Error correcting - O(k2logNS)

We get from the blue cells

0
1
2
5
15
11
7
5
C2h(Ai)iAi
15
9
7
5
3
11 - 9 2(5 - 3) i2
December 2003
14
Error correcting - O(k2logNS)

The probability to succeed is about 1/2.
To lower the failer probability we will run it 3
times.
We will get a list of possible mistakes each
time.
Output all the mistakes that appear in at least 2
of the 3 runs.

December 2003
15
O(klog2k) - Solution

The Idea is two stage hashes

k/logk
w.h.p O(logk)
Bar-Yossef, Jayram, Kumar, Sivakumar 03
December 2003
16
O(klog2k) - Solution
keep accumulated XOR
The Probability to fail is less then 1/2.
Run it 2logk times And take the max. failer
probabilty less then 1/k2
O(logk)
O(log2k)
Space O(log3k)
Bar-Yossef, Jayram, Kumar, Sivakumar 03
December 2003
17
O(klog2k) - Solution
k/logk
O(log3k)
O(log3k)
O(log3k)
O(log3k)
O(klog2k)
P(Failer) ? k/logk 1/k2 Bar-Yossef, Jayram, Kumar, Sivakumar 03
December 2003
18
O(k2logklogk) -Idea (recursion)
k/logk
Pr(F)logk/loglogk
logk/loglogk runs, take max
December 2003
19
Error Correcting O(klogNS)
Alice
Bob
TA
TB
n
n
r0r1r2
p?(N3S)
Constant Probability
December 2003
20
Error Correcting O(klogNS)
Alice
Bob
TA
TB
n
n
If we wrong w.h.p jn
December 2003
21
Error Correcting O(klogNS)
Alice
Bob
TA
TB
n
n
rj , aj - bj
December 2003
22
Error Correcting O(klogNS)
Alice
Bob
TA
TB
n
n
O(klnk)
December 2003
23
Recursion
Alice
Bob
TA
TB
n
n
ck
TA
TB
n
n
December 2003
24
Recursion
Alice
Bob
TA
TB
n
n
ck
O(klogNS)
December 2003
25
Complexity
TA
TB
n
n
S
S
SA
SB
Size O(klogNS) Computing sketch O(nlogk) Comp
aring sketches O(klogk)
December 2003
26
O(klogk) -Solution

We can just encode in unary and hash the input to
k3 cells and then run the O(klogNS)O(klogk)
algorithm.

December 2003
27
Reed-Solomon Codes
We manage to develop a deterministic algorithm
based on that. But the encoding and the decoding
is slower.
Amir, Farach 95Feigenbaum, Ishai, Malkin,
Nissim, Strauss, Wright 01Bar-Yossef, Jayram,
Kumar, Sivakumar 03
Efremenko, Porat, Rothschild 06Efremenko, Porat
07
28
File Sharing
Napster
source
n
Source need to stay until someone will have the
whole file. (and willing to stay)
There is bottleneck at the end.
29
File Sharing
emule/kazaa/torrent
source
n
The source has to send nlnn blocks before
disconnecting.
Sometimes there are some bottlenecks
30
Improved File Sharing - Ver 1
a0a1a2.an-1
source
n
n6
31
Improved File Sharing - Ver 1
n6
Each client that got n points can recreate the
file
There is no more nlnn
Almost no bottlenecks
32
Improved File Sharing - Ver 2
a0a1a2.an-1
source
n
Send linear equations on the file.
33
Improved File Sharing - Ver 2
a0a1a2.an-1
source
n