Title: Improve sketching of Hamming Distance with Error Correcting
1Improve sketching of Hamming Distance with Error
Correcting
Ely Porat Bar-Ilan University Google Inc
Ohad Lipsky Bar-Ilan University Check Point Inc
December 2003
2Problem Definition (1)
Alice
Bob
TA
TB
n
n
hamm(TA,TB)
Given k - bound on the number of mismatches
December 2003
3Problem Definition (2)
TA
TB
n
n
S
S
SA
SB
Calculate hamm(TA,TB) given only SA,SB
Finding the mistakes
Given k - bound on the number of mismatches
December 2003
4Motivations
- Data Bases
- Internet
- Error Correcting
Router C
Router B
Router A
Router D
December 2003
5Outline
- Simple Solution
- Error Correcting
- Improved Solution
- Improve more
- Recursion
- File sharing
December 2003
6Simplest Solution - O(k2log1/?)
- Binary Alphabet
- Allocate k2 cells.
- Take the input array and hash each bit to one of
the cells. - In each cell remember the xor of all the values
hash to it.
0
1
1
0
December 2003
7Simplest Solution - O(k2log1/?)
0
1
0
0
1
1
0
0
December 2003
8Simplest Solution - O(k2log1/?)
- Due to the birthday principal The probability
that 2 Error will fall to the same cell - log1/? - to get a probability to fail ?
0
1
1
0
December 2003
9Alphabet
- Denote with S the size of the alphabet.
- We can encode each latter with its unary
representation. - The only effect is that each mistake will be
counted twice.
0 - 1000000.0 1 - 0100000.0 . S-1 -
0000000.1
0 - 1000000.0 5 - 0000010.0
December 2003
10Error correcting - O(k2logNS)
- Here we allocate two kind of k2 cells k2 of logS
bits. k2 of logNS bits.
C1h(Ai)Ai
5
8
3
2
C2h(Ai)iAi
15
6
7
8
December 2003
11Error correcting - O(k2logNS)
- As before with probability 1/2 there wont fall
2 Errors in the same cell.
C1h(Ai)Ai
5
8
3
2
C1h(Ai)iAi
15
6
7
8
December 2003
12Error correcting - O(k2logNS)
- We get from the red cells
5
5
8
3
2
C1h(Ai)Ai
5
6
3
2
3
8 - 6 5 - 3
December 2003
13Error correcting - O(k2logNS)
- We get from the blue cells
0
1
2
5
15
11
7
5
C2h(Ai)iAi
15
9
7
5
3
11 - 9 2(5 - 3) i2
December 2003
14Error correcting - O(k2logNS)
- The probability to succeed is about 1/2.
- To lower the failer probability we will run it 3
times. - We will get a list of possible mistakes each
time. - Output all the mistakes that appear in at least 2
of the 3 runs.
December 2003
15O(klog2k) - Solution
- The Idea is two stage hashes
k/logk
w.h.p O(logk)
Bar-Yossef, Jayram, Kumar, Sivakumar 03
December 2003
16O(klog2k) - Solution
keep accumulated XOR
The Probability to fail is less then 1/2.
Run it 2logk times And take the max. failer
probabilty less then 1/k2
O(logk)
O(log2k)
Space O(log3k)
Bar-Yossef, Jayram, Kumar, Sivakumar 03
December 2003
17O(klog2k) - Solution
k/logk
O(log3k)
O(log3k)
O(log3k)
O(log3k)
O(klog2k)
P(Failer) ? k/logk 1/k2 Bar-Yossef, Jayram, Kumar, Sivakumar 03
December 2003
18O(k2logklogk) -Idea (recursion)
k/logk
Pr(F)logk/loglogk
logk/loglogk runs, take max
December 2003
19Error Correcting O(klogNS)
Alice
Bob
TA
TB
n
n
r0r1r2
p?(N3S)
Constant Probability
December 2003
20Error Correcting O(klogNS)
Alice
Bob
TA
TB
n
n
If we wrong w.h.p jn
December 2003
21Error Correcting O(klogNS)
Alice
Bob
TA
TB
n
n
rj , aj - bj
December 2003
22Error Correcting O(klogNS)
Alice
Bob
TA
TB
n
n
O(klnk)
December 2003
23Recursion
Alice
Bob
TA
TB
n
n
ck
TA
TB
n
n
December 2003
24Recursion
Alice
Bob
TA
TB
n
n
ck
O(klogNS)
December 2003
25Complexity
TA
TB
n
n
S
S
SA
SB
Size O(klogNS) Computing sketch O(nlogk) Comp
aring sketches O(klogk)
December 2003
26O(klogk) -Solution
- We can just encode in unary and hash the input to
k3 cells and then run the O(klogNS)O(klogk)
algorithm.
December 2003
27Reed-Solomon Codes
We manage to develop a deterministic algorithm
based on that. But the encoding and the decoding
is slower.
Amir, Farach 95Feigenbaum, Ishai, Malkin,
Nissim, Strauss, Wright 01Bar-Yossef, Jayram,
Kumar, Sivakumar 03
Efremenko, Porat, Rothschild 06Efremenko, Porat
07
28File Sharing
Napster
source
n
Source need to stay until someone will have the
whole file. (and willing to stay)
There is bottleneck at the end.
29File Sharing
emule/kazaa/torrent
source
n
The source has to send nlnn blocks before
disconnecting.
Sometimes there are some bottlenecks
30Improved File Sharing - Ver 1
a0a1a2.an-1
source
n
n6
31Improved File Sharing - Ver 1
n6
Each client that got n points can recreate the
file
There is no more nlnn
Almost no bottlenecks
32Improved File Sharing - Ver 2
a0a1a2.an-1
source
n
Send linear equations on the file.
33Improved File Sharing - Ver 2
a0a1a2.an-1
source
n
- Problems
- 1. Heavy to encode each packet we need to go over
all the file. - 2. Very heavy to decode O(n2) block operation
O(n3) fields operations. - Facts
- 1. If you get n(1/2-?) random combination of two
blocks - you wont have dependents w.h.p.
- 2. If you have d - pairs combinations you can
easilly reduce your system - to n-d variables.
Solution Use sparse functionals
34Improved File Sharing - Ver 2
a0a1a2.an-1
source
n
- Futures
- Backward compatibility.
- Even if you dont have the whole file you can mix
functionals.