Pairwise Document Similarity in Large Collections with MapReduce - PowerPoint PPT Presentation

1 / 20
About This Presentation
Title:

Pairwise Document Similarity in Large Collections with MapReduce

Description:

Tamer Elsayed, Jimmy Lin, and Douglas W. Oard. University of Maryland, College Park. Human Language Technology Center of ... Okapi BM25. Subsets of collection ... – PowerPoint PPT presentation

Number of Views:81
Avg rating:3.0/5.0
Slides: 21
Provided by: Tam103
Learn more at: http://www.cs.umd.edu
Category:

less

Transcript and Presenter's Notes

Title: Pairwise Document Similarity in Large Collections with MapReduce


1
Pairwise Document Similarity in Large
Collections with MapReduce
  • Tamer Elsayed, Jimmy Lin, and Douglas W. Oard
  • University of Maryland, College Park
  • Human Language Technology Center of Excellence
  • and
  • UMIACS CLIP Lab

2
Abstract Problem
  • Applications
  • Clustering
  • Coreference resolution
  • more-like-that queries

3
Trivial Solution
  • load each vector o(N) times
  • load each term o(dft2) times

Goal
scalable and efficient solutionfor large
collections
4
Better Solution
Each term contributes only if appears in
  • Load weights for each term once
  • Each term contributes o(dft2) partial scores

5
MapReduce Framework
(a) Map
(b) Shuffle
(c) Reduce
(k1, v1)
k2, v2
Shuffling group values by keys
(k3, v3)
map
input
(k2, v2)
reduce
output
map
input
reduce
output
map
input
reduce
output
map
input
handles low-level details transparently
6
Decomposition
Each term contributes only if appears in
reduce
map
  • Load weights for each term once
  • Each term contributes o(dft2) partial scores

7
Standard Indexing
(a) Map
(b) Shuffle
(c) Reduce
Shuffling group values by terms
tokenize
doc
combine
posting list
tokenize
doc
combine
posting list
tokenize
doc
combine
posting list
tokenize
doc
8
Indexing (3-doc toy collection)
Clinton ObamaClinton
Clinton Obama Clinton
Clinton
1
2
Indexing
1
ClintonCheney
Cheney
Clinton Cheney
1
Barack
1
Clinton Barack Obama
ClintonBarackObama
Obama
1
1
9
Pairwise Similarity
(a) Generate pairs
(b) Group pairs
(c) Sum pairs
Clinton
1
2
1
Cheney
1
Barack
1
Obama
1
1
10
Pairwise Similarity (abstract)
(a) Generate pairs
(b) Group pairs
(c) Sum pairs
Shuffling group values by pairs
multiply
term postings
sum
similarity
multiply
term postings
sum
similarity
multiply
term postings
sum
similarity
multiply
term postings
11
Experimental Setup
  • 0.16.0
  • Open source MapReduce implementation
  • Cluster of 19 machines
  • Each w/ two processors (single core)
  • Aquaint-2 collection
  • 906K documents
  • Okapi BM25
  • Subsets of collection

12
Efficiency (disk space)
Aquaint-2 Collection, 906k docs
8 trillion intermediate pairs
Hadoop, 19 PCs, each 2 single-core processors,
4GB memory, 100GB disk
13
Terms Zipfian Distribution
each term t contributes o(dft2) partial results
very few terms dominate the computations
most frequent term (said) ? 3 most frequent 10
terms ? 15 most frequent 100 terms ? 57 most
frequent 1000 terms ? 95
doc freq (df)
0.1 of total terms(99.9 df-cut)
term rank
14
Efficiency (disk space)
Aquaint-2 Collection, 906k doc
8 trillionintermediate pairs
0.5 trillion intermediate pairs
Hadoop, 19 PCs, each w/ 2 single-core
processors, 4GB memory, 100GB disk
15
Effectiveness (recent work)
Drop 0.1 of termsNear-Linear GrowthFit on
diskCost 2 in Effectiveness
Hadoop, 19 PCs, each w/ 2 single-core
processors, 4GB memory, 100GB disk
16
Ivory
  • Open source implementation
  • Java 1.5, 0.16.0
  • Available soon

17
Conclusion
  • Simple and efficient MapReduce solution
  • Many HLT problems can also be hadoopified
  • E.g., Statistical MT (see paper in StatMT
    workshop)
  • Shuffling is critical
  • df-cut controls efficiency vs. effectiveness
    tradeoff
  • 99.9 df-cut achieves 98 relative accuracy

18
Future work
  • Apply to larger collections!
  • Develop analytical model
  • Measure effectiveness for different applications

19
Thank You!
20
Algorithm
  • Matrix must fit in memory
  • Works for small collections
  • Otherwise disk access optimization
Write a Comment
User Comments (0)
About PowerShow.com