Title: The Link Database: Fast Access to Graphs of the Web
1The Link Database Fast Access to Graphs of the
Web
- Written by K. Randall, R. Stata,
- R. Wickremesinghe, J. L. Wiener
- Presented by Xiaoguang Qi
- CSE 397/497 WWW Search Engine
- Nov 19, 2004
2Overview
- Introduction
- Background and Link 1
- Link 2
- single list compression and starts array
compression - Link3
- interlist compression
- Measurements
- Conclusion
3Introduction
- What is Link Database?
- Part of the Connectivity Server
- Provides fast access to the hyperlink data
- What is Connectivity Server?
- A special purpose database
- Models the web as a graph
- URLs -gt nodes hyperlinks -gt directed edges
4Applications of Link Database
Introduction
- PageRank
- rank the web pages based on their inlinks
- HITS
- refine web query results by examining subgraphs
of the web - Mirror detection
- use links and URLs to find sites that mirror each
others contents - Studies of web structure
5Goal of the Link Database
Introduction
- Reduce the amount of space required
- Provide fast decompression speed
- Allow random access
- Effective on small amount of data
- Adjacency list (68 bytes on average)
6Existing Work in the Area
Introduction
- Many compression techniques exist
- However, nearly all algorithms are optimized for
sequential access - Random access needed
- Most algorithms are optimized for large date sets
- Adjacency lists are small
- Focus on reducing disk-space and I/O requirements
of disk-based databases - Reduction of memory requirements needed
7Existing Work in the Area (Cont.)
Introduction
- Google must use compression techniques to store
the Web graph - Unpublished
- Altavista
- Uses Link2 described in this paper
8Background and Link 1
- Links files
- The URL database
- The link database
9Links files
Link 1
- A sequence a records
- Each record consists of a source URL followed by
a sequence of destination URLs
http//www.foo.com/ http//www.foo.com/css/foo
style.css http//www.foo.com/images/logo.gif
http//www.foo.com/images/navigation.gif
http//www.foo.com/about/ http//www.foo.com/p
roducts/ http//www.foo.com/jobs/
10The URL Database
Link 1
- Three kinds of representations for URLs
- Text original URL
- Fingerprint a 64-bit hash of URL text
- URL-id sequentially assigned 32-bit integer
11URL-ids
Link 1
- Sequentially assigned from 1 to N
- Divide the URLs into three partitions based on
their degree - indegree or outdegree gt 254, high-degree
- 24 254, medium degree
- Both lt24, low degree
- Assign URL-ids by partition
- Inside each partition, by lexicographic order
12The Link Database
Link 1
- Maps from each URL-id to the sets of URL-ids that
are its outlinks and its inlinks - A and AT
Recall the way we store a sparse matrix?
13Link 2 Single List Compression and Starts Array
Compression
- Two facts that motivates the compressing
algorithm - Majority links are local 80 of links point to
URLs on the same host - URL-ids on the same host tend to be close to one
another
14Single List Compression
Link 2
- Delta values the differences between neighbors
of the list
15Variable-length Nybble Code
Link 2
- A sequence of 4-bit strings
- The first 3 bits an unsigned number
- The last one bit a stop bit
- Eg. (28)10 (11100)2
- Nybble encoding 0111 1000
- What about negative values?
- Not as good as Huffman coding in terms of
compression - However, provides faster decompression
16Starts Array Compression
Link 2
- For each of the three URL partitions mentioned,
we use a different number of bits to encode the
starts entries - Why?
- Most of the URLs (74 in this date set) are in
the small-degree partition - i.e. the majority of entries in the starts array
are in the small-degree partition where indices
take a little over 8 bits each
17Link 3 Interlist Compression
- Pages with close URL-ids have many links in common
18Select and Union Compression
Link 3
- Select compression
- Choose a previous adjacency list as the
representative list - Other adjacency lists are represented by the
differences between itself and the representative
list - Additions
- Deletions
19Select and Union Compression (Cont.)
Link 3
- Union compression
- Similar to select compression
- Except the representative list is the union of a
set of adjacency lists
20Variations For Select and Union Compression
Link 3
- Problem
- Select may create long chains of encoding, which
can increase decompression time - Solutions
- Consider variations for it
21Variations For Select and Union Compression
(Cont.)
Link 3
22LimitSelect-K-L
Link 3
- Choose LimitSelect-K-L because
- It avoids chain problem
- It achieves better compression with short chains
than Block-Select-K-B and Union
23LimitSelect-K-L (Cont.)
Link 3
24Measurements
- Each step from Link1 to Link2 to Link3
approximately doubles the number of pages we can
handle on our 16 GB machine, but each step also
costs us in access time
25Conclusion
- Reduce the amount of space required from 32 bits
to less than 6 bits per link better than a ¼
compression ratio - Effective on the small amounts of data in an
adjacency list - Allow random access to individual lists
- Provide fast decompression speed.
26My Opinion
- Good points
- Good balance between the compression ratio and
decompression speed - Provides algorithms optimized for link database
27Questions?