The Link Database: Fast Access to Graphs of the Web - PowerPoint PPT Presentation

1 / 27
About This Presentation
Title:

The Link Database: Fast Access to Graphs of the Web

Description:

CSE 397/497: WWW Search Engine. Nov 19, 2004. Overview. Introduction. Background and Link 1 ... http://www.foo.com/jobs/ source URL. destination URLs. The URL Database ... – PowerPoint PPT presentation

Number of Views:43
Avg rating:3.0/5.0
Slides: 28
Provided by: xiaog7
Category:
Tags: access | database | fast | graphs | link | web

less

Transcript and Presenter's Notes

Title: The Link Database: Fast Access to Graphs of the Web


1
The Link Database Fast Access to Graphs of the
Web
  • Written by K. Randall, R. Stata,
  • R. Wickremesinghe, J. L. Wiener
  • Presented by Xiaoguang Qi
  • CSE 397/497 WWW Search Engine
  • Nov 19, 2004

2
Overview
  • Introduction
  • Background and Link 1
  • Link 2
  • single list compression and starts array
    compression
  • Link3
  • interlist compression
  • Measurements
  • Conclusion

3
Introduction
  • What is Link Database?
  • Part of the Connectivity Server
  • Provides fast access to the hyperlink data
  • What is Connectivity Server?
  • A special purpose database
  • Models the web as a graph
  • URLs -gt nodes hyperlinks -gt directed edges

4
Applications of Link Database
Introduction
  • PageRank
  • rank the web pages based on their inlinks
  • HITS
  • refine web query results by examining subgraphs
    of the web
  • Mirror detection
  • use links and URLs to find sites that mirror each
    others contents
  • Studies of web structure

5
Goal of the Link Database
Introduction
  • Reduce the amount of space required
  • Provide fast decompression speed
  • Allow random access
  • Effective on small amount of data
  • Adjacency list (68 bytes on average)

6
Existing Work in the Area
Introduction
  • Many compression techniques exist
  • However, nearly all algorithms are optimized for
    sequential access
  • Random access needed
  • Most algorithms are optimized for large date sets
  • Adjacency lists are small
  • Focus on reducing disk-space and I/O requirements
    of disk-based databases
  • Reduction of memory requirements needed

7
Existing Work in the Area (Cont.)
Introduction
  • Google must use compression techniques to store
    the Web graph
  • Unpublished
  • Altavista
  • Uses Link2 described in this paper

8
Background and Link 1
  • Links files
  • The URL database
  • The link database

9
Links files
Link 1
  • A sequence a records
  • Each record consists of a source URL followed by
    a sequence of destination URLs

http//www.foo.com/ http//www.foo.com/css/foo
style.css http//www.foo.com/images/logo.gif
http//www.foo.com/images/navigation.gif
http//www.foo.com/about/ http//www.foo.com/p
roducts/ http//www.foo.com/jobs/
10
The URL Database
Link 1
  • Three kinds of representations for URLs
  • Text original URL
  • Fingerprint a 64-bit hash of URL text
  • URL-id sequentially assigned 32-bit integer

11
URL-ids
Link 1
  • Sequentially assigned from 1 to N
  • Divide the URLs into three partitions based on
    their degree
  • indegree or outdegree gt 254, high-degree
  • 24 254, medium degree
  • Both lt24, low degree
  • Assign URL-ids by partition
  • Inside each partition, by lexicographic order

12
The Link Database
Link 1
  • Maps from each URL-id to the sets of URL-ids that
    are its outlinks and its inlinks
  • A and AT

Recall the way we store a sparse matrix?
13
Link 2 Single List Compression and Starts Array
Compression
  • Two facts that motivates the compressing
    algorithm
  • Majority links are local 80 of links point to
    URLs on the same host
  • URL-ids on the same host tend to be close to one
    another

14
Single List Compression
Link 2
  • Delta values the differences between neighbors
    of the list

15
Variable-length Nybble Code
Link 2
  • A sequence of 4-bit strings
  • The first 3 bits an unsigned number
  • The last one bit a stop bit
  • Eg. (28)10 (11100)2
  • Nybble encoding 0111 1000
  • What about negative values?
  • Not as good as Huffman coding in terms of
    compression
  • However, provides faster decompression

16
Starts Array Compression
Link 2
  • For each of the three URL partitions mentioned,
    we use a different number of bits to encode the
    starts entries
  • Why?
  • Most of the URLs (74 in this date set) are in
    the small-degree partition
  • i.e. the majority of entries in the starts array
    are in the small-degree partition where indices
    take a little over 8 bits each

17
Link 3 Interlist Compression
  • Pages with close URL-ids have many links in common

18
Select and Union Compression
Link 3
  • Select compression
  • Choose a previous adjacency list as the
    representative list
  • Other adjacency lists are represented by the
    differences between itself and the representative
    list
  • Additions
  • Deletions

19
Select and Union Compression (Cont.)
Link 3
  • Union compression
  • Similar to select compression
  • Except the representative list is the union of a
    set of adjacency lists

20
Variations For Select and Union Compression
Link 3
  • Problem
  • Select may create long chains of encoding, which
    can increase decompression time
  • Solutions
  • Consider variations for it

21
Variations For Select and Union Compression
(Cont.)
Link 3
22
LimitSelect-K-L
Link 3
  • Choose LimitSelect-K-L because
  • It avoids chain problem
  • It achieves better compression with short chains
    than Block-Select-K-B and Union

23
LimitSelect-K-L (Cont.)
Link 3
24
Measurements
  • Each step from Link1 to Link2 to Link3
    approximately doubles the number of pages we can
    handle on our 16 GB machine, but each step also
    costs us in access time

25
Conclusion
  • Reduce the amount of space required from 32 bits
    to less than 6 bits per link better than a ¼
    compression ratio
  • Effective on the small amounts of data in an
    adjacency list
  • Allow random access to individual lists
  • Provide fast decompression speed.

26
My Opinion
  • Good points
  • Good balance between the compression ratio and
    decompression speed
  • Provides algorithms optimized for link database

27
Questions?
Write a Comment
User Comments (0)
About PowerShow.com