Title: Querying Internet Graphs with Recursive Queries
1Querying Internet Graphs with Recursive Queries
- Boon Thau Loo
- Joseph Hellerstein, Ion Stoica, Ryan Huebsch,
Timothy Roscoe, Scott Shenker, . - many others in the PIER group
STREAM meeting (07 May 2004)
2Outline
- Introduction
- Applications
- Queries on Distributed Data Streams
- Query Processing and Optimization Early insights
3Recursive Queries A review
- Hot research topic in the 80s and 90s.
- Citeseer recursive queries (244), transitive
closure (2745), deductive databases (1276) - A recursive query allows a query result to be
defined in terms of itself. - Useful for querying graph structures.
- Transitive Closure example Computes the set of
all nodes reachable from a node - R1 reachable(X,Y) - link(X,Y)
- R2 reachable(X,Y) - link(X,Z) reachable(Z,Y)
- Query ?-reachable(a,N)
L(X,Z)
R(Z,Y)
X
Z
Y
4Why Recursive Queries?
- The Internet is a graph
- Physical links
- Routing tables
- Multicast trees on overlay networks
- Hypertext structures
- Peer-to-Peer networks
- Recursive queries are a powerful tool for
understanding and controlling structural
properties of networks. - Generic substrate for developing more flexible
and powerful routing protocols.
5Why P2P?
- Queried graph is too large and dynamic for
centralized processing. - Many network applications inherently
decentralized. Natural to query them in a
decentralized fashion.
6Recursive Network Queries
- Embedded Network Queries
- Each node embeds query processing functionality.
- Query processor at each node has access to local
nodes routing table. - External Network Queries
- Query is executed on a separate Distributed Hash
Table (DHT) based query infrastructure such as
PIER.
7Applications
Embedded Network Queries
External Network Queries
- Gnutella Monitoring
- Distributed Web Crawler
Network Monitoring
- End-host Routing Infrastructure
- DHT Routing
Routing
8App 1 Gnutella Network Monitoring
- Motivated by our earlier work on a DHT-Gnutella
bridge (IPTPS 2004) - Can we leverage a distributed set of nodes to
take an accurate snapshot and perform long-term
monitoring of a large, dynamic network like
Gnutella?
9Crawl as a Query
Distributed Gnutella Crawler R1 node(X,0) -
startset(X) R2 link(X,Y) - node(X,Hop),
gnutellaPing(X,Y) R3 node(Y,Hop) -
node(X,Hop-1), link(X,Y), Hop ?-node(N,D)
What the rules mean R1 Provide a set of seed IP
addresses X of Gnutella nodes. R2 Ping the nodes
on these addresses X, and get the set of
neighbors Y R3 Crawl the neighbors Y as long as
they are within K hops.
10Other Queries
- Crawl is only one of the many queries we can run.
- Measurement-based queries
- What is the diameter of the Gnutella network?
- What is the robustness of the Gnutella network?
(number of ways to route from one node to
another) - Summarize search horizon (number of ultrapeers,
number of files) - Routing-based queries
- Direct search query towards high degree nodes
11App 2 Network Introspection
- A network needs to monitor its structural
properties under network churn. - Important metrics under churn for a Distributed
Hash Table (DHT) - Dynamic Resilience
- How many possible live paths are there between
any two nodes? - Average Path Length
- Given routing algorithm, what is the average
number of hops between any two nodes?
12App 3 Routing Infrastructure
Adapted slide from Karthiks Sahara retreat
presentation
Internet Indirection Infrastructure (i3)
- Applications demand greater flexibility in route
selection. - Currently, clients query centralized servers to
setup application-specific paths. - Recursive queries supports decentralized
application-specific routing.
13App 4 Decentralized Focused Web Crawler (with
Owen Cooper, Sailesh Krishnamurthy)
- Massively distributed web crawler
- P2P users donate excess bandwidth and computation
resources to crawl the web. - Completely decentralized organized using
Distributed Hash tables (DHTs) - Crawl as a declarative recursive query
- Focused Crawls with user customization
- Seed URLs Pages to begin crawling from?
- Ordering What pages to crawl first?
- Coverage What pages to avoid crawling?
- Distributed PageRank as a recursive query
14Dynamic Graphs Distributed Data Streams
- In practice, the graphs are distributed, dynamic
and often based on soft-state - Queries are also long-running.
- Applications often require real-time information
- Find the current best-route from node a to b
- Is the DHT handling network churn well at this
moment? - Direct search query towards high-degree nodes.
- Are there (new) lessons from streaming DB work
that apply here?
15Recursive View Maintenance
- Current state of the network (query results) can
be stored in the DHT as materialized views. - Long-term statistics
- Routing tables
- Cache to be shared across queries.
- View Maintenance
- As nodes enter and leave the system, the
materialized view needs to be incrementally
updated. - Facts or Tuples derived from expired base tuples
need to be invalidated and removed from the
network.
16Work Sharing
- Sharing within a query
- May get for free during query execution.
- Sharing across queries is critical for
scalability - Results of previous queries need to be cached in
the DHT to be reused in subsequent queries.
Best-path(b,c,path)
b
c
17Recursive Query Processing with PIER
- PIER A relational query processor over
Distributed Hash Tables (DHTs)
2. Query Rewrite and Optimization
3. PIER Query
1. Datalog
18Early Insights
- Our experiments have focused on commonly used
queries (reachable, shortest-path) over static
graphs. - Several query execution techniques
- Semi-naïve (hop-at-a-time) vs Smart
(Squaring) - Left vs right recursion
- reachable(X,Y) - link(X,Z), reachable(Z,Y)
- reachable(X,Y) - reachable(X,Z), link(Z,Y)
- Magic sets Avoid sending data/query to
irrelevant nodes in the network. - Natural communication patterns
- Reachable query exhibits distance vector-like
routing protocol. - Work-sharing or memorization is observed.
19Early Insights
- Choice of query execution techniques offers
tradeoffs in latency, work-sharing and
communication overhead. - Factors affecting query execution
- Graph topology (size, diameter and density)
- Queries (degree of overlap)
- Dynamic graphs with changing topology means one
execution strategy may not work for entire query
duration.
20References
- http//pier.cs.berkeley.edu
- Reading List
- http//www.cs.berkeley.edu/boonloo/research/pier
/readinglist.html - Overview
- Boon Thau Loo, Ryan Huebsch, Joseph M.
Hellerstein, Timothy Roscoe, and Ion
Stoica, Analyzing P2P Overlays and Recursive
Queries, Intel Research, IRB-TR-03-045, Nov. 17,
2003 - Application
- Boon Thau Loo, Owen Cooper, Sailesh
Krishnamurthy,. Distributed Web Crawling over
DHTs. UC Berkeley Technical Report
UCB//CSD-4-1305, Feb 2004. - Query Execution and Optimization
- Distributed Network Topology Analysis with
Recursive Queries Class Project Report