Querying Internet Graphs with Recursive Queries - PowerPoint PPT Presentation

1 / 20
About This Presentation
Title:

Querying Internet Graphs with Recursive Queries

Description:

Distributed Web Crawler. End-host Routing Infrastructure. DHT Routing. Directed Gnutella Search ... App 4: Decentralized Focused Web Crawler (with Owen Cooper, Sailesh ... – PowerPoint PPT presentation

Number of Views:153
Avg rating:3.0/5.0
Slides: 21
Provided by: unkn492
Category:

less

Transcript and Presenter's Notes

Title: Querying Internet Graphs with Recursive Queries


1
Querying Internet Graphs with Recursive Queries
  • Boon Thau Loo
  • Joseph Hellerstein, Ion Stoica, Ryan Huebsch,
    Timothy Roscoe, Scott Shenker, .
  • many others in the PIER group

STREAM meeting (07 May 2004)
2
Outline
  • Introduction
  • Applications
  • Queries on Distributed Data Streams
  • Query Processing and Optimization Early insights

3
Recursive Queries A review
  • Hot research topic in the 80s and 90s.
  • Citeseer recursive queries (244), transitive
    closure (2745), deductive databases (1276)
  • A recursive query allows a query result to be
    defined in terms of itself.
  • Useful for querying graph structures.
  • Transitive Closure example Computes the set of
    all nodes reachable from a node
  • R1 reachable(X,Y) - link(X,Y)
  • R2 reachable(X,Y) - link(X,Z) reachable(Z,Y)
  • Query ?-reachable(a,N)

L(X,Z)
R(Z,Y)
X
Z
Y
4
Why Recursive Queries?
  • The Internet is a graph
  • Physical links
  • Routing tables
  • Multicast trees on overlay networks
  • Hypertext structures
  • Peer-to-Peer networks
  • Recursive queries are a powerful tool for
    understanding and controlling structural
    properties of networks.
  • Generic substrate for developing more flexible
    and powerful routing protocols.

5
Why P2P?
  • Queried graph is too large and dynamic for
    centralized processing.
  • Many network applications inherently
    decentralized. Natural to query them in a
    decentralized fashion.

6
Recursive Network Queries
  • Embedded Network Queries
  • Each node embeds query processing functionality.
  • Query processor at each node has access to local
    nodes routing table.
  • External Network Queries
  • Query is executed on a separate Distributed Hash
    Table (DHT) based query infrastructure such as
    PIER.

7
Applications
Embedded Network Queries
External Network Queries
  • Gnutella Monitoring
  • Distributed Web Crawler

Network Monitoring
  • DHT Introspection
  • End-host Routing Infrastructure
  • DHT Routing
  • Directed Gnutella Search

Routing
8
App 1 Gnutella Network Monitoring
  • Motivated by our earlier work on a DHT-Gnutella
    bridge (IPTPS 2004)
  • Can we leverage a distributed set of nodes to
    take an accurate snapshot and perform long-term
    monitoring of a large, dynamic network like
    Gnutella?

9
Crawl as a Query
Distributed Gnutella Crawler R1 node(X,0) -
startset(X) R2 link(X,Y) - node(X,Hop),
gnutellaPing(X,Y) R3 node(Y,Hop) -
node(X,Hop-1), link(X,Y), Hop ?-node(N,D)
What the rules mean R1 Provide a set of seed IP
addresses X of Gnutella nodes. R2 Ping the nodes
on these addresses X, and get the set of
neighbors Y R3 Crawl the neighbors Y as long as
they are within K hops.
10
Other Queries
  • Crawl is only one of the many queries we can run.
  • Measurement-based queries
  • What is the diameter of the Gnutella network?
  • What is the robustness of the Gnutella network?
    (number of ways to route from one node to
    another)
  • Summarize search horizon (number of ultrapeers,
    number of files)
  • Routing-based queries
  • Direct search query towards high degree nodes

11
App 2 Network Introspection
  • A network needs to monitor its structural
    properties under network churn.
  • Important metrics under churn for a Distributed
    Hash Table (DHT)
  • Dynamic Resilience
  • How many possible live paths are there between
    any two nodes?
  • Average Path Length
  • Given routing algorithm, what is the average
    number of hops between any two nodes?

12
App 3 Routing Infrastructure
Adapted slide from Karthiks Sahara retreat
presentation
Internet Indirection Infrastructure (i3)
  • Applications demand greater flexibility in route
    selection.
  • Currently, clients query centralized servers to
    setup application-specific paths.
  • Recursive queries supports decentralized
    application-specific routing.

13
App 4 Decentralized Focused Web Crawler (with
Owen Cooper, Sailesh Krishnamurthy)
  • Massively distributed web crawler
  • P2P users donate excess bandwidth and computation
    resources to crawl the web.
  • Completely decentralized organized using
    Distributed Hash tables (DHTs)
  • Crawl as a declarative recursive query
  • Focused Crawls with user customization
  • Seed URLs Pages to begin crawling from?
  • Ordering What pages to crawl first?
  • Coverage What pages to avoid crawling?
  • Distributed PageRank as a recursive query

14
Dynamic Graphs Distributed Data Streams
  • In practice, the graphs are distributed, dynamic
    and often based on soft-state
  • Queries are also long-running.
  • Applications often require real-time information
  • Find the current best-route from node a to b
  • Is the DHT handling network churn well at this
    moment?
  • Direct search query towards high-degree nodes.
  • Are there (new) lessons from streaming DB work
    that apply here?

15
Recursive View Maintenance
  • Current state of the network (query results) can
    be stored in the DHT as materialized views.
  • Long-term statistics
  • Routing tables
  • Cache to be shared across queries.
  • View Maintenance
  • As nodes enter and leave the system, the
    materialized view needs to be incrementally
    updated.
  • Facts or Tuples derived from expired base tuples
    need to be invalidated and removed from the
    network.

16
Work Sharing
  • Sharing within a query
  • May get for free during query execution.
  • Sharing across queries is critical for
    scalability
  • Results of previous queries need to be cached in
    the DHT to be reused in subsequent queries.

Best-path(b,c,path)
b
c
17
Recursive Query Processing with PIER
  • PIER A relational query processor over
    Distributed Hash Tables (DHTs)

2. Query Rewrite and Optimization
3. PIER Query
1. Datalog
18
Early Insights
  • Our experiments have focused on commonly used
    queries (reachable, shortest-path) over static
    graphs.
  • Several query execution techniques
  • Semi-naïve (hop-at-a-time) vs Smart
    (Squaring)
  • Left vs right recursion
  • reachable(X,Y) - link(X,Z), reachable(Z,Y)
  • reachable(X,Y) - reachable(X,Z), link(Z,Y)
  • Magic sets Avoid sending data/query to
    irrelevant nodes in the network.
  • Natural communication patterns
  • Reachable query exhibits distance vector-like
    routing protocol.
  • Work-sharing or memorization is observed.

19
Early Insights
  • Choice of query execution techniques offers
    tradeoffs in latency, work-sharing and
    communication overhead.
  • Factors affecting query execution
  • Graph topology (size, diameter and density)
  • Queries (degree of overlap)
  • Dynamic graphs with changing topology means one
    execution strategy may not work for entire query
    duration.

20
References
  • http//pier.cs.berkeley.edu
  • Reading List
  • http//www.cs.berkeley.edu/boonloo/research/pier
    /readinglist.html
  • Overview
  • Boon Thau Loo, Ryan Huebsch, Joseph M.
    Hellerstein, Timothy Roscoe, and Ion
    Stoica, Analyzing P2P Overlays and Recursive
    Queries, Intel Research, IRB-TR-03-045, Nov. 17,
    2003
  • Application
  • Boon Thau Loo, Owen Cooper, Sailesh
    Krishnamurthy,. Distributed Web Crawling over
    DHTs. UC Berkeley Technical Report
    UCB//CSD-4-1305, Feb 2004.
  • Query Execution and Optimization
  • Distributed Network Topology Analysis with
    Recursive Queries Class Project Report
Write a Comment
User Comments (0)
About PowerShow.com