Federated Facts and Figures - PowerPoint PPT Presentation

About This Presentation
Title:

Federated Facts and Figures

Description:

... Figures. Joseph M. Hellerstein. UC Berkeley. Road Map. The Deep Web and the FFF ... Yahoo Maps 'Smith' DupElim '1600. Pennsylvania. Avenue, DC' Name. Address ... – PowerPoint PPT presentation

Number of Views:70
Avg rating:3.0/5.0
Slides: 56
Provided by: JosephMHe5
Learn more at: https://dsf.berkeley.edu
Category:

less

Transcript and Presenter's Notes

Title: Federated Facts and Figures


1
Federated Facts and Figures
  • Joseph M. Hellerstein
  • UC Berkeley

2
Road Map
  • The Deep Web and the FFF
  • An Overview of Telegraph
  • Demo Election 2000
  • From Tapping to Trawling
  • A Taste of Policy and Countermeasures
  • Delicious Snacks

3
Meet the Deep Web
  • Available in your browser, but not via hyperlinks
  • Accessed via forms (press the submit button)
  • Typically runs some code to generate data
  • E.g. call out to a database, or run some
    servlet
  • Pretty-print results in HTML
  • Dynamic HTML
  • Estimated to be 400x larger than the surface
    web
  • Not accessible in the search engines
  • Typically crawl hyperlinks only

4
Federated Facts and Figures
  • One part of the deep web more full-text
    documents
  • E.g. archived newspaper articles, legal
    documents, etc.
  • Figure out how to fetch these, the add to search
    engine
  • Various people working on this (e.g.
    CompletePlanet)
  • Another part Facts and Figures
  • I.e. structured database data
  • Fetch is only the first challenge
  • Want to combine (federate) these databases
  • Want to search by criteria other than keywords
  • Want to analyze the data en masse
  • I.e. want full query power, not just search
  • Search was always easy
  • Ranking not clearly appropriate here

5
Meet the FFF
6
Meet the FFF
7
Meet the FFF
8
(No Transcript)
9
Meet the FFF
10
Telegraph
http//telegraph.cs.berkeley.edu
  • An adaptive dataflow system
  • Dataflow
  • siphon data from the deep web and other data
    pools
  • harness data streaming from sensors and traces
  • flow these data streams through code
  • Adaptive
  • sensor nets wide area networks volatile!
  • like Telegraph Avenue
  • needs to be cool with volatile mix from all
    over the world
  • adaptive techniques route data to machines and
    code
  • marriage of queries, app-level route/filter,
    machine learning
  • First apps
  • Facts and Figures Federation Election 2000
  • Continuous queries on sensor nets
  • Rich queries on Peer-to-Peer
  • Joe Hellerstein, Mike Franklin, co.

11
Dataflow Commonalities
  • Dataflow at the heart of queries and networks
  • Query engines move records through operators
  • Networks move packets through routers
  • Networked data-intensive apps an emerging middle
    ground
  • Database Systems
  • High-function, high integrity, carefully
    administered. Compile intelligent query plans
    based on data models and statistical properties,
    query semantics.
  • Networks
  • Low-function, high availability, federated
    administration. Adapt to performance
    variabilities, treat data and code as opaque for
    loose coupling.

12
Long-Running Dataflows on the FFF
  • Not precomputed like web indexes
  • Need online systems apps for online performance
    goals
  • Subject of prior work in CONTROL project
  • Combo of query processing, sampling/estimation,
    HCI

100
Online
?
Traditional
Time
13
Telegraph Architecture
  • Telegraph executes Dataflow Graphs
  • Extensible set of operators
  • With extensible optimization rules
  • Data access operators
  • TeSS the Telegraph Screen Scraper
  • Napster/Gnutella readers
  • File readers
  • Data processing operators
  • Selections (filters), Joins, Drill-Down/Roll-Up,
    Aggregation
  • Adaptivity Operators
  • Eddies, STeMs, FLuX, etc.

14
Screen Scraping TeSS
  • Screen scrapers do two things
  • Fetch emulate a web user clicking
  • Parse extract info from resulting HTML/XML
  • Somebody has to train the screen scraper
  • Need a separate wrapper for each site
  • Some research work on making this process
    semi-automatic
  • TeSS is an open-source screen-scraper
  • Available at http//telegraph.cs.berkeley.edu/tess
  • Written by a (superstar) sophomore!
  • Simple scripting interface targeted today
  • Moving towards GUI for non-technical users (by
    example)

15
First Demo Election 2000
16
From Tapping to Trawling
  • Telegraph allows users to pose rich queries over
    the deep web
  • But sometimes would like to be more aggressive
  • Preload a telegraph cache
  • Access a variety of data for offline mining
  • More (well see soon!)
  • Want something like a webcrawler for FFF
  • But FFF is too big.
  • Want to trawl for interesting stuff hidden
    there.

17
From Tapping to Trawling
18
From Tapping to Trawling
19
(No Transcript)
20
(No Transcript)
21
From Tapping to Trawling
Name
Address
DupElim
Anywho Name
Yahoo Maps
Eddy
Infospace Name
Infospace Street
Smith
22
API Challenges in Trawling
  • Load APIs on the web today service and silence
  • Various policies at the servers, hard to learn
  • No analogy to robots.txt (which is too limiting
    anyhow)
  • Feedback can be delayed, painful
  • Solutions
  • Be very conservative
  • Make out-of-band (human) arrangements
  • Both seem inefficient
  • Finding new sites to trawl is hard
  • Have to wrap them fetch is easyish, parse
    hardish
  • XML will help a little here
  • Query? Or Update? Again, an API problem!
  • Imagine we auto-trawled AnyWho and WeSpamYou.com

23
Trawling Domains
  • Can now collect lists of
  • Names (First, Last), Addresses, Companies,
    Cities, States, etc. etc.
  • Can keep lists organized by site and in toto
  • Allows for offline mining, etc.
  • Q Do webgraph mining techniques apply to facts
    and figures?

24
Exploiting Enumerated Domains I
  • Can trawl any site on known domains!
  • Suddenly the deep web is not so hidden.
  • In essence, we expand our trawl
  • Can use pre-existing domains to trawl further
  • Or, can add new sites to the trawl process

25
Exploiting Enumerated Domains II
  • Trawling gets a sample (signature) of a sites
    content
  • Analogous to a random walk, but needs to be
    characterized better
  • Can identify that 2 sites have related subsets of
    domains
  • Helps with the query composition problem
  • Rich query interfaces tend to be non-trivial
  • What sites to use? How to combine them?
  • Imagine
  • Traditional search engine experience to pick some
    sites
  • System suggests how to join the sites in a
    meaningful way
  • As you build the query, you always see
    incremental results
  • Refine query as the data pours in
  • Berkeley CONTROL project has been incremental
    queries
  • Blends search, query, browse and mine

26
A Sampler of FFF Policy Issues
  • Statistical DB Security Issues
  • Facing the Power of the FFF
  • False combinations
  • Combination strength
  • What is trawling?
  • Copying? So what?
  • Akamai for the deep web?
  • Cracking?

27
Sampler of Countermeasures
  • Trawl detection
  • And Distributed Trawl Detection
  • Metadata Watermarking
  • Provenance, Lineage, Disclaimers
  • Stockpiling Spam

28
Delicious Snacks
  • "Concepts are delicious snacks with which we try
    to alleviate our amazement -- A. J. Heschel,
    Man Is Not Alone

29
Technical Snacks
  • Adaptive Dataflow
  • Systems Learning
  • Incremental continuous querying
  • And online, bounded trawling
  • Adds an HCI component to the above
  • FFF APIs, standards
  • The wrapper-writing bottleneck XML?
  • Backoff APIs?
  • Search vs. Update
  • Mining trawls

30
More Technical Snacks
  • Tie-ins with Security
  • Applications beyond FFF
  • Sensors
  • P2P
  • Overlay Networks

31
Policy Questions
  • Presenting Interpreting Data
  • Not just search
  • Privacy What is it, whats it for?
  • Leading Indicators from the FFF

32
More?
  • http//telegraph.cs.berkeley.edu
  • jmh_at_cs.berkeley.edu
  • Collaborators
  • Mike Franklin, Hal Varian -- UCB
  • Lisa Hellerstein Torsten Suel -- Polytechnic
  • Sirish Chandrasekaran, Amol Deshpande, Sam
    Madden, Vijayshankar Raman, Fred Reiss, Mehul
    Shah -- UCB

33
Backup Slides
34
Telegraph Adaptive Dataflow
  • Mixed design philosophy
  • Tolerate loose coupling and partial failure
  • Adapt online and provide best-effort results
  • Learn statistical properties online
  • Exploit knowledge of semantics via extensible
    optimization infrastructures
  • Target new networked, data-intensive applications

35
Adaptive Systems General Flavor
  • Repeat
  • Observe (model) environment
  • Use observation to choose behavior
  • Take action

36
Adaptive Dataflow in DBs History
  • Rich But Unacknowledged History
  • Codd's data independence predicated on
    adaptivity!
  • adapt opaquely to changing schema and storage
  • Query optimization does it!
  • statistics-driven optimization
  • key differentiator between DBMSs and other systems

37
Adaptivity in Current DBs
  • Limited coarse grain
  • Repeat
  • Observe (model) environment
  • runstats (once per week!!) model changes in data
  • Use observation to choose behavior
  • query optimization fixes a single static query
    plan
  • Take action
  • query execution blindly follow plan

38
Whats So Hard Here?
  • Volatile regime
  • Data flows unpredictably from sources
  • Code performs unpredictably along flows
  • Continuous volatility due to many decentralized
    systems
  • Lots of choices
  • Choice of services
  • Choice of machines
  • Choice of info sensor fusion, data reduction,
    etc.
  • Order of operation
  • Maintenance
  • Federated world
  • Partial failure is the common case
  • Adaptivity required!

39
Adaptive Query Processing Work
Competition Sampling
Query Scrambling
Ingres DECOMP
Inter-Operator
Late Binding
Future Work
Per Query
System R
Eddies
Frequency of Adaptivity
  • Late Binding Dynamic, Parametric
    HP88,GW89,IN92,GC94,AC96,LP97
  • Per Query Mariposa SA96, ASE CR94
  • Competition RDB AZ96
  • Inter-Op KD98, Tukwila IF99
  • Query Scrambling AF96,UFA98
  • Survey Hellerstein, Franklin, et al., DE
    Bulletin 2000

40
A Networking Problem!?
  • Networks do dataflow!
  • Significant history of adaptive techniques
  • E.g. TCP congestion control
  • E.g. routing
  • But traditionally much lower function
  • Ship bitstreams
  • Minimal, fixed code
  • Lately, moving up the foodchain?
  • app-level routing
  • active networks
  • politics of growth
  • assumption of complexity assumption of liability

41
Networking Code as Dataflow?
  • States Events, Not Threads
  • Asynchronous events natural to networks
  • State machines in protocol specification and
    system code
  • Low-overhead, spreading to big systems
  • Totally different programming style
  • remaining area of hacker machismo
  • Eventflow optimization
  • Cant eventflow be adaptively optimized like
    dataflow?
  • Why didnt that happen years ago?
  • Hold this thought

42
Query Plans are Dataflow Too
  • Programming model iterators
  • old idea, widely used in DB query processing
  • object with three methods
  • Init(), GetNext(), Close()
  • input/output types
  • query plan graph of iterators
  • pipelining iterators that return results before
    children Close()

43
Clever Dataflow Tricks
  • Volcano exchange iterator Graefe
  • encapsulate exchange logic in an iterator
  • not in the dataflow system
  • Box-and-arrow programming can ignore parallelism

44
Some Solutions Were Focusing On
  • Rivers
  • Adaptive partitioning of work across machines
  • Eddies
  • Adaptive ordering of pipelined operations
  • Quality of Service
  • Online aggregation data reduction CONTROL
  • MUST have app-semantics
  • Often may want user interaction
  • UI models of temporal interest
  • Data Dissemination
  • Adaptively choosing what to send, what to cache

45
River
  • Berkeley built the world-record sorting machine
  • On the NOW 100 Sun workstations SAN
  • Only beat the record under ideal conditions
  • No such thing in practice!
  • (Arpaci-Dusseau)2
  • with Culler, Hellerstein, Patterson
  • River adaptive dataflow on clusters
  • One main idea Distributed Queues
  • adaptive exchange operator
  • Simplifies management and programming
  • Remzi Arpaci-Dusseau, Eric Anderson, Noah
    Treuhaft
  • w/Culler, Hellerstein, Patterson, Yelick

46
River
47
Multi-Operator Query Plans
  • Deal with pipelines of commutative operators
  • Adapt at finer granularity than current DBMSs

48
Continuous Adaptivity Eddies
Eddy
  • A pipelining tuple-routing iterator
  • just like join or sort or exchange
  • Works best with other pipelining operators
  • like Ripple Joins, online reordering, etc.
  • Ron Avnur Joe Hellerstein

49
Continuous Adaptivity Eddies
Eddy
  • How to order and reorder operators over time
  • based on performance, economic/admin feedback
  • Vs.River
  • River optimizes each operator horizontally
  • Eddies optimize a pipeline vertically

50
Continuous Adaptivity Eddies
  • Adjusts flow adaptively
  • Tuples routed through ops in different orders
  • Visit each op once before output
  • Naïve routing policy
  • All ops fetch from eddy as fast as possible
  • A la River
  • Turns out, doesnt quite work
  • Only measures rate of work, not benefit
  • Lottery-based routing
  • Uses lottery scheduling to address a bandit
    problem
  • Kris Hildrum, et al. looking at formalizing this
  • Various AI students looking at Reinforcement
    Learning
  • Competitive Eddies
  • Throw in redundant data access and code modules!

51
An Aside n-Arm Bandits
  • A little machine learning problem
  • Each arm pays off differently
  • Explore? Or Exploit?
  • Sometimes want to randomly choose an arm
  • Usually want to go with the best
  • If probabilities are stationary, dampen
    exploration over time

52
Eddies with Lottery Scheduling
  • Operator gets 1 ticket when it takes a tuple
  • Favor operators that run fast (low cost)
  • Operator loses a ticket when it returns a tuple
  • Favor operators with high rejection rate
  • Low selectivity
  • Lottery Scheduling
  • When two ops vie for the same tuple, hold a
    lottery
  • Never let any operator go to zero tickets
  • Support occasional random exploration
  • Set up inflation (forgetting) to adapt over
    time
  • E.g. tix a?oldtix newtix

53
Promising!
  • Initial performance results
  • Ongoing work on proofs of convergence
  • have analysis for contrained case

54
To Be Continued
  • Tune formalize policy
  • Competitive eddies
  • Source Join selection
  • Requires duplicate management
  • Parallelism
  • Eddies Rivers?
  • Reliability
  • Long-running flows
  • Rivers RAID-style computation

55
To Be Continued, cont.
  • What about wide area?
  • data reduction
  • sensor fusion
  • asynchronous communication
  • Continuous queries
  • events
  • disconnected operation
  • Lower-level eventflow?
  • can eddies, rivers, etc. be brought to bear on
    programming?
Write a Comment
User Comments (0)
About PowerShow.com