Federated Facts and Figures presentation

About This Presentation

Transcript and Presenter's Notes

Title: Federated Facts and Figures

1
Federated Facts and Figures

Joseph M. Hellerstein
UC Berkeley

2
Road Map

The Deep Web and the FFF
An Overview of Telegraph
Demo Election 2000
From Tapping to Trawling
A Taste of Policy and Countermeasures
Delicious Snacks

3
Meet the Deep Web

Available in your browser, but not via hyperlinks
Accessed via forms (press the submit button)
Typically runs some code to generate data
E.g. call out to a database, or run some
servlet
Pretty-print results in HTML
Dynamic HTML
Estimated to be 400x larger than the surface
web
Not accessible in the search engines
Typically crawl hyperlinks only

4
Federated Facts and Figures

One part of the deep web more full-text
documents
E.g. archived newspaper articles, legal
documents, etc.
Figure out how to fetch these, the add to search
engine
Various people working on this (e.g.
CompletePlanet)
Another part Facts and Figures
I.e. structured database data
Fetch is only the first challenge
Want to combine (federate) these databases
Want to search by criteria other than keywords
Want to analyze the data en masse
I.e. want full query power, not just search
Search was always easy
Ranking not clearly appropriate here

5
Meet the FFF
6
Meet the FFF
7
Meet the FFF
8
(No Transcript)
9
Meet the FFF
10
Telegraph
http//telegraph.cs.berkeley.edu

An adaptive dataflow system
Dataflow
siphon data from the deep web and other data
pools
harness data streaming from sensors and traces
flow these data streams through code
Adaptive
sensor nets wide area networks volatile!
like Telegraph Avenue
needs to be cool with volatile mix from all
over the world
adaptive techniques route data to machines and
code
marriage of queries, app-level route/filter,
machine learning
First apps
Facts and Figures Federation Election 2000
Continuous queries on sensor nets
Rich queries on Peer-to-Peer
Joe Hellerstein, Mike Franklin, co.

11
Dataflow Commonalities

Dataflow at the heart of queries and networks
Query engines move records through operators
Networks move packets through routers
Networked data-intensive apps an emerging middle
ground
Database Systems
High-function, high integrity, carefully
administered. Compile intelligent query plans
based on data models and statistical properties,
query semantics.
Networks
Low-function, high availability, federated
administration. Adapt to performance
variabilities, treat data and code as opaque for
loose coupling.

12
Long-Running Dataflows on the FFF

Not precomputed like web indexes
Need online systems apps for online performance
goals
Subject of prior work in CONTROL project
Combo of query processing, sampling/estimation,
HCI

100
Online
?
Traditional
Time
13
Telegraph Architecture

Telegraph executes Dataflow Graphs
Extensible set of operators
With extensible optimization rules
Data access operators
TeSS the Telegraph Screen Scraper
Napster/Gnutella readers
File readers
Data processing operators
Selections (filters), Joins, Drill-Down/Roll-Up,
Aggregation
Adaptivity Operators
Eddies, STeMs, FLuX, etc.

14
Screen Scraping TeSS

Screen scrapers do two things
Fetch emulate a web user clicking
Parse extract info from resulting HTML/XML
Somebody has to train the screen scraper
Need a separate wrapper for each site
Some research work on making this process
semi-automatic
TeSS is an open-source screen-scraper
Available at http//telegraph.cs.berkeley.edu/tess
Written by a (superstar) sophomore!
Simple scripting interface targeted today
Moving towards GUI for non-technical users (by
example)

15
First Demo Election 2000
16
From Tapping to Trawling

Telegraph allows users to pose rich queries over
the deep web
But sometimes would like to be more aggressive
Preload a telegraph cache
Access a variety of data for offline mining
More (well see soon!)
Want something like a webcrawler for FFF
But FFF is too big.
Want to trawl for interesting stuff hidden
there.

17
From Tapping to Trawling
18
From Tapping to Trawling
19
(No Transcript)
20
(No Transcript)
21
From Tapping to Trawling
Name
Address
DupElim
Anywho Name
Yahoo Maps
Eddy
Infospace Name
Infospace Street
Smith
22
API Challenges in Trawling

Load APIs on the web today service and silence
Various policies at the servers, hard to learn
No analogy to robots.txt (which is too limiting
anyhow)
Feedback can be delayed, painful
Solutions
Be very conservative
Make out-of-band (human) arrangements
Both seem inefficient
Finding new sites to trawl is hard
Have to wrap them fetch is easyish, parse
hardish
XML will help a little here
Query? Or Update? Again, an API problem!
Imagine we auto-trawled AnyWho and WeSpamYou.com

23
Trawling Domains

Can now collect lists of
Names (First, Last), Addresses, Companies,
Cities, States, etc. etc.
Can keep lists organized by site and in toto
Allows for offline mining, etc.
Q Do webgraph mining techniques apply to facts
and figures?

24
Exploiting Enumerated Domains I

Can trawl any site on known domains!
Suddenly the deep web is not so hidden.
In essence, we expand our trawl
Can use pre-existing domains to trawl further
Or, can add new sites to the trawl process

25
Exploiting Enumerated Domains II

Trawling gets a sample (signature) of a sites
content
Analogous to a random walk, but needs to be
characterized better
Can identify that 2 sites have related subsets of
domains
Helps with the query composition problem
Rich query interfaces tend to be non-trivial
What sites to use? How to combine them?
Imagine
Traditional search engine experience to pick some
sites
System suggests how to join the sites in a
meaningful way
As you build the query, you always see
incremental results
Refine query as the data pours in
Berkeley CONTROL project has been incremental
queries
Blends search, query, browse and mine

26
A Sampler of FFF Policy Issues

Statistical DB Security Issues
Facing the Power of the FFF
False combinations
Combination strength
What is trawling?
Copying? So what?
Akamai for the deep web?
Cracking?

27
Sampler of Countermeasures

Trawl detection
And Distributed Trawl Detection
Metadata Watermarking
Provenance, Lineage, Disclaimers
Stockpiling Spam

28
Delicious Snacks

"Concepts are delicious snacks with which we try
to alleviate our amazement -- A. J. Heschel,
Man Is Not Alone

29
Technical Snacks

Adaptive Dataflow
Systems Learning
Incremental continuous querying
And online, bounded trawling
Adds an HCI component to the above
FFF APIs, standards
The wrapper-writing bottleneck XML?
Backoff APIs?
Search vs. Update
Mining trawls

30
More Technical Snacks

Tie-ins with Security
Applications beyond FFF
Sensors
P2P
Overlay Networks

31
Policy Questions

Presenting Interpreting Data
Not just search
Privacy What is it, whats it for?
Leading Indicators from the FFF

32
More?

http//telegraph.cs.berkeley.edu
jmh_at_cs.berkeley.edu
Collaborators
Mike Franklin, Hal Varian -- UCB
Lisa Hellerstein Torsten Suel -- Polytechnic
Sirish Chandrasekaran, Amol Deshpande, Sam
Madden, Vijayshankar Raman, Fred Reiss, Mehul
Shah -- UCB

33
Backup Slides
34
Telegraph Adaptive Dataflow

Mixed design philosophy
Tolerate loose coupling and partial failure
Adapt online and provide best-effort results
Learn statistical properties online
Exploit knowledge of semantics via extensible
optimization infrastructures
Target new networked, data-intensive applications

35
Adaptive Systems General Flavor

Repeat
Observe (model) environment
Use observation to choose behavior
Take action

36
Adaptive Dataflow in DBs History

Rich But Unacknowledged History
Codd's data independence predicated on
adaptivity!
adapt opaquely to changing schema and storage
Query optimization does it!
statistics-driven optimization
key differentiator between DBMSs and other systems

37
Adaptivity in Current DBs

Limited coarse grain
Repeat
Observe (model) environment
runstats (once per week!!) model changes in data
Use observation to choose behavior
query optimization fixes a single static query
plan
Take action
query execution blindly follow plan

38
Whats So Hard Here?

Volatile regime
Data flows unpredictably from sources
Code performs unpredictably along flows
Continuous volatility due to many decentralized
systems
Lots of choices
Choice of services
Choice of machines
Choice of info sensor fusion, data reduction,
etc.
Order of operation
Maintenance
Federated world
Partial failure is the common case
Adaptivity required!

39
Adaptive Query Processing Work
Competition Sampling
Query Scrambling
Ingres DECOMP
Inter-Operator
Late Binding
Future Work
Per Query
System R
Eddies
Frequency of Adaptivity

Late Binding Dynamic, Parametric
HP88,GW89,IN92,GC94,AC96,LP97
Per Query Mariposa SA96, ASE CR94
Competition RDB AZ96
Inter-Op KD98, Tukwila IF99
Query Scrambling AF96,UFA98
Survey Hellerstein, Franklin, et al., DE
Bulletin 2000

40
A Networking Problem!?

Networks do dataflow!
Significant history of adaptive techniques
E.g. TCP congestion control
E.g. routing
But traditionally much lower function
Ship bitstreams
Minimal, fixed code
Lately, moving up the foodchain?
app-level routing
active networks
politics of growth
assumption of complexity assumption of liability

41
Networking Code as Dataflow?

States Events, Not Threads
Asynchronous events natural to networks
State machines in protocol specification and
system code
Low-overhead, spreading to big systems
Totally different programming style
remaining area of hacker machismo
Eventflow optimization
Cant eventflow be adaptively optimized like
dataflow?
Why didnt that happen years ago?
Hold this thought

42
Query Plans are Dataflow Too

Programming model iterators
old idea, widely used in DB query processing
object with three methods
Init(), GetNext(), Close()
input/output types
query plan graph of iterators
pipelining iterators that return results before
children Close()

43
Clever Dataflow Tricks

Volcano exchange iterator Graefe
encapsulate exchange logic in an iterator
not in the dataflow system
Box-and-arrow programming can ignore parallelism

44
Some Solutions Were Focusing On

Rivers
Adaptive partitioning of work across machines
Eddies
Adaptive ordering of pipelined operations
Quality of Service
Online aggregation data reduction CONTROL
MUST have app-semantics
Often may want user interaction
UI models of temporal interest
Data Dissemination
Adaptively choosing what to send, what to cache

45
River

Berkeley built the world-record sorting machine
On the NOW 100 Sun workstations SAN
Only beat the record under ideal conditions
No such thing in practice!
(Arpaci-Dusseau)2
with Culler, Hellerstein, Patterson
River adaptive dataflow on clusters
One main idea Distributed Queues
adaptive exchange operator
Simplifies management and programming
Remzi Arpaci-Dusseau, Eric Anderson, Noah
Treuhaft
w/Culler, Hellerstein, Patterson, Yelick

46
River
47
Multi-Operator Query Plans

Deal with pipelines of commutative operators
Adapt at finer granularity than current DBMSs

48
Continuous Adaptivity Eddies
Eddy

A pipelining tuple-routing iterator
just like join or sort or exchange
Works best with other pipelining operators
like Ripple Joins, online reordering, etc.
Ron Avnur Joe Hellerstein

49
Continuous Adaptivity Eddies
Eddy

How to order and reorder operators over time
based on performance, economic/admin feedback
Vs.River
River optimizes each operator horizontally
Eddies optimize a pipeline vertically

50
Continuous Adaptivity Eddies

Adjusts flow adaptively
Tuples routed through ops in different orders
Visit each op once before output
Naïve routing policy
All ops fetch from eddy as fast as possible
A la River
Turns out, doesnt quite work
Only measures rate of work, not benefit
Lottery-based routing
Uses lottery scheduling to address a bandit
problem
Kris Hildrum, et al. looking at formalizing this
Various AI students looking at Reinforcement
Learning
Competitive Eddies
Throw in redundant data access and code modules!

51
An Aside n-Arm Bandits

A little machine learning problem
Each arm pays off differently
Explore? Or Exploit?
Sometimes want to randomly choose an arm
Usually want to go with the best
If probabilities are stationary, dampen
exploration over time

52
Eddies with Lottery Scheduling

Operator gets 1 ticket when it takes a tuple
Favor operators that run fast (low cost)
Operator loses a ticket when it returns a tuple
Favor operators with high rejection rate
Low selectivity
Lottery Scheduling
When two ops vie for the same tuple, hold a
lottery
Never let any operator go to zero tickets
Support occasional random exploration
Set up inflation (forgetting) to adapt over
time
E.g. tix a?oldtix newtix

53
Promising!

Initial performance results
Ongoing work on proofs of convergence
have analysis for contrained case

54
To Be Continued

Tune formalize policy
Competitive eddies
Source Join selection
Requires duplicate management
Parallelism
Eddies Rivers?
Reliability
Long-running flows
Rivers RAID-style computation

55
To Be Continued, cont.

What about wide area?
data reduction
sensor fusion
asynchronous communication
Continuous queries
events
disconnected operation
Lower-level eventflow?
can eddies, rivers, etc. be brought to bear on
programming?

Write a Comment

User Comments (0)

About PowerShow.com

Federated Facts and Figures PowerPoint PPT Presentation