Title: Scalable Trusted Computing Engineering challenge, or something more fundamental?
1Scalable Trusted ComputingEngineering challenge,
or something more fundamental?
- Ken Birman
- Cornell University
2Cornell Quicksilver Project
- Krzys Ostrowski The key player
- Ken Birman, Danny Dolev Collaborators and
research supervisors - Mahesh Balakrishnan, Maya Haridasan, Tudor
Marian, Amar Phanishayee, Robbert van Renesse,
Einar Vollset, Hakim Weatherspoon Offered
valuable comments and criticisms
3Trusted Computing
- A vague term with many meanings
- For individual platforms, integrity of the
computing base - Availability and exploitation of TPM h/w
- Proofs of correctness for key components
- Security policy specification, enforcement
- Scalable trust issues arise mostly in distributed
settings
4System model
- A world of
- Actors Sally, Ted,
- Groups Sally_Advisors Ted, Alice,
- Objects travel_plans.html, investments.xls
- Actions Open, Edit,
- Policies
- (Actor,Object,Action) ? Permit, Deny
- Places Ted_Desktop, Sally_Phone, .
5Rules
- If Emp.place ? Secure_Place and Emp ?
Client_Advisors thenAllow Open
Client_Investments.xls - Can Ted, working at Ted_Desktop, open
Sally_Investments.xls? - yes, if Ted_Desktop ? Secure_Places
6Miscellaneous stuff
- Policy changes all the time
- Like a database receiving updates
- E.g. as new actors are added, old ones leave the
system, etc - and they have a temporal scope
- Starting at time t19 and continuing until now,
Ted is permitted to access Sallys file
investments.xls
7Order dependent decisions
- Consider rules such as
- Only one person can use the cluster at a time.
- The meeting room is limited to three people
- While people lacking clearance are present, no
classified information can be exposed - These are sensitive to the order in which
conflicting events occur - Central clearinghouse decides what to allow
based on order in which it sees events
8Goal Enforce policy
investments.xls
Read
(data)
Policy Database
9 reduction to a proof
- Each time an action is attempted, system must
develop a proof either that the action should be
blocked or allowed - For example, might use the BAN logic
- For the sake of argument, lets assume we know
how to do all this on a single machine
10Implications of scale
- Well be forced to replicate and decentralize the
policy enforcement function - For ownership Allows local policy to be stored
close to the entity that owns it - For performance and scalability
- For fault-tolerance
11Decentralized policy enforcement
investments.xls
Read
(data)
Policy Database
Original Scheme
12Decentralized policy enforcement
investments.xls
Read
(data)
Policy DB 1
Policy DB 2
New Scheme
13So how do we decentralize?
- Consistency the bane of decentralization
- We want a system to behave as if all decisions
occur in a single rules database - Yet want the decisions to actually occur in a
decentralized way a replicated policy database - System needs to handle concurrent events in a
consistent manner
14So how do we decentralize?
- More formally
- Analogy database 1-copy serializability
Any run of the decentralized system should be
indistinguishable from some run of a centralized
system
15But this is a familiar problem!
- Database researchers know it as the atomic commit
problem. - Distributed systems people call it
- State machine replication
- Virtual synchrony
- Paxos-style replication
- and because of this we know a lot about the
question!
16 replicated data with abcast
- Closely related to the atomic broadcast problem
within a group - Abcast sends a message to all the members of a
group - Protocol guarantees order, fault-tolerance
- Solves consensus
- Indeed, a dynamic policy repository would need
abcast if we wanted to parallelize it for speed
or replicate it for fault-tolerance!
17A slight digression
- Consensus is a classical problem in distributed
systems - N processes
- They start execution with inputs?? 0,1
- Asynchronous, reliable network
- At most 1 process fails by halting (crash)
- Goal protocol whereby all decide same value v,
and v was an input
18Distributed Consensus
Jenkins, if I want another yes-man, Ill build
one!
Lee Lorenz, Brent Sheppard
19Asynchronous networks
- No common clocks or shared notion of time (local
ideas of time are fine, but different processes
may have very different clocks) - No way to know how long a message will take to
get from A to B - Messages are never lost in the network
20Fault-tolerant protocol
- Collect votes from all N processes
- At most one is faulty, so if one doesnt respond,
count that vote as 0 - Compute majority
- Tell everyone the outcome
- They decide (they accept outcome)
- but this has a problem! Why?
21What makes consensus hard?
- Fundamentally, the issue revolves around
membership - In an asynchronous environment, we cant detect
failures reliably - A faulty process stops sending messages but a
slow message might confuse us - Yet when the vote is nearly a tie, this confusing
situation really matters
22Some bad news
- FLP result shows that fault-tolerant consensus
protocols always have non-terminating runs. - All of the mechanisms we discussed are equivalent
to consensus - Impossibility of non-blocking commit is a similar
result from database community
23But how bad is this news?
- In practice, these impossibility results dont
hold up so well - Both define impossible ? not always possible
- In fact, with probabilities, the FLP scenario is
of probability zero - must ask Does a probability zero result even
hold in a real system? - Indeed, people build consensus-based systems all
the time
24Solving consensus
- Systems that solve consensus often use a
membership service - This GMS functions as an oracle, a trusted status
reporting function - Then consensus protocol involves a kind of
2-phase protocol that runs over the output of the
GMS - It is known precisely when such a solution will
be able to make progress
25More bad news
- Consensus protocols dont scale!
- Isis (virtual synchrony) new view protocol
- Selects a leader normally 2-phase 3 if leader
dies - Each phase is a 1-n multicast followed by an n-1
convergecast (can tolerate n/2-1 failures) - Paxos decree protocol
- Basic protocol has no leader and could have
rollbacks with probability linear in n - Faster-Paxos is isomorphic to the Isis view
protocol (!) - both are linear in group size.
- Regular Paxos might be O(n2) because of rollbacks
26Work-arounds?
- Only run the consensus protocol in the group
membership service or GMS - It has a small number of members, like 3-5
- They run a protocol like the Isis one
- Track membership (and other global state on
behalf of everything in the system as a whole - Scalability of consensus wont matter
27But this is centralized
- Recall our earlier discussion
- Any central service running on behalf of the
whole system will become burdened if the system
gets big enough - Can we decentralize our GMS service?
28GMS in a large system
Global events are inputs to the GMS
Output is the official record of events that
mattered to the system
GMS
29Hierarchical, federated GMS
- Quicksilver V2 (QS2) constructs a hierarchy of
GMS state machines - In this approach, each event is associated with
some GMS that owns the relevant official record
GMS0
GMS2
GMS1
30Delegation of roles
- One (important) use of the GMS is to track
membership in our rule enforcement subsystem - But delegate responsibility for classes of
actions to subsystems that can own and handle
them locally - GMS reports the delegation events
- In effect, it tells nodes in the system about the
system configuration about their roles - And as conditions change, it reports new events
31Delegation
In my capacity as President of the United States,
I authorize John Pigg to oversee this nations
banks
Thank you, sir! You can trust me
32Delegation
GMS0
GMS1
Policysubsystem
33Delegation example
- IBM might delegate the handling of access to its
Kingston facility to the security scanners at the
doors - Events associated with Kingston access dont need
to pass through the GMS - Instead, they exist entirely within the group
of security scanners
34 giving rise to pub/sub groups
- Our vision spawns lots and lots of groups that
own various aspects of trust enforcement - The scanners at the doors
- The security subsystems on our desktops
- The key management system for a VPN
- etc
- A nice match with publish-subscribe
35Publish-subscribe in a nutshell
- Publish(topic, message)
- Subscribe(topic, handler)
- Basic idea
- Platform invokes handler(message) each time a
topic match arises - Fancier versions also support history mechanisms
(lets joining process catch up)
36Publish-subscribe in a nutshell
- Concept first mentioned by Willy Zwaenepoel in a
paper on multicast in the V system - First implementation was Frank Schmucks Isis
news tool - Later re-invented in TIB message bus
- Also known as event notification very popular
37Other kinds of published events
- Changes in the user set
- For example, IBM hired Sally. Jeff left his job
at CIA. Halliburton snapped him up - Or the group set
- Jeff will be handling the Iraq account
- Or the rules
- Jeff will have access to the secret archives
- Sally is no longer allowed to access them
38But this raises problems
- If actors only have partial knowledge
- E.g. the Cornell library door access system only
knows things normally needed by that door - then we will need to support out-of-band
interrogation of remote policy databases in some
cases
39A Scalable Trust Architecture
GMS hierarchy tracks configuration events
GMS
GMS
GMS
Pub/sub framework
Roledelegation
Slave systemapplies policy
Masterenterprisepolicy DB
Knowledge limited to locally useful policy
Central database tracks overall policy
Enterprise policy system for some company or
entity
40A Scalable Trust Architecture
- Enterprises talk to one-another when decisions
require non-local information
PeopleSoft
Inquiry
FBI
(policy)
Cornell University
41www.zombiesattackithaca.com
42Open questions?
- Minimal trust
- A problem reminiscent of zero-knowledge
- Example
- FBI is investigating reports of zombies in
Cornells Mann Library Mulder is assigned to the
case. - The Cornell Mann Library must verify that he is
authorized to study the situation - But does FBI need to reveal to Cornell that the
Cigarette Man actually runs the show?
43Other research questions
- Pub-sub systems are organized around topics, to
which applications subscribe - But in a large-scale security policy system, how
would one structure these topics? - Topics are like file names paths
- But we still would need an agreed upon layout
44Practical research question
- State transfer is the problem of initializing a
database or service when it joins the system
after an outage - How would we implement a rapid and secure state
transfer, so that a joining security policy
enforcement module can quickly come up to date? - Once its online, the pub-sub system reports
updates on topics that matter to it
45Practical research question
- Designing secure protocols for inter-enterprise
queries - This could draw on the secured Internet
transaction architecture - A hierarchy of credential databases
- Used to authenticate enterprises to one-another
so that they can share keys - They employ the keys to secure queries
46Recap?
- Weve suggested that scalable trust comes down to
emulation of a trusted single-node rule
enforcement service by a distributed service - And that service needs to deal with dynamics such
as changing actor set, object set, rule set,
group membership
47Recap?
- Concerns that any single node
- Would be politically unworkable
- Would impose a maximum capacity limit
- Wont be fault-tolerant
- pushed for a decentralized alternative
- Needed to make a decentralized service emulate a
centralized one
48Recap?
- This led us to recognize that our problem is an
instance of an older problem replication of a
state machine or an abstract data type - The problem reduces to consensus and hence is
impossible - but we chose to accept Mission Impossible V
49 Impossible? Who cares!
- We decided that the impossibility results were
irrelevant to real systems - Federation addressed by building a hierarchy of
GMS services - Each supported by a group of servers
- Each GMS owns a category of global events
- Now can create pub/sub topics for the various
forms of information used in our decentralized
policy database - enabling decentralized policy enforcement
50QS2 A work in progress
- Were building Quicksilver, V2 (aka QS2)
- Under development by Krzys Ostrowski at Cornell,
with help from Ken Birman, Danny Dolev (HUJL) - Some parts already exist and can be downloaded
now - Quicksilver Scalable Multicast (QSM).
- Focus is on reliable and scalable message
delivery even with huge numbers of groups or
severe stress on the system
51Quicksilver Architecture
- Our solution
- Assumes low latencies, IP multicast
- A layered platform, native hosting on .NET
Applications (any language)
Quicksilver pub-sub API
our platform
GMS
Strongly-typed .NET group endpoints
Properties Framework endows groups with stronger
properties
Quicksilver Scalable Multicast (C / .NET)
52Quicksilver Major ideas
- Maps overlapping groups down to regions
- Engineering challenge application may belong to
thousands of groups efficiency of mapping is key - Multicast is doing by IP multicast, per-region
- Discovers failures using circulating tokens
- Local repair avoids overloading sender
- Eventually will support strong reliability model
too - Novel rate limited sending scheme
53Members of a region have similar group
membership
QSM runs protocols that aggregate over regions,
improving scalability
In traditional group multicast systems, groups
run independently
Hierarchical aggregation used for groups that
span multiple regions
54(No Transcript)
55Connections to type theory
- Were developing a new high-level language for
endowing groups with types - Such as security or reliability properties
- Internally, QS2 will compile from this language
down to protocols that amortize costs across
groups - Externally, we are integrating QS2 types with
types in the operating system / runtime
environment (right now, Windows .net) - Many challenging research topics in this area!
- http//www.cs.cornell.edu/projects/quicksilver/
56Open questions?
- Not all policy databases are amenable to a
decentralized enforcement - Must have enough information at the point of
enforcement to construct proofs - Is this problem tractable? Complexity?
- More research is needed on the question of
federation of policy databases with minimal
disclosure
57Open questions?
- We lack a constructive logic of distributed,
fault-tolerant systems - Part of the issue is exemplified by the FLP
problem logic has yet to deal with the
pragmatics of real-world systems - Part of the problem resides in type theory we
lack true distributed type mechanisms