Title: FLARe:%20a%20Fault-tolerant%20Lightweight%20Adaptive%20Real-time%20Middleware%20for%20Distributed%20Real-time%20and%20Embedded%20Systems
1FLARe a Fault-tolerant Lightweight Adaptive
Real-time Middleware for Distributed Real-time
and Embedded Systems
Jaiganesh Balasubramanian jai_at_dre.vanderbilt.edu
http//www.dre.vanderbilt.edu/jai
Dr. Aniruddha S. Gokhale gokhale_at_dre.vanderbilt.
edu (Co-Advisor)
Dr. Douglas C. Schmidt schmidt_at_dre.vanderbilt.edu
(Advisor)
Department of Electrical Engineering and Computer
Science Vanderbilt University, Nashville, TN, USA
Middleware 2007 Doctoral Symposium (MDS
2007) Newport Beach, CA, USA
2Focus Distributed Real-time Embedded (DRE)
Systems
- Stringent simultaneous QoS demands, e.g., never
die, soft real-time, etc. - predominantly stateless, tolerates weaker
consistency if stateful - Distributed Object Computing middleware used to
design and develop DRE systems - support for highly available systems (e.g.,
FT-CORBA) - end-to-end predictable behavior for requests
(e.g., RT-CORBA)
- Goal is to provide real-time fault tolerance to
DRE systems - FT uses redundancy RT assured by resource
management
3Determining the Replication Scheme for DRE Systems
- Active replication
- client requests multicast and executed at all the
replicas - strong state consistency
- deterministic behavior of replicas
- very fast recovery
- resource-expensive
- Passive replication
- low resource/execution overhead
- better suited for weaker consistency
- no restrictions on deterministic behavior
- enables making tradeoffs between FT and resource
consumption - applies to a class of soft real-time DRE systems
- Passive replication better suited for our purpose
- Goal is to provide RTFT for DRE systems using
passive replication
4Challenges Using Passive Replication for DRE
Systems
- Challenge 1 Maintain real-time performance of
applications at all times - Focus Real-time performance after failover
- Decision-making algorithms used for electing a
new primary - Client response times depend on the loads of the
processor hosting the failover target - Task deadlines are met if the CPU utilization is
under a threshold - Failure could affect multiple clients failover
to multiple processors
5Challenges Using Passive Replication for DRE
Systems
- Challenge 2 Fast failover on client side
- Focus Faster and predictable failover
- Client-side middleware could maintain static list
of references - Round-robin approach of trying out different
references - Faster failover but not appropriate failover
- No RT guarantee after failover
- Client-side middleware need to be updated with
references based on dynamic operating conditions
6Challenges Using Passive Replication for DRE
Systems
- Challenge 3 FTRT in spite of resource overloads
- Focus Dynamic reconfigurations and overload
management - Long running systems continued operation
through many failures - Periodic loss of resources simultaneous
failures - Graceful degradation of applications
- Operate higher priority applications at all times
- Overload management predictable and fast
- Alternate degraded and assured functionality
7Challenges Using Passive Replication for DRE
Systems
- Challenge 4 Resource-aware stateful replication
- Focus State consistency in stateful DRE systems
- State transfer requires CPU and network
reservations - Support for resource-constrained operations
- Different consistency models strong, weak, and
no consistency - Adapt consistency of certain tolerant
applications depending on available resources - Utility optimizations better state consistency
for higher priority applications
8Our Approach FLARe RT-FT Middleware
- FLARe Fault-tolerant Lightweight Adaptive
Real-time Middleware - Transparent and Fast Failover
- Redirection using client-side portable
interceptors - catches COMM_FAILURE exceptions and transparently
throws LOCATION_FORWARD exceptions - Failure detection can be improved with better
protocols e.g., SCTP
9Our Approach FLARe RT-FT Middleware
- Real-time performance after failover
- monitor CPU utilizations at hosts where backups
are deployed - adaptive failover target selection algorithms
operated by a resource manager - failover targets chosen on the least loaded host
hosting the backups - better chance to provide RT performance
10Our Approach FLARe RT-FT Middleware
- Predictable failover
- failover target decisions computed periodically
by the resource manager - conveyed to client-side middleware agents
forwarding agents - agents work in tandem with portable interceptors
- redirect clients quickly and predictably to
appropriate targets - agents periodically/proactively updated when
targets change
11Current Progress
- Current Progress
- Initial prototype of FLARe developed using The
ACE ORB (TAO) - Stateless FT using passive replication
- Implemented a resource-aware adaptive failover
target selection algorithm - Compared and contrasted the performance of the FT
middleware when using static failover strategies
versus adaptive failover strategies - Significant reduction in client response times
and system utilization
FLARe is open-source and available at
www.dre.vanderbilt.edu
12Proposed Research and Expected Milestones
- Overload management
- investigate overload management algorithms that
do not degrade application QoS - minimum client disturbance
- implemented within the resource manager
- extreme resource constrained operating conditions
investigate opportunities to change
implementations and reduce overloads - Utility optimizations when to degrade QoS and
when not to - RT/FT trade-offs
Deadline March 2008
13Proposed Research and Expected Milestones
- State Synchronization
- View state synchronization as an aperiodic
scheduling problem - more slack available more time available to
synchronize state - slack devoted for higher priority applications
always - availability of slack support for different
consistency management schemes (e.g., weak,
strong, none) - application informs middleware when to
synchronize state
Deadline July 2008
14Proposed Research and Expected Milestones
- Network Reservations
- View real-time fault-tolerance as an end-to-end
scheduling problem - network reservations are required for state
transfers - without reservations, no predictability
- middleware-mediated mechanisms to use external
network QoS mechanisms such as DiffServ - network monitors for alternate routes in the
presence of failures (leverage existing network
research)
Deadline September 2008
15Concluding Remarks
- Passive replication a promising approach for
DRE systems - Resource-aware adaptive fault-tolerance
required for adapting passive replication for DRE
system requirements - Adaptive algorithms required for trading off RT
versus FT requirements - Middleware transparently supports FT for
applications works in conjunction with adaptive
algorithms to take care of RT requirements as well
FLARe is open-source and available at
www.dre.vanderbilt.edu