Title: Middle-R: A Middleware for Dynamically Adaptive Database Replication
1Middle-R A Middlewarefor Dynamically
AdaptiveDatabase Replication
- R. Jiménez-Peris, M. Patiño-Martínez, Jesús Milán
- Distributed
- Systems
- Laboratory
- Universidad Politécnica de Madrid (UPM)
Lsd
2Symmetric vs. Asymmetric Processing
- Transactions in a replicated system can be
processed either - Symmetrically, that means, that all replicas
process the whole transaction. - This approach can only scale by introducing
queries in the workload. - Asymmetrically, that means, that one replica
process the transaction and the other replicas
just apply the resulting updates. - This approach can scale depending the ratio
between the cost of executing the whole
transaction and the cost of just applying the
updates.
3Scalability of Symmetric Systems
w 1
4Scalability of Asymmetric Systems
Asymmetric System
- The transaction is fully executed at its master
site. - Non-master sites only apply the updates.
- This approach leaves some spare computing power
- that enables the scalability
5Comparing the Scalability
6Taxonomy of Eager Database Replication
- White box. Modifying the database engine
(Betinnas PostgresR VLDB00,TODS00). - It can use either symmetric or asymmetric
processing. - Black box. At the middleware level without
assuming anything from the database (Yair Amir
ICDCS02). - Inherently symmetric approach.
- Transactions are executed sequentially by all
replicas. - Gray box. At the middleware level based on the
get/set updates services (our approach
ICDCS02). - It can use symmetric processing.
- It can also use asymmetric processing provided
two services from the database to get/set updates
of a transaction. This the approach we have taken.
7Assumptions in Middle-R
- Each site has the entire database (no partial
replication). - Read one write all available.
- We work on a LAN.
- Virtually synchronous group communication
available. - The underlying database provides two basic
services (i.e. similar to the Corba ones) - get state returns a list of the physical updates
performed by a transaction, - set state applies the physical updates of a
transaction at a site. - Our approach exploits the application semantics
we assume that the database is partitioned in
some arbitrary way and that it is known which
data partitions are going to be accessed by a
transaction. - This allows us to execute transactions from
different partitions in parallel. Transactions
spanning several partitions are also considered.
8Protocol Overview Disc00
Client
Middleware layer
Database layer
9Integrating the Middleware with the Application
Server
- JBoss accesses databases through JDBC.
- In order to integrate the middleware with JBoss
it will be necessary to develop a JDBC driver. - This JDBC driver will access the middleware by
multicasting requests to the middleware instances
at each site.
10Integrating the Middleware with the Application
Server
JBoss
JBoss
JBoss
JDBC Driver
JDBC Driver
JDBC Driver
Group Communication Bus
Middle-R
Middle-R
Middle-R
Middle-R
DB
DB
DB
DB
11Integrating the Middlewarewith the Application
Server
- If JBoss is replicated, some issues should be
tackled with - Independently of the kind of replication in
JBoss, duplicated requests might reach the
replicated database. - Active replication provokes the duplication of
every request. - Other kinds of replication strategies might
generate duplicate requests upon fail-over (i.e.,
requests done by the failed primary might be
resubmitted by the new primary). - The middleware imposes the requirement to
identify duplicate requests identically. - The middleware, provided the above guarantee,
will enforce the removal of duplicate requests.
12Automatic DB partitioning
- Middle-R exploits application semantics, that is,
it requires to partition the DB in some arbitrary
way and know in advance which partitions each
transaction is going access. - In our previous work, these partitioning was
performed by the programmer. - For each stored procedure accessing the DB, a
function was provided that taking the parameters
of the invocation determined the partitions that
would be accessed by the stored procedure
invocation. - This is a limitation of the previous approach
that has to be overcome in Adapt. - This DB partitioning should transparent to users
and therefore automatically performed on a
partition per table basis (at least).
13Automatic DB Partitioning
- The second issue is how to know in advance which
partitions a particular transaction is going to
access. - Our new approach will analyze on-the-fly the
submitted SQL statements to determine which
partitions it will access.
14DB Interaction Model
- Our previous work assumed that each transaction
was submitted in a single message to the
middleware. - This model was suitable for working for stored
procedures. - However, this interaction model does not match
with the one adopted by JDBC. - Under JDBC a transaction might span an arbitrary
number of requests. - Under JDBC a transaction might be distributed, so
the XA interface should be supported for
distributed atomic commit. - For this reason, we are extending the underlying
replication protocol to deal with transactions
spanning multiple messages.
15Dynamic Adaptability
- The following dynamic adaptability properties are
considered - Online recovery. Whilst a new (or failed) replica
is being recovered, the system continues its
regular processing without disruption (SRDS02
approach that extend ideas from DSN01 to the
middleware context). - Load balancing. The masters of the different
partitions are reassigned to balance the load
dynamically. - Admission control. Depending on the workload the
optimal number of transactions active in the
system changes. A limit of active transactions is
dynamically adapted to reach the maximum
throughput for each workload.
16Dynamic Adaptability Online Recovery SRDS02
- Recovery is performed on a per-partition basis.
- Recovery is not performed during the state
transfer associated to the view change to prevent
the blocking of regular requests. - Once a partition is recovered at a recovering
replica, it can start processing requests on that
partition although the other partitions are not
recovered yet. - Recovery is flexible to enable load balancing
policies to take into account the load of
recovery - The recovery can use one or more recoverers.
- Each recoverer can recover one or more partitions.
17Dynamic Adaptability Online Recovery
- Replicas might recover in a cascading fashion.
- The online recovery protocol deals efficiently
with cascading recoveries. - Basically, it prevent redundancies in the
recovery process as follows - A replica that starts recovery, whilst the
recovery of another replica is underway, is not
delayed till the whole recovery completes. - Neither a new recovery is started in parallel
(yielding redundant recoveries). - Instead, this replica joins the recovery process
with the next partition to be recovered. - In this way, cascading recovering replicas share
the recovery of common partitions.
18Dynamic Adaptability Load Balancing
- The middleware approach has the advantage that
every replica knows without any additional
information the load of each other replica. - This allows to achieve load balancing with very
little overhead. - One of the main difficulties of load balancing is
to determine the current load of each replica. - We are currently modeling the behavior of the DB
to be able to determine dynamically the current
load of each replica. - These models will enable the middleware to
determine which replicas become saturated, so its
load can be redistributed. - The load is redistributed by reducing the number
of partitions that are mastered by an overloaded
replica.
19Dynamic AdaptabilityLoad Balancing during
Online Recovery
- The load balancing will also control the online
recovery to adapt it to the load conditions. - When the system load is low it will increase the
resources devoted to recovery to accelerate it
taking advantage of the spare computing
resources. - When the system load increases it will
dynamically decrease the resources devoted to
recovery to cope with the new load.
20Dynamic Adaptability Admission Control
- The maximum throughput for a workload is reached
with a given number of concurrent transactions in
the system. - Once this threshold is exceeded the DB begins to
thrash. - This threshold is different for each workload so
it needs to be dynamically adapted to achieve the
maximum throughput for the changing workload. - The middleware has a pool of connections with the
DB, and it can control the transaction admission
to attain the optimal degree of concurrency. - We are developing behavior models that will
enable us to find dynamically the thrashing point
and adapt dynamically the threshold in the
admission control.
21Wide Area Replication
- The underlying protocols in the middleware are
amenable to be used in a WAN. - We are currently studying which new requirements
are needed in a WAN to find problems that might
require changes in the protocols. - Replication across a WAN help to survive
catastrophic failures and it is also needed by
many multinational companies with branches
spanning different countries. - For the former scenario we contemplate a replica
at each geographic location. - For the latter scenario we contemplate a cluster
at each geographic location.
22Partial Replication
- Scalability in the middleware, although good, is
limited due to the overhead induced by
propagating the updates to all the replicas (see
SRDS01 for an analytical model determining the
precise scalability of the approach). - This limitation can be overcome by means of
partial replication. - In this way, each partition can be dynamically
replicated to the optimal level. - However, partial replication introduces new
complications such as queries spanning multiple
partitions that cannot be performed on a replica
that do not hold a copy of all the accessed
partitions.
23Conclusions
- Extensions to our previous work and the JDBC
driver will enable the use of our middleware
approach to provide dynamically adaptable DB
replication for JBoss. - The flexibility of the middleware approach enable
us to contribute on different issues regarding
dynamic adaptability such as online recovery,
dynamic admission control, dynamic load
balancing, changing dynamically the degree of
partial replication, etc.
24Optimistic Delivery KPAS99
a) Replication protocol with non-optimistic total
ordered multicast
Total order MC of the transaction
Execution of transaction
Latency for a transaction
time
b) Replication protocol with optimistic total
ordered multicast
Opt-delivery
Execution of transaction
Totally ordered- delivery
Total order MC of the transaction
time
Latency for a transaction
25Advantages of optimistic delivery
- For the optimism to create problems two things
must happen at the same time - Messages get out of order (unlikely in a LAN).
- The corresponding transactions conflict.
- The resulting probability is very low and we can
make it even lower (transaction reordering at the
primary). - Cost of group communication minimized.
26Experimental set up
- Database PostgreSQL.
- Group communication Ensemble.
- Network 100 Mbit Ethernet.
- 15 database sites (each SUN Ultra 5, Solaris).
- Two kinds of transactions were used in the
workload - Queries (only reads).
- Pure updates (only writes).
27Experiments
The dangers of replication none of these
statements is true in conventional eager
replication protocols.
- 1 using replication does not make the system
worse. - 2 adding more replicas increases the throughput
of the system. - 3 the increase in throughput does not affect
the response time. - 4 acceptable overhead in worst case scenarios.
28Comparison with Distributed Locking
Load 5 tps
292. Throughput Scalability
303. Response Time Analysis
313. Response Time Analysis
323. Response Time Analysis
334. Coordination overhead
34Conclusions
- Consistent replication can be implemented at the
middleware level. - Achieving efficiency requires to understand the
dangers of replication - Only one message per transaction
- Asymmetric system
- Reduce communication latency
- Reduce abort rates
- Our system demonstrates different ways to address
all of these problems.
35Ongoing work
- We are using the middleware to implement
replication in object containers (e.g. J2EE,
Corba). - Tests are underway to use the system to implement
replication across the Internet. - Porting system to Spread Amir et al..
- Load balancing for web servers based on
replicated databases. - Online recovery and dynamic system
reconfiguration - DSN 2001 Kemme, Bartoli, Babaoglu.
- SRDS 2002 Jimenez, Patiño, Alonso.
36Analytical vs. Empirical Measures
37How can the middleware performwith faster
databases?
- The 1 upd transaction took 10 ms to be executed,
whilst an 8 upd transaction took 55 ms. - This means that in a faster database for
transactions lasting within these ranges we can
obtain similar scalabilities (till some
bottleneck is reached, most likely group
communication). - The determinant factor of scalability is the
ratio of the cost of executing a full transaction
and applying its updates, but this factor,
although can be reduced, will be always
significant (in Postgres for 8 upd transactions
it was 0.16 and for 1 upd transactions it was
0.2).
38Background
- Replication has been used for two different and
exclusive purposes in transactional systems - To increase availability (eager replication) by
providing redundancy at the cost of throughput
and scalability. - To increase throughput and scalability by
distributing the work among replicas (lazy
replication) at the cost of consistency. - We want both availability and performance.
- However, Gray in The Dangers of replication
SIGMOD96 stated that eager replication could not
scale.
39Motivation
- Postgres-R KA00 showed how to combine database
replication with group communication to implement
a scalable solution within a database. - We extended this work PJKA00 by exploring how
to implement replication outside the database - Protocol is provably correct.
- Could be implemented as middleware.
- It scales (e.g. adding more sites increases the
capacity). - In this talk we discuss the performance of such
protocol as implemented on a cluster of computers
connected through a LAN and show that it can be
used in a wide range of applications.
40Eager Data Replication
- There is a copy of the database at each site.
- Every replica can perform update transactions
(update everywhere). - Transaction updates must be propagated to the
rest of the replicas. - Queries (read only transactions) are executed at
a single replica.
41Understanding the Scalabilityof Data Replication
Symmetric System
Assume sites with a processing capacity of 4 tps
Each transaction executed by a site induces a
load of one transaction on each other site
The capacity of the system is at most the
capacity of a single site 4 tps
42Asymmetric Systems
- In an asymmetric system the work performed by a
replica consists of - Local transactions, i.e., transactions submitted
to the replica. - Remote transactions, i.e., update transactions
submitted to other replicas.
43A Middleware Replication Layer
Queue Manager
Queue Manager
Communication Manager
Connection Manager
Communication Manager
Connection Manager
Replica Manager X
Replica Manager Y
PostgreSQL
PostgreSQL
Group Communication
44A Middleware Replication Layer
- The replication system has been implemented as a
middleware layer that runs on top of
off-the-shelf non-distributed databases or other
data stores (e.g., an object container like
Corba). - This layer only requires two simple services from
the underlying data repository - get state returns a list of the physical updates
performed by a transaction, - set state applies the physical updates of a
transaction at a replica.
45Exp. 1 Comparison with Distributed Locking
- In this experiment we compared our system with a
commercial database using distributed locking and
eager replication to guarantee full consistency
of the replicas. - A small load of 5 transactions per second was
used for this experiment.
46Response Time Analysis
- The goal of this experiment is to show that
transaction latency keeps stable with loads
within the scalability interval. - For each configuration and update rate, the load
is increased until the response time degenerates.
47Exp. 2 Throughput Scalability
- This experiment tested how the throughput of the
system varies for an increasing number of
replicas. - In particular, we wanted to know the power of the
cluster relative to a single site.
48Measuring the Overhead
- The latency of short transactions is extremely
sensitive to any overhead. - The goal of this experiment is to measure how the
response time was affected by the overhead
introduced by the middleware layer. - In this experiment the shortest update
transaction was used a transaction with a single
update.
49Motivation and background
- Eager replication is the text book approach to
achieve availability - Yet, very few database products provide
consistent replication. - The reasons were explained by Gray in The
Dangers of replication SIGMOD96.
- Postgres-R KA00 showed how to avoid these
dangers and implement eager replication within a
DB. - Combines transaction processing and group
communication. - Uses asymmetric processing
- Showed how to embed these techniques in a real
database engine.
50Motivation and Background
- A subsequent approach explored scalable eager DB
replication outside the DB, at the middleware
level Disc00,ICDCS02. - Experiments showed that it was possible to
achieve replication at the middleware level with
a scalability close to the one achieved within
the database.
51Two Crucial Issues
- Processing should be asymmetric
- Otherwise it does not scale
- but difficult to do outside the database
- Avoid the latency introduced by group
communication (especially for large groups) - Otherwise the response time suffers
- but we need the group communication semantics