Title: DM216 The Three Ps of Replication in a High Availability Environment
1DM216The Three Ps of Replication in a High
Availability Environment
Tony T. Sheng Technical Team Lead PLW Data
Architecture Bell Atlantic tony.t.sheng_at_bellatlant
ic.com
2The Three Ps of Replication in a High
Availability Environment
- Who am I?
- Overview of Configuration
- Best Practices
- Pitfalls
- The Principles
- Q A
3The Three Ps of Replication in a High
Availability Environment
- Who am I?
- Tech Lead for a DBA team
- ASE/Rep Server administration for DEV/TEST
- Development of processes for schema changes and
custom data conversion to support application - Infrastructure and configuration design
- Production Support for poor performance
- Around 5 years of experience with Sybase products
4Configuration Overview
- Overview of PLW/CW configuration
- Homogenous dataservers
- Bi directional replication
- No stored procedures, no function strings yet
- No routes, one RS serves PDB-gtRDB
- No Sybase Warm Standby
5Configuration Overview
6Configuration Overview
7Configuration Overview
- R4 to R2 W1CP (in more detail)
8Configuration Overview
9Configuration Overview
- Overview of Configuration
- Why is this so confusing?
- TUXEDO, XA Server for 2-phase commit
- Application architecture start up is too long
- Can use the Rep Server to queue for maintenance
and availability - Alternative to System 12
- Wanted DBAs to be disconnected from the failover
10Best Practices
- Design
- Rep Server
- On another host to get data out of the Rep Agent
fast - Its own host to reduce network/CPU bottleneck
- Lots of disk so you dont run out of SD
- Hold lots of transactions in the SD if you need
to - Data Model
- Login and timestamp field on every table
- Could be used in the future for conflict
resolution
11Best Practices
- Design
- Data Server
- Hardware layout and software configuration all to
accommodate more users in case of failover - Each database resides on its own set of disks, no
controller/disk contention - Use 11.5 Rep Agent and Rep Server
- Hold lots of days in transaction logs if you need
to
12Best Practices
- Standards
- Data Definition Language
- Each table has unique index and primary key that
are relevant (used for auto generation of
repdefs) - Object name allows generation of a repdef that is
less than 30 characters - UNIX host and Dataserver naming standards
- Each host has a isXXXXXX name
- Each ASE servers is a SQLNAAXXXXXX
13Best Practices
- Standards
- UNIX host and Data Server naming standards
- Each host has a isXXXXXX name
- is001277
- Each ASE servers is a SQLNAAXXXXXX
- SQL1LW001277, SQL1CW002147
- Can have a backupserver, XA Server and Rep Server
- REP1LW001277, REP2LW001277, REP1CW002147
- SYB_BACKUP1LW001277
14Best Practices
- Standards
- Replication Objects
- Repdefs - rmachinedatabaseobject
- r1001931w1mdtlb
- Subscriptions - smachinedatabaseobject
- s1001929w1mdtlb
- Function strings names would also be standard
15Best Practices
- Standards
- Every Sybase server is configured identically.
- Filesystems, partitions, disk names, database
names - Configuration changes between the type of
application, ie. sqljumbo, replarge, etc. - Flat File Site Specific Information (SSI)
- Flat files to map database names and replication
flow - rep_env.hostname
- PRSPDSPDBRRSRDSRDB
16Best Practices
- Standards
- Allows getting a quick diagnosis
- Assists in getting a quick handle on replication
flow - Assists to drill down to specific objects that
have issues - The key is to recover quickly from problems
17Best Practices
- Operational Procedures
- Who updates where ?
- Replication is on the whole time and is
bidirectional - Transactions executed by the maintenance user are
not replicated - (see send_maint_xacts_to_replicate under
sp_config_rep_agent)
18Best Practices
- Operational Procedures
- Who updates where ?
- Application user groups are GRANTed/REVOKEd
permission based upon operating state (normal
versus failover) - One dataserver is always considered the REAL ONE,
data on this server can be assumed to be correct
19Best Practices
- Operational Procedures
- Constraints
- Used in the data model to keep integrity of the
data - Can bring the Rep Server thread down because of
the sequence of the SQL - We have been having OPS drop them on the
replicate side when necessary - Probably should include a DROP/ADD constraint
process during the failover
20Best Practices
- Operational Procedures
- Autocorrection
- Can help get rid of duplicate key errors that are
stuck in the queue when synching the data. - Cant be used with Replicate Minimal Columns
(which we dont do but probably should)
21Best Practices
- Monitor
- You must monitor!!
- Rep Agent and Transaction Logs
- Ran into some bugs with Rep Agent hanging
- Check that the transaction log dumps are not
growing in size - This would infer that the Rep Agent is not moving
through the transaction log and moving the
secondary truncation pointer
22Best Practices
- Monitor
- Replication Server DSI Threads
- Had a few days where the thread(s) were down (OPS
error) - Takes time to recover from this to drain the
queue - Vulnerable if failover is needed. You would have
the potential for in flight transactions. (BIG
headache)
23Best Practices
- Monitor
- Some kind of replication heartbeat
- rep_verify process
- Heartbeat table every 10 minutes
24Best Practices
- Monitor
- Extra Monitoring
- Rep Server queue throughput
- Difference between first and last block every
hour - Capture trends for load, additional data, peak
volumes, future capacity
25Best Practices
- Monitor
- Rep Server queue throughput
26Best Practices
- Monitor
- Other various monitoring
- Latency
- Have not noticed too much latency but could
assimilate data with queue throughput - rs_lastcommit timestamp on replicate database -
the last time an update flowed through - select getdate(), db_name, origin,
datediff(ss,origin_time,dest_commit_time)from
rs_lastcommit where origingt0 and origin_time gt
"Jan 1 1900"
27Best Practices
- Monitor
- Other various monitoring
- Aggregate information for our rollup replication
89354 ---- md 1449 ---- wv 328029 ----
pa 63383 ---- dc 25255 ---- de 126779 ----
va 98929 ---- nj 733178
SQL1R2051972 -----------
733178 SQL1R2051973 ----------- 733178
28Best Practices
- Tools and Utilities
- droprep.sh DBNAME ALL
- Utility to drop all replication components for a
database or all databases on the dataserver - Large schema changes when it is easier to rebuild
all replication - Drops subs/repdefs
- Shutdown Rep Agents and disable
- Remove secondary truncation point (most important)
29Best Practices
- Tools and Utilities
- Standard process to build replication
- Establish connection for PDB/RDB
- Turn on Rep Agent for PDB
- Build repdefs/subs
- Set replicate bit
- Grant permissions to maintenance user
30Best Practices
- Tools and Utilities
- REPON/REPOFF.sh
- Utility to temporarily turn off and on
replication - Shutdown Rep Agents and disable secondary
truncation point - Resets truncation point, turns on Rep Agent
31Best Practices
- Tools and Utilities
- dump/rcp/load
- Set of scripts to synch up databases from primary
to replicate - Make sure no users on the replicate so the
database can be loaded - Must reset generation id, reset truncation point,
reconfigure Rep Agent to the correct Rep Server
(two way)
32Best Practices
- Tools and Utilities
- REPdelta
- Process which adds/drops/synchs replication
objects to database schema changes - Reads from flat files that have been generated by
data model process
33Pitfalls
- Recovering Data from Operational Errors
- Queue was down and a failover was executed
- Have to figure out what updates were held at the
PDB and what updates were executed at the RDB.
Timestamp field can help here.
34Pitfalls
- Application Doing Very Bad Things
- A HUGE Transaction
- One large update took more than 75000 locks on
the replicate database. - Rep Server would open the transaction, flow, get
an error for insufficient locks, rollback the
transaction and then try again (for 36 hours) - Had to shutdown the RDS and reconfigure the
number of locks and let the Rep Server replicate
the transaction - Got lucky because the RDS was not in use
35Pitfalls
- Application Doing Very Bad Things
- Moving the Primary Key of a Unique Index Column
to another value - This is a documented restriction.
- Causes the DSI thread to go down because the
order of the transaction causes a duplicate key
violation - Must skip the transaction
- see Chapter 8 of the Rep Server Admin Guide
36Pitfalls
- Application Doing Very Bad Things
- Moving the Primary Key of a Unique Index Column
to another value
Table cu_ln clustered unique index is on cu_f,
cu_ln_f update cu_ln set a.cu_ln_f (select
a.cu_ln_f max(b.cu_ln_f) from cu_ln b
where b.cu_f 23732123) from cu_ln a where
a.cu_f 982323433 go
37Pitfalls
- Application Doing Very Bad Things
- Moving the Primary Key of a Unique Index Column
to another value - the workaround involves making all the updates to
a temp table and then inserting them and deleting
the old rows
1. select into getoldculn from cu_ln where
cu_f OLD 2. update getoldculn set cu_ln_f
(select cu_ln_f max(cu_ln_f) from getoldculn
where cu_f OLD 3. insert into cu_ln select
from getoldculn where cu_f OLD
38Pitfalls
- DBA functions
- Tried to move the logsegment off a device onto
another device using sp_dropsegment and
sp_extendsegment - alter database xlvww1ma log on SYBLDEV26 500
- go
- use xlvww1ma
- go
- sp_dropsegment "logsegment", "xlvww1ma","SYBLDEV04
" - go
- sp_extendsegment "default", "xlvww1ma","SYBLDEV04"
- go
39Pitfalls
- DBA functions
- Tried to move the logsegment off a device onto
another device using sp_dropsegment and
sp_extendsegment - Ended up getting 605 and 692 errors on the log
segment of the database. - Should have turned off the Rep Agent before any
of this work. - Recovered by reloading the databases. (painful)
40Pitfalls
- DBA functions
- Bad constraint added manually
- Had a DSI thread that was up but no data would
flow. - Maintenance user spid IOs never increased.
- Queue would drain when autocorrection was turned
on - Found a bad constraint on the replicate database
(most likely put in by manual process - C_AD_gssl_adr_rng_1
- adr FOREIGN KEY (adr_f) SELF REFERENCES
adr(adr_f)
41Pitfalls
- Synching databases
- You will probably have to synch your data
periodically - Skipped transactions
- Incorrect failover
- Make this a documented standard procedure so it
is readily available. - Use autocorrection to avoid the first few
duplicate key errors from data that is in the log
42The Principles
- A Replication Lab
- Have a restricted development lab that mimics
production - Owned by DBAs, no other teams have access
- Same DS/RS configuration
- Number of databases, Rep Servers, rs_config,
sp_configure - Used to isolate replication problems
- Mock deploy replication fixes
- Test out new replication topologies
- Recreate production problems
43The Principles
- Send all SQL through the SW lifecycle
- When we get burned, 90 is from SQL that didnt
go through the software lifecycle. - Be stringent about application transaction size
- every SQL is reviewed by DBA team for
- Performance standards (valid SARGs, good
showplan, etc.) - Replication Server standards
- (no update table set col1T where col1R) -gt
sends into a big update transaction on the
replicate
44The Principles
- More Principles We Have Learned
- Have good standards
- Assist you in debugging quickly
- Start simple with Rep Server
- Monitor, Monitor, Monitor
- Know your Operating Procedures
- Step by Step of failover/failback process
45What Is Next?
- Replicating data through different schemas
- combination of function strings/operational steps
depending on the type of change - Configuration that is similar to System 12
- move to hot/cold configuration where the disks
are shared and switched - no Rep Server for local failover but still for
DR mode data replication
46Q A