Recovery

About This Presentation

Title:

Recovery

Description:

Recovery Chapter 9, 10, 11.1-11.4 in Gray and Reuter Adapted from s by J. Gray & A. Reuter – PowerPoint PPT presentation

Number of Views:114

Avg rating:3.0/5.0

Slides: 101

Provided by: rutg155

Learn more at: http://eceweb1.rutgers.edu

Category:

more less

Transcript and Presenter's Notes

Title: Recovery

1
Recovery
Chapter 9, 10, 11.1-11.4 in Gray and Reuter

Adapted from slides by J. Gray A. Reuter

2
Failure Types

Transaction Failure
Transaction issues abort
System Failure
Volatile memory is corrupted
Media Failure
Stable storage is corrupted

3
Failure Model (Assumptions)

Failures can always be detected (Failstop)
System
Defensive programming
Error detecting codes (parity, checksums, etc.)
Media
Redundant information (e.g., description of block
in header)
Checksums

4
System Failure Recovery

Goal
At point of failure, history is H.
Recovery must restore DB to final state defined
by C(H).
All information needed to accomplish this must be
in stable storage

5
Normal (no failure) Transaction Execution

TM generates the TRID at Begin_Work().
Coordinates Commit,
RM joins work, generates log records, allows
commit

6
The Resource manager view

Boolean Prepare(LSN ) / invoked at ? 1. Return
vote on commit /
void Commit() / called at commit ?2 /
void Abort() / called at failed commit ? 2 or
abort /
void UNDO(LSN) / Undo the log record with this
LSN /
void REDO(LSN) / Redo the log record with this
LSN /
void TM_Startup(LSN) / TM restarting. Passes RM
ckpt LSN /
LSN Checkpoint(LSN low_water) / TM
checkpointing, Return RM ckpt LSN,
set low water LSN /

7
The Transaction Manager

Transaction rollback.
coordinates transaction rollback to a savepoint
or abort rollbacks can be initiated by any
participant.
Resource manager restart.
If an RM fails and restarts, TM presents
checkpoint anchor RM undo/redo log
System restart.
TM drives local RM recovery (like RM restart)
TM resolves any in-doubt distributed transactions
Media recovery.
TM helps RM reconstruct damaged objects by
providing
archive copies of object the log of object
since archived.
Node restart.
Transaction commit among independent TMs when a
TM fails.

8
When a Transaction Aborts

At transaction rollback
TM drives undo of each RM joined to the
transaction
Can be to savepoint 0 (abort) or partial rollback.

9
The Transaction Manager at Restart/Recovery

At restart, TM reading the log drives RM
recovery.
Single log scan.
Single resolver of transactions.
Multiple logs possible, but more complex/more
work.

10
Resource Manager ConceptsUndo Redo Protocol
11
Resource Manager Concepts Transaction UNDO
Protocol
declare cursor for transaction_log select
rmid, lsn / a cursor on the transaction's
log / from log / it returns the resource
manager name / where trid trid / and
record id (log sequence number) / descending
lsn / and returns records in LIFO
order / void transaction_undo(TRID trid) /
Undo the specified transaction. / int
sqlcode / event variables set by sql
/ open cursor transaction_log / open an
sql cursor on the trans log / while
(TRUE) / scan trans log backwards undo
each/ / fetch the next most recent log
rec / fetch transaction_log into rmid, lsn
/ / if (sqlcode ! 0) break
/ if no more, trans is undone, end loop /
rmid.undo(lsn) / tell RM to undo that
record / close cursor
transaction_log / Undo scan is complete, close
cursor / / return to caller /
12
Resource Manager Concepts Restart REDO Protocol
void log_redo(void) declare cursor for the_log
/ declare cursor from log start
forward / select rmid, lsn / gets RM id and
log record id (lsn) / from log / of all log
records. / ascending lsn / in FIFO
order / open cursor the_log / open an
sql cursor on the log table / while (TRUE) /
Scan log forward redo each record. / fetch
the_log into rmid, lsn / fetch the next log
record / if (sqlcode ! 0) break / if
no more, then all redone, end loop /
rmid.redo(lsn) / tell RM to redo that
record / close cursor the_log / Redo
scan complete, close cursor / /
return to caller /

Note REDO forwards, UNDO backwards

13
Idempotence

F(F(X)) F(X) Needed in case restart fails
(and restarts)
Redo(Redo(old_state,log), log)
Redo(new_state,log) new_state
Undo(Undo(new_state,log), log)
Undo(old_state,log) old_state

14
Testable State Can Tell If It Happened.

IF operation not idempotent AND state not
testable
THEN recovery is impossible

15
Kinds of Logging

Physical
Keep old and new value of container (page,
file,...)
Pro Simple
Allows recovery of physical object (e.g. broken
page)
Con Generates LOTS of log data
Logical
Keep call params such that you can compute F(x),
F-1(x)
Pro Sounds simple
Compact log.
Con Doesn't work (wrong failure model).
Operations do not fail cleanly.

16
Sample Physical LOG RECORD
struct compressed_log_record_for_page_update /
/ int opcode / opcode will
say compressed page update/ filename fname /
name of file that was updated / long pageno
/ page that was updated / long offset /
offset within page that was updated / long len
gth / length of field that was
updated / char old_valuelength / old
value of field / char new_valuelength
/ new value of field / /
/

Ordinary sequential insert is OK.
Update of sorted (B-tree) page
update LSN
update page space map
update pointer to record
insert record at correct spot (move 1/2 the
others)
Essentially writes whole page (old and new).
16KB log records for 100-byte updates.

17
Sample Physical LOG RECORD
struct logical_log_record_for_insert /
/ int opcode / opcode will
says insert / filename fname / name of
file that was updated / long length /
length of record that was updated / char record
length / value record / /
/

Very compact.
Implies page update(s) for record (may be many
pages long).
Implies index updates (may be many indices on
base table)

18
The trouble with Logical Logging

Logical logging needs to start UNDO/REDO with an
action-consistent state.
Partial Actions
If an action runs to completion, we can use
inverse action to UNDO operation.
What if action fails part of the way through? How
do we put system in consistent state.
for example insert (table, record)
ALL or NONE of the indices should be updated when
logical UNDO/REDO is invoked.
Action Consistency
After a system failure, the state of persistent
storage may not be action consistent.
How can we restore an action consistent state?

19
Making Logical Logging Work Shadows

Keep old copy of each page
Reset page to old copy at abort (no undo log)
Discard old copy at commit.
Handles all online failures due to
Logic e.g. duplicate key.
Limit ran out of space
Contention deadlock
Problem forces page locking, only one updater
per page.
What about restart?
Need to atomically write out all changed pages.

20
Making Logical Logging Work Shadows

Perform same shadow trick at disc level.
Keep shadow copy of old pages.
Write out new pages.
In one careful write, write out new page root.
Makes update atomic

21
Shadows

Pro Simple
Not such a bad deal with non-volatile ram
Con page locking
extra space
extra overhead (for page maps)
extra IO
declusters sequential data

22
Logical vs Physio-logical Logging
Note physical log records would be bigger for
sorted pages.
23
Physiological Logging Rules

Complex operations are a sequence of simple
operations on pages.
Each operation is constructed as a
mini-transaction
lock the object in exclusive mode
transform the object
generate an UNDO-REDO log record
record log LSN in object
unlock the object.
Action Consistent Object
When object semaphore free, no ops in progress.
Log-Consistency
Log contains log records of all complete page
actions.

24
Physiological Logging Rules - Online Operation

Each operation is structured as a
mini-transaction.
Each operation generates an UNDO record.
No page operation fails with the semaphore set.
(exception handler must clean up state and UNFIX
any pages).
Then Rollback can be physical to a page and
logical within page.

25
Physiological Logging Rules - Restart Operation

Need Page-Action consistent persistent state.
Pages are action consistent.
Committed actions can be redone from log.
Uncommitted actions can be undone from log.
WAL Write Ahead Log
Write undo/redo log records before overwriting
disk page
Only write action-consistent pages
Force-Log-At-Commit
Make transaction log records durable at commit.

26
WAL and Force at Commit

WAL Write Ahead Log
write page
get page semaphore
copy page to buffer
give page semaphore / avoids holding
semaphore during IO /
Force_log(Page(LSN)) /WAL logic, probably
already flushed/
Write buffer to disk.
WAL gives idempotence and testability.
Force-Log-At-Commit
At commit phase 1
Force_log(transaction.max_lsn)

27
The One Bit Resource Manager

Manages an array of transactional bits (the free
space bit map).
i get_bit() / gets a free bit and sets
it /
give_bit(i) / returns a free bit /

28
The Bitmap and Its Log Records

The Data Structure
struct / layout of the one-bit RM data
structure /
LSN lsn / page LSN for WAL
protocol /
xsemaphore sem / semaphore regulates
access to the page /
Boolean bitBITS / page.biti TRUE gt
biti is free /
page / allocates the page
structure /
The Log Records
struct / log record format for the one-bit RM /
int index / index of bit that was
updated /
Boolean value / new value of bitindex
/
log_rec / log record used
by the one-bit RM /
const int rec_size sizeof(log_rec) /size of
the log record body. /

29
Page and Log Consistency for 1-Bit RM

Data dirty if reflects an uncommitted transaction
update. Otherwise, data is clean.
Page Consistency
No clean free bit has been given to any
transaction.
Every clean busy bit was given to exactly one
transaction.
Dirty bits locked in X mode by updating
transactions.
The page.lsn reflects most recent log record for
page.
Log Consistency
Log contains a record for every completed
mini-transaction update to the page.

30
give_bit()

get_bit() give_bit(i) temporarily violate page
consistency.
Mini-transaction holds semaphore while violating
consistency.
Makes page log mutually consistent before
releasing sem.
each mini-transaction observes a consistent page
state.
void give_bit(int i) / free a bit /
if (LOCK_GRANTEDlock(i,LOCK_X,LOCK_LONG,0))
/ Lock bit /
Xsem_get(page.sem) / get page sem /
page.biti TRUE / free the bit /
log_rec.index i / generate log rec/
log_rec.value TRUE /saying bit is free/
page.lsn log_insert(log_rec,rec_size)
/write log recupdate lsn/
Xsem_give(page.sem) / page consistent/
else / if lock failed, caller doesn't own
bit, /
Abort_Work() / in that case abort
caller's trans /
return

31
get_bit()

int get_bit(void) / allocate a bit to and
returns bit index /
int i / loop variable /
Xsem_get(page.sem) / get the page
semaphore /
for ( i 0 iltBITS i) / loop looking for a
free bit /
if (page.biti) / if bit is free, may be
dirty (so locked) /
if (LOCK_GRANTED lock(i,LOCK_X,LOCK_LONG,
0))/ lock bit /
page.biti FALSE / got lock on it /
log_rec.value FALSE / generate log rec
describing update/
log_rec.index i
page.lsn log_insert(log_rec,rec_size) /
write log recupdatelsn /
Xsem_give(page.sem) / page now consistent,
give up sem/
return i / return to caller /
/ try next free bit, /
Xsem_give(page.sem) / if no free bits, give
up semaphore /
Abort_Work() / abort transaction /
return -1 / returns -1 if no bits are
available. /

32
Compensation Logging

Undo may generate a log record recording undo
step
Makes Page LSN monotonic

33
1-bit RM UNDO Callback

void undo(LSN lsn) / undo a one-bit RM
operation /
int i / bit index /
Boolean value / old bit value from log rec to
be undone/
log_rec_header header / buffer to hold log
record header /
rec_size log_read_lsn(lsn,header,0,log_rec,big)
/ read log rec /
Xsem_get(page.sem) / get the page
semaphore /
i log_rec.index / get bit index from log
record /
value ! log_rec.value / get complement of
new bit value/
page.biti value / update bit to old
value /
log_rec.value value / make a compensation
log record /
page.lsn log_insert(log_rec,rec_size) /
log it and bump page lsn/
Xsem_give(page.sem) / free the page
semaphore /
return

34
1-bit RM REDO Callback

void redo( LSN lsn) / redo an free space
operation /
int i / bit index /
Boolean value / new bit value from log rec to
be redone/
log_rec_header header / buffer to hold log
record header /
rec_size log_read_lsn(lsn,header,0,log_rec,big)
/ read log record /
i log_rec.index / Get bit index /
lock(i,LOCK_X,LOCK_LONG,0) / get lock on the
bit (often not needed) /
Xsem_get(page.sem) / get the page
semaphore /
if (page.lsn lt lsn) / if bit version older
than log record /
value log_rec.value / then redo the op.
get new bit value /
page.biti value / apply new bit value to
bit /
page.lsn lsn / advance the page lsn /
Xsem_give(page.sem) / free the page
semaphore /
return

35
1-BIT Rm Noise Callbacks

Boolean prepare(LSN lsn) / 1-bit RM has no
phase 1 work /
lsn NULLlsn return TRUE / /
void Commit(void ) / Commit release locks
/
unlock_class(LOCK_LONG, TRUE, MyRMID()) /
return /
void Abort(void ) / Abort release all locks
/
unlock_class(LOCK_LONG, TRUE, MyRMID()) /
return /
Boolean savepoint((LSN lsn) / no work to do
at savepoint /
lsn NULLlsn return TRUE
void UNDO_savepoint(LSN lsn) / rollback work or
abort transaction/
if (savepoint 0) / if at savepoint zero
(abort) /
unlock_class(LOCK_LONG, TRUE, MyRMID()) /
release all locks/

36
Summary

Model Complex actions are a page action
sequence.
LSN Each page carries an LSN and a semaphore.
ReadFix Read acquires semaphore in shared mode.
WriteFix Update actions (1) get semaphore in
exclusive mode, (2) generate one or more log
records covering the page, (3) advance the page
LSN to match highest LSN (4) give semaphore
WAL log_flush(page.LSN) before overwriting
persistent page
FORCE AT COMMIT force all log records up to the
commit LSN at commit
Compensation Logging Invalidate undone log
record with a compensating log record.
Idempotence via LSN page LSN makes REDO
idempotent

37
Two Phase Commit

Getting two or more logs to agree
Getting two or more RMs to agree
Atomically and Durably
Even in case one of them fails and restarts.
The TM phases
Prepare. Invoke each joined RM asking for its
vote.
Decide. If all vote yes, durably write commit
log record.
Commit. Invoke each joined RM, telling it commit
decision.
Complete. Write commit completion when all RM
ACK.

38
Centralized Case of Two Phase Commit

Each participant (TM RM) goes through a
sequence of states

39
Transitions in Case of Restart
Active state not persistent, others are
persistent For both TM and RM. Log records make
them persistent (redo) TM tries to drive states
to the right. (to committed, aborted)
Prepared
Committing
Committed
Active
Null
Aborted
Aborting
40
Successful two phase commit

Call flow from TM to each RM joined to
transaction
If TM and RM share the same log,
the RM FORCE can piggyback on the TM FORCE
One IO to commit a transaction (less if commit is
grouped)

41
Abort Two Phase Commit

If RM sends "NO" or no response (timeout), TM
starts abort.
Calls UNDO of each trans log record
May stop at a savepoint.
At begin_trans it calls ABORT() callback of each
joined RM

42
Full Transaction State Diagram
43
CHECKPOINTING

Commit consistent checkpoints
Stop admitting new transactions and wait until
all active transactions complete (abort or
commit)
Flush all dirty cache slots
Write checkpoint record to log
During recovery, begin forward scan at last
checkpoint record.
After last checkpoint, every element in DB
contained its last committed value.
If an element does not contain its last committed
value, it must have been updated after the
checkpoint

44
Fuzzy Checkpointing

Commit consistent checkpoint has two drawbacks
A lot of disk I/O is needed (1000 pages _at_
5mS/page 5sec)
Must wait until all active transactions terminate
(2 sec.)
Protocol
Stop processing new operations- Wait until all
active ones complete.
Flush every cache slot that has not been flushed
since last checkpoint. Stable-LSN lt
checkpoint-LSN.
Update stable-LSN of all buffers flushed.
Write checkpoint record including-
Active transaction list
List of data items and stable-LSNs of all dirty
slots

45
Restart Algorithm

Locate Penultimate Checkpoint (the checkpoint
preceding the last one)
Add all transactions in checkpoint record to
active transaction list.
Forward scan of log (starting at penultimate
checkpoint)
Call rm_redo() for each log record
On BEGIN_TRANSACTION log record add transaction
to active transaction list
On COMMIT or ABORT log record, remove transaction
from active transaction list.

46
Restart Algorithm (Cont.)

For each transaction in active transaction list
Retreive the transactions log records in reverse
order (last record of transaction retreived
first)
For each log record, call rm_undo()

47
Simple Recovery Method (Bernstein)

Restart Algorithm
redone undone ?
Scan log from last record to first. For each log
record Ti, x, vbefore, vafter do
If x ? (redone ? undone) then
If Ti is committed then
restore xs cache slot to vafter
redone redone ? x
Otherwise
restore xs cache slot to vbefore
undone undone ? x

48
Simple Recovery

Assumptions
Strict 2PL - locking at granularity of page
Before and after-image are complete pages
Every element x is restored to its last committed
value by Restart.
If last update to x is by a committed
transaction, the value it wrote is restored and
no further changes are made. (x ? redone)
If last update to x is by aborted or active
transaction Ti, the before image of x wrt Ti is
restored and no further change to x is made. (x ?
undone)
Because histories are strict, this value was
written by the last transaction to commit and
write x.

49
Record Level Locking

History is not strict with respect to pages
anymore. It is, however, strict with respect to
individual tuples.
A single page LSN is not enough. Consider the
following example
r1, r2, and r3 are records on the same page.
Log wir1 ci wjr2 wkr3 ck aj
Page LSN 1 3 4
?
The abort of Tj is processed by restoring the
before image of r2 in the page.
What should we do to the LSN?
If we leave it at 4 and a system failure occurs,
what will happen?
If we set it to 3 what happens?

50
Record Level Locking

Need to log the undo operations we apply.
This way, the undo in the history above will
update the LSN of the page to 6, the LSN of the
compensation record.
Log wir1 ci wjr2 wkr3 ck undo(wjr2)
Page LSN 1 3 4 6
Recovery when page-LSN is 1
Redo log records 3, 4 and 6
Undo log records 6 and 3
Add records 7 and 8 to end of log
undo(undo(wjr2)) and undo(wjr2) - page LSN
8.

51
Record Level Locking (Cont.)

Recovery when page-LSN is 4
Redo log record 6.
Undo log records 6 and 3
Add records 7 and 8 to end of log
undo(undo(wjr2)) and undo(wjr2) - page LSN
8.
Recovery when page-LSN is 6
Redo nothing
Undo log records 6 and 3
Add records 7 and 8 to end of log
undo(undo(wjr2)) and undo(wjr2) - page LSN
8.

52
Record Level Locking (Cont.)

Recovery when page-LSN is 7 (recovery fails)
Log wir1 ci wjr2 wkr3 ck
undo(wjr2) wjr2) undo(wjr2)
Page LSN 1 3 4 6
7 8
Redo log record 8
Undo log records 8, 7, 6 and 3.
Add records 9, 10, 11 and 12 to end of log
undo(undo(wjr2)), undo(wjr2),
undo(undo(wjr2)) and undo(wjr2) - page LSN
12.

53
ARIES

Source
C. Mohan, et.al., ARIES A Transaction Recovery
Method Supporting Fine-Granularity Locking and
Partial Rollbacks Using Write-Ahead Logging in
ACM Transactions on Database Systems, Vol. 17,
No. 1, March 1992.
Context
ARIES is an integrated set of recovery,
concurrency control, and indexing methods.
Explains state of the practice for Transaction
Processing. In some ways superior to academic
work in TP.
Concurrency control, buffering, and logging
assumptions are similar to what we have been
discussing.

54
Issues and Goals

Types of logging
Value logging Physical logging
Page oriented logging Physiological logging
Operation logging Logical logging
Buffer management
Force policy - Force updates of transactions
before they commit. (No REDO required)
Steal policy - Buffer manager can steal pages
dirtied by uncommitted transactions. (requires
UNDO)
STEAL / NO FORCE
Most desirable as it imposes few constraints on
BM.
This is model we follow in project.

55
Commutativity-based Concurrency Control

Definitions of conflicting operations can be
generalized based on the the notion of
commutativity
Operations p and q conflict if they do not
commute
Different definitions of commutativity are
possible.
Operations p and q commute with respect to ? if
for any two states st1 ? st2, the values returned
by p and q are the same and the state after
executing p q beginning in state st1 is
equivalent (?) to the state after executing q p
beginning in state st2.
If we are interested in states of bank accounts
where two states are equivalent if the bank
account has the same balance
debit operations (that do not return the current
balance commute)
credit operations do not commute in general
Other examples of commuting operations are
increment / decrement (assuming signed numbers
with no overflow)

56
Failure Assumptions

Transaction / System / Media
Process failure
What is it?
Can it be handled using one of the other three
approaches?

57
Record Keeping

Transaction table
TRID
State of transaction (prepared or unprepared)
LastLSN - Last log record written by transaction
UndoNxtLSN - LSN of next record to process during
rollback
Dirty page table
PageID
RecLSN - Beginning of updates not yet applied to
page (How is this maintained?)

58
Normal Processing

Updates (order of events)
Order of events
acquire lock (exclusive)
acquire latch (exclusive)
perform update
insert log record
update LSN in page being updated
unlatch page
unlock record

59
Normal Processing

Discussion
Assume that lock names are (page, directory
offset).
How can we determine the name of the lock we
need?
What happens if we change the order, and latch
first and then lock?

60
Rollback

Update records
Fix page
Undo log record
Write compensation record
Update page LSN (LSN of compensation record)
Update transaction table last LSN
Process PrevLSN next
Compensation records
Nothing to undo
Process UndoNxtLSN next

61
Checkpoints

Done asynchronously
Write begin_chkpt record
Write contents of transaction table
Write dirty page table to log file
Write end_chkpt record
Update checkpoint LSN in master record

62
Restart Processing

Analysis pass
Forward scan of log starting at begin_chkpt
Process end_chkpt
transaction table and dirty page table entries
Process other log records
If transaction not in transaction table, add it.
If page is not in dirty page table add it
Return RedoLSN minimum RecLSN of all entries in
dirty page table.
Checkpoints are asynchronous and so we may have
transactions executing during a checkpoint.
How are effects of transactions that are active
during checkpoint accounted for?

63
REDO Pass

Begin forward scan at RedoLSN
Process update and compensation records
PageID is dirty and RecLSN lt Log.LSN
Page.LSN lt Log.LSN
Redo log record
update Page.LSN
Page.LSN gt Log.LSN
RecLSN Page.LSN 1

64
UNDO Pass

Find last log record of all active transactions.
Update record
Undo update
Write compensation record
update page LSN (LSN of compensation record)
UndoNxtLSN LogRec.PrevLSN
If no more records for TX
Write end_transaction record
Delete transaction from transaction table

65
UNDO Pass (Cont.)

Compensation records
UndoNxtLSN LogRec.UndoNxtLSN
(Skip over intermediate records from same
transaction.)
Recovery Example (from paper)

66
Nested Top Actions

Recovery in a B-tree
Need to log page splits so that they can be
redone if a failure occurs before we finish.
Page allocation
Copy some records to new page
Post the new key-pointer pair to parent.
Once index is restructured, release locks on the
pages that are modified (hold lock on key range)

67
Nested Top Actions (Cont.)

Dont want to undo split if transaction that
caused split aborts.
It will probably be needed again in the near
future.
Another example of problem is page allocation
Undo of allocation may result in committed
transactions work being undone. Why?

68
Requirements

Actions of incomplete Nested Top Actions (NTA)
are redone regardless of transaction outcome.
Incomplete NTAs are undone regardless of outcome.
Complete NTAs are not undone even if transaction
fails. Inverse operations (e.g., deallocation)
will be invoked when necessary.

69
Aries Solution

Write to mark beginning of nested top action.
At end add a CLR which points to log record
just before .
During forward pass the work will be redone.
During backward pass if is not found, the
work will be undone.
If is found, undo pass will skip operations
in parenthesis.

70
Discussion

What is the advantage of the analyis pass?
There is a lot of effort spent to justify
replaying history.
What is the disadvantage?
What advantages does it provide?
If the recovery method in your project provided
NTAs, how would you use them?
Could you use redo-only log records for the same
purpose?

71
TPC Benchmark

Goal- Provide a method of characterizing
performance of OLTP systems
Accounts for hardware/software costs over 5 year
period
Report both transactions per second (tps) and
/tps
Faster systems (measured in tps) operate with
larger databases.

72
TPC

Models large Bank with multiple branches
Based on single Deposit transaction
UPDATE Account SET Abalance Abalance delta
WHERE Aid Aid
SELECT Abalance INTO Abalance
FROM accounts
WHERE Aid Aid
UPDATE Tellers SET Tbalance Tbalance delta
WHERE Tid Tid
UPDATE Branches SET Bbalance Bbalance delta
WHERE Bid Bid
INSERT INTO history(Tid, Bid, Aid, delta, time)
VALUES (Tid, Bid, Aid, delta, CURRRENT)

73
Scaling

For every tps of the system under test, there are
at least
1 branch records
10 teller records
100,000 account records
What is purpose of this scaling?
Does it favor systems with low tps?

74
Project

Test case for project
Initialization - (dbs_tpc_init.c) Create four
tables with index on primary key.
Workload - (dbs_tpc.c)
multiple clients (2 - 5)
Each executes 20 transactions and quits
Verification - (dbs_tpc_test.c)
Prints each branch balance, sum of all teller
balances in each branch, sum of all account
balances for each branch.
If ACID properties are working, balances should
be same.
Testing
Run multiple client verify
Run multiple clients kill RM perform recovery
verify

75
Discussion

What throughput can you achieve with page-level
locking?
What locks must be acquired?
What is probability of lock conflict?
How long will conflict persist?
What throughput would you expect with tuple-level
locking?

76
Newer TPC Benchmarks

More info at http//www-europe.sgi.com/Technology/
tpc.html or http//www.tpc.org
TPC-C
Models warehouse distribution system
Nine tables representing warehouse, district,
customer, order, stock, and history information.
Five transaction types - creating a new order,
making a payment to a customer's balance,
determining the status of an order, the delivery
of orders, and determining the current stock
level
Scaling by warehouse
A 10,000 tpmC requires about half a terabyte of
on-line storage

77
Newer TPC Benchmarks

More info at http//www.tpc.org
Newer benchmarks
TPC-H - Ad-hoc Decision support
TPC-R - Business Reporting Decision support
TPC-W - Transactional WEB benchmark
e-commerce and business to business intranet

78
Log Manager

Concept
Log is a history of all changes to the state.
Log old state gives new state
Log new state gives old state
Log is a sequential file.
Complete log is the complete history
Current state is just a "cache" of the log
records.
Performance
What kind of log performance is important for
High throughput?
Low response time?

79
The Log Table

Log table is a sequential set (relation).
Log Records have standard part and then a log
body.
Often want to query table via one attribute or
another
RMID, TRID, timestamp
create domain LSN unsigned integer(64) -- log
sequence number (file , rba)
create domain RMID unsigned integer -- resource
manager identifier
create domain TRID char(12) -- transaction
identifier
create table log_table (
lsn LSN, -- the records log sequence number
prev_lsn LSN, -- the lsn of the previous
record in log
timestamp TIMESTAMP, -- time log record was
created
resource_manager RMID, -- resource mgr that
wrote this record
trid TRID, -- id of transaction that wrote
this record
tran_prev_lsn LSN, -- prev log record of this
transaction (or 0)
body varchar, -- log data rm understands
it
primary key (lsn) -- lsn is primary key
foreign key (prev_lsn) -- previous log record
in this table
references a_log_table(lsn), --

80
Log is complete history
B files
A files
Log Table
Log Anchor
lsn
prev_lsn
resource_mgr
trid
tran_prev_lsn

body
Archive

Log anchor points at chain of each transaction.
May maintain other chains.
Log records map to sequence of N-plexed files
Old files are archived.
Eventually, archive files are discarded (weeks,
months, never)

81
The Log LSN

Each log record has a logical sequence number.
This number (LSN for Log Sequence Number) plays a
key role in many algorithms.
Key property - If action A happened after action
B then
LSN(A) gt LSN(B).

82
Accessing the Log

Read a log record
long log_read_lsn( LSN lsn, / lsn of record
to be read /
log_record_header header, / header of
record to be read /
long offset, / offset into body to
start read /
pointer buffer, / buffer to receive log
data /
long n) / length of buffer /
Find current end of log
LSN log_max_lsn(void) / returns the current
maximum lsn of the log table./
Insert Log record - Log manager fills in header.
LSN log_insert( char buffer, long n) / log
body is buffer0..n-1 /
Force Log - up to a certain LSN to persistent
storage
LSN log_flush( LSN lsn, Boolean lazy) / lazy
waits for a batch write or
timeout boxcar) /

83
Accessing the Log (Cont.)

Many real interfaces allow some of
empty buffer to allow RM to fill it in (avoids
data copies)
gather take log data from many buffers.
Few offer SQL access to the log.

84
Summary of Log Structure
B file
A file
durable
storage
log page header
Log pages
in buffer pool
empty page in
Pages written in next write
buffer pool
current end of log
end of
header
durable
Log Table
body
log
85
Log Anchor Logging and Locking
typedef struct filename tablename / name
of log table / struct log_files/ A
B file prefix names active file
/ xsemaphore lock / semaphore
regulates log write / LSN prev_lsn / LSN
of most recent write / LSN lsn
/ LSN of next record / LSN
durable_lsn / max lsn in durable
storage / LSN TM_anchor_lsn / lsn of
trans mgr's last ckpt / struct / array of
open log parts / long partno /
partition number / int
os_fnum / operating system file /
part MAXOPENS / /
log_anchor / /

Log records never updated only inserted and read
- So no locks needed on log.
Semaphore (or something) needed on "end" of log
to manage space/growth/LSN for inserts

86
Log Insert

Log semaphore covers
Incrementing LSN
Finding the log end
filling in the page(s)
allocating space on a page, perhaps allocating
new pages.
Log write Daemons
Log Semaphore can be a hotspot so No IO under
semaphore
Flush to persistent storage (disc) is done
asynchronously.
Demons driven by timers and by events (requests)
Demons need not touch end-of-log semaphore

87
Group Commit (Boxcaring)

Batch processing of log writes.
Forcing log buffer each time requested may write
many incomplete buffers
Most efficient to write large buffers to log
Must delay some log_flush() requests until buffer
is full.
Example - Boxcaring
Modern SCSI disks have highest write bandwidth
around 64KB block size
Transfers per second is lt 15ms. (10ms for seek
and rotation)
Each block contains 325 log records of 200 bytes
each.
TPC benchmark generates around four records per
transaction.
This will allow 5416 TPS
If receive 1,000 log force requests/second why
not just execute 50 of them?
Response time will be the same (20ms).
IOs will be 20x fewer
CPU will be 10x smaller (10x fewer dispatches,
20x fewer OS IO).
Without it, systems are limited to about
50tps no ping-pong
100tps ping-pong.
With it, systems are limited to disc bandwidth
gtgt10ktps.

88
Example - Synchronous write per flush

Perform log write every time a log_flush() is
received.
Transfer small blocks (4 200 bytes) requires
10ms.
Each block holds log records for one transaction.
This allows about 100 TPS

89
WADS- Giving the Log Disc Zero Latency

Log disc is dedicated, so only has rotational
latency.
Reserve some cylinders on the disc as scratch.
For each write
Write at current position on next track (zero
latency).
When have a full-track (or two) of log data
consolidate the write in ram
do a single LARGE write (100KB 1 rotation) to
the log.
cost of this is seek rotation 20ms.
Reserved area is called Write Ahead Data Set
(WADS).
At restart
read cylinders
gather recent log data
rewrite end of log.

90
Synchronization on B-Trees What Is the Problem?

B-Trees are fully redundant structures, which can
be reconstructed from the tuples therefore, no
synchronization should be required at all.
However, some queries operate on the index only.
This requires all operations on B-trees to be
serializable with the operations on the tuples.
Standard two-phase locking with the nodes as the
objects is not feasible for performance reasons.

91
Protecting Tree Traversal
1. semaphore on Q
Node Q at level i
search path
2. follow search path
3. semaphore on R
Node R at level i1
4. release sem. on Q
92
B-Trees and Value Locks
B-tree
14 P1 P2 16 P3 17 P4 P5 P6 20 P7 22 P8 P9 P10
93
Making Lock Names

To implement value locking, we need to build lock
names according to the following rule
LockN TableName, IndexName, KeyValue.
KeyValue in turn is a composite
KeyValue AttributeValue, TupleIdentifier.

94
Key Range Locking on B-Trees

Details
Allow for duplicate key values by modeling a
tuple as ltc, kgt where c is the key value and k is
a tuple pointer.
A retreival of c1, c2 retreives all tuples
ltc1, kmingt, ltc2, kmaxgt.
Insertions and deletions apply to a single tuple
ltc1, kigt.
retrieval(c1?? K ? c2) (Range query)
retrieval(k c) is special case of this -
retrieval(c ?? K ? c)
1. Get shared semaphore on first leaf page that
contains the range.
2. Get shared locks on all existing key values
c c0 ? c ? c2 in sequence, where c0 is
largest key such that c0 ? c1
A key lock on key value c1 protects range of keys
from ltc1, kmingt, ltc2, kmingt) where c2 is next
key value after c1.
3. Lock tuple in shared mode.

95
Key Range Locking on B-Trees

4. Use coupling technique when acquiring
semaphores
5. Locks are held until commit. Semaphores are
released earlier.
insert(ltK, tidgt)
1. Get exclusive semaphore on leaf page.
2. Get an instant IX lock on c1 where c1 is
largest key value less than or equal to K.
This ensures that no search has started using the
rangeltc1, kmingt, ltc2, kmingt) which includes the
key being inserted.
Multiple insertions into ltc1, kmingt, ltc2, kmingt)
can proceed concurrently.
3. If c1 ? K get an exclusive lock on K
This protects the new key range until the
transaction commits.
4. Get X Lock on tuple.

96
Key Range Locking on B-Trees

delete(ltK, tidgt)
1. Get exclusive semaphore on leaf page.
2. Let ltc1, kdgt be largest index entry ? ltK,
tidgt.
If c1 K, get an IX lock on K.
This does not change the key values that exist in
the relation. An ix lock is enough to discourage
a search.
If c1 lt K, get an X lock on c1 and K.
In this case, the number of key ranges is
reduced. The x lock on K forces searches for K to
wait, and an X lock on c1 will force an insertion
into this range to be delayed.
3. Get X lock on tuple.
4. Hold lock(s) until commit

97
B-Tree Recovery
98
B-Tree Recovery Based on Physiological Logging

Cover all B-tree operations with semaphores on
all affected pages.
For each logical update a log record with the
logical UNDO (only) operation must be moved to
the log
While the update operation is being performed,
physical REDO (only) log records are written.
After all REDO records are safely in the log, the
exclusive semaphores can be released.

99
The Two Phases of B-Tree-Recovery

Phase1 Go forward through the log up to its
current end, applying all REDO records to the
tree.
Phase2 Go backward to the Begin of transaction
record of the oldest incomplete transaction,
executing the UNDO operations on the tree for all
losers along the way.

Recovery - PowerPoint PPT Presentation

Recovery

Recovery Chapter 9, 10, 11.1-11.4 in Gray and Reuter Adapted from s by J. Gray & A. Reuter – PowerPoint PPT presentation