Concurrency Control On Inverted Lists - PowerPoint PPT Presentation

1 / 22
About This Presentation
Title:

Concurrency Control On Inverted Lists

Description:

Concurrency Control On Inverted Lists. Alexander Behm. University of ... modeled in similar fashion: ... being up to date is critical, e.g. news systems ... – PowerPoint PPT presentation

Number of Views:62
Avg rating:3.0/5.0
Slides: 23
Provided by: Rin52
Category:

less

Transcript and Presenter's Notes

Title: Concurrency Control On Inverted Lists


1
Concurrency Control On Inverted Lists
CS223 Transaction Processing and Distributed
Data Management
  • Alexander Behm
  • University of California, Irvine
  • Instructor Prof. Sharad Mehrotra
  • Based on BMV96, KAM96, MOH93
  • (see references for details)

2
Overview
CS223 Transaction Processing and Distributed
Data Management
  • Introduction to Inverted Lists
  • Transactions for Inverted Lists
  • GOLD Text Indexing Engine
  • Summary
  • ARIES/LHS
  • References

3
Introduction to Inverted Lists
CS223 Transaction Processing and Distributed
Data Management
Suppose we have a set of documents with keywords
Doc1
Doc2
Doc3

Keywords Transaction Concurrency
Keywords Serializability Transaction
Keywords Database Transaction
  • How can we efficiently do keyword queries? Such
    as
  • Get documents with keyword1 AND keyword2 AND
  • - Get documents with keyword1 OR keyword2 OR
  • - Get documents with keyword1 AND keyword2 AND
    NOT
  • - Etc. etc.

4
Introduction to Inverted Lists
CS223 Transaction Processing and Distributed
Data Management
A popular solution in Information Retrieval (IR)
Create an inverted list index
Index on keywords
Inverted Lists (contains document IDs)
Database
1 3 5 6
Transaction
1 2 5
Concurrency
2 7 8
Serializability
3 4 7 8
Phantom
5 7 8


In essence For each keyword keep a list of
documents that contain it
5
Introduction to Inverted Lists
CS223 Transaction Processing and Distributed
Data Management
How can we answer queries?
Get documents with Database AND Transaction
AND Phantom
Database
1 3 5 6
Transaction
1 2 5
Create set intersection
Concurrency
2 7 8
5
Serializability
3 4 7 8
DocID 5 is a result!
Phantom
5 7 8


Other operations modeled in similar fashion
E.g. OR by set union, AND NOT by set difference,
etc.
6
Inverted Lists Good and Bad
CS223 Transaction Processing and Distributed
Data Management
  • Good
  • when answering queries only look at inverted
    lists
  • only documents matching query are retrieved
  • if inverted lists are sorted union,
    intersection, etc. can be implemented efficiently
  • other applications in exact and fuzzy string
    matching
  • Bad
  • inverted list structure can become very large
  • updates may need to modify several inverted
    lists at a time, i.e. updates are expensive

7
Transactions for Inverted Lists
CS223 Transaction Processing and Distributed
Data Management
Characteristics of Information Retrieval Systems
  • For some IR systems being up to date is not
    critical, can be read-only
  • E.g. an online shop where new products are
    (typically) not added every minute
  • Read-only systems perform updates offline, e.g.
    at night in a batch
  • No concurrency control needed, no transactions
    needed
  • For other systems being up to date is critical,
    e.g. news systems
  • Most relevant documents to a query may be most
    recent ones
  • Updates may be frequent (but typically still
    less frequent that reads)
  • Concurrency control needed, e.g. model queries
    as transactions

8
Transactions for Inverted Lists
CS223 Transaction Processing and Distributed
Data Management
What about traditional CC mechanisms e.g. 2PL?
A read-query accessing Database AND
Serializability must wait for whole update to
complete ? BAD
Keywords Database, Concurrency, Phantom
Doc9
Consider adding this document
Database
1 3 5 6
2PL acquires long-term locks
Transaction
1 2 5
Concurrency
2 7 8
Perform updates
Serializability
3 4 7 8
2PL releases locks
Phantom
5 7 8
9
Transactions for Inverted Lists
CS223 Transaction Processing and Distributed
Data Management
Observations by KAM96
  • Write Write conflicts
  • most write operations are actually appends, i.e.
    we append a new docID to some lists
  • (we discuss deletions/updates later)
  • appends are idempotent and commute (consider
    inverted list as a set of docIDs)
  • Read Write conflicts
  • write operations do not depend on reads, i.e.
    transactions are write only or read only
  • read operations do not need to wait until whole
    update has completed
  • need only wait until the conflicting lists have
    been updated
  • may read each list directly after conflicting
    update has completed

10
Transactions for Inverted Lists
CS223 Transaction Processing and Distributed
Data Management
I Using latches example
T1 w(Database), w(Transaction) T2 r(Database),
r(Concurrency)
Schedule w1(Database), r2(Database),
r2(Concurrency), w2(Transaction)
T1 locks
Database
1 3 5 6
T1 releases
Transaction
1 2 5
T2 locks
Concurrency
2 7 8
T2 releases
Serializability
3 4 7 8
T2 locks
Phantom
5 7 8

11
Transactions for Inverted Lists
CS223 Transaction Processing and Distributed
Data Management
II Keeping track of read-write dependencies
T1 add docID 9 with keywords Database,
Concurrency T2 get docs containing Database
AND Concurrency
Since we are using latches and not e.g. 2PL, the
following schedule is possible
S w1(Database), r2(Database), r2(Concurrency),
w2(Concurrency)
We miss docID 9 because it has yet to be added to
Concurrency list Results of T2 are still
correct! Meaning no false answers are
returned However, we miss the most recent (and
possibly most relevant) document! To meet
recency requirement we track dependencies for
read-transactions
12
Transactions for Inverted Lists
CS223 Transaction Processing and Distributed
Data Management
III Operation reordering example
T1 w(x), w(y), w(z) T2 w(a), w(b), w(z) T3
r(z), r(b), r(m)
  • Say we first scheduled T1 and T2
  • S w1(x), w1(y), w1(z), w2(a), w2(b), w2(c)
  • Now T3 arrives at the scheduler
  • We reorder operations to minimize the wait for T3
  • S w1(z), w2(z), r3(z), w2(b), r3(b), r3(m),
    w1(x), w1(y), w2(a)
  • Why can we do this?
  • We use latches, locks are immediately released
  • Writes are independent of reads each
    transaction writes a data item no more than once
  • ? order of execution for writes can be chosen
    arbitrarily (assuming appends)

13
CS223 Transaction Processing and Distributed
Data Management
Transaction
Inverted lists that need to be locked
T.ACCESS
Determine T.ACCESS
List of active update transactions
ACTIVE
Update?
Y
N
Add T to ACTIVE
Fill T.CONFLICT Using ACTIVE
List of conflicting active update transactions
T.CONFLICT
Reorder Operations
Request locks in T.ACCESS
Request locks in T.ACCESS
Lock request
Lock request
If granted list is empty Issue read
lock Else Put on wait queue
Safe check conflict list (and processed
list) If safe AND NOT granted to write
Issue read lock Else Put on wait queue
Unlock request
Unlock request
Add transaction ID to PROCESSED Release lock If
wait-queue NOT empty grant lock to next T in
wait-queue
Release lock If wait-queue not empty Grant
lock to next transaction in queue
14
Notes/Thoughts
CS223 Transaction Processing and Distributed
Data Management
  • Deletions/Updates
  • Deletions can be handled by the proposed CC
    mechanism
  • Updates can be seen as collection of appends and
    deletions
  • Reducing locking overhead
  • Main property an inverted list is accessed at
    most once by a transaction
  • Appends can be aggregated into mini-batches
  • Deletions can be memorized in a purged list
  • When purged list reaches certain number,
    deletions are performed in a batch

15
Notes/Thoughts
CS223 Transaction Processing and Distributed
Data Management
  • Critical assessment
  • Very nice recency can be traded for performance
    using mini-batches!
  • Concurrency increased and locking overhead
    reduced
  • Many algorithms rely on inverted lists being
    sorted
  • Appends may not be commutative anymore, cannot
    be reordered arbitrarily
  • Sorting requirement may impose further
    restrictions on the scheduling of operations
  • In worst case read-queries need to check for
    conflicts with a number of queries equal to MPL
    (multi-programming level)

16
GOLD Text Indexing Engine
DataStructure Manager
OverFlow Manager
QUERY
Provides Access
Provides Access
Inverted Lists
Overflow Inverted Lists
Database
1 3 5
Database
6 8
Transaction
1 2
Transaction
7 9
Insertion List
Position Lists
DocID6, DocID7
Database
1,35 3,44 5,50
Transaction
1,5 2,80
Other structures
Documents
DocID1, DocID2, DocID3
Other structures
DISK
MEMORY
17
GOLD Text Indexing Engine
Multi-Layered System
  • Concurrency Control Details
  • Inverted List traversal done with lock coupling
  • L0-L2 CC done via conflict matrix
  • Novelty in L3 TimestampLocking CC protocol

Multi-Level Concurrency Control
Timestamp Locking
Locking
Locking
Locking
18
GOLD Text Indexing Engine
L3 Concurrency Control Protocol Basic Idea Use
Timestamps and Locks at Document level Locks
used to prevent insertions and deletions of
creating inconsistent state. Retrieve does not
care about locks. To avoid incorrect query
results timestamps are used.
Delete
Insert
Retrieve
Lock document
Lock document
Get next TS
Delete from Inverted Lists but keep document
Initialize Doc.TS to large value
Ignore Docs with greater TS
Get next TS
Perform Insertion
Check for active retrieve operations with smaller
TS. If exists, wait.
Get next TS and set Doc.TS
Delete Document
19
Summary
Both papers identify problem of interleaving
retrieve and write/delete/update
operations Solutions GOLD 1. Insert Ts set
document Timestamp, retrieve operations ignore
documents with greater timestamp 2. Delete Ts
remove entries from inverted lists and then check
for concurrent retrieve with greater timestamp
and wait before deleting document KAM96 1.
Retrieve Ts check set of common inverted lists
with update transactions 2. Operations are
reordered such that conflicting updates are
executed first 3. For each inverted list,
retrieve Ts wait until conflicting update has
been performed In terms of concurrency INSERT
GOLD gt KAM96 (GOLD could miss some recent
results) DELETE GOLD lt KAM96
20
ARIES/LHS
Keywords are typically mapped to the
corresponding inverted list by hashing. Only
point-queries are required for keywords,
therefore hashing is a good solution. BUT What
Hashing technique can we use? How can we do CC on
the Hash Table? MOH93 studies CC for
Linear Hashing (LHS) Basic operations in
LHS Insert Delete Update File Contraction File
Expansion Retrieve
May cause record relocation!
  • Basic Idea
  • Writing Ts get X lock on the record they modify,
    NOT on records they relocate
  • Uncommitted relocations can be modified by other
    Ts ? high concurrency
  • Read Ts get S lock on records
  • Recovery algorithms need to be prepared to handle
    the above
  • Problem becomes ensuring only correct answers are
    returned

21
ARIES/LHS
  • Retrieval of records is supported by an in-memory
    data structure, signature table (ST)
  • current_page of a record may be different from
    home_page
  • ST helps to quickly identify current_page for a
    given query without causing disk I/O
  • CC problem we need to ensure page signatures and
    ST are in synch, i.e. consistent

ST Latch
  • File Expansion/Contraction get X latch on ST
  • Signatures depend number of pages
  • STE latches used to increase concurrency (details
    complex)
  • Other operations get S on ST

Page
STE Latch
STE Latch
Page
Page
STE Latch
T acquires latches
22
References
CS223 Transaction Processing and Distributed
Data Management
KAM96 Mohan Kamath, Krithi Ramamritham,
Efficient Transaction Support for Dynamic
Information Retrieval Systems, Proceedings of
the 19th annual international ACM SIGIR
conference on Research and Development in
information retrieval, 1996 BMV96 D.
Barbara, S. Mehrotra, P. Vallabhaneni, The Gold
Text Indexing Engine, Proceedings of 12th
International Conference on Data Engineering
(ICDE'96),  1996 MOH93 C. Mohan,
ARIES/LHS A Concurrency Control and Recovery
Method Using Write-Ahead Logging for Linear
Hashing with Separators, Proceedings of the Ninth
International Conference on Data Engineering
(ICDE), 1993
23
Issue with GOLD (my opinion)
Inverted List Index
T1 Get Docs containing A AND B AND C . AND NOT Z
T2 Delete Document 3
A
Z

TS(T1) lt TS(T2)
1 2 3
2 3 4

The following interleaving is possible r2(A),
d1(A,3), r2(B), r2(C).., d1(Z,3), r2(Z) T1 will
identify 3 as a result and since TS(T1) lt TS(T2)
the document has not been deleted yet!
Write a Comment
User Comments (0)
About PowerShow.com