Title: SDDS distributed index architecture and rank queries optimization in distributed objectoriented data
1SDDS distributed index architecture and rank
queries optimization in distributed
object-oriented databases
- Tomasz Kowalski
- Advisor
- Kazimierz Subieta D.Sc.
2Presentation plan
- What is index
- Indexing strategies in distributed environment
- Architecture of SDDS index
- New approach to optimize rank queries with SDDS
- Sequences and rank queries in SBQL
- Overview of an approach
- Selecting k smallest element in the set -
parallel algorithm in SBQL - Properties and advantages of proposed strategy
- Conclusion
3What is index
- Auxiliary (redundant) database structures stored
on the server side, - Used to quicken searching of an objects by
reducing the amount of data to be searched - Essential element in optimizing big databases,
- Most popular techniques of implementing indices
are B-tree (balanced tree) and hash indices.
4Indices structure
- Two column table, where
- I column consists of unique key values (i.e.
input for search procedure) - II column consists of non-key values (i.e.
reference to objects)
Example index for Department objects according
to Location attribute
5Indices example
(Depart where Radom ? Location) join
(Employs.Emp where Position manager)
- Index_Depart_Location
- parameters String Location
- returns bag of departament references
(Index_Depart_Location(Radom ) join
(Employs.Emp where Position manager)
6Indices example (2)
(Depart where Radom ? Location) join
(Employs.Emp where Position manager)
- Index_Depart_Loc_Emp_Pos
- parameters String Location, String Position
- returns bag of structures including departament
and employee references
Index_Depart_Loc_Emp_Pos(Radom, manager)
7Indices division
- dense index for each key value there is
separate position in index - range index each index position holds values
connected with a range - often used in lt, gt, , optimizing (order
preserving range function)
Example range index for Employees objects
according to Salary attribute
8Index properties
- Index transparency
- programmer should not be aware that indices
exist. - administrator of the database can freely generate
new indices and remove them without changing code
of applications. - Automatic index updating
- result of changes in database.
- index cohesion is maintained by automatic
mechanism should improving, eliminating or
generating new index in case of database update.
9Query optimization architecture with indices
query
query parser
Client
query syntax tree
Rewriting query optimizer
Query Interpreter
Index repository
Static stackS_ENVS
ENVS
Index administration module
Static stackS_QRES
QRES
Query evaluation cost module index profit
estimation module
Meta-base
Data repository
10Automatic index updatingin distributed
enviroment
- each site containing data has to maintain index
convergence with data (by inserting, deleting,
modifying index) - index key value is calculated according to
selected attribute definition in integrating
global view and contributory local view - main difficulty is the form of non-key value
which is responsible for direct access to indexed
virtual objects in virtual repository!
11Global views integration of distributed data
- Example view
- create view EmpDef
- virtual objects Emp
- return ((Cracow.Prac union Warsaw.Prac union
Wroclaw.Prac union Lodz.Prac) as b) -
- on_retrieve do
- return b.(deref(Nazwisko) as Name,
deref(Zarobek) as Salary) -
-
-
- i.e. Index is created on Salary attribute of Emp
virtual objects.
12Automatic index updating server architecture
Emp definition
Globalschema
Contributory schema
Index automatic updating service
Prac definition
Contributory views
Localschema
Local resources
Local repository
Cracow peer
13Distributed environment schema
I
Index_Emp_Salary Data
Emp Data
GRID Server
GRID
Client
14Example query evaluation
- Query COUNT(Emp WHERE 1800 Salary AND Salary lt
2500)
results
query
15Local indexing strategy
- Query COUNT(Emp WHERE 1800 Salary AND Salary lt
2500)
results
I
I
query
I
I
16Central indexing strategy
- Query Count(Index_Emp_Salary(1800, 2500))
result
query
I
17Distributed scalable indexing strategy
- Query Count(Index_Emp_Salary(1800, 2500))
result
I
I
I
query
I
I
I
I
I
I
I
I
I
18Global indexing strategies
- Central index
- efficient for selective queries
- indexing server is hot spot (overloading,
failure) - Distributed index
- uses GRID computing potential
- more complex architecture
- more communication necessary
19SDDS distributed index
- Scalable Distributed Data Structure (Witold
Litwin D.Sc. Universite Paris) - High-availability tolerate the unavailability
of k 1 server sites (parity calculus uses the
Reed-Solomon Codes) - supports parallel/distributed queries evaluation
- LHRS generalizes Linear Hashing to distributed
RAM or disk files - avoids centralized address calculus
20Further SDDS properties
- provides concurrency transparency
- does not assume any size and capacity constraints
- SDDS file expands on new servers when optimal
load of current servers is reached - access and modification of SDDS file does not
demand global refresh on servers or clients
21LH algorithm performance
- Over 65 of SDDS file is used
- Number of messages between servers per random
insertion is 1 in general - Number of messages between servers per key search
is 2 in general - Parallel operations on SDDS M buckets requires at
most 2M1 messages and 1 between 1 and O(log2M)
rounds of messages
22SDDS index utilization architecture client (1)
- Client addresses directly or indirectly
integrated data defined by its global schema
(EmpDef view). - Client has knowledge about existing indices and
some cost estimations. - Client utilizes indices according to query
optimizer architecture. - I.e. query
- COUNT(Emp WHERE 1800 Salary AND Salary lt 2500)
client should rewrite to utilize index in
following way - COUNT(Index_Emp_Salary(1800, 2500))
- or using data-independent index count method
- Count_Index_Emp_Salary(1800, 2500)
23SDDS index utilization architecture server (2)
- Query COUNT(Emp WHERE 1800 Salary AND Salary lt
2500)
Rewritten query Count(Index_Emp_Salary(1800,
2500))
I
- Client send rewritten query to any of grids site
- Servers index service module is responsible to
reply clients query. - Server uses SDDS addressing algorithm.
24SDDS index utilization architecture grid (3)
- Query Count(Index_Emp_Salary(1800, 2500))
addressing adjustment
I
I
I
query
I
I
- Server index service module executes parallel
query over SDDS - Server knows most peers participating in indexing
and has own distributed index image - Each query result contains SDDS image adjustment
25SDDS index utilization architecture addressing
(4)
Query Count(Index_Emp_Salary(1800, 2500))
I
I
I
3
2
6
I
4
I
I
5
I
1
0
- Server 4 calculates servers containing searched
part of index - Site 4 index service module sends SDDS queries to
1st and 2nd server in parallel. - Each site indexes some range.
Figure. Incorrect index image table on site 4
(server processing query)
26SDDS index utilization architecture addressing
adjustment (5)
Query Count(Index_Emp_Salary(1800, 2500))
addressing adjustment
I
I
I
3
2
6
I
4
I
I
5
I
1
0
- Server 2 resend index query to 6th site using
up-to-date index image information. - Sites reply to 4th site with searched data and
index image adjustment . - Each site contains dense index information (for
bucket splitting)
Figure. Correct index image table on site 2
(server corrects addressing error)
27Distributed indexing in GRID with SDDS - problems
- main difficulty is how to preserve virtual
pointers to data organized by view definitions.
Is updateable view definition sufficient to
provide such a functionality. - wrappers should provide fast access methods to
single virtual objects (use of tids, primary
keys) - wrappers to local data scheme should provide to
index information about changes of data in local
repository
28Rank queries importance
- top k-query importance
- similarity queries in multimedia databases
- searching web databases
- more complex ranking functions are used,
dedicated ranking system are developed, ranking
relational algebras etc.
29Use of ORDER BY in SBQL rank queries
- (Emp order by Salary) returns
- sequence Emp1, Emp2, Emp3
- ((Emp order by Salary) number as n) returns
- sequence structEmp1, n(1), structEmp2,
n(2), structEmp3, n(3), - Rank query syntax
- (((Emp order by Salary) number as n) where n
10) returns - sequence structEmp1, n(1), structEmp2,
n(2), structEmp3, n(3),, structEmp10,
n(10) - that is 10 worse earning employees
- Alternative syntax for rank queries
- (Emp order by Salary)1..10
30Sequence optimizing in query languages
- It is common opinion that
- Sequence significantly reduce query optimizing
potential - Methods based on rewriting and indexing cannot be
used - Order forced by sequences leads to sequence
processing (one element followed by another)
31Symbols used in discussed example
- n number of employees stored in database
- qEMP number of servers holding employees
salaries data - qIND number of servers holding employees
salaries data index - qGRID number of servers in GRID
32Execution of rank query through sorting
- It is necessary to send all data to chosen server
before final evaluation (Total Data Shipping) - Sorting of whole set cuts down performance
regardless of data distribution cannot be better
then O(nlog(q)), because of merge cost in the
chosen server for final evaluation - Data distribution is disadvantage
- Index cannot be taken into an account
33New optimizing strategy for rank queries - example
- ((Emp order by Salary) number as n) where n 10
- After detecting suitable rank query in syntax
tree it can be replaced with the following kind
of query - (FindKElement(Emp, Salary, 10) group as x).
- (((Emp where Salary x.Salary) order by
Salary) number as n) where n 10 - where
- Method FindKElement returns reference to the 10th
worst earning employee.
34New optimizing strategy for rank queries -
performance
- (FindKElement(Emp, Salary, 10) group as x).
- (((Emp where Salary x.Salary) order by
Salary) number as n) where n 10 - It is assumed that considered rank query selects
small number of employees from whole dataset - Performance of evaluating the second part of the
query is relevant due to introducing sub-query
(Emp where Salary x.Salary) - small dataset selected
- possibility to use static decomposition strategy
- possibility to increase performance with local or
distributed indexing - Performance of query evaluation strongly depends
on FindKElement methods performance!
35FindKElement method algorithms
- Classic Hoares algorithm for selecting k biggest
element of the set, on example of employees
salary. - Parallel Hoares algorithm
- Proposed FindKElement parallel algorithm using
distributed index SDDS - Conclusion
36Hoares algorithm
- Faster then obvious algorithm sorting of the
given set and selecting k element - O(nlog(n)). - Modification of quicksort algorithm by omitting
sorting both parts of divided set - O(n). - Based on bisection algorithm
- Consumes additional memory to store copy of data
to swap freely elements if needed by algorithm - In each iteration algorithm splits examined data
rank on to parts (smaller and greater then
selected pivot element). Then the part containing
kth element is selected as rank for next iteration
37Hoares algorithm work schema
- white color selected subset containing k
element - orange color subset rejected (without k
element) - grey color subset not evaluated in iteration
- Sali is random salary from selected set (white
color) from i-1 iteration
n
k
Iteration
UNORDERED SET OF SALARIES
1
Set of salaries lt Sal2
Rejected set of salaries Sal2
2
Salaries lt Sal3
Rejected set of salaries Sal2
Sal3 salaries ltSal2
3
Salaries lt Sal3
Rejected set of salaries Sal2
4
Clog(n)
38Parallel modification of Hoares algorithm
- Hoares algorithm working parallel on servers
- Part of algorithm must be given to each of qEMP
servers holding employees salaries data. Each
servers must communicate with main controlling
algorithm to decide which part of divided set is
to be examined in next iteration - Performance O(n/qEMP) only if salaries data are
equally distributed on servers - Hoares algorithm working on one server
- Performance loss O(n)
- Total data shipping necessary
- No possibility to utilize salaries indices
39Alternative algorithm in SBQL start conditions
- S_MIN MIN(Emp.Salary)
- S_MAX MAX(Emp.Salary)
- S_COUNT COUNT(Emp.Salary)
- ITERATION 1
- MIN_COUNT 0
- MAX_COUNT 0
- n S_COUNT
MIN_COUNT 0
S_COUNT n
MAX_COUNT 0
UNORDERED SET OF SALARIES
k
40Alternative algorithm in SBQL - main loop
- while (ITERATION lt log2(n))
- Mean (S_MIN S_MAX)/2
- MinM_COUNT COUNT(Emp WHERE S_MIN Salary AND
Salary lt Mean) - Mean_COUNT COUNT(Emp WHERE Salary Mean)
- MMax_COUNT S_COUNT MinM_COUNT Mean_Count
- If (MinM_Count MIN_Count gt k) S_MAX Mean
MAX_COUNT MMax_COUNT - else S_MIN Mean MIN_COUNT MinM_Count
- S_COUNT n MIN_COUNT MAX_COUNT
- return ((Emp WHERE S_MIN Salary AND Salary
S_MAX) ORDER BY Salary)(k MIN_COUNT)
ITERATION
MinM_COUNT
MMax_COUNT
1
Salaries lt Mean
Mean lt Salaries
MIN_COUNT
MMax_COUNT
S_MIN Mean
MinM_COUNT
2
Salaries lt S_MIN
Mean lt Salaries lt S_MAX
S_MIN lt Salaries lt Mean
S_MAX Mean
MIN_COUNT
MAX_COUNT
MinM_COUNT
MMax_COUNT
3
Salaries lt S_MIN
S_MAX lt Salaries
k
41Alternative algorithm in SBQL example iteration
1
Table 1. Examined salaries range values
Table 2. Number of employees earning examined
ranges of salaries
1000 employees
MIN_COUNT 0
MAX_COUNT 0
UNORDERED SET OF SALARIES
232 employees
767 employees
Salaries lt Mean
Mean lt Salaries
Result S_MIN MEAN
300th employee
42Alternative algorithm in SBQL example iteration
2
Table 1. Examined salaries range values
Table 2. Number of employees earning examined
ranges of salaries
MIN_COUNT 232
MAX_COUNT 0
768 employees
Salaries lt S_MIN
S_MIN Salaries
232
463 employees
303 employees
Salaries lt S_MIN
Mean lt Salaries
S_MIN Salaries lt Mean
Result S_MAX MEAN
300th employee
43Alternative algorithm in SBQL example iteration
3
Table 1. Examined salaries range values
Table 2. Number of employees earning examined
ranges of salaries
MIN_COUNT 232
MAX_COUNT 463
305 employees
Salaries lt S_MIN
S_MAX lt Salaries
S_MIN Salaries S_MAX
232
463
130
174
Salaries lt S_MIN
S_MAX lt Salaries
Result S_MAX MEAN
300th employee
44Alternative algorithm in SBQL example all
iterations
Table. Examined salaries range values and
adequate number of employees
With this bisection algorithm after 7 iterations
employees set has been truncated from 1000 to 23
employees. Now following query can be
used return ((Emp WHERE 3250 Salary AND Salary
3325) ORDER BY Salary)300 294 1
45Alternative algorithm,indexing
- Possible dense or range index on salaries. I.
eCOUNT(Emp WHERE S_MIN Salary AND Salary lt
Mean) can be replaced with - COUNT(IndexEmpSalary(S_Min, Mean))
- WHERE use efficiently range index on salaries
- COUNT query is mostly computed on the index
itself (f. e. similarly like in Oracle databases) - COUNT Real data participation in evaluation of
such a query is very little
46Alternative algorithm,indexing - SDDS
- With SDDS computations can be distributed over
qIND servers! - SDDS expands when optimal load for used servers
is reached, so - qIND is limited only by number of GRID units
(qIND qGRID) - qIND does not depend on the number qEMP servers
holding salaries - As a general rule qIND is much bigger then qEMP
- SDDS unlike central index supports parallel
evaluation
47Alternative algorithm,indexing - advantages
- SDDS unlike central index supports parallel
evaluation - Regardless of salaries data distribution on GRID
servers performance is - O(nlog(qind)/qind) for pessimistic number of
rounds of messages for parrallel bucket queries - O(n/qind) if index is used often by clients
- Simplicity
- Algorithm can be compiled and run on one server!
- It uses standard SBQL language operators which
can be evaluated in parallel - Index is transparent for GRID programmer or user
48Alternative algorithm,main properties
- There is no need for total data shipping
- Algorithm mainly reads data and compares
- Small memory consumption and number of write
operations - Utilizes distributed indices efficiently
- Performance of algorithm using SDDS is near to
optimal for GRID O(n/qIMP)! - Performance without distribution and indices is
not worse then sorting O(nlog(n))
49Comparison of algorithms in distributed
environment
50New optimizing strategy for rank queries -
advantages
- It uses efficient algorithms in distributed
environment presented in previous section - It avoids Total Data Shipping
- It takes advantage from data distribution
- It is suitable for using distributed index
- Optimal performance
- Does not depend on actual data distribution
- It is possible to use it (in some range) in other
database system (f.e. relational)
51Rank queries evaluation strategies - comparison
52Conclusion
- Distributed indexing of sets often used in
queries or statistical computations causes
serious performance growth - Thanks to distributed index some solutions can be
directly copied from non-distributed systems to
GRID - Proposed algorithm can be used in optimizing
evaluation of some SBQL queries containing ORDER
BY operator, which goal is not to sort the whole
given set, but to obtain subset out of it (i. e.
50 best earning employees) - Main difficulty is to introduce flexible solution
to preserve direct access to virtual store objects
53Conclusion future work
- Benchmark tests of proposed rank queries optimize
strategy without SDDS index (using operators
distribution implemented?) - Implementing local query optimization
architecture using indices - Solution for indexing of global data (virtual
pointers?) architecture, implementation - More optimizations efficiently utilizing SDDS
54Project eGov-busWIDDOK scholarship
- eGov-Bus, Advanced eGovernment Information
Service BusVIth Frame Program - Stypendium Docelowe w ramach Mechanizmu
Wspierania Innowacyjnej Dzialalnosci Doktorantów
(WIDDOK)
55- Dziekuje serdecznie
- za uwage!