SDDS distributed index architecture and rank queries optimization in distributed objectoriented data - PowerPoint PPT Presentation

1 / 55
About This Presentation
Title:

SDDS distributed index architecture and rank queries optimization in distributed objectoriented data

Description:

Selecting k smallest element in the set - parallel algorithm in SBQL ... Used to quicken searching of an objects by reducing the amount of data to be searched ... – PowerPoint PPT presentation

Number of Views:69
Avg rating:3.0/5.0
Slides: 56
Provided by: tomaszk
Category:

less

Transcript and Presenter's Notes

Title: SDDS distributed index architecture and rank queries optimization in distributed objectoriented data


1
SDDS distributed index architecture and rank
queries optimization in distributed
object-oriented databases
  • Tomasz Kowalski
  • Advisor
  • Kazimierz Subieta D.Sc.

2
Presentation plan
  • What is index
  • Indexing strategies in distributed environment
  • Architecture of SDDS index
  • New approach to optimize rank queries with SDDS
  • Sequences and rank queries in SBQL
  • Overview of an approach
  • Selecting k smallest element in the set -
    parallel algorithm in SBQL
  • Properties and advantages of proposed strategy
  • Conclusion

3
What is index
  • Auxiliary (redundant) database structures stored
    on the server side,
  • Used to quicken searching of an objects by
    reducing the amount of data to be searched
  • Essential element in optimizing big databases,
  • Most popular techniques of implementing indices
    are B-tree (balanced tree) and hash indices.

4
Indices structure
  • Two column table, where
  • I column consists of unique key values (i.e.
    input for search procedure)
  • II column consists of non-key values (i.e.
    reference to objects)

Example index for Department objects according
to Location attribute
5
Indices example
(Depart where Radom ? Location) join
(Employs.Emp where Position manager)
  • Index_Depart_Location
  • parameters String Location
  • returns bag of departament references

(Index_Depart_Location(Radom ) join
(Employs.Emp where Position manager)
6
Indices example (2)
(Depart where Radom ? Location) join
(Employs.Emp where Position manager)
  • Index_Depart_Loc_Emp_Pos
  • parameters String Location, String Position
  • returns bag of structures including departament
    and employee references

Index_Depart_Loc_Emp_Pos(Radom, manager)
7
Indices division
  • dense index for each key value there is
    separate position in index
  • range index each index position holds values
    connected with a range
  • often used in lt, gt, , optimizing (order
    preserving range function)

Example range index for Employees objects
according to Salary attribute
8
Index properties
  • Index transparency
  • programmer should not be aware that indices
    exist.
  • administrator of the database can freely generate
    new indices and remove them without changing code
    of applications.
  • Automatic index updating
  • result of changes in database.
  • index cohesion is maintained by automatic
    mechanism should improving, eliminating or
    generating new index in case of database update.

9
Query optimization architecture with indices
query
query parser
Client
query syntax tree
Rewriting query optimizer
Query Interpreter
Index repository
Static stackS_ENVS
ENVS
Index administration module
Static stackS_QRES
QRES
Query evaluation cost module index profit
estimation module
Meta-base
Data repository
10
Automatic index updatingin distributed
enviroment
  • each site containing data has to maintain index
    convergence with data (by inserting, deleting,
    modifying index)
  • index key value is calculated according to
    selected attribute definition in integrating
    global view and contributory local view
  • main difficulty is the form of non-key value
    which is responsible for direct access to indexed
    virtual objects in virtual repository!

11
Global views integration of distributed data
  • Example view
  • create view EmpDef
  • virtual objects Emp
  • return ((Cracow.Prac union Warsaw.Prac union
    Wroclaw.Prac union Lodz.Prac) as b)
  • on_retrieve do
  • return b.(deref(Nazwisko) as Name,
    deref(Zarobek) as Salary)
  • i.e. Index is created on Salary attribute of Emp
    virtual objects.

12
Automatic index updating server architecture
Emp definition
Globalschema
Contributory schema
Index automatic updating service
Prac definition
Contributory views
Localschema
Local resources
Local repository
Cracow peer
13
Distributed environment schema
I
Index_Emp_Salary Data
Emp Data
GRID Server
GRID
Client
14
Example query evaluation
  • Query COUNT(Emp WHERE 1800 Salary AND Salary lt
    2500)

results
query
15
Local indexing strategy
  • Query COUNT(Emp WHERE 1800 Salary AND Salary lt
    2500)

results
I
I
query
I
I
16
Central indexing strategy
  • Query Count(Index_Emp_Salary(1800, 2500))

result
query
I
17
Distributed scalable indexing strategy
  • Query Count(Index_Emp_Salary(1800, 2500))

result
I
I
I
query
I
I
I
I
I
I
I
I
I
18
Global indexing strategies
  • Central index
  • efficient for selective queries
  • indexing server is hot spot (overloading,
    failure)
  • Distributed index
  • uses GRID computing potential
  • more complex architecture
  • more communication necessary

19
SDDS distributed index
  • Scalable Distributed Data Structure (Witold
    Litwin D.Sc. Universite Paris)
  • High-availability tolerate the unavailability
    of k 1 server sites (parity calculus uses the
    Reed-Solomon Codes)
  • supports parallel/distributed queries evaluation
  • LHRS generalizes Linear Hashing to distributed
    RAM or disk files
  • avoids centralized address calculus

20
Further SDDS properties
  • provides concurrency transparency
  • does not assume any size and capacity constraints
  • SDDS file expands on new servers when optimal
    load of current servers is reached
  • access and modification of SDDS file does not
    demand global refresh on servers or clients

21
LH algorithm performance
  • Over 65 of SDDS file is used
  • Number of messages between servers per random
    insertion is 1 in general
  • Number of messages between servers per key search
    is 2 in general
  • Parallel operations on SDDS M buckets requires at
    most 2M1 messages and 1 between 1 and O(log2M)
    rounds of messages

22
SDDS index utilization architecture client (1)
  • Client addresses directly or indirectly
    integrated data defined by its global schema
    (EmpDef view).
  • Client has knowledge about existing indices and
    some cost estimations.
  • Client utilizes indices according to query
    optimizer architecture.
  • I.e. query
  • COUNT(Emp WHERE 1800 Salary AND Salary lt 2500)
    client should rewrite to utilize index in
    following way
  • COUNT(Index_Emp_Salary(1800, 2500))
  • or using data-independent index count method
  • Count_Index_Emp_Salary(1800, 2500)

23
SDDS index utilization architecture server (2)
  • Query COUNT(Emp WHERE 1800 Salary AND Salary lt
    2500)

Rewritten query Count(Index_Emp_Salary(1800,
2500))
I
  • Client send rewritten query to any of grids site
  • Servers index service module is responsible to
    reply clients query.
  • Server uses SDDS addressing algorithm.

24
SDDS index utilization architecture grid (3)
  • Query Count(Index_Emp_Salary(1800, 2500))

addressing adjustment
I
I
I
query
I
I
  • Server index service module executes parallel
    query over SDDS
  • Server knows most peers participating in indexing
    and has own distributed index image
  • Each query result contains SDDS image adjustment

25
SDDS index utilization architecture addressing
(4)
Query Count(Index_Emp_Salary(1800, 2500))
I
I
I
3
2
6
I
4
I
I
5
I
1
0
  • Server 4 calculates servers containing searched
    part of index
  • Site 4 index service module sends SDDS queries to
    1st and 2nd server in parallel.
  • Each site indexes some range.

Figure. Incorrect index image table on site 4
(server processing query)
26
SDDS index utilization architecture addressing
adjustment (5)
Query Count(Index_Emp_Salary(1800, 2500))
addressing adjustment
I
I
I
3
2
6
I
4
I
I
5
I
1
0
  • Server 2 resend index query to 6th site using
    up-to-date index image information.
  • Sites reply to 4th site with searched data and
    index image adjustment .
  • Each site contains dense index information (for
    bucket splitting)

Figure. Correct index image table on site 2
(server corrects addressing error)
27
Distributed indexing in GRID with SDDS - problems
  • main difficulty is how to preserve virtual
    pointers to data organized by view definitions.
    Is updateable view definition sufficient to
    provide such a functionality.
  • wrappers should provide fast access methods to
    single virtual objects (use of tids, primary
    keys)
  • wrappers to local data scheme should provide to
    index information about changes of data in local
    repository

28
Rank queries importance
  • top k-query importance
  • similarity queries in multimedia databases
  • searching web databases
  • more complex ranking functions are used,
    dedicated ranking system are developed, ranking
    relational algebras etc.

29
Use of ORDER BY in SBQL rank queries
  • (Emp order by Salary) returns
  • sequence Emp1, Emp2, Emp3
  • ((Emp order by Salary) number as n) returns
  • sequence structEmp1, n(1), structEmp2,
    n(2), structEmp3, n(3),
  • Rank query syntax
  • (((Emp order by Salary) number as n) where n
    10) returns
  • sequence structEmp1, n(1), structEmp2,
    n(2), structEmp3, n(3),, structEmp10,
    n(10)
  • that is 10 worse earning employees
  • Alternative syntax for rank queries
  • (Emp order by Salary)1..10

30
Sequence optimizing in query languages
  • It is common opinion that
  • Sequence significantly reduce query optimizing
    potential
  • Methods based on rewriting and indexing cannot be
    used
  • Order forced by sequences leads to sequence
    processing (one element followed by another)

31
Symbols used in discussed example
  • n number of employees stored in database
  • qEMP number of servers holding employees
    salaries data
  • qIND number of servers holding employees
    salaries data index
  • qGRID number of servers in GRID

32
Execution of rank query through sorting
  • It is necessary to send all data to chosen server
    before final evaluation (Total Data Shipping)
  • Sorting of whole set cuts down performance
    regardless of data distribution cannot be better
    then O(nlog(q)), because of merge cost in the
    chosen server for final evaluation
  • Data distribution is disadvantage
  • Index cannot be taken into an account

33
New optimizing strategy for rank queries - example
  • ((Emp order by Salary) number as n) where n 10
  • After detecting suitable rank query in syntax
    tree it can be replaced with the following kind
    of query
  • (FindKElement(Emp, Salary, 10) group as x).
  • (((Emp where Salary x.Salary) order by
    Salary) number as n) where n 10
  • where
  • Method FindKElement returns reference to the 10th
    worst earning employee.

34
New optimizing strategy for rank queries -
performance
  • (FindKElement(Emp, Salary, 10) group as x).
  • (((Emp where Salary x.Salary) order by
    Salary) number as n) where n 10
  • It is assumed that considered rank query selects
    small number of employees from whole dataset
  • Performance of evaluating the second part of the
    query is relevant due to introducing sub-query
    (Emp where Salary x.Salary)
  • small dataset selected
  • possibility to use static decomposition strategy
  • possibility to increase performance with local or
    distributed indexing
  • Performance of query evaluation strongly depends
    on FindKElement methods performance!

35
FindKElement method algorithms
  • Classic Hoares algorithm for selecting k biggest
    element of the set, on example of employees
    salary.
  • Parallel Hoares algorithm
  • Proposed FindKElement parallel algorithm using
    distributed index SDDS
  • Conclusion

36
Hoares algorithm
  • Faster then obvious algorithm sorting of the
    given set and selecting k element - O(nlog(n)).
  • Modification of quicksort algorithm by omitting
    sorting both parts of divided set - O(n).
  • Based on bisection algorithm
  • Consumes additional memory to store copy of data
    to swap freely elements if needed by algorithm
  • In each iteration algorithm splits examined data
    rank on to parts (smaller and greater then
    selected pivot element). Then the part containing
    kth element is selected as rank for next iteration

37
Hoares algorithm work schema
  • white color selected subset containing k
    element
  • orange color subset rejected (without k
    element)
  • grey color subset not evaluated in iteration
  • Sali is random salary from selected set (white
    color) from i-1 iteration

n
k
Iteration
UNORDERED SET OF SALARIES
1
Set of salaries lt Sal2
Rejected set of salaries Sal2
2
Salaries lt Sal3
Rejected set of salaries Sal2
Sal3 salaries ltSal2
3
Salaries lt Sal3
Rejected set of salaries Sal2
4
Clog(n)
38
Parallel modification of Hoares algorithm
  • Hoares algorithm working parallel on servers
  • Part of algorithm must be given to each of qEMP
    servers holding employees salaries data. Each
    servers must communicate with main controlling
    algorithm to decide which part of divided set is
    to be examined in next iteration
  • Performance O(n/qEMP) only if salaries data are
    equally distributed on servers
  • Hoares algorithm working on one server
  • Performance loss O(n)
  • Total data shipping necessary
  • No possibility to utilize salaries indices

39
Alternative algorithm in SBQL start conditions
  • S_MIN MIN(Emp.Salary)
  • S_MAX MAX(Emp.Salary)
  • S_COUNT COUNT(Emp.Salary)
  • ITERATION 1
  • MIN_COUNT 0
  • MAX_COUNT 0
  • n S_COUNT

MIN_COUNT 0
S_COUNT n
MAX_COUNT 0
UNORDERED SET OF SALARIES
k
40
Alternative algorithm in SBQL - main loop
  • while (ITERATION lt log2(n))
  • Mean (S_MIN S_MAX)/2
  • MinM_COUNT COUNT(Emp WHERE S_MIN Salary AND
    Salary lt Mean)
  • Mean_COUNT COUNT(Emp WHERE Salary Mean)
  • MMax_COUNT S_COUNT MinM_COUNT Mean_Count
  • If (MinM_Count MIN_Count gt k) S_MAX Mean
    MAX_COUNT MMax_COUNT
  • else S_MIN Mean MIN_COUNT MinM_Count
  • S_COUNT n MIN_COUNT MAX_COUNT
  • return ((Emp WHERE S_MIN Salary AND Salary
    S_MAX) ORDER BY Salary)(k MIN_COUNT)

ITERATION
MinM_COUNT
MMax_COUNT
1
Salaries lt Mean
Mean lt Salaries
MIN_COUNT
MMax_COUNT
S_MIN Mean
MinM_COUNT
2
Salaries lt S_MIN
Mean lt Salaries lt S_MAX
S_MIN lt Salaries lt Mean
S_MAX Mean
MIN_COUNT
MAX_COUNT
MinM_COUNT
MMax_COUNT
3
Salaries lt S_MIN
S_MAX lt Salaries
k
41
Alternative algorithm in SBQL example iteration
1
Table 1. Examined salaries range values
Table 2. Number of employees earning examined
ranges of salaries
1000 employees
MIN_COUNT 0
MAX_COUNT 0
UNORDERED SET OF SALARIES
232 employees
767 employees
Salaries lt Mean
Mean lt Salaries
Result S_MIN MEAN
300th employee
42
Alternative algorithm in SBQL example iteration
2
Table 1. Examined salaries range values
Table 2. Number of employees earning examined
ranges of salaries
MIN_COUNT 232
MAX_COUNT 0
768 employees
Salaries lt S_MIN
S_MIN Salaries
232
463 employees
303 employees
Salaries lt S_MIN
Mean lt Salaries
S_MIN Salaries lt Mean
Result S_MAX MEAN
300th employee
43
Alternative algorithm in SBQL example iteration
3
Table 1. Examined salaries range values
Table 2. Number of employees earning examined
ranges of salaries
MIN_COUNT 232
MAX_COUNT 463
305 employees
Salaries lt S_MIN
S_MAX lt Salaries
S_MIN Salaries S_MAX
232
463
130
174
Salaries lt S_MIN
S_MAX lt Salaries
Result S_MAX MEAN
300th employee
44
Alternative algorithm in SBQL example all
iterations
Table. Examined salaries range values and
adequate number of employees
With this bisection algorithm after 7 iterations
employees set has been truncated from 1000 to 23
employees. Now following query can be
used return ((Emp WHERE 3250 Salary AND Salary
3325) ORDER BY Salary)300 294 1
45
Alternative algorithm,indexing
  • Possible dense or range index on salaries. I.
    eCOUNT(Emp WHERE S_MIN Salary AND Salary lt
    Mean) can be replaced with
  • COUNT(IndexEmpSalary(S_Min, Mean))
  • WHERE use efficiently range index on salaries
  • COUNT query is mostly computed on the index
    itself (f. e. similarly like in Oracle databases)
  • COUNT Real data participation in evaluation of
    such a query is very little

46
Alternative algorithm,indexing - SDDS
  • With SDDS computations can be distributed over
    qIND servers!
  • SDDS expands when optimal load for used servers
    is reached, so
  • qIND is limited only by number of GRID units
    (qIND qGRID)
  • qIND does not depend on the number qEMP servers
    holding salaries
  • As a general rule qIND is much bigger then qEMP
  • SDDS unlike central index supports parallel
    evaluation

47
Alternative algorithm,indexing - advantages
  • SDDS unlike central index supports parallel
    evaluation
  • Regardless of salaries data distribution on GRID
    servers performance is
  • O(nlog(qind)/qind) for pessimistic number of
    rounds of messages for parrallel bucket queries
  • O(n/qind) if index is used often by clients
  • Simplicity
  • Algorithm can be compiled and run on one server!
  • It uses standard SBQL language operators which
    can be evaluated in parallel
  • Index is transparent for GRID programmer or user

48
Alternative algorithm,main properties
  • There is no need for total data shipping
  • Algorithm mainly reads data and compares
  • Small memory consumption and number of write
    operations
  • Utilizes distributed indices efficiently
  • Performance of algorithm using SDDS is near to
    optimal for GRID O(n/qIMP)!
  • Performance without distribution and indices is
    not worse then sorting O(nlog(n))

49
Comparison of algorithms in distributed
environment
50
New optimizing strategy for rank queries -
advantages
  • It uses efficient algorithms in distributed
    environment presented in previous section
  • It avoids Total Data Shipping
  • It takes advantage from data distribution
  • It is suitable for using distributed index
  • Optimal performance
  • Does not depend on actual data distribution
  • It is possible to use it (in some range) in other
    database system (f.e. relational)

51
Rank queries evaluation strategies - comparison
52
Conclusion
  • Distributed indexing of sets often used in
    queries or statistical computations causes
    serious performance growth
  • Thanks to distributed index some solutions can be
    directly copied from non-distributed systems to
    GRID
  • Proposed algorithm can be used in optimizing
    evaluation of some SBQL queries containing ORDER
    BY operator, which goal is not to sort the whole
    given set, but to obtain subset out of it (i. e.
    50 best earning employees)
  • Main difficulty is to introduce flexible solution
    to preserve direct access to virtual store objects

53
Conclusion future work
  • Benchmark tests of proposed rank queries optimize
    strategy without SDDS index (using operators
    distribution implemented?)
  • Implementing local query optimization
    architecture using indices
  • Solution for indexing of global data (virtual
    pointers?) architecture, implementation
  • More optimizations efficiently utilizing SDDS

54
Project eGov-busWIDDOK scholarship
  • eGov-Bus, Advanced eGovernment Information
    Service BusVIth Frame Program
  • Stypendium Docelowe w ramach Mechanizmu
    Wspierania Innowacyjnej Dzialalnosci Doktorantów
    (WIDDOK)

55
  • Dziekuje serdecznie
  • za uwage!
Write a Comment
User Comments (0)
About PowerShow.com