Priority Dynamics Presentation to Customers - PowerPoint PPT Presentation

About This Presentation
Title:

Priority Dynamics Presentation to Customers

Description:

HMP also known as the 'Lowfat' is HP's reliable user-mode IPC protocol. ... HP Cluster Install. Oracle 9i Best Practices performance Diagnostics & Tuning. ... – PowerPoint PPT presentation

Number of Views:116
Avg rating:3.0/5.0
Slides: 37
Provided by: Amir
Learn more at: http://www.nocoug.org
Category:

less

Transcript and Presenter's Notes

Title: Priority Dynamics Presentation to Customers


1
Large RAC Benchmark Lessons LearnedHanan Hit
NOCOUG Feb 14, 2006
2
Todays agenda
  • SUT Definitions
  • Benchmark Goals
  • Used RAC Mode
  • OS (HP-UX) best practices
  • HMP - Hyper Messaging Protocol
  • UDP Usage
  • OS Scheduling
  • Deadlocks
  • Startup Shutdown
  • Execution Plans
  • ORA Parameters
  • Multiple Block Sizes

3
The Tested Application
  • Type 1 - Powerbuilder front-end on windows,
    tuxedo middleware and Oracle backend.
  • Type 2 Java on Sun servers using Weblogic as
    the middle tier.
  • Type 3 - Batch processing written in
    ProC/ProCobol.
  • All the Oracle database servers run on HP
    servers, running Oracle 9.2.0.5 on HP-UX 11iV2.

4
SUT
5
SUT (Cont.)
  • Database Size 30TB
  • RAW Devices
  • Non Shared Oracle Home
  • Two Full XP 12K storage array from HP
  • 82 TB each (1152 disks with 73GB 15K RPM)
  • 128GB Cache
  • 120,000 IOPS (Max Random Performance Disk)

6
Machines
7
Machines (Cont.)
8
Machines (Cont.)
9
Benchmark Goals
  • Manual Load Balancing
  • Online only 6K Type 1 , 3.5K Type 2 (Test 1)
  • All the connections to the RAC instances should
    be balanced.
  • The throughput should be comparable with an
    earlier NON-RAC test environment.
  • Online Batch - 6K Type 1, 3.5K Type 2, 500 Type
    3 (Test 2)
  • All the connections to the RAC instances should
    be balanced.
  • The throughput should be comparable with an
    earlier NON-RAC tests.
  • Type 3 timings should be comparable with the
    NON-RAC tests.
  • Online only- 12K Type 1 , 7K Type 2 (Test 3)
  • All the connections to the RAC instances should
    be balanced.
  • The throughput should increase 95 comparable
    with an earlier Test1.

10
Benchmark Goals (Cont.)
  • RAC Load Balancing
  • Online only 6K Type 1 , 3.5K Type 2 (Test 1)
  • All the connections to the RAC instances should
    be balanced.
  • The throughput should be comparable with an
    earlier NON-RAC test environment.
  • Configuration/Setup
  • Amend the /etc/tnsnames.ora with the RAC LB
    strings.
  • RAC Fail Over Testing
  • Online only 6K Type 1 , 3.5K Type 2 (Test 1)
  • All the database connections that were connected
    to the RAC3 instance should have been
    successfully failed over to the other RAC1
    instance.

11
RAC - Mode
12
OS Port Specific Best Practices
  • IPC Protocols
  • HPs HMP
  • HP True64s RDG (Reliable DataGram)
  • Suns RSM (Remote Shared Memory)
  • VIA on Intel
  • In general the recommendation is to use low
    latency user mode IPC rather then UDP or TCP.

13
What is HyperFabric
  • HyperFabric is a high-speed cluster interconnect
    fabric that supports both the industry standard
    TCP/UDP over IP and HPs proprietary Hyper
    Messaging Protocol (HMP).
  • HyperFabric extends the scalability and
    reliability of TCP/UDP by providing transparent
    load balancing of connection traffic across
    multiple network interface cards (NICs) and
    transparent failover of traffic from one card to
    another without invocation of MC/ServiceGaurd.
  • The HyperFabric NIC incorporates a network
    processor that implements HPs Hyper Messaging
    Protocol and provides lower latency and lower
    host CPU utilization for standard TCP/UDP
    benchmarks over HyperFabric when compared to
    gigabit Ethernet.
  • Hewlett-Packard released HyperFabric in 1998 with
    a link rate of 2.56 Gbps over copper. In 2001,
    Hewlett-Packard released HyperFabric 2 with a
    link rate of 4.0 Gbps over fiber with support for
    compatibility with the copper HyperFabric
    interface. Both HyperFabric products support
    clusters up to 64-nodes.

14
HyperFabric Switches
  • Hewlett-Packard provides the fastest cluster
    interconnect via its proprietary HyperFabric
    switches.
  • The latest product being HyperFabric 2, which is
    a new set of hardware components with fiber
    connectors to enable low-latency, high bandwidth
    system interconnect.
  • With fiber interfaces, HyperFabric 2 provides
    faster speed up to 4Gbps in full duplex over
    longer distance up to 200 meters.
  • HyperFabric 2 also provides excellent scalability
    by supporting up to 16 hosts via point-to-point
    connectivity and up to 64 hosts via fabric
    switches. It is backward compatible with previous
    versions of HyperFabric and available on IA-64,
    PA-RISC servers.

15
Hyper Messaging Protocol (HMP)
  • HMP also known as the Lowfat is HPs reliable
    user-mode IPC protocol.
  • Hyper Messaging Protocol significantly expands on
    the feature set provided by TCP/UDP by providing
    a true Reliable Datagram model for both remote
    direct memory access (RDMA) and traditional
    message semantics.
  • Coupled with OS bypass capability and the
    hardware support for protocol offload provided by
    HyperFabric, HMP provides high bandwidth, low
    latency and extremely low CPU utilization with an
    interface and feature set optimized for business
    critical parallel applications such as Oracle 9i
    RAC.

16
How to Relink Oracle with HMP?
  • OS Configuration
  • setprivgrp dba MLOCK
  • Persistent over reboots /etc/privgroup dba
    MLOCK
  • Kernel MAXFILES must be at least 1024
  • HMP Configuration in /opt/clic/lib/skgxp/skclic.co
    nf
  • HMP relink
  • make f ins_rdbms.mk rac_on ipc_hms ioracle
  • UDP relink (Back to default)
  • make f ins_rdbms.mk rac_on ipc_udp ioracle

17
HMP Diagnostics
  • Check the alert.log and verify that you are using
    HMP protocol.
  • netstat in Check that the CLIC N interfaces are
    up on all nodes and that the MTU is at least 32K
  • Cluster_interconnects init.ora (spfile) parameter
    will not be showing the IP address and shouldnt
    be used.
  • oradebug setmypid
  • oradebug ipc

18
HMP Findings
  • Each Oracle shadow process will require 300-400K
    of additional memory.
  • The system couldnt scale the No. of concurrent
    connections beyond 1K and no additional
    connections could be established to either of the
    instances once this threshold had been reached.
  • The maximum (single card), memory was 4GB (during
    the benchmark timeframe)
  • We couldnt start more than a single instance
    while reaching the above threshold while the
    other ones were waiting.
  • Direct connection could be established to the
    single node that was active.

19
HMP Findings (Cont.)
  • netstat may show that most of the traffic is
    going to single interface, but this is really not
    a concern as the traffic is load balanced in the
    fabric and the command is not able to catch it.
  • Rollback to UDP over HyperFabric as this is
    usually more then good enough.

20
UDP Findings
  • Use UDP over HyperFabric as this will have the
    ability to use multiple cards.
  • Due to the Fail Over scenarios we had to be able
    to cope with at least twice node loads on a
    single node, hence minimum 20K concurrent
    connections.
  • The initial setup failed with (ORA-27504) that
    were pointing to OS Errors (227).
  • The cause for the above was found to be the fact
    that we were hitting an OS limitations of maximum
    UDP ports which is defaulted to 16380 (49152
    through 65535).
  • The solution is to lower the starting port from
    49152 to 20000
  • /usr/bin/ndd set /dev/udp udp_smallest_anon_port
    20000

21
Scheduling process prioritization
  • Most UNIX OS implemented time-sharing (TS) and
    real time (RT) scheduling.
  • The default for user process is TS.
  • With the regular scheduler (TS) the longer a
    process runs, its priority get weaker, and the
    chances for its preemption increases.
  • The cost of context switch is likely to be very
    high especially for LMD/LMS processes in RAC.
  • SCHED_NOAGE scheduler allows the processes to
    hold the CPU until they give it up.
  • Starting all processes in the same highest
    priority of this scheduler remove the possibility
    that they will steal the CPU from each other

22
SCHED_NOAGE
  • Unlike the normal TS policy a process scheduled
    with the SCHED_NOAGE will not increase or
    decrease its priority nor it will be preempted.
  • How to implement
  • As root (after reboot)
  • setprivgrp -g dba RTSCHED RTPRIO
  • Enter in /etc/privgroup dba RTSCHED RTPRIO
  • In init.ora (spfile) enter
  • hpux_sched_noage178
  • Check (ps) that the Oracle backgrond processes
    use priority 178.
  • In RAC make sure that this will be true for all
    nodes.

23
Deadlock etc.
  • Deadlock situations might get extrapolated to a
    serious issue in RAC environment.
  • Please make sure to check these issues prior to
    any RAC implementations.
  • Hints that might help
  • .event "7445 trace name ERRORSTACK level 3"
  • We encountered a situation that a instance had
    crashed due to deadlocks.
  • RAC1 had crashed while the databases on the other
    nodes remained up.

24
Deadlock etc. (Cont.)
  • In RAC environment the deadlock detection
    works different than NON-RAC. In NON-RAC it
    updates the alert.log with ORA-60 and
    generate a user trace file, But in RAC
    environment it puts the deadlock graph only in
    LMD Tracefile (background_dump_dest no user trace
    file).
  • Since we are dealing with global resources in
    RAC, the deadlock detection mechanism is
    different than from a non-RAC environment. In
    RAC, LMD process periodically checks for
    deadlocks on the database. How often it checks
    for deadlocks is controlled by the parameter
    _lm_dd_interval. The default is 60 seconds on
    most versions / platforms and can be checked with
    the following query
  • select x.ksppinm , y.ksppstvl from xksppi x ,
    xksppcv y where x.indx y.indx and
    x.ksppinm like 'lm_ddorder by x.ksppinm
  • Setting _lm_dd_interval too low can cause LMD to
    consume too much CPU. Setting it too high can
    cause applications to hang up because deadlocks
    aren't being detected quickly enough. The
    default is sufficient for most cases.

25
Deadlock etc. (Cont.)
  • Deadlocks may occur on the same node and on
    different nodes in the cluster.
  • Diagnostibility of ORA-60 errors on the single
    instance is easier then the case that is spans
    multiple instances (getting the SQL stmt).
  • On multiple instances starting from 9.2.0.6 /
    10.1.0.3 and above Oracle dumps additional
    diagnostics for deadlock in RAC environment.
  • On a single instance you can use the following
    (After checking the in the LMD trace file)
    ON All instances (spfile).
  • event"60 trace name errorstack level 3name
    systemstate level 10"

26
Startup Shutdown
  • Be cautious while performing startup and shutdown
    operations on such a large SGAs.
  • Shutdown may take very long time (half an hour).
  • While performing the below procedure we got few
    Oracle Errors
  • srvctl start database -d
  • Bug 2540942 (LMS may spin in kjctr_rksxp() during
    simultaneous shutdown).
  • During simultaneous shutdown of several instances
    with large buffer caches there is the chance of
    running out of tickets due to the amount of
    messages that need to be sent for cleanup
    purposes. Most visible side effect is LMS
    spinning in kjctr_rksxp(). the ora-600504 is
    the side effect of this.

27
Different Execution Plans
  • Node 1 Example

Enter value for hashvalue 2601636153 SQL_HASH
SQL_TEXT

---------------
--------------------------------------------------
-------------------------
2601636153 select
members.MEMBER_ID into b0b1 from (select
t.MEMBER_ID from
UFMI_INV
u ,RM_MEMBER_ID t where (((((u.URBAN_ID()b2
and

u.FLEET_ID()b3) and u.MEMBER_ID()t.MEMBER_ID)
and u.MEMBER_ID is null
) and
t.MEMBER_IDb4) and t.MEMBER_IDt.MEMBER_ID ) members
where
ROWNUM

Enter value for hashvalue 2601636153 Plan Table


--------------------
--------------------------------------------------
--------------------------------------------------
Operation Name Starts
E-Rows A-Rows Buffers Reads Writes
E-Time
----------------------------------
--------------------------------------------------
------------------------------------
INDEX RANGE SCAN
UFMI_INV_PK 4

SORT JOIN
4

INDEX RANGE SCAN
RM_MEMBER_ID_PK 142K

MERGE JOIN OUTER


FILTER


FILTER

VIEW

142K
COUNT STOPKEY


SELECT STATEMENT


-----------------------------------------
--------------------------------------------------
-----------------------------
12 rows selected.
28
Different Execution Plans (Cont.)
  • Node 3 Example

Enter value for hashvalue 2601636153 SQL_HASH
SQL_TEXT

---------------
--------------------------------------------------
-------------------------
2601636153 select
members.MEMBER_ID into b0b1 from (select
t.MEMBER_ID from
UFMI_INV
u ,RM_MEMBER_ID t where (((((u.URBAN_ID()b2
and

u.FLEET_ID()b3) and u.MEMBER_ID()t.MEMBER_ID)
and u.MEMBER_ID is null
) and
t.MEMBER_IDb4) and t.MEMBER_IDt.MEMBER_ID ) members
where
ROWNUM

Enter value for hashvalue 2601636153 Plan Table


--------------------
--------------------------------------------------
--------------------------------------------------
- Operation Name Starts
E-Rows A-Rows Buffers Reads Writes
E-Time
----------------------------------
--------------------------------------------------
------------------------------------
INDEX RANGE SCAN
UFMI_INV_PK 1

INDEX RANGE SCAN
RM_MEMBER_ID_PK 1

NESTED LOOPS OUTER


FILTER


FILTER

VIEW

1
COUNT STOPKEY


SELECT STATEMENT


-----------------------------------------
--------------------------------------------------
-------------------------
----




11 rows
selected.
29
What might be the Reason?
  • The first set of binds on node1 return a larger
    result set.
  • Bind Peeking
  • Version 9i / 10G uses Bind Variable Peeking
  • the CBO makes the assumption, that the bind
    variable values used for the first execution of a
    SQL statement are representative of the bind
    variable values to be used in future executions
    of the same SQL statement.
  • Bind variable peeking occurs every time the
    cursor is parsed or loaded.
  • If the data is skewed, the execution plan may
    change when, for example, the shared pool is
    flushed or the underlying object are analyzed
  • The optimizer is making different estimates
    because of the different parameters set on each
    node.
  • Histograms built on these columns

30
How to Solve the issue?
  • Turn off bind peeking - (_optim_peek_user_bindsfa
    lse)
  • Pros Will eliminate any bind peeking
  • Cons - This will affect all statements, some of
    which could already be benefiting from bind
    peeking
  • Hints (If the data is skewed or if the result
    set of the binds vary a lot)
  • Pros -Always get the same plan, regardless of
    peeked binds
  • Cons - This would unfortunately require an app
    change.
  • Use Store Outlines (we chose this option)
  • Pros No Need to change the application and
    applied immediately.
  • Cons Extra DBA maintenance and any changes to
    the query will disable the use of the outline.

31
_ Hidden Parameters More
  • _row_cr
  • Global cache cr request Event This event is
    generated when an instance has requested a
    consistent read data block (or UNDO segment
    headers), and the block to be transferred hasnt
    arrived at the requesting instance. Please make
    sure to check VCR_BLOCK_SERVER.
  • The usage of _row_crTRUE (default FALSE) might
    reduce the number CR rollbacks and avoid a costly
    cleanup/rollback in RAC. Instead of performing
    block cleanup/rollback the usage of this
    parameter will try to generate a CR for a given
    ROW.
  • Once we used this parameter we were hitting
    another bug 3577772.8 which we found during a
    Direct Path export.

32
_ Hidden Parameters More (Cont.)
  • FAST_START_MTTR_TARGET
  • This parameters guarantees that instance or crash
    recovery will be completed within this time frame
    and when set to non zero value will use
    incremental checkpoints. The default value is 300
    Sec (5 minutes).
  • The customer were using 0 sec and once setting
    this parameter to the Oracle default we reduced
    the inter-node communications as well as reduced
    the overall I/O on the system. Please make sure
    to amend it to suite your site requirements.
  • _BUMP_ HIGHWATER_ MARK_COUNT
  • HW Enqueue (High water mark) - High Water Mark
    for the table may be constantly increased since
    the table is growing and running out of blocks
    for new inserts on the free list. The default
    value for bumping up the HWM is 0, which may be
    too low on a busy system or for the average
    insert size.
  • We increased it to 5 and nearly eliminated this
    Enqueue at all.

33
Multiple Block Sizes?
  • Single instance was using KEEP buffer pool nearly
    the size as the DEFAULT buffer pool.
  • We tried once to segregate the index from the
    data on a smaller buffer pool (2k for indexes).
  • ORA-00379 no free buffers available in buffer
    pool DEFAULT for block size 2K
  • Due to the un-symmetric machines we encountered
    the above error once the proper buffer pool was
    not defined on all instances.
  • The recommendations - Combine the different
    buffer pool caches into one, no need to use KEEP,
    RECYCLE. Oracle can manage them well.

34
Summary
  • Address application deadlocks. This especially
    important once moving to RAC.
  • Combine the different buffer caches into one, no
    need to use KEEP, RECYCLE. Oracle can manage them
    well.
  • Move to Locally Managed Tablespaces, with ASSM
    (Automatic segment space management).
  • Dont consider using HMP while going to
    production in very large systems. The usage of
    the fabric over UDP should be enough.
  • Try to use Hints once you encounter different
    execution plans (Binds).
  • If possible partition to the Applications to few
    physical nodes (only in real high concurrency
    volumes).
  • All bugs had been fixed in newer versions of
    Oracle.
  • Takes extra precautions once performing startup
    shutdown in such large environments.

35
References
  • HP Cluster Install
  • Oracle 9i Best Practices performance
    Diagnostics Tuning.  
  • http//www.oracle.com/technology/documentation/ind
    ex.html (And search for HMP).

36
  • Thanks
Write a Comment
User Comments (0)
About PowerShow.com