Oracle DB Statistics What do they actually do and why do I care about the details - PowerPoint PPT Presentation

1 / 36
About This Presentation
Title:

Oracle DB Statistics What do they actually do and why do I care about the details

Description:

insert into patient_list select ' ', ' ','M',NULL from dba_tables where rownum 101; ... insert into gender select 'F' from dba_objects where rownum 10. commit; ... – PowerPoint PPT presentation

Number of Views:1840
Avg rating:5.0/5.0
Slides: 37
Provided by: wachovias
Category:

less

Transcript and Presenter's Notes

Title: Oracle DB Statistics What do they actually do and why do I care about the details


1
Oracle DB StatisticsWhat do they actually do and
why do I care about the details?
  • Why the optimizer does what it does

2
Scenario
  • The CEO of your organization is waiting for the
    management reports to present to the board.
    Hes getting nervous... He really needs those
    reports

3
Scenario
  • The CEOs admin assistant says, the reports are
    running slow Something about the database
  • How does the CEO respond?

4
Scenario
  • Did they run database statistics?
  • Why did he say this, Because statistics are
    MAGICAL

5
Goals of the Presentation
  • Discuss what statistics actually collects
  • Where are they stored
  • How are they used
  • And through this, acquire understanding of
  • How the optimizer works
  • It truly isnt always bad SQL
  • provide insight to why the optimizer occasionally
    picks a bad plan over a good plan
  • It isnt magical. Just as in most disciplines,
    without a basic understanding of the
    fundamentals, your skilsl can only reach so high

6
Example xample
  • drop table PATIENT_LIST
  • -- Create table w/ 200 Rows
  • create table PATIENT_LIST(last_name char(1000),
    first_name char(1000), gender char(1),
    first_preg_date date)
  • insert into patient_list select ' ', '
    ','F',sysdate - rownum from dba_tables where
    rownum lt 101
  • insert into patient_list select ' ', '
    ','M',NULL from dba_tables where rownum lt 101
  • -- create index
  • create index patient_list_ndx on
    patient_list(gender, first_preg_date)
  • -- collect stats
  • exec dbms_stats.gather_table_stats(user,
    'PATIENT_LIST', estimate_percent gt 100,
    method_optgt 'FOR ALL COLUMNS SIZE 1', cascade
    gttrue)
  • set autotrace on
  • select from patient_list where gender 'M' and
    first_preg_date is not null

7
Here is the explain plan
  • The explain Plan
  • 0 SELECT STATEMENT OptimizerCHOOSE (Cost2
    Card50 Bytes750)
  • 1 0 TABLE ACCESS (FULL) OF 'PATIENT_LIST'
    (TABLE) (Cost2 Card50 Bytes750)
  • The CARD (CARDINALITY) is 50. Cardinality is
    the number of rows the optimizer expects to come
    back
  • TEST TIME! What is the explanation? (dont
    worry, multiple choice)

8
What does this Mean?
  • (a) MEN REVOLT!!!! We no longer need women to
    reproduce!
  • (b) The optimizer is flawed, file an SR!
  • (c) statistics simply have their limitations

9
How did Oracle calculate 50?
  • 200 Rows in the Table
  • 50 have first_preg_date NULL
  • 2 Distinct values for Gender 50
  • 200 Rows 50 50 50 Rows
  • Not Rocket science, but it is obviously way off
  • If the table happened to 10TB, the calculation
    would have occurred the same way
  • Sounds like a mathematical statistical
    limitation

10
Limitation? Oracle Should fix that
  • Statistics collects only very rudimentary
    metadata about tables, indexes and table columns
  • To blindly widen the net could cause your
    statistical metadata to grow exponentially
    causing the optimizer to slow significantly (not
    to mention the stats job) for the sake of only a
    relatively small number of queries

11
In a nutshell, how does the optimizer work
  • Upon a hard parse, oracle comes up with a large
    number of possible execution plans
  • The optimizer simply uses basic mathematical
    statistics (and limited Oracle proprietary
    algorithms) to assign cardinalities to each step
    of these execution plans (stats are used here)
  • Then the optimizer assigns a cost to each step
    (how much work to collect that cardinality)
    (stats are used here)
  • Then for each plan, the optimizer sums the total
    cost and picks the plan with the lowest number

12
What are Table Level Statistics
  • ALL_TABLES
  • The Biggies
  • NUM_ROWS - Number of rows in the table
  • BLOCKS - Number of used data blocks in the table
  • Considerations
  • DEGREE - Number of threads per instance for
    scanning the table (parallelism)
  • PARTITIONED - Indicates whether this table is
    partitioned. Set to YES if it is partitioned.
  • AVG_ROW_LEN - Average length of a row in the
    table (in bytes)

13
What are Column Statistics
  • This is where method_opt comes in
  • FOR ALL COLUMNS
  • FOR ALL INDEXED COLUMS
  • ALL_TAB_COL_STATISTICS
  • The Biggies
  • NUM_DISTINCT - Number of distinct values in the
    column
  • LOW_VALUE RAW(32) - Low value in the column
  • HIGH_VALUE RAW(32) - High value in the column
  • NUM_NULLS - Number of nulls in the column
  • Considerations
  • AVG_COL_LEN NUMBER Average length of the column
    (in bytes)
  • HISTOGRAM - Indicates existence/type of histogram
    (we shall discuss this later)
  • Calculated based on other stats
  • DENSITY NUMBER Density of the column

14
What is out of Range?
  • This can occur when the a where clause predicate
    on a column is less then LOW_VALUE or greater
    then HIGH_VALUE
  • Can be caused by
  • big data shifts dumped into the table after the
    most recent analyze
  • Sample rate too low
  • Problem can appear random when using bind
    variables
  • Nice to know but the values in ALL_TAB_COL_STATIST
    ICS
  • LOW_VALUE
  • HIGH _VALUE
  • Are Raw data type, how do I translate?

15
Why do I care about Out of Range
  • The further you get out of range, the lower the
    expected cardinality
  • The optimizer can greatly underestimate the
    cardinality of the rows coming back causing a
    suboptimal plan (plan starts in the wrong place,
    wrong index is used)

16
How do you detect Out of Range Issues
  • This is tough, if you keep your stats from
    getting stale look at the slow sql plan in
    vsql_plan. Look at the cardinalities, if one
    appears way off, this might be it (but just
    because you dont see the symptoms, doesnt mean
    this isnt the prob). (Much worse w/ binds).
  • When using binds , try to extract first parse
    bind variables from the data dictionary or AWR.
    You may have to trust your gut.

17
LOW_VALUE - HIGH_VALUEconversion
  • create or replace function display_raw (rawval
    raw, type varchar2)
  • return varchar2
  • is
  • cn number
  • cv varchar2(32)
  • cd date
  • cnv nvarchar2(32)
  • cr rowid
  • cc char(32)
  • begin
  • if (type 'NUMBER') then
  • dbms_stats.convert_raw_value(rawval, cn)
  • return to_char(cn)
  • elsif (type 'VARCHAR2') then
  • dbms_stats.convert_raw_value(rawval, cv)
  • return to_char(cv)
  • elsif (type 'DATE') or (type LIKE
    'TIMESTAMP') then
  • dbms_stats.convert_raw_value(rawval, cd)
  • return to_char(cd)

18
Here is the SQL
  • select
  • a.column_name,
  • display_raw(a.low_value,b.data_type) as
    low_val,
  • display_raw(a.high_value,b.data_type) as
    high_val,
  • b.data_type, a.NUM_DISTINCT, a.DENSITY,
    a.NUM_NULLS
  • from
  • all_tab_col_statistics a, all_tab_cols b
  • where
  • a.table_name'ACAT_CTRL_FACT' and
  • a.table_nameb.table_name and
  • a.column_nameb.column_name and
  • a.owner b.owner

19
What are Index Stats
  • ALL_IND_STATISTICS
  • BLEVEL- B-Tree level
  • LEAF_BLOCKS - Number of leaf blocks in the index
  • DISTINCT_KEYS - Number of distinct keys in the
    index key
  • CLUSTERING_FACTOR - Indicates the amount of order
    of the rows in the table based on the values of
    the index.

20
Why do I have to sometime run stats in the middle
of the day to fix a query?
  • Likely candidates
  • Cursor Invalidation (for reused sql)
  • Simply a bad peaked bind variable
  • Out of Range issue due to new data or stats
    sampling issue
  • Significant data change at the column level
    (update) since last analyze
  • Significant change at the number of rows (or
    table size) typically because stats havent
    been run for a long time since last analyze

21
Statistics Types on Partitioned tables
  • Statistics are collected two ways on a
    partitioned table (or 3 on subpartioned)
  • Global (Same as a normal table)
  • ALL_TABLES
  • ALL_TAB_COL_STATISTICS
  • ALL_TAB_HISTOGRAMS
  • ALL_INDEXES
  • Partitioned
  • ALL_TAB_PARTITIONS
  • ALL_PART_COL_STATISTICS
  • ALL_PART_HISTOGRAMS
  • ALL_IND_PARTITIONS
  • Subpartitions
  • ALL_SUBPART_COL_STATISTICS
  • ALL_SUBPART_HISTOGRAMS
  • ALL_TAB_SUBPARTITIONS

22
Why so many statistics on Partitioned Tables
  • When oracle received a SQL statement that
    references a partitioned table, it will use the
    values in the where clause to use the lowest
    degree of stats possible
  • Select from part_table (global stats)
  • Select from part_table where partitioned_key
    X (partition stats)
  • Select from part_table where sub_partitioned_key
    X (sub_partition stats)
  • Select from part_table where partitioned_key in
    ( X, Y) (global stats)

23
Gotchas with stats on partitioned tables
  • If you use bind variables (i.e. partitioned_key
    (X) and gender Y), the database will find
    the best plan for that value for the data in
    Xs partition. Maybe that is a bad plan in
    another partition.
  • Imagine if the first hard parsed query in the
    morning hits an empty partition (cheap FTS).
    Then the next query hits a big partition. Same
    plan (supposedly some work is done on this issue
    in 11G)

24
Gotchas with stats on partitioned tables
  • Image a Credit Card transaction table partitioned
    by load date (1000 Days)
  • Each load date will contain 4 partial transaction
    dates
  • No histograms
  • Local Stats on any given partition say 4 distinct
    values for load date
  • Global Stats say 1004 distinct values
  • Index on Load Data
  • All input values are in range

  • Select from CC_TRANS where load_date X and
    trans_date Y
  • Optimizer says FTS (full partion scan) best plan
    GOOD
  • Select from CC_TRANS where load_date in (X,Z)
  • and trans_date Y
  • Optimizer says Index Lookup best plan BAD
  • Select from CC_TRANS where load_date gt X
  • and trans_date Y

25
Out of Range issues
  • Partitioned tables tend to be big tables, so
    adding 1M rows may not be a big change (stale).
  • If you are adding a new partition every day
    (partitioned by transaction date), every day the
    max(transaction date) for the global table grows.
    Global stats doesnt know. Everyday will grow
    to be further out of range. Execution plans
    against global stats may perform terribly when
    sql tries to hit recent out-of-range values.
  • In 10.2.0.4 and after, new copy table statistics
    procedure is available to counteract some of
    these issues

26
Histograms what are they
  • Dan Hotka Histograms are a cost-based
    optimizer (CBO) feature that allows for Oracle to
    see the possible number of values of a particular
    column. This is known as data skewing and
    histograms can track the number of occurrences of
    a particular data value when deciding on what
    type of index to use or even whether to use an
    index at all. Histograms give the CBO the ability
    to see the uniqueness of a particular data value.

27
Histogram metadata
  • ALL_TAB_HISTOGRAMS
  • ENDPOINT_NUMBER - Histogram bucket number
  • ENDPOINT_VALUE - Normalized endpoint value for
    this bucket
  • ENDPOINT_ACTUAL_VALUE - Actual (not normalized)
    string value of the endpoint for this bucket

28
Lets give Histograms a look
  • We will create a frequency based histogram
  • create table gender as select 'M' gender from
    dba_objects
  • insert into gender select 'F' from dba_objects
    where rownum lt 10
  • commit
  • EXEC DBMS_STATS.GATHER_TABLE_STATS(ownname gt
    user, tabname gt 'GENDER', ESTIMATE_PERCENT
    gt100, method_opt gt 'FOR ALL COLUMNS SIZE 10')
  • select num_rows from all_tables where owner
    user and table_name 'GENDER
  • Result 36769

29
Histograms, still looking
  • SELECT column_name,
  • ENDPOINT_NUMBER,
  • endpoint_value,
  • ENDPOINT_ACTUAL_VALUE,
  • SUBSTR (SYSTEM.hexstr (endpoint_value),
    1, 30)
  • FROM all_tab_histograms a
  • WHERE table_name 'GENDER' AND owner USER
  • hexstr function comes from asktom.oracle.com

30
Histogram Explain Plans
  • set echo on
  • set autotrace traceonly explain
  • SQLgt select from gender where gender 'F'
  • Execution Plan
  • --------------------------------------------------
    --------
  • 0 SELECT STATEMENT OptimizerCHOOSE
    (Cost19 Card9 Bytes18)
  • 1 0 TABLE ACCESS (FULL) OF 'GENDER'
    (TABLE) (Cost19 Card9 Bytes18)
  • SQLgt select from gender where gender 'M'
  • Execution Plan
  • --------------------------------------------------
    --------
  • 0 SELECT STATEMENT OptimizerCHOOSE
    (Cost19 Card36760 Bytes
  • 73520)
  • 1 0 TABLE ACCESS (FULL) OF 'GENDER'
    (TABLE) (Cost19 Card36760 Bytes73520)

31
Histograms two types
  • Frequency Based (less the 255 sampled values)
  • Army personnel table
  • Female 5
  • Male 95
  • Height based (not bound by of distinct values)
  • Bell shape curve distribution, i.e. IQ scale
  • Say (making this up)
  • 3 of adults have an IQ between 0 70
  • 47 of adults have an IQ between 70 100
  • 47 of adults have an IQ between 100 130
  • 3 of adults have an IQ between 130 200
  • Oracle implementation will break up the numbers
    evenly by the range (for instance 10 buckets, 10
    of the range will be in each bucket)

32
Histograms in 10G
  • 10G defaults to size auto in method_opt
  • This means there is an algorithm in oracle to
    view recent behavior in the database to determine
    if histograms are needed
  • I believe 10G size auto, is too aggressive in
    creating histograms. Creates histograms on too
    many columns
  • And histogram collection time is significant and
    resource intensive

33
10G Histogram Oddities
  • I have seen 10G put height based histograms on
    Last Name columns
  • If histograms tell use there are 1M rows with
    last names between Pallack and Phillips, how can
    histograms help the optimizer guess how many
    Penellopeys (uncommon) there are..
  • Clearly there are Common last names, but range
    based histograms cant hold them (too many), and
    height based dont make sense.

34
Histogram Bind Variables
  • The combination of histograms and bind variable
    usage can get you in trouble (11G has some
    advances in this)
  • The first hard parse determines your plan. What
    if the bind variable value references infrequent
    skewed data which is not used again. Subsequent
    queries go after data on the other side of the
    skew?
  • Im not saying lack of histograms doesnt have
    its downfalls, but the existence of histograms
    can cause large volatility which brings the
    phone calls.
  • If a small discrete set of values, consider using
    literals in where predicate for histogram columns
    for the best plan, every time.

35
But 10G takes care of Stale Stats
  • Oracle Documentation - If a monitored table has
    been modified more than 10, then these
    statistics are considered stale and gathered
    again
  • A good solid rule for 80 of all tables.
  • But staleness truly cannot be measured by number
    of modified rows alone. Internal data swings
    (most likely on bigger tables) can cause HUGE
    issues. Consider a transaction table with a new
    distinct ORDR CODE for every day.
  • If there is a frequency histogram on this column,
    a new ORDR CODE will not be in it. The
    optimizer, using the histogram data in
    combination of the columns distinct values
    (all_tab_cols) and the number of rows in the
    all_tables, the optimizer may guess as loww as 1
    row.

36
Statistics Guidelines
  • Many tables fall into batch processes, If
    possible the app teams should run stats after
    each batch.
  • Partitioned tables should collect global stats
    each week
  • Tables partitioned by month should additionally
    collect global stats immediately after the first
    load of monthly data.
  • If you periodically find bad plans based on
    wrong statistics, even after an analyze, up the
    sample size.
  • Sure 10G has a default daily stats job, dont
    hesitate to use your own
Write a Comment
User Comments (0)
About PowerShow.com