Oracle DB Statistics What do they actually do and why do I care about the details - PowerPoint PPT Presentation

1 / 36

About This Presentation

Title:

Oracle DB Statistics What do they actually do and why do I care about the details

Description:

insert into patient_list select ' ', ' ','M',NULL from dba_tables where rownum 101; ... insert into gender select 'F' from dba_objects where rownum 10. commit; ... – PowerPoint PPT presentation

Number of Views:1840

Avg rating:5.0/5.0

Slides: 37

Provided by: wachovias

Category:

more less

Transcript and Presenter's Notes

Title: Oracle DB Statistics What do they actually do and why do I care about the details

1
Oracle DB StatisticsWhat do they actually do and
why do I care about the details?

Why the optimizer does what it does

2
Scenario

The CEO of your organization is waiting for the
management reports to present to the board.
Hes getting nervous... He really needs those
reports

3
Scenario

The CEOs admin assistant says, the reports are
running slow Something about the database
How does the CEO respond?

4
Scenario

Did they run database statistics?
Why did he say this, Because statistics are
MAGICAL

5
Goals of the Presentation

Discuss what statistics actually collects
Where are they stored
How are they used
And through this, acquire understanding of
How the optimizer works
It truly isnt always bad SQL
provide insight to why the optimizer occasionally
picks a bad plan over a good plan
It isnt magical. Just as in most disciplines,
without a basic understanding of the
fundamentals, your skilsl can only reach so high

6
Example xample

drop table PATIENT_LIST
-- Create table w/ 200 Rows
create table PATIENT_LIST(last_name char(1000),
first_name char(1000), gender char(1),
first_preg_date date)
insert into patient_list select ' ', '
','F',sysdate - rownum from dba_tables where
rownum lt 101
insert into patient_list select ' ', '
','M',NULL from dba_tables where rownum lt 101
-- create index
create index patient_list_ndx on
patient_list(gender, first_preg_date)
-- collect stats
exec dbms_stats.gather_table_stats(user,
'PATIENT_LIST', estimate_percent gt 100,
method_optgt 'FOR ALL COLUMNS SIZE 1', cascade
gttrue)
set autotrace on
select from patient_list where gender 'M' and
first_preg_date is not null

7
Here is the explain plan

The explain Plan
0 SELECT STATEMENT OptimizerCHOOSE (Cost2
Card50 Bytes750)
1 0 TABLE ACCESS (FULL) OF 'PATIENT_LIST'
(TABLE) (Cost2 Card50 Bytes750)
The CARD (CARDINALITY) is 50. Cardinality is
the number of rows the optimizer expects to come
back
TEST TIME! What is the explanation? (dont
worry, multiple choice)

8
What does this Mean?

(a) MEN REVOLT!!!! We no longer need women to
reproduce!
(b) The optimizer is flawed, file an SR!
(c) statistics simply have their limitations

9
How did Oracle calculate 50?

200 Rows in the Table
50 have first_preg_date NULL
2 Distinct values for Gender 50
200 Rows 50 50 50 Rows
Not Rocket science, but it is obviously way off
If the table happened to 10TB, the calculation
would have occurred the same way
Sounds like a mathematical statistical
limitation

10
Limitation? Oracle Should fix that

Statistics collects only very rudimentary
metadata about tables, indexes and table columns
To blindly widen the net could cause your
statistical metadata to grow exponentially
causing the optimizer to slow significantly (not
to mention the stats job) for the sake of only a
relatively small number of queries

11
In a nutshell, how does the optimizer work

Upon a hard parse, oracle comes up with a large
number of possible execution plans
The optimizer simply uses basic mathematical
statistics (and limited Oracle proprietary
algorithms) to assign cardinalities to each step
of these execution plans (stats are used here)
Then the optimizer assigns a cost to each step
(how much work to collect that cardinality)
(stats are used here)
Then for each plan, the optimizer sums the total
cost and picks the plan with the lowest number

12
What are Table Level Statistics

ALL_TABLES
The Biggies
NUM_ROWS - Number of rows in the table
BLOCKS - Number of used data blocks in the table
Considerations
DEGREE - Number of threads per instance for
scanning the table (parallelism)
PARTITIONED - Indicates whether this table is
partitioned. Set to YES if it is partitioned.
AVG_ROW_LEN - Average length of a row in the
table (in bytes)

13
What are Column Statistics

This is where method_opt comes in
FOR ALL COLUMNS
FOR ALL INDEXED COLUMS
ALL_TAB_COL_STATISTICS
The Biggies
NUM_DISTINCT - Number of distinct values in the
column
LOW_VALUE RAW(32) - Low value in the column
HIGH_VALUE RAW(32) - High value in the column
NUM_NULLS - Number of nulls in the column
Considerations
AVG_COL_LEN NUMBER Average length of the column
(in bytes)
HISTOGRAM - Indicates existence/type of histogram
(we shall discuss this later)
Calculated based on other stats
DENSITY NUMBER Density of the column

14
What is out of Range?

This can occur when the a where clause predicate
on a column is less then LOW_VALUE or greater
then HIGH_VALUE
Can be caused by
big data shifts dumped into the table after the
most recent analyze
Sample rate too low
Problem can appear random when using bind
variables
Nice to know but the values in ALL_TAB_COL_STATIST
ICS
LOW_VALUE
HIGH _VALUE
Are Raw data type, how do I translate?

15
Why do I care about Out of Range

The further you get out of range, the lower the
expected cardinality
The optimizer can greatly underestimate the
cardinality of the rows coming back causing a
suboptimal plan (plan starts in the wrong place,
wrong index is used)

16
How do you detect Out of Range Issues

This is tough, if you keep your stats from
getting stale look at the slow sql plan in
vsql_plan. Look at the cardinalities, if one
appears way off, this might be it (but just
because you dont see the symptoms, doesnt mean
this isnt the prob). (Much worse w/ binds).
When using binds , try to extract first parse
bind variables from the data dictionary or AWR.
You may have to trust your gut.

17
LOW_VALUE - HIGH_VALUEconversion

create or replace function display_raw (rawval
raw, type varchar2)
return varchar2
is
cn number
cv varchar2(32)
cd date
cnv nvarchar2(32)
cr rowid
cc char(32)
begin
if (type 'NUMBER') then
dbms_stats.convert_raw_value(rawval, cn)
return to_char(cn)
elsif (type 'VARCHAR2') then
dbms_stats.convert_raw_value(rawval, cv)
return to_char(cv)
elsif (type 'DATE') or (type LIKE
'TIMESTAMP') then
dbms_stats.convert_raw_value(rawval, cd)
return to_char(cd)

18
Here is the SQL

select
a.column_name,
display_raw(a.low_value,b.data_type) as
low_val,
display_raw(a.high_value,b.data_type) as
high_val,
b.data_type, a.NUM_DISTINCT, a.DENSITY,
a.NUM_NULLS
from
all_tab_col_statistics a, all_tab_cols b
where
a.table_name'ACAT_CTRL_FACT' and
a.table_nameb.table_name and
a.column_nameb.column_name and
a.owner b.owner

19
What are Index Stats

ALL_IND_STATISTICS
BLEVEL- B-Tree level
LEAF_BLOCKS - Number of leaf blocks in the index
DISTINCT_KEYS - Number of distinct keys in the
index key
CLUSTERING_FACTOR - Indicates the amount of order
of the rows in the table based on the values of
the index.

20
Why do I have to sometime run stats in the middle
of the day to fix a query?

Likely candidates
Cursor Invalidation (for reused sql)
Simply a bad peaked bind variable
Out of Range issue due to new data or stats
sampling issue
Significant data change at the column level
(update) since last analyze
Significant change at the number of rows (or
table size) typically because stats havent
been run for a long time since last analyze

21
Statistics Types on Partitioned tables

Statistics are collected two ways on a
partitioned table (or 3 on subpartioned)
Global (Same as a normal table)
ALL_TABLES
ALL_TAB_COL_STATISTICS
ALL_TAB_HISTOGRAMS
ALL_INDEXES
Partitioned
ALL_TAB_PARTITIONS
ALL_PART_COL_STATISTICS
ALL_PART_HISTOGRAMS
ALL_IND_PARTITIONS
Subpartitions
ALL_SUBPART_COL_STATISTICS
ALL_SUBPART_HISTOGRAMS
ALL_TAB_SUBPARTITIONS

22
Why so many statistics on Partitioned Tables

When oracle received a SQL statement that
references a partitioned table, it will use the
values in the where clause to use the lowest
degree of stats possible
Select from part_table (global stats)
Select from part_table where partitioned_key
X (partition stats)
Select from part_table where sub_partitioned_key
X (sub_partition stats)
Select from part_table where partitioned_key in
( X, Y) (global stats)

23
Gotchas with stats on partitioned tables

If you use bind variables (i.e. partitioned_key
(X) and gender Y), the database will find
the best plan for that value for the data in
Xs partition. Maybe that is a bad plan in
another partition.
Imagine if the first hard parsed query in the
morning hits an empty partition (cheap FTS).
Then the next query hits a big partition. Same
plan (supposedly some work is done on this issue
in 11G)

24
Gotchas with stats on partitioned tables

Image a Credit Card transaction table partitioned
by load date (1000 Days)
Each load date will contain 4 partial transaction
dates
No histograms
Local Stats on any given partition say 4 distinct
values for load date
Global Stats say 1004 distinct values
Index on Load Data
All input values are in range
Select from CC_TRANS where load_date X and
trans_date Y
Optimizer says FTS (full partion scan) best plan
GOOD
Select from CC_TRANS where load_date in (X,Z)
and trans_date Y
Optimizer says Index Lookup best plan BAD
Select from CC_TRANS where load_date gt X
and trans_date Y

25
Out of Range issues

Partitioned tables tend to be big tables, so
adding 1M rows may not be a big change (stale).
If you are adding a new partition every day
(partitioned by transaction date), every day the
max(transaction date) for the global table grows.
Global stats doesnt know. Everyday will grow
to be further out of range. Execution plans
against global stats may perform terribly when
sql tries to hit recent out-of-range values.
In 10.2.0.4 and after, new copy table statistics
procedure is available to counteract some of
these issues

26
Histograms what are they

Dan Hotka Histograms are a cost-based
optimizer (CBO) feature that allows for Oracle to
see the possible number of values of a particular
column. This is known as data skewing and
histograms can track the number of occurrences of
a particular data value when deciding on what
type of index to use or even whether to use an
index at all. Histograms give the CBO the ability
to see the uniqueness of a particular data value.

27
Histogram metadata

ALL_TAB_HISTOGRAMS
ENDPOINT_NUMBER - Histogram bucket number
ENDPOINT_VALUE - Normalized endpoint value for
this bucket
ENDPOINT_ACTUAL_VALUE - Actual (not normalized)
string value of the endpoint for this bucket

28
Lets give Histograms a look

We will create a frequency based histogram
create table gender as select 'M' gender from
dba_objects
insert into gender select 'F' from dba_objects
where rownum lt 10
commit
EXEC DBMS_STATS.GATHER_TABLE_STATS(ownname gt
user, tabname gt 'GENDER', ESTIMATE_PERCENT
gt100, method_opt gt 'FOR ALL COLUMNS SIZE 10')
select num_rows from all_tables where owner
user and table_name 'GENDER
Result 36769

29
Histograms, still looking

SELECT column_name,
ENDPOINT_NUMBER,
endpoint_value,
ENDPOINT_ACTUAL_VALUE,
SUBSTR (SYSTEM.hexstr (endpoint_value),
1, 30)
FROM all_tab_histograms a
WHERE table_name 'GENDER' AND owner USER
hexstr function comes from asktom.oracle.com

30
Histogram Explain Plans

set echo on
set autotrace traceonly explain
SQLgt select from gender where gender 'F'
Execution Plan
--------------------------------------------------
--------
0 SELECT STATEMENT OptimizerCHOOSE
(Cost19 Card9 Bytes18)
1 0 TABLE ACCESS (FULL) OF 'GENDER'
(TABLE) (Cost19 Card9 Bytes18)
SQLgt select from gender where gender 'M'
Execution Plan
--------------------------------------------------
--------
0 SELECT STATEMENT OptimizerCHOOSE
(Cost19 Card36760 Bytes
73520)
1 0 TABLE ACCESS (FULL) OF 'GENDER'
(TABLE) (Cost19 Card36760 Bytes73520)

31
Histograms two types

Frequency Based (less the 255 sampled values)
Army personnel table
Female 5
Male 95
Height based (not bound by of distinct values)
Bell shape curve distribution, i.e. IQ scale
Say (making this up)
3 of adults have an IQ between 0 70
47 of adults have an IQ between 70 100
47 of adults have an IQ between 100 130
3 of adults have an IQ between 130 200
Oracle implementation will break up the numbers
evenly by the range (for instance 10 buckets, 10
of the range will be in each bucket)

32
Histograms in 10G

10G defaults to size auto in method_opt
This means there is an algorithm in oracle to
view recent behavior in the database to determine
if histograms are needed
I believe 10G size auto, is too aggressive in
creating histograms. Creates histograms on too
many columns
And histogram collection time is significant and
resource intensive

33
10G Histogram Oddities

I have seen 10G put height based histograms on
Last Name columns
If histograms tell use there are 1M rows with
last names between Pallack and Phillips, how can
histograms help the optimizer guess how many
Penellopeys (uncommon) there are..
Clearly there are Common last names, but range
based histograms cant hold them (too many), and
height based dont make sense.

34
Histogram Bind Variables

The combination of histograms and bind variable
usage can get you in trouble (11G has some
advances in this)
The first hard parse determines your plan. What
if the bind variable value references infrequent
skewed data which is not used again. Subsequent
queries go after data on the other side of the
skew?
Im not saying lack of histograms doesnt have
its downfalls, but the existence of histograms
can cause large volatility which brings the
phone calls.
If a small discrete set of values, consider using
literals in where predicate for histogram columns
for the best plan, every time.

35
But 10G takes care of Stale Stats

Oracle Documentation - If a monitored table has
been modified more than 10, then these
statistics are considered stale and gathered
again
A good solid rule for 80 of all tables.
But staleness truly cannot be measured by number
of modified rows alone. Internal data swings
(most likely on bigger tables) can cause HUGE
issues. Consider a transaction table with a new
distinct ORDR CODE for every day.
If there is a frequency histogram on this column,
a new ORDR CODE will not be in it. The
optimizer, using the histogram data in
combination of the columns distinct values
(all_tab_cols) and the number of rows in the
all_tables, the optimizer may guess as loww as 1
row.

36
Statistics Guidelines

Many tables fall into batch processes, If
possible the app teams should run stats after
each batch.
Partitioned tables should collect global stats
each week
Tables partitioned by month should additionally
collect global stats immediately after the first
load of monthly data.
If you periodically find bad plans based on
wrong statistics, even after an analyze, up the
sample size.
Sure 10G has a default daily stats job, dont
hesitate to use your own