Root Cause and Other DBA Urban Legends - PowerPoint PPT Presentation

About This Presentation
Title:

Root Cause and Other DBA Urban Legends

Description:

Root Cause and Other DBA Urban Legends. Brian Hitchcock. OCP 10g DBA. Sun Microsystems ... Perhaps these are Urban Legends? Root cause. Test system(s) Specific ... – PowerPoint PPT presentation

Number of Views:70
Avg rating:3.0/5.0
Slides: 46
Provided by: brianhi
Learn more at: http://www.nocoug.org
Category:
Tags: dba | cause | legends | root | urban

less

Transcript and Presenter's Notes

Title: Root Cause and Other DBA Urban Legends


1
Root Cause and Other DBA Urban Legends
  • Brian HitchcockOCP 10g DBA
  • Sun Microsystems
  • brian.hitchcock_at_sun.com
  • brhora_at_aol.com
  • www.brianhitchcock.net

SunFed DBA
Brian Hitchcock October 19, 2006
Page 1
2
For DBA Issues
  • I'm told that I must find root cause,
  • Can't resolve the issue without root cause
  • Must have testing environment that is
  • Similar enough to production to recreate the
    issue
  • And the fix
  • Need extensive expertise to solve performance
    issues
  • 10046 trace
  • Wait events
  • Extents size/number
  • Physical spindles
  • Rebuilding indexes
  • Undocumented init parameters

3
What is My Experience
  • Centralized DBA support team
  • 2000 databases
  • Database users open cases with DBA team
  • Cases randomly assigned to DBAs
  • Not assigned based on experience or expertise
  • My cases are probably typical of all the cases

4
My Experience
  • Finding root cause costs money
  • Building and maintaining test system(s) costs
    money
  • Test system
  • Needs to be able to recreate production issue
  • Database, network, apps and web servers, load
    balancers
  • Users around the world
  • Root cause and test system(s) are only worthwhile
  • If having them is less expensive than not having
    them.
  • Perhaps these are Urban Legends?
  • Root cause
  • Test system(s)
  • Specific expertise required

5
My Experiences
  • Review major cases I worked on over the last 2
    years
  • Cases are presented in chronological order,
    oldest first
  • For each case
  • What worked, i.e. what was the real-world
    solution?
  • Was root cause identified?
  • Was a test system available to verify root cause
  • and the solution?
  • What extensive expertise was required?

6
Keeping Score
  • For each case, look for 3 things
  • Root cause
  • Test system(s)
  • Specific expertise
  • Listed earlier (10046, wait events, extents,
    etc.)
  • Were any of the following useful in resolving the
    issue?
  • Open Mind (anything can and will happen)
  • Communication (it's difficult)
  • Simple solutions
  • Any active db users?
  • Reboot one or more components?

7
Case 1 CRM Local Language
  • Moved to UTF8, add local language columns
  • SQL ran slower
  • 10046 trace, explain plan
  • Found the single step slowing execution
  • Existing index wasnt updated to have new local
    language column
  • Recreated index
  • Performance returned to normal

8
Case 1 Three Things
  • Root cause
  • Existing indexes not optimal for new schema
  • Corrected indexes fixed the problem
  • Test system identified the issue
  • Specific Expertise
  • 10046
  • Explain plan
  • Indexes

9
Case 2 Storefront Slow
  • App support team reports database slow
  • Check database, 1-3 active users at most
  • Active users gone in 1-2 seconds
  • App users report 30 second response time
  • Have App support person use app
  • Watch database
  • See single active user connect, complete,
    disconnect
  • 2 seconds maximum
  • Remainder of response time
  • Web servers, load balancers etc.
  • Application server was locked up reboot fixed it

10
Case 2 Three Things
  • Root cause
  • Not identified
  • Test Environment
  • Exists, but not even close to production
  • Complex production application environment
  • Accurate test environment
  • Expensive to build and maintain
  • Hard to find qualified testers
  • No one completely understands how production was
    built
  • Impact of layoffs, outsourcing
  • Hard to recreate what you dont understand
  • No specific expertise required

11
Case 3 Contracts Slow
  • App users report message
  • Problem contacting the database No available
    resource
  • Sar shows CPU 90 idle
  • Statspack snapshot for last hour shows top wait
    events
  • 70 CPU time
  • 20 db file read
  • App server rebooted, performance returns to normal

12
Case 3 Three Things
  • Root cause unknown
  • How to determine what caused the app server to
    hang?
  • Test system does not have all the components of
    prod
  • Dont know why production locked up
  • Tough to reproduce in test system
  • Simple solutions no specific expertise required
  • Reboot app server
  • Low-risk, cheap, quick
  • No special experience required
  • If it doesnt work, time to put more resources
    into the issue

13
Case 4 Storefront
  • Users report app hanging
  • Few or no active users in database
  • Active users taking much longer than normal
  • Isolate single SQL statement that is running slow
  • Explain plans show good/bad execution plans
  • Changes from good to bad at random times
  • Various attempts to isolate give conflicting
    results
  • We assume that the problem is stable, it isnt
  • 10053 trace, watch optimizer choose execution
    plan
  • Optimizer changes from good to bad plan
  • one column of one table has 2 versus 3 distinct
    values

14
Case 4 Three Things
  • Root cause
  • not sure
  • App developers gone (outsourced)
  • No one knows how code really works
  • Why are these values appearing and disappearing?
  • App is inserting/deleting the rows with the 3rd
    distinct value
  • This change causes optimizer to choose bad plan
  • Bug or feature?
  • Fix?
  • Create rows so column always has 3 distinct values

15
Case 4 Three Things (contd)
  • No test system
  • Test system has small percentage of production
    data
  • App system too complex to reproduce
  • Multiple strings, load balancers, app servers
  • Hardware, support, upgrades, patches
  • Difficult to get testing resources people are
    expensive
  • Changes are tested for functionality
  • Changes arent tested for performance
  • Until released in production
  • Specific expertise
  • Explain plan
  • 10053 trace

16
Case 5 Customer Demo
  • Users report slow performance
  • Database shows 4 inactive sessions
  • Inactive sessions holding locks
  • Kill these 4 sessions
  • Performance is fine

17
Case 5 Three Things
  • Root cause unknown
  • Why some sessions inactive and holding locks?
  • App developers gone
  • Easier to kill sessions once in a while
  • Real root cause would be expensive to find
  • Test system
  • Dont have test system app developed on the
    cheap
  • Dont know how app works
  • No specific expertise required
  • Kill database sessions

18
Case 6 Executive Dashboard
  • User reports database is sorting dates
    incorrectly
  • Phone on mute laugh
  • Users have really good drugs today!
  • Connect to db
  • Verify that indeed dates are being sorted
    backwards
  • From NLS experience, look at actual bytes of
    dates
  • There are extra bytes that I cant explain
  • Go back to the docs for DATE type
  • Expand DATE format to include all the possible
    fields
  • Suddenly it all makes sense!

19
Case 6 Whats the Story?
  • The dates are being correctly sorted because
  • They are from BC!
  • OK, now what?
  • This is sales data from Fred Flintstone?
  • Is Barney setting up his own IT department?
  • Check the basics
  • App code uses 10g JDBC
  • User says this was a requirement
  • But the database is 9i it just gets better and
    better
  • User finds Metalink note
  • 10g JDBC issues with 9i database
  • Doesnt describe our issue, but its a start

20
Case 6 What to Do?
  • Setup a 10g database and test the app code
    against it
  • No test system, in fact, no other system of any
    kind
  • Critical executive reporting system, cant be
    down for long
  • Very limited disk space
  • Why not upgrade existing 9i database?
  • Upgrades can cause problems
  • If this isnt a 10g to 9i issue, why risk the db
    upgrade?
  • Install 10g db, full export 9i db, import into
    10g db
  • User tests app code against 10g database
  • No more dates from BC!
  • Remove 9i database, expand 10g database to match
  • User happy
  • Executives reports dont show sales to the
    Flintstones

21
Case 6 Three Things
  • Root cause
  • Not clear
  • Tried 10g database and issue went away
  • Did we really identify the root cause?
  • Was it a feature of 10g JDBC? A bug?
  • Test system
  • None
  • Expertise required
  • Minimal configuration control would have
    prevented this
  • Hardest part was accepting what Oracle was
    telling us
  • Dates from BC
  • Not at all the way things are supposed to be

22
Case 7 Customer Demo
  • Users calls, application is slow
  • Blocking processes, restart database twice in one
    day
  • App still slow
  • Ask user to connect and start using app
  • I watch database for active users
  • Over 20 seconds before new db user appears
  • Db user is done and gone in less than a second
  • User reports 30 seconds before results appear in
    browser
  • I tried ping between the db server and the app
    server
  • 200ms with packet loss
  • Ask network group to investigate
  • Network switch in data center has failed

23
Case 7 Three Things
  • Root cause
  • Network switch no question
  • This was not a database problem
  • But we could have wasted a lot of time with SQL
    tracing etc.
  • Need to confirm that the database is the problem
  • Then work the issue as a db tuning issue
  • No test system available
  • How would you reproduce the data center network?
  • Expertise required
  • Look for active users in database
  • ping between db and app server

24
Case 8 Data Warehouse
  • User reports application slow
  • User cant truncate a table command hangs
  • Loading data into warehouse is also hanging
  • Watch database while user starts load process
  • During load
  • User is using third-party app to do the load
  • No active users performing inserts
  • When no load is happening
  • Truncate table runs quickly
  • User contacts app vendor
  • Vendor changes some parameters for the load
    process
  • Data load runs normally, truncate table runs
    normally

25
Case 8 Three Things
  • Root cause
  • Looking at the db showed no active inserts during
    data load
  • Needed to verify that db wasnt the issue
  • Force vendor to perform
  • Test system
  • This was a dev database so there was a test
    system
  • Data load issues identified, resolved in dev
    environment
  • Expertise required
  • Quickly check for database problems
  • Communicate with user to understand what is
    happening
  • Politics dealing with vendor
  • Vendor really, really wants it to be a database
    problem

26
Case 9 Software Registry
  • User trying to truncate table
  • Part of a data feed process
  • Other users in database
  • Performing deletes on same table
  • Blocking truncate
  • Kill delete processes
  • Truncate still blocked
  • Watch database processes
  • Delete process starts every hour on the hour
  • Cron job starts delete hourly
  • Part of data feed process
  • Why dont the app owners know this?

27
Case 9 Three Things
  • Root cause
  • Unknown
  • Why was cron in place that user didnt know
    about?
  • Test system exists
  • But doesnt have the production data feeds
  • Expertise required
  • None basic DBA skills
  • Not a database performance issue
  • Basic app configuration was the issue

28
Case 10 CRM Reporting
  • User reports app is too slow
  • Specific selects are taking 10-15 seconds
  • STATSPACK snapshots are being taken
  • Check snapshots over same time for last week
  • Performance of the top SQL hasnt changed
  • User agrees that performance hasnt changed
  • Resources not available to make performance
    better
  • When is a performance problem not a problem?
  • When you dont want to pay to fix it

29
Case 10 Three Things
  • Root cause
  • User perception
  • Test system
  • None
  • Expertise required
  • Document that performance hasnt changed
  • Offer to work the issue if user will get funding

30
Case 11 Configurator
  • Users getting error
  • Cant allocate memory in shared pool
  • What is causing this?
  • Spend some time looking at the database
  • Looks normal
  • Reboot database to clear shared pool
  • Watch database after restart
  • Shared pool doesnt fill up

31
Case 11 Three Things
  • Root cause
  • Dont know why shared pool fills up
  • Test system
  • How to recreate problem in test system?
  • Expertise needed
  • Basic DBA skills
  • Flush shared pool
  • Problem didnt reoccur
  • One-time or infrequent set of circumstances

32
Case 12 Pricing Database
  • User reports slow performance
  • Check database server
  • CPU and iowait very high
  • Watch SQL executing
  • Explain plan for worst SQL shows full table scans
  • No indexes on tables being scanned
  • Test server
  • Same tables do have indexes
  • While recreating indexes
  • All the needed indexes reappear
  • Cron job for pricing data runs every two weeks
  • Rebuilds indexes and loads data

33
Case 12 Three Things
  • Root cause
  • What caused indexes to be dropped?
  • Unknown
  • Test system
  • Used to verify that indexes were missing
  • Cant reproduce whatever dropped the indexes
  • Expertise required
  • Basic DBA skills

34
Case 13 Contracts Database
  • User reports error when selecting from table
  • Invalid row id
  • Some queries on same table dont report error
  • Looking at table and row ids
  • Some queries use index
  • No errors
  • Other queries use table
  • Specific rows have invalid row ids
  • Rebuild table
  • Create table as select from

35
Case 13 Three Things
  • Root cause
  • Why did row ids become invalid?
  • Unknown
  • Test system
  • Exists, cant reproduce cause of invalid row ids
  • Expertise required
  • Basic DBA skills

36
Case 14 Software Download
  • User reports application slow in Dev environment
  • SQL so slow, application server times out
  • Sar shows CPU near 0 idle
  • User executes problem SQL
  • Watch database
  • SQL completes, very slowly, database is working
  • Explain plan shows full table scans
  • Look at tables involved
  • Production last analyzed NULL, indexes, runs
    fast
  • Test recently analyzed, indexes, runs slow
  • Drop stats, rebuild indexes, reanalyze
  • Performance returns to normal

37
Case 14 Three Things
  • Root cause
  • What happened to indexes, statistics in
    production?
  • Unknown
  • Test system
  • Exists
  • Doesnt match production which is correct?
  • Cant reproduce dropped indexes and/or statistics
  • Expertise required
  • Basic DBA skills

38
Case 15 BRIO
  • User reports SQL failing with error
  • Cant allocate memory in shared pool
  • Watch database
  • Same SQL runs most of the time
  • Error occurs once in a while
  • Solutions
  • Reduce sort area size
  • Free up more memory for shared pool
  • Increase physical memory assigned to database
  • Error occurred so infrequently
  • Users decided not to change system

39
Case 15 Three Things
  • Root cause
  • Unknown
  • Just too many users for a brief time?
  • Test system
  • Yes, but how to reproduce this issue (user load)?
  • Expertise required
  • Basic DBA skills

40
Case 16 Revenue App
  • Automated alert
  • Report log switching hung
  • Database stopped for transactions
  • Examine database
  • Cant find anything wrong
  • Restart database
  • Log switch works
  • No further alerts
  • No user problems

41
Case 16 Three Things
  • Root Cause
  • Unknown
  • Test system
  • Exists, but no help
  • Expertise required
  • Basic DBA skills

42
Scoreboard
43
Observations
  • How many times did we
  • Identify root cause?
  • Have a complete test system?
  • Need specific expertise?
  • Un-scientific results
  • Root cause 2 out of 16 13
  • Test system 3 out of 16 19
  • Specific expertise indexes 1/16 6
  • Simple solutions worked 94 of the time

44
Conclusions
  • Brians off his meds
  • Ignore him this isnt typical
  • Urban legends?
  • You dont have to find root cause
  • You dont have to have a complete test
    environment
  • You dont need extensive expertise in specific
    DBA areas
  • You do need
  • DBA experience
  • Open mind strange stuff happens all the time
  • Communication skills you dont know whats
    happening
  • Looking at training resources
  • Why become more expert at things that are rarely
    needed?
  • Why not become familiar with things you know
    little about?

45
Opinion
  • Specific expertise can be obtained when needed
  • Based on the results shown
  • 94 of the time, dont need expertise to be
    productive
  • 15 of 16 cases solved with general DBA skills
  • Some expertise can be automated or outsourced
  • Tracing, wait events
  • Focus on skills that cant be put into a GUI
  • Communication
  • Ability to solve problems, not just database
    issues
Write a Comment
User Comments (0)
About PowerShow.com