Deal with Production Issues - PowerPoint PPT Presentation

About This Presentation
Title:

Deal with Production Issues

Description:

Deal with Production Issues Suggestions from ITIL Problems to solve Long resolution time Neglected issues Issues we lose track of until our users remind us Recurring ... – PowerPoint PPT presentation

Number of Views:69
Avg rating:3.0/5.0
Slides: 42
Provided by: bigapplez
Category:

less

Transcript and Presenter's Notes

Title: Deal with Production Issues


1
Deal with Production Issues
  • Suggestions from ITIL

2
Problems to solve
  • Long resolution time
  • Neglected issues
  • Issues we lose track of until our users remind us
  • Recurring issues
  • Inconsistency in response time
  • Developers are distracted constantly to resolve
    issues

3
Goal
  • Manage issues in a consistent manner
  • Fast resolution
  • Reduce client impact
  • Proactively resolve issues before they impact
    clients

4
Basic Concepts
  • Incidents
  • Any event which is not part of the standard
    operation of a service and which causes, or may
    cause an interruption to or a reduction in, the
    quality of that service
  • Problems
  • A problem is a condition often identified as the
    cause of multiple incidents that exhibit common
    symptoms.
  • Known Errors
  • A known error is a condition identified by
    successful diagnosis of the root cause of a
    problem, and subsequent development of a
    Work-around

5
Relationship of the three
  • Problem is the root cause of the incidents
  • Incident is the manifest of a underline Problem
  • One Problem can cause many Incidents
  • Known error is a problem with known root cause
    and known workaround

6
Manage Incident vs. Manage Problem
  • Different goals
  • Incident Management focus on restoring the
    service operation as quickly as possible
  • Problem management focus on finding and
    eliminating the root cause
  • Different actions
  • Incident management applies workarounds or
    temporary fixes to quickly restore the services
  • Problem management issue a change to
    fundamentally eliminate the root cause
  • Incident management is reactive and problem
    management is proactive
  • Incident management emphasize speed and problem
    management emphasize quality

7
Common mistakes
  • Spend tremendous time and efforts to find root
    cause before the service level is recovered
  • Stop the investigation after an incident is fixed
    by a workaround
  • Same incident occurs repeatedly without
    understanding of the root cause

8
Solutions from ITIL
  • Separate out Incident Management and Problem
    Management into two independent but related
    processes
  • Handle incidents (restore service) as quickly as
    possible
  • Proactively and independently work on resolving
    problems
  • Wisely manage Known Errors

9
Incident Management
  • Always remember the goal is to Restore service
    level as quickly as possible
  • How to go fast?
  • Classification
  • Match known errors and known workarounds
  • Appropriate escalation
  • Go fast, but not go crazy. Dont miss
  • Record
  • Prioritize
  • Follow up

10
Incident Management Process
11
Acceptance And Record
  • Benefits of recording
  • Help to diagnosis new incidents based on known
    incidents
  • Help Problem Management to find the root cause
  • Easy to determine the impact
  • Be able to track and control the issue
    resolution.
  • Incident Reporting Channels
  • User
  • System Monitor/Alert
  • IT person

12
Incident Record
  • Unique ID
  • Basic diagnosis info
  • Timestamp
  • Symptoms
  • User info (name, contact info)
  • Whos responsible
  • Additional information
  • Screenshots
  • Logs
  • Status
  • New, Accepted, Scheduled, Assigned, Active,
    Suspended, Resolved, Terminated

13
Classification
  • Classification
  • Possible reasons (application, network, database,
    business logic, etc.)
  • Supporting group (application group, database
    group, infrastructure group, network group, etc.)
  • Prioritize
  • Priority Impact X Urgency
  • Determine resolution timeline (resolve within X
    hours) based on Service Level Agreement

14
Preliminary Support
  • Preliminary Response
  • Acknowledge of acceptance
  • Collect basic info
  • Provide basic help to the user
  • Service Requests
  • Service Request is standard service like check
    status, reset password, etc.
  • Go through standard procedure to handle service
    requests

15
Match
  • Match known errors
  • Known solution
  • Known workaround
  • Known resolution procedure
  • Match existing incidents
  • Link the new incident with the existing incidents
  • Increase the impact level of the existing
    incident
  • If the existing one is already worked on, inform
    the responsible personal/group

16
Investigate and Diagnosis
  • Escalation
  • Functional escalation (Technical escalation)
    Involve more technical experts, involve teams in
    other functional group, or involve external
    suppliers
  • Hierarchical escalation (Management escalation)
    Escalate to higher level management team

17
Escalation by Priorities
  • D (Incident Manager)
  • E (Division Management)
  • F (Corporate Management
  • A (Service Desk)
  • B (Second Line)
  • C (Third Line, Supplier)

18
Investigation Activities
  • Assign dedicated support person
  • Collect basic info
  • Query historical data
  • Recent releases
  • Recent changes
  • Workload trend
  • Analyze
  • Again, dont spend too much time in finding the
    root cause. Find a workaround as soon as
    possible!

19
Resolve and recover
  • Resolution (workarounds or permanent fix)
  • Create a Request For Change (RFC)
  • Approve RFC
  • Implement Change.
  • Record the analysis, the root cause, the
    workaround and the solution
  • Leave the incident in Open status when resolution
    hasnt been found

20
Termination
  • Contact the user to confirm incident is resolved
  • Change the Incident status into Closed
  • Update all the Incident record to reflect the
    final priority, impact, user and root cause

21
Track and Monitor
  • Assign an owner to each incident. Usually its
    the Service Desk person.
  • Provide feedback to the users after a change
  • Enforce the escalation based on the priority

22
Problem Management
  • Problem Control
  • Find the root cause of a problem
  • Turn a problem into a Known Error
  • Error Control
  • Control and Monitor the Known Errors until they
    are appropriately handled
  • Proactive Problem Management
  • Resolve problems before they cause any incidents

23
Problem Control
24
Identify Problems
  • Analyze the trends of incidents
  • Likely to reoccur
  • Likely more will occur
  • Likely to have larger impact
  • Analyze the weakness of the infrastructure
  • Availability
  • Capability
  • A significant incident (outage)

25
Diagnosis
  • Recreate incident in testing environment
  • Link the modules with incidents
  • Review the latest changes
  • After the root cause of a problem is found, this
    problem becomes a Known Error

26
Temporary Fixes
  • Its important to find a temporary fix if the
    problem causes significant incident
  • If temporary fix involves changes in the
    infrastructure, a Request For Change must be
    submitted. (Later, another RFC may be submitted
    to fix the root cause)
  • For urgent problems, Emergency Change Request
    Process should be initialized.

27
Error Control
28
Identify and Record Known Error
  • Identify
  • Find the root cause of a problem
  • Link a problem with a known error
  • Record
  • Assign an ID
  • Symptoms
  • Root cause
  • Status
  • Notification
  • Notify incident management team. They can
    associate new incidents with known errors

29
Determine the solution
  • Evaluate based on
  • Service Level Agreement
  • Impact and Urgency
  • Cost and benefit
  • Possible solutions
  • Temporary fixes
  • Permanent fixes
  • No fix (cost is greater than benefits)
  • Record the decision in Problem Database

30
Known Errors from other environments
  • Known errors from development environment
  • We may choose to release with some minor known
    issues
  • Known errors from suppliers
  • Usually reported in the release notes
  • Record, Monitor and Track those known errors
  • Relate problems with those known errors

31
PIR (Post Implementation Review)
  • Normal problems
  • Confirm all the related incidents are closed
  • Verify if the problem record is complete
    (symptoms, root cause and solutions)
  • Change the problem status into Resolved
  • Significant problems
  • What went well?
  • What went wrong?
  • How to do better next time?
  • How to prevent the similar issues from happening
    again?

32
Track and Monitor
  • Track the full lifecycle of each known error
  • Reevaluate impact and urgency. Adjust the
    priorities accordingly.
  • Monitor the progress of the diagnosis and
    implementation of the solution. Monitor the
    implementation of the RFC.

33
Proactive Problem Management
  • Focus on the quality of the service and the
    infrastructure
  • Analyze operational trends
  • Detect the potential incidents and prevent them
    from happening
  • Find out the weak points of the infrastructure or
    the overloaded components

34
Ideas to improve our Production Support process
  • Idea 1 Create an independent Problem Management
    Team.
  • Idea 2 Create an Problem Database
  • Idea 3 Define the Production Support Procedure
  • Idea 4 Review and revise the procedures of using
    TeamTrack
  • Idea 5 Enforce Post Implementation Review
  • Idea 6 Proactively manage problems
  • Idea 7 (optional) Acquire an Service Desk
    software to facilitate the process

35
Create an independent Problem Management Team.
  • Can be a full time team or a part time team
  • Appoint a Problem Management Manager. Must be
    different than the Production Support Manager.
    Their goals, schedules and requirements are
    different.
  • Responsible for managing all the production
    problems (not incidents) for multiple
    applications
  • Identify problems
  • Record problem
  • Find and evaluate solutions
  • Track the progress till closure
  • Work closely with the existing Production Support
    team.

36
Create a Problem Database
  • A easy to search knowledge database
  • Include problems and known errors
  • Track symptoms, root causes, temporary fixes,
    workarounds, and permanent solutions
  • Include all the known errors in DEV and
    unresolved or deferred defects in QA/RATE
    environments
  • Maintained by the Problem Management Team
  • Will be used by Production Support team for match
    and fast resolution of incidents

37
Define the Production Support Procedure (Work
Instructions)
  • Create a formal and detailed document. Train
    Production Support Team to follow the new
    procedure
  • Start with ITIL Incident Management Process.
    Adjust it to our own situation and tools
  • Clearly define how to calculate priorities
  • Clearly define the time-bound escalation
    procedure
  • Clearly define the monitoring and tracking steps

38
Review and define the procedure of using TeamTrack
  • TeamTrack is our existing Incident Tracking
    system
  • Review the functions of TeamTrack
  • Redefine the incident escalation process
    according to ITIL suggestions
  • Define the interface between PC Support and IT
    Production Support Team
  • Communication channel
  • Roles and responsibilities
  • Escalation
  • Track and Control
  • Knowledge sharing

39
Enforce PIR
  • Contact each user to confirm all the incidents
    are closed
  • Make sure the Problem record is complete and
    useful
  • Identify issues in the Incident and Problem
    Management process. Add those to Problem database.

40
Proactively Manage Problems
  • Responsibility of the Problem Management Team.
  • Perform the following activities
  • Analyze incidents to find the trend
  • Analyze infrastructure to identify possible
    bottleneck
  • Run fail-over and stress tests
  • Apply a problem solution across multiple related
    applications
  • Establish and maintain the Production Monitor
    System to proactively detect system anomalies
  • Evaluate how many problems are proactively
    identified and resolved

41
Service Desk Software
  • Evaluate the existing TeamTrack software and see
    if it covers out needs
  • Other popular options
  • HP Openview Service Desk
  • Remedy Strategic Service Suite
  • CA Unicenter Service Desk
Write a Comment
User Comments (0)
About PowerShow.com