pSeries Hardware Support Center Error Log Analysis and Coordination: 12 Years of rsense - PowerPoint PPT Presentation

1 / 17
About This Presentation
Title:

pSeries Hardware Support Center Error Log Analysis and Coordination: 12 Years of rsense

Description:

Parity Error on PCI bus between Planar 0 Amd Adapter Slot 16 at 11:17:00 on 12/13 ... Error on PCI bus between Planar 0 Amd Adapter Slot 16 at 11:17:00 on ... – PowerPoint PPT presentation

Number of Views:291
Avg rating:3.0/5.0
Slides: 18
Provided by: IBMU328
Category:

less

Transcript and Presenter's Notes

Title: pSeries Hardware Support Center Error Log Analysis and Coordination: 12 Years of rsense


1
pSeries Hardware Support Center Error Log
Analysis and Coordination12 Years of rsense
  • Daniel J. Henderson
  • p and i Series HW Availability Lead

2
  • The slides that follow (excepting this one) are
    meant for poster board display to be arranged on
    tri-fold poster as follows

rsense pSeries Support Center Error Log
Analysis Dan Henderson p and i Series
Availability Lead
3
rsense pSeries Support Center Error Log
Decode/Analysis and Correlation
Daniel J. Henderson
p and i Series Availability RAS Lead
4
Traditional HW Error Logging in IBM RS/6000
Systems
  • HW Platform and device driver errors logged in OS
    error Log
  • Information Essential For repair logged in
    Customer/Servicer Readable Form
  • A Service Request Code Number for lookup in
    service publications
  • General Description of type of failure
  • FRU numbers telling what parts to replace for the
    failure
  • Detailed Information, known as sense data
    explaining exact nature of the failure and
    associated hardware state logged in a an ASCII
    hex format sense data
  • Error Log Analysis programs in OS
  • Identified log entries and reported on SRC and
    FRU callouts sufficient to direct service repair,
    But
  • Very little decoding of Sense Data
  • Very little, if any, correlation of multiple
    errors in log to either
  • Modify the hardware action plan
  • or threshold recoverable errors

Traditional pSeries Approach To Error Logging
and Analysis
OS
OS Error Log
Device Drivers
I/O
I/O
System Fw
HW Service Processor
5
Sample AIX Log Entry
Customer Recommendation Run Diagnostics
LABEL SCAN_ERROR_CHRP IDENTIFIER BFE4C025 Date/T
ime Sat Jan 29 181743 2005 Sequence
Number 25111 Machine Id 00CEE05D4C00 Node
Id va-txdb01 Class H Type
PERM Resource Name sysplanar0 Resource
Class planar Resource Type
sysplanar_rspc Location Description UNDETERMINED
ERROR Failure Causes UNDETERMINED Recommended
Actions RUN SYSTEM DIAGNOSTICS. Detail
Data PROBLEM DATA 0644 00E0 0000 06EC CE00 8E00
0000 0000 0000 0000 4942 4D00 5048 0030 0100 6300
2005 0129 1706 3453 2005 0129 1706 3459 4500
0113 0000 0000 0000 0000 0000 0000 500F FAB9 500F
FAB9 5548 0018 0100 4F30 2003 4000 0000 F962 0000
A902 0000 0000 5053 0104 0101 6300 0201 0009 0000
00FC 0200 00F0 28DA 5410 C100 98A0 2000 0000 0000
0000 0000 0013 0081 1930 FFFF FFFF 4231 3230 4636
3637 2020 2020 2020 2020 2020 2020 2020 2020 2020
2020 2020 2020 C000 002D 4C2C 481C 5537 3837 392E
3030 312E 4451 4430 345A 452D 5032 2D43 312D 4338
0000 4944 1C1D 3533 5033 3233 3200 3330 4141 594C
3130 3234 3236 3530 3535 4D52 1001 0000 0000 0000
0048 0081 028A 542C 481C 5537 3837 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000
............................ 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 ----------------------------------------
-----------------------------------
Diagnostic Results B120F667 Memory subsystem
including external cache Unrecovered Error,
general. Refer to the system service
documentation for more information. Error log
information FRU 53P3232 S/N YL1024265055
CCIN 30AA Location U7879.001.DQD04ZE-P2-C1-C8
Priority H FRU 53P3232 S/N YL10242650Z0
CCIN 30AA Location U7879.001.DQD04ZE-P2-C1-C1
Priority L Maintenance Procedure FSPSP1
OS Based Error Log Analysis Says what to
replace. Gives little information as to why
6
Support Center Error Log Analysis dsense
  • In a hardware support center, decoding of sense
    data originally proceeded manually to
  • Determine pattern of errors across multiple
    systems to look for pervasive issues
  • Provide additional fault detection/isolation when
    original hardware action plan provided did not
    satisfy customer needs
  • dsense was created in the early 90s to automate
    translating of hex bytes to give
  • A human readable description of the pertinent
    data of each byte
  • A Bottom-Line analysis of what each error log
    entry, creating a one line description of each
    error
  • A Summarization of multiple errors using the
    one-line description to give a much more accurate
    picture of system behavior over multiple days and
    multiple log entries
  • dsense eventually shipped as part of AIX
    diagnostics to allow on the spot analysis of
    errors rather than waiting for data to be
    transmitted to a support center.

OS Error Log Log Entry 1 at time a 1234 5678
9134 Log Entry 2 at time b abcd 3433 6763 Log
Entry 3 at time c 3432 3434 3432
  • Dsense Output of OS error Log
  • Log Entry 1
  • Hex Data Means
  • a) b) c) and d)
  • Bottom Line Device www experienced xxx kind of
    error
  • Log Entry 2
  • Hex Data Means
  • e) f) g) and h)
  • Bottom Line Device yyy experienced zzz kind of
    error
  • Log Entry 3
  • Hex Data Means
  • b) c) and d)
  • Bottom Line Device www experienced xxx kind of
    error
  • --------------------------------------------------
    ------------------
  • Summary of Errors
  • Device www experienced 2 errors of Kind xxx
  • Device yyy experienced 1 errors of kind zzz
  • --------------------------------------------------
    -----------------
  • Chronological summary of errors

7
pSeries Environment Error Log Analysis Challenge
Rsense response
  • In pSeries a single hardware platform can host
    multiple OS images and I/O virtualization.
  • On high end systems a hardware management console
    consolidate error logs from multiple OS images to
    report basic error information for Service, but
    not detailed support center information
  • Requirements for detailed support center analysis
    even greater than before
  • Rsense program created concurrent with pSeries to
    provide that level of support center analysis.
  • Functionality expanding with advances in
    partitioning and virtualization to provide
  • Cross OS and platform
  • Summarization of multiple logs
  • Correlation of log entries to modify parts
    replacement strategies
  • Thresholding of soft errors
  • Pervasive issue detection

Linux
AIX
AIX
AIX
Service Focal Point
Service Action Event
Error log
Error log
Error log
Hardware Mngmt Console
Hypervisor
Hypervisor
Single, Unified Decode Error Log Results Summary
of Entries Errors with no Resource Type
Error os_event_scan () on resource appeared 2
times Processor Hang detected due to
internal source) Error os_event_scan ()
on resource appeared 4 times
Most likely Processor subsystem tests
detected a fault in a component Entries on Node
JTH02U1M Errors with no Resource Type
Error ERRLOG_ON (9DBCFDEE) on resource
errdemon appeared 7 times ERROR LOGGING
TURNED ON Error ERRLOG_OFF (192AC071) on
resource errdemon appeared 7 times
ERROR LOGGING TURNED OFF Error
RMCD_INFO_0_ST (A6DF45AA) on resource RMCdaemon
appeared 13 times ....
p690 hardware/ Service Processor
I/O
p690 hardware
8
Same Log Filtered Through Rsense (Abbreviated 1/3)
All the initial text information, Label,
identifier, resources, etc. decoded by the error
log parser (and redisplayed as part of the
output.) Various labels are made variables for
use by rsense scripts. The description
Undetermined Error becomes the default bottom
line summary of the error used in rsense
summaries.
Through a defined algorithm the event manager in
rsense, calls the appropriate script for decoding
this SCAN_ERROR_CHRP ! SCAN_ERROR_CHRP
Decodes rpa fmted logs. Set a variable
indicating which byte To start decoding from
rpa_rc_at 0 Advance decode pointer to the
byte gt rpa_rc_at Call routines which
decodes this error call link_crit_err
call rpa_decode Log this as a critical error
for later summary call log_crit_err
Script to decode the first 10 bits Display
section header, using the rpa_section_at variable
to keep track of sections displayed
rpa_section_at RTAS Error Return
Information rpa_section_at
rpa_section_at 1 At the current location in
sense data, display 32 bits of data as a single
hex word (return code0 - _at_ 32
Return Code (0x.8x) At the same location,
display the first 8 bits as version - 07
_at_ Bits 007 (0x.2x) Version d
_at_rpa_version Dec. bits 810 as a severity,
creating a variable, severity - 810 _at_
810 (0x.2x) Severity\ _at_severity
Display additional info about the severity
select (severity)
case 0x05 Fatal Error
case 0x04 Non-Fatal Error
case 0x03 Error
Sync case 0x02
Warning case 0x01
Event case 0x00
No Error default
Undefined Value
endselect

9
Same Log Filtered Through Rsense (Abbreviated) 2-3
The diagnostic ela program decoded the first word
of an SRC. This decode gives extended details
about the first SRC The SRC itself is
hyperlinked to another support center tool that
provides an additional level of decode for some
cases.
First 16 bits of the SRC decode to an src_area
as described below select src_area case
0xb110 Process or subsystem event or error
case 0xb111 Processor FRU event or error
case 0xb112 Processor chip (including
cache) event case 0xb113 Processor unit
(CPU) event or error case 0xb114
Processor/system bus controller and .. case
0xb120 Memory subsystem event or error
select ref_code case 0xf667
UE during maint scrub or L3 connections
test endselect ... endselect
In the above script, a command beginning with a
and a space is a print command. A command
starting with is a print command where the
text printed is identified as a bottom-line
summary of a problem supplanting any previous
bottom line summary encountered.
10
Same Log Filtered Through Rsense (Abbreviated) 3-3
Diagnostic ELA could recognize the FRUs (parts to
replace) but this decode gives additional
information
Log may contain many sections of User Data Some
will be automatically decoded by rsense. Others
will only be understood by developer. In the
latter case, a built-in script language function
is used to format the data conveniently into a
hex/ASCII dump display. (Illustrated below)
! rpa_v6_ud ---------------------- User
Data Section ---------------------- nbytes
sec_len - 8 select (sec_subtype _at_)
opt (0x01) indent 5 call
rpa_v6_ud_hwregs opt (0x10) call
rpa_v6_device_driver_dets ...
default hex_dump nbytes cur_offset
Above was Non-Decoded User Data endselect
The bottom line summary (reason why to replace)
is redisplayed for convenience and stored in a
critical errors data base along with other
variables for use in displaying a final summary
! log_crit_err Records certain info about
"key" error Bottom Line cur_descript
glo.logged_crits 1 Store all the local
variables for this error log entry dbstore
11
rsense internals
Event Manager
  • Error Log Parser for each Log Entry
  • Creates internal variables for the defined
    (non-hex sense data) fields of each log
  • Identifies and stores valid hex sense Data into
    an indexable array
  • Through Event Manager dispatches a script for
    each log entry based on log type
  • rsense Scripts for events such as
  • rsense program start
  • Phase one parsing of log start
  • End of phase one/ Start of Phase Two parsing
  • End of Phase two

Log Summarizer Gives a summary view
chronologically and by count of each log entry
using bottom-line description
Rsense Program -- C (40k lines) -- Object
Oriented -- Compiled for AIX and for Linux -- Dan
Henderson author and maintainer
Script Interpreter Like many interpreted
languages can execute logical, numeric and string
operations in a structured program fashion (with
flow control and subroutines) Has additional
built-in constructs to efficiently process a
large data array and Compose and display data
values from arbitrary bits and bytes Format Plain
language description of the data values using
constructs to test for the value of bits and
bytes and words Supports a drill-down approach
for determining the best bottom-line short text
representation of the meaning of all the data
decoded.
12
rsense Scripting Language
  • Basic Syntax from early 1990s
  • Meant to quickly take a bit/byte spec
  • and mark-up to create a decoding script
    routine
  • Many ways to parse and dissect an array of sense
    data, including positional and relative
    procession through the data
  • Data decodable at as numeric, Boolean or ASCII
  • Lightweight syntax to display text strings based
    on decoded values and to identify candidates for
    the bottom-line description of a problem
  • Scripts stored in script libraries for easy
    maintenance
  • Scripts translated into an internal executable
    code representation using a just in time
    approach
  • Standard script language features including
  • Structured language flow control
  • Local and global variables
  • Integer arithmetic
  • Rich set of string and ASCII/hex conversion
    routines
  • Advanced Features for error log correlation and
    summarization
  • Variables can be stored into a database
  • Database can be queried and sorted
  • For custom summarizations
  • To do advanced error log correlation of multiple
    errors
  • Built-in functions for creating and manipulating
    timestamps

13
Sample rsense Customized Summary
! postprocess_v6sum
Summary of Key
Errors dbselect default dbasort first_src
if (db.have_a_rec) do_rec 0 Store
first record rc strcpy(cnt_abstract,db.abst
ract) rc strcpy(cnt_descript,db.cur_descrip
t) cnt_first_src db.first_src rc
strcpy(cnt_all_loccodes,db.all_loccodes)
store_cmp cnt_abstract cnt_descript
cnt_first_src cnt_all_loccodes
cnt 0 endif while (db.have_a_rec)
Get next record dbnext cnt cnt 1
test_a_rec db.have_a_rec if (!
db.have_a_rec) do_rec 1 else
rc strcpy(new_abstract,db.abstract)
rc strcpy(new_descript,db.cur_descript)
new_first_src db.first_src rc
strcpy(new_all_loccodes,db.all_loccodes)
cur_cmp new_abstract new_descript
new_first_src new_all_loccodes rc
strcmp(store_cmp,cur_cmp) if (rc 0)
Will have to print off previous cnt
record do_rec 1 endif
endif if (do_rec) Will display
old record cnt Incidents of
cnt_abstract -- cnt_descript
SRC .8x\ cnt_first_src if (!
strcmp(cnt_all_loccodes,""))
CALLOUTS cnt_all_loccodes else
endif Make new record old
cnt 0 do_rec 0 rc
strcpy(store_cmp,cur_cmp) rc
strcpy(cnt_abstract,new_abstract) rc
strcpy(cnt_descript,new_descript)
cnt_first_src new_first_src rc
strcpy(cnt_all_loccodes,new_all_loccodes)
endif endwhile
First Summary customized for key errors (by a
script shown to right( using database
queries. (Note that summary gives quick access to
platform error essentials SRC, FRU callouts,
and the Bottom line.)
Rsense generates second summary automatically (no
customized script.) It summarizes by count and
type (including bottom line summary.) Rsense can
also automatically generate a chronological
summary.
14
Multiple Log Coordination Application One
  • Common hardware shared across two systems or
    operating system images.
  • Shared hardware unable to communicate error
    information directly
  • Any single OS instance unable to localize source
    of the fault
  • Error Log Coordination could determine if fault
    is with one node or the other, or the device in
    the middle

Node A
Node B
Common I/O device (E.g. Switch)
  • Rsense Pass Two
  • Decodes and displays each log entry
  • as encountered.
  • When fault encountered that may be
  • between the two nodes
  • Queries database for presence
  • of faults in the other node
  • Time based functions can limit search to faults
    within the same timespan

Rsense Pass One Decodes Logs from nodes. Stores
Summary info into database
Node A Log
Node B Log
15
Multiple Log Coordination Application Two
  • Fault Encountered at IPL must be coordinated with
    previous run-time event
  • Graphically, previous example log for a system
    showed

Error C -- During IPL Memory UE (uncorrectable
error) occurred Suspected FRUS Memory DIMMs C1
and C8 on Card 1
Error A -- During Runtime Memory UE
(uncorrectable error) occurred Suspected FRUS
Memory DIMMs C6 and C3 on card2
Error B -- Log entry indicating System IPL
  • Each fault handled individually would end in
    replacing 2 DIMMS on each of 2 memory cards
  • Odds of two simultaneous Multi-bit errors on 2
    cards is small
  • Correlating these two errors might lead to a
    different callout
  • Firmware handling the IPL error is different from
    firmware handling the run-time error
  • Correlation through firmware not easily
    accomplished
  • Correlation in the support center by a tool like
    rsense could be effective especially if the
    problem in question was relatively frequent.
  • Possible Method
  • Pass 1 All Errors are summarized and stored in
    database
  • Pass 2 For Errors A and B as shown below

Err B IPL Time
Err A Run Time
Determine time of last IPL
Determine time of next IPL
Type A encountered within reboot_time_pls_minutes
of IPL?
Type B err seen immediately after IPL?
n
n
Make no parts callout indicating this error an
artifact of previous error A
Unaltered Parts call
Unaltered Parts call
Unaltered parts callout, but acknowledge that Err
B correlated with this one
16
rsense in Product Engineering PFA and Data Mining
  • In a support center, rsense scripts are easily
    written to mine called-home error log entries to
    investigate pervasive issues and to very quickly
    make ad-hoc studies.
  • Some advantages in using rsense over other
    scripting methods
  • In-depth analysis of sense data for many
    different platform and devices have been written
    for pSeries.
  • Since rsense separates the decoding of the log
    format from the decoding of sense data, same
    script can be written to do analysis of Linux,
    AIX and service processor firmware logs
  • Built in functions of rsense simplfy the process
    of script writing

DASD Pervasive Issue A hard-drive error was
discovered in the field with a particular
detailed sense data error entry. With no special
programming rsense was capable of scanning a
library of multiple system logs and identify the
extent of the issue in existing systems.
System Average Uptime To answer the question of
the average time between IPLs of a model type,
rsense was used to examine multiple error log
entries and caculate results. Though not a
difficult programming assignment with any
interpretive language, rsense with its built in
database and time calculation features made the
script programming time a matter of a few minutes.
  • pSeries predictive failure analysis can be
    accomplished at
  • Hardware level, reported to operating system.
  • Firmware level, reported to operating system.
  • Device driver level, reported to operating
    system.
  • OS based Error log analysis program, looking at
    multiple log entries
  • Maturity in the predictive failure analysis
    algorithms often dictate which approach can be
    used.
  • When a product is shipped without a PFA feature,
    however, ad-hoc PFA analysis can be piloted
    through rsense/dsense and then migrated into a
    delivered product through one of the above
    mechanisms

7) Calculations Assuming a 2.3 GB Tape
maxtape 2292769
Compute tape used and tape used as fraction
tapeused maxtape - remaining
perused ( tapeused 100) / maxtape
perfrac ((( tapeused 100)
maxtape ) 100) / maxtape
Compute soft err rate as percent of tape used
if( tapeused gt 0 )
pererrs (softerr 100 )/ tapeused
fractp (((softerr 100)
tapeused) 100) / tapeused
else pererrs 0
fractp 0 endif
Computations
maxtape d maxtape tapeused
maxtape - remaining d - d d maxtape
remaining tapeused d blocks
or (d..2d ) of the tape used \ tapeused
perused perfrac had d Soft errors
softerr Error Percentage
d..2d pererrs fractp if (read_cmnd) if
(pererrs gt 1) Calculated error
percentage exceeds max (1) allowed on read
endif elseif (write_cmnd) if (pererrs gt
1) Calculated error percentage
exceeds max (2) allowed on write
endif endifj
Tape Drive Predictive Failure Analysis
Example After a tape product was shipped, FA and
log analysis on multiple drives showed that a
threshold of rd/wr soft errors was indicative of
future tape/drive failure. Soft Errors were
logged in error log, but no analysis was being
made of whether hardware should be replaced. An
rsense/dsense script was added to the decode of
tape logs for support center to determine whether
to replace a tape/drive based on a calculation
involving counters logged in sense data Later the
soft error analysis incorporated in OS diagnostic
Error Log Analysis
17
rsense Future
  • rsense continues to be enhanced to meet support
    center needs
  • Possible future activities for consideration
  • Parsing and decoding Service Action Event log of
    Hardware Management Console
  • Summer 2005
  • Support for Linux evlog as it becomes adopted for
    logging of pSeries
  • Support for analysis of system soft errors as
    these are called home in pSeries
  • Incorporation of rsense capabilities for decode
    of logs generated for xSeries product space
Write a Comment
User Comments (0)
About PowerShow.com