Title: Continuity Equations: Analytical Monitoring of Business Processes and Anomaly Detection in Continuou
1Continuity Equations Analytical Monitoring of
Business Processes and Anomaly Detection in
Continuous Auditing
- Michael G. Alles
- Alexander Kogan
- Miklos A. Vasarhelyi
- Jia Wu
- Rutgers University
- Nov, 2005
2Data-oriented CA Automation of Substantive
Testing
- Formalization of BP rules as data integrity
constraints. - Verification of data integrity ? identification
of exceptions. - Selection of critical BP metrics and development
of stable business flow (continuity) equations. - Monitoring of continuity equation residuals ?
identification of anomalies.
3Establishing Data Integrity A Procurement Example
- Referential integrity along the business cycle
and identification of completed cycles - P.O. ? Shipment receipt ? voucher payment.
- Identification of data consistency issues and
automatic alarms to resolve exceptions - Changes in purchase order vendor numbers
- Discrepancies between the totals and the sums of
line items - Discrepancies between matched voucher amounts.
4Detection of Exceptions
- Referential integrity violations
- PO without matching requisition
- Received item without matching PO
- Payments without matching received items
- Data integrity violations
- PO has zero order quantity
- Received item has negative quantity
- Invalid payment check numbers (e.g. All 0s)
- Gross payment amount is smaller than net payment
amount
5Advanced Analytics in CA BP Modeling Using
Continuity Equations
- Continuity equations
- Statistical models capturing relationships
between various business processes. - Can be used as expectation models in the
analytical procedures of continuous auditing. - Originated in physical sciences (various
conservation laws e.g. Mass, momentum). - Continuity equations are developed using the
methodologies of - Simultaneous equation modeling (SEM)
- Multivariate time series modeling (MTSM).
6Basic Procurement Cycle
t2-t1
P.O.(t1)
Receive(t2)
t3-t2
Voucher(t3)
7Continuity Equations of Basic Procurement Cycle
- Receive(t2) P.O.(t1)
- Voucher(t3) Receive(t2)
- Arent partial deliveries allowed?
- Are all orders delivered after exactly the same
time lag? - Are there any feedback loops?
8Inferred Analytical Model of Procurement
- P.O.(t) 0.24P.O.(t-4) 0.25P.O.(t-14)
0.56Receive(t-15) ePO - Receive(t) 0.26P.O.(t-4) 0.21P.O.(t-6)
0.60Voucher(t-10) eR - Voucher(t)0.73Receive(t-1) - 0.25P.O.(t-7)
0.22P.O.(t-17)t-17 0.24Receive(t-17) eV
9Detection of Anomalies
- Anomalies are detected if
- Observed P.O.(t) lt Predicted P.O.(t) - Var
- or
- Observed P.O.(t) gt Predicted P.O.(t) Var
- Similarly for
- Receive(t)
- Voucher(t)
- Var acceptable threshold of variance.
- If there is anomaly ? generate alarm!
10Steps of Analytical Modeling and Monitoring Using
Continuity Equations
- Choose essential business processes to model
(purchasing, payments, etc.). - Define (physical, financial, etc.) metrics to
represent each process e.g., Amount of
purchase orders, quantity of items received,
number of payment vouchers processed. - Choose the levels of aggregation of metrics
- By time (hourly, daily, weekly), by business
unit, by customer or vendor, by type of products
or services, etc.
11Steps of Analytical Modeling and Monitoring Using
Continuity Equations - II
- Identify and estimate stable statistical
relationships between business process metrics
Continuity Equations (CEs). - Define acceptable thresholds of variance from the
expected relationships. - If the variances (residuals) exceed the
acceptable levels, alarm human auditors to
investigate the anomaly (i.e., the relevant
sub-population of transactions).
12How Do We Evaluate CE Models?
- Linear Regression Model is the classical
benchmark for comparison. - Models are compared on two aspects
- Prediction Accuracy.
- Anomaly Detection Capability.
- Mean Absolute Percentage Error (MAPE) is used to
measure prediction accuracy. - MAPE Abs (predicted value actual value) /
(actual value) 100 - A good analytical model is expected to have high
prediction accuracy, or low MAPE.
13Prediction Accuracy Comparison Results Analysis
- Prediction accuracy comparison results
- Linear regression (best).
- Multivariate Time Series (middle).
- Simultaneous Equations (worst).
- Difference is small (lt2).
- Noise in our data sets may pollute the results.
- Prediction accuracy is relatively good for all
three models - MAPE is around 0.40 (Leitch and Chen 2003).
- Other studies report over 100 MAPE.
14Simulating Error Stream The Ultimate Test of CA
Analytics
- Seed errors of various magnitude into randomly
chosen subset of the holdout sample. - Identify anomalies as those observations in the
holdout sample for which the variance exceeds the
acceptable threshold of variance. - Test whether anomalies are the observations with
seeded errors, and count the number of false
positives and false negatives. - Repeat this simulation several times by choosing
different random subsets to seed errors into.
15Acceptable Threshold of Variance
- What to use as acceptable threshold of variance?
- Prediction Interval
- Confidence interval for the predicted variable
value. - Anomalies are detected if
- Value in the observation lt lower confidence
limit, or - Value in the observation gt upper confidence limit.
16Error Seeding Procedure
- To simulate an anomaly detection scenario, we
seed errors into the hold-out data set (47 obs.) - Original anomalies are detected before error
seeding. - Errors are seeded into 8 randomly-selected
observations which do not have original
anomalies. - 5 different error magnitudes are used for each
round of error seeding respectively. (10, 50,
100, 200 and 400 of actual value of the seeded
observation). - The above procedure is repeated 10 times to
reduce the variance of the results.
17Measuring Anomaly Detection
- False positive error (false alarm, Type I error)
A non-anomaly mistakenly detected by the model as
an anomaly. Decreases efficiency. - False negative error (Type II error) An anomaly
failed to be detected by the model. Decreases
effectiveness. - Detection rate is used for clear presentation
purpose The rate of successful detection of
seeded errors. - Detection rate 1 False Negative Error Rate
- A good analytical model is expected to have good
anomaly detection capability low false negative
error rate (i.e. high detection rate) and low
false positive error rate.
18Simulated Error Correction
- CA makes it possible to investigate a detected
anomaly in (nearly) real-time. - Anomaly investigation can likely correct a
detected problem in (nearly) real-time. - Real-time problem correction results in utilizing
the actual (not erroneous) values in analytical
BP models for future predictions. - Real-time error correction is likely to benefit
future anomaly detection, and the magnitude of
this benefit can be evaluated using simulation.
19Benefit of Real-time Error Correction MTSM
20Anomaly Detection Rate Comparison Results
21False Positive Error Comparison
22Anomaly Detection Rate Comparison Results
Analysis
- SEM and MTSM outperform the linear regression
model when the error magnitudes are large, even
though linear regression has slightly better
detection rate when the error magnitudes are
small. - It is more important to detect material errors
than non-material errors.
23Concluding Remarks
- New CA-enabled analytical audit methodology
simultaneous relationships between highly
disaggregated BP metrics. - How to automate the inference and estimation of
numerous CE models? - How to identify and remove outliers from the
historical data to estimate statistically valid
CEs (step-wise re-estimation of CEs)? - How to identify the need to re-estimate a CE
model (trends in residuals)? - How to make it worthwhile (trade-off between
effectiveness, efficiency and timeliness)? - Any patterns for detected errors?