Title: Panorama for detecting outliers methods in structural surveys implementation on French and Ukrainian
1Panorama for detecting outliers methods in
structural surveys implementation on French and
Ukrainian data
- Olga A.Vasyechko, Research Institute of
Statistics of Ukraine O.Vasechko_at_ukrstat.gov.ua - Noureddine Benlagha, Université Paris 2, ERMES
UMR 7017(CNRS) blnouri2002_at_yahoo.fr - Michel Grun-Rehomme, Université Paris 2, ERMES
UMR 7017(CNRS) grun_at_u-paris2.fr
2Introduction(1)
- Statistical analysis requires to ensure
- The quality of data
- The robustness of indicators
3Introduction(2)
- Two categories of enterprises
- Small enterprises
- Big and middle enterprises
- We consider the turnover as the main variable
4Why is it necessary to detect outliers before
mining the data?
- Theoretical reasons Extreme values increase the
variance, deteriorate the occurrence of estimates - Practical reasons we have to detect outliers to
prepare the next surveys
5Some classical tests of outliers detection
- Grubbs (1950,1969)
- Grubbs and Beck ,Tietjen and Moore (1972)
- Rosner (1975)
- Atinkson A.C., Koopman S.J., Shepard N. (1997)
- Tancredi and al (2002)
- F.Dominici, L. Cope, D.Q. Naiman and S.L. Zeger
(2005)
6Objective
-
- Using of different methods to detect the atypical
units in the structural business surveys
7Different Methods
- Algebraic Method
- New non parametric method
- Graphical Method Box plot
- Probabilistic model Extreme value theory
8Application
- These various methods are applied to
- French data
- Ukrainian data
9Algebraic methods(1)
- The distance from the unit to the center of the
distribution
xi the unit i m central tendency parameter s
scale parameter
10Algebraic methods(2)
- Hidiroglou and Berthetols interval
11Graphic method
- Box plot method
- The Tukeys limits of a box plot
12A non parametric method to detect extreme
values(1)
- Two aspects
- The distance
- The form of distribution
13A non parametric method of extreme values
detection (2)
- The indicator of contribution
14A non parametric method of extreme values
detection (4)
- Properties of the indicator
- This indicator has the following properties
- (1) 0 In (i) 1 For any i and N, and it can
reach its end point - (2) In (i) It is increasing on the whole of the
values above mean of X and decreasing if not - (3) In (i) It can admit a point of inflection
on the whole of the values above mean of X
15A non parametric method of extreme values
detection (5)
- The consequences
- If In (i) 1 then i is an extreme value
- Else, i is a normal observation
16The extreme value theory
- Two approaches
- The classical method (EVT)
- The peak over threshold (POT)
17The peak over threshold method
- Two problems
- Estimating three parameters of the distribution
- Fixing the threshold
18The generalized Pareto distribution
Where
19The generalized Pareto distribution
- G(y) the generalized extreme value
- H(y) the generalized Pareto distribution
- ? the shape parameter ( the tail index)
- µ the location parameter
- s the scale parameter
20Estimation of the tail index (?)
- Likelihood estimator
- Hill estimator (1975)
- Pickands estimator (1975)
21Choice of the threshold(1)
- Two methods
- The function of the mean of excesses
- Using the extreme quantile
22Choice of the threshold(2)
- The function of the excesses mean.
23Choice of the threshold(3)
- Using the extreme quantile.
- Where F-1 is the reverse function of
distribution of X.
24Approximation of the GPD extreme quantile
- Problem estimating ?n , µn, sn
25Data
- French and Ukrainian data volumes of turnovers
(in 2003) of small enterprises - 4 divisions
- Work of metals (28)
- Construction(45)
- Retail trade(52)
- Computer operations(72)
-
26Empirical results
- The used Software
- SAS software
- Extreme software
- Matlab software
27Results
28Conclusion
- A principal component analysis
- The first principal axis 55 of explained
inertia, the couple (Box plot, In) Vs other
criteria. - The second principal axis 26 of explained
inertia representing the extreme value theory. - The last axis corresponds to the expert s point
of view.
29References
- Sim C.H., Gan F.F., Chang T.C. (2005)
- Outlier labelling with Boxplot Procedures
- JASA, vol. 100, n. 470, 642-652
- Marchette D.J., Solka J.L. (2003)
- Using data images for outlier detection
- Computational Statistics Data Analysis,
43, 541-552 - Nikulin M., Zerbet A. (2002)
- Détection des observations aberrantes par des
méthodes statistiques - RSA, L(3), 25-51
- Reiss, R., Thomas, M. (2001)
- Statistical Analysis of extreme values
- Birkhauser Verlag
30 Q2006