1 / 37

- Bodhisattva Sen

Topics

- Fractile Graphical Analysis (FGA) definition

motivation Prof. Mahalanobis Method

Statistical Inference Using FGA ours ideas. - Two notions of Multivariate Quantiles

Geometric Quantiles and PCMs Fractile. - Extension of FGA to the Multiple Covariate Setup

Using the above two notions of Quantiles. - Evaluation of the performance of the methods

using synthetic data and real life data.

Want to compare the regression functions for two

populations with single covariate.

- We could look at

- Instead we look at

and

and

where F, F are the distribution functions of X

and X.

When X and X are not in comparable scales the

above does the necessary standardization.

Mahalanobis Idea for Comparing Fractile

Graphs(Econometrica,1960)

- Divide the dataset into fractile groups

according to the x-variable. Plot the y-averages

for each fractile group and join the consecutive

points (y-averages) by straight lines. - Compute the area between the two fractile graphs

for the two different samples.This is called the

separation area. - Using two sub samples from the same population

the significance of the observed separation area

is tested.

Limitations of Mahalanobis Idea

- The exact distribution of the test statistic

separation area is not known.Only some

approximations were tried by Mahalanobis and his

co-workers. - He divided the original dataset into two parts

and used the area between the two sub-samples

(called error area) to test the significance of

the observed separation area. The error area

calculated in this way has different variance

than the separation area due to the decrease in

the number of sample points. - The accuracy of the approximation was poor and

it was known to Mahalanobis.

Statistical Inference in FGAOur Ideas

- We discuss two methods to test the equality of

the regression functions (single covariate). - Swap the Response, (Method I)
- Resampling from the Joint Density of (X,Y)

(Method II).

Swap the Response

- Transform the covariates into the corresponding

quantiles, i,e, the ith ranked x-value is

transformed to i/(n1) (where n data points). - Draw the fractile graphs by smoothing the

y-values using usual kernel regression estimates

and compute the separation area between them. - To test the significance of the observed

separation area we resample from the two

samples in the following way

We form two resampled datasets and compute the

separation area between them. Case (I)

Suppose n1n2 (i.,e., both the samples are of the

same size). For the ith data point (i.,e., with

quantile value i/(n1)) we interchange or swap

the y-value for the two datasets with probability

0.5 and keep the original y-values with

probability 0.5. Case (II) The sample sizes

are different. We interpolate the dataset from

the 2nd population to find the y-value at

i/(n11)th quantile and then repeat the same idea

as above. Similarly we get the other resampled

dataset. (Modified Swap Method)

Resample from the Joint Density of (X,Y)

- We use usual kernel density estimates with very

small bandwidth (bandwidth proportional to the

inverse of sample size) to generate a pair of

resampled datasets from each population. It is

like resampling from a smoothed version of the

empirical distribution of (X,Y). - We compute the separation area for each pair

of resampled datasets from the same population.

It gives us the distribution of the separation

area under Ho.(Note that we have 2 separation

areas corresponding to the 2 populations).

Plots of datasets along with the Fractile

Graphs. The models are Y 1.0 X e and Y

1.2 X e, where eN(0,0.09),XN(0,1) where N

data points 100.

Swap the Response Method an illustration

Resample from Joint Distribution of (X,Y) an

illustration.

How do we define Quantiles in the Multivariate

setup?

We discuss two such notions of

Multivariate Quantiles

- 1.Geometric Quantiles and 2.PCMs Fractile

Geometric Quantiles

Chaudhuri (JASA 1996), Koltchinskii (Annals of

Stat. 1997)

- If XP (X in Rd) then we define the Geometric

Quantile QP(u) for u in B(d) u u lt 1 as

where

and the norm is the usual Euclidean norm

and lt , gt denotes the usual inner product.

Properties of Geometric Quantiles ?

- The solution QP(u) always exist for any u and it

is unique if P is not supported on a straight

line. - QP(u) characterizes the associated distribution,

i.e, QP1(u) QP2(u) implies P1P2 . - Computation of the sample geometric quantile for

the data data set X1,X2,Xn ,via

is straightforward.

- The Geometric Quantiles have an asymptotic
- Multivariate normal distribution and the
- convergence rate is n1/2.

Drawbacks of Geometric Quantile

- No simple distributional interpretation as was

the case with the univariate quantiles. - Though the Geometric Quantile is equivariant with

respect to change in shift,any orthogonal

transformation and homogeneous scale

transformation, it is not equivariant under

heterogeneous scale transformation. - (Affine equivariant versions of Geometric

Quantiles are defined using Transformation-Retrans

formation methods.) (see On Afiine Equivariant

Multivariate Quantiles, Biman Chakraborty, Ann.

Inst. Stat. Math, Vol. 53, No.2, 380-403 (2001))

. However we do not intend to consider it here.

Another notion of Multivariate Quantile PCMs

Fractile - Mahalanobis(1970)

For a d-dimensional random vector X

(X1,X2,,Xd) P, we define the PCMs

distribution function HP from Rd to 0,1d as

where

Why should we consider PCMs fractile ?

- If P and Q are two distributions with continuous

densities of Rd and HP(x) HQ(x) for all x in Rd

then - P Q.
- Define Z(Z1,Z2,,Zd) Q s.t Zi fi (xi) for

all i1,,d where each fi is a strictly monotone

function on R. Then HP(x) HQ(f (x)) where f(x)

(f1(x1), f2(x2),,fd(xd)). This shows that the

PCMs fractile is equivariant under co-ordinate

wise monotonic transformations. - The PCMs fractile has simple probabilistic
- interpretations.

Drawbacks of PCMs Fractile

- Requires the computation of the conditional

distributions which is estimated by kernel

density estimation.This estimation is very

unstable in high dimensions due to the lack of

data points. It is also computationally very

intensive and time consuming. - PCMs Fractile map is not n1/2 consistent as it

uses density estimates which are nv consistent

(vlt1/2). v may be much smaller than ½ when the

covariate dimension is high(Curse of

Dimensionality in Non Parametric Function

Estimation). - PCMs Fractile depends on the ordering of the

co-ordinate random variables we have to fix an

order and then work with it.

Extension of Fractile Graphical Analysis in the

Multiple Covariate Case

- We use two different notions of Multivariate

Quantiles - (Geometric Quantiles and PCMs Fractile).
- We extend the method of Resampling from the

joint density of (X,Y) using both the above

notions of Multivariate Quantiles.

Methodology using PCMs Fractile

- Transform the datasets to

and

where H is the PCMs distribution function.

- Smooth the two transformed datasets by using the

usual multivariate kernel regression estimates to

get the two fractile surfaces.

3. The difference between the two fractile

graphs gives the separation volume between the

two populations. 4. To find the significance

level for the observed volume we resample a

pair of data sets from the same population using

very small bandwidth(from a fitted joint

density of X and Y) and recalculate the

volumebetween the two graphs for the same

population.In this way we try to estimate the

distribution of the separation volume under Ho.

Remarks

- We used Least Square Cross Validation methods in

choosing the bandwidth parameters in kernel

regression and kernel density estimates (note

that we use LSCV optimal bandwidths for KDE for

the transformation x?H(x)). - The method described depends highly on the

fitted joint distribution of the covariate

(i.,e., X) as we are computing the Quantiles

from this kernel density estimate.

Methodology using Geometric Quantiles

- We transform the covariates by using the

tranformation x ? FP(x) where FP is the

M-distribution function. The empirical

M-distribution function is 1/n Si (x-Xi)

(x-Xi)-1. - We use the kernel regression estimates to smooth

the transformed data sets using LSCV (Least

Squares Cross Validation) optimal bandwidths.

- We compute the fractile volume between the two

surfaces (for the two populations). - We resample a pair of datasets from each

population using kernel density estimates with

very small bandwidths and compute the

resampled fractile volume.In this way we

simulate the distribution of the fractile volume

under both the populations separately.

Remarks

- For choosing the kernel regression bandwidths

for X we assume h1h2hd where hi is the

bandwidth for Xi (note that X (X1,Xd)).This

makes the computation relatively simple and

computationally feasible. - This method has the advantage that we do not

need any LSCV optimal kernel density estimates as

was the case with PCMs Fractile.

Evaluation of the Methods using Synthetic Data

- Models considered
- Single covariate
- Y a bX e ,where e N(0,s2), XN(0,1).
- Y 1/(1 X2k) e, where k0.5,1.0,1.5,2.0,2.5.

- Y cs X2s eX2 e , where s1,2,3,4 cs

normalizing constant - Multiple covariates (2 covariates)
- Y a b1X1 b2X2 e ,where e N(0,s2),

X1N(0,1), X2N(0,1).

Remarks

- The Swap Method has a natural tendency to lower

the resampled fractile area. Thus it exhibits

high power but has high observed level also. - The Swap Method essentially works on the number

of crossings of the two datasets. - The Swap Method works better when the error

variance is large compared to the variance of X.

- As the difference in the variance of the error

terms in the two models increase, the two

P-values for the Resampling from Joint Density

Method show significant difference. - The Resampling from Joint Density Method gives

good results in all the models considered by us

whereas the Swap the Response Method behaves

very unsatisfactorily in the two nonlinear models

considered. - We recommend the Resampling from Joint Density

method.

Case Studies

- State wise data from scheduled commercial banks -

RBI BSR data - Amount Outstanding (as a fraction of Credit

Limit) VS Credit Limit. - Credit Deposit Ratio VS Total Deposit.
- Credit deposit Ratio VS Distribution of Employees

(officers,clerks and sub-ordinates) (2

covariates and 3 covariates). - Credit Deposit Ratio VS accounts and offices.
- RBI data on private corporate firms
- Gross Company Profits VS Sales and Paid Up

Capital.

Amount Outstanding (as a ratio of Credit Limit)

VS Credit Limit

Credit Deposit Ratio VS Total Deposit

Grey Scale Image of the P-values for the two

methods with varying kernel regression

bandwidths. (Amount Outstanding (as ratio of

Credit Limit) VS Credit Limit for the year 1996

and 1998).

Using 1996

Using 1998

Resample from the Joint density of (X,Y)

Swap the Response Method

Grey Scale Image of the P-values for the two

methods with varying kernel regression

bandwidths. (Amount Outstanding (as ratio of

Credit Limit) VS Credit Limit for the year 1997

and 2000).

Using 1997

Using 2000

Resample from the Joint density of (X,Y)

Swap the Response Method

Remarks (for Single Covariate)

- The fractile graph of the year 2002 is very

different from the other years. - Over the years the ratio Amount Ouststanding to

Credit Limit has decreased. - For the pair wise comparison tests between

Credit Deposit Ratio and Total Deposit most of

the P-values for the are very high.The the

fractile graph for the year 2002 is somewhat

different from the rest.

Remarks (for Multiple Covariates)

- With the inclusion of the subordinates the

regressions functions change (as indicated by the

change in P-values). - The subordinates have decreased steadily over

the years from 1996 to 2002. - The P-values for the pair wise comparisons for

Credit Deposit Ratio VS offices and accounts

are very high.

Thank you