Resistant Estimation of Multivariate Location and Scatter

Compute a multivariate location and scale estimate with a high breakdown point -- this can be thought of as estimating the mean and covariance of the good part of the data. cov.mve and cov.mcd are compatibility wrappers.

cov.rob(x, cor = FALSE, quantile.used = floor((n + p + 1)/2),
        method = c("mve", "mcd", "classical"),
        nsamp = "best", seed)

cov.mve(...)
cov.mcd(...)

Arguments

x	a matrix or data frame.
cor	should the returned result include a correlation matrix?
quantile.used	the minimum number of the data points regarded as `good` points.
method	the method to be used -- minimum volume ellipsoid, minimum covariance determinant or classical product-moment. Using `cov.mve` or `cov.mcd` forces `mve` or `mcd` respectively.
nsamp	the number of samples or `"best"` or `"exact"` or `"sample"`. If `"sample"` the number chosen is `min(5*p, 3000)`, taken from Rousseeuw and Hubert (1997). If `"best"` exhaustive enumeration is done up to 5000 samples: if `"exact"` exhaustive enumeration will be attempted however many samples are needed.
seed	the seed to be used for random sampling: see `RNGkind`. The current value of `.Random.seed` will be preserved if it is set.
...	arguments to `cov.rob` other than `method`.

Value

A list with components

center

the final estimate of location.

cov

the final estimate of scatter.

cor

(only is cor = TRUE) the estimate of the correlation matrix.

sing

message giving number of singular samples out of total

crit

the value of the criterion on log scale. For MCD this is the determinant, and for MVE it is proportional to the volume.

best

the subset used. For MVE the best sample, for MCD the best set of size quantile.used.

n.obs

total number of observations.

Details

For method "mve", an approximate search is made of a subset of size quantile.used with an enclosing ellipsoid of smallest volume; in method "mcd" it is the volume of the Gaussian confidence ellipsoid, equivalently the determinant of the classical covariance matrix, that is minimized. The mean of the subset provides a first estimate of the location, and the rescaled covariance matrix a first estimate of scatter. The Mahalanobis distances of all the points from the location estimate for this covariance matrix are calculated, and those points within the 97.5% point under Gaussian assumptions are declared to be good. The final estimates are the mean and rescaled covariance of the good points.

The rescaling is by the appropriate percentile under Gaussian data; in addition the first covariance matrix has an ad hoc finite-sample correction given by Marazzi.

For method "mve" the search is made over ellipsoids determined by the covariance matrix of p of the data points. For method "mcd" an additional improvement step suggested by Rousseeuw and van Driessen (1999) is used, in which once a subset of size quantile.used is selected, an ellipsoid based on its covariance is tested (as this will have no larger a determinant, and may be smaller).

References

P. J. Rousseeuw and A. M. Leroy (1987) Robust Regression and Outlier Detection. Wiley.

A. Marazzi (1993) Algorithms, Routines and S Functions for Robust Statistics. Wadsworth and Brooks/Cole.

P. J. Rousseeuw and B. C. van Zomeren (1990) Unmasking multivariate outliers and leverage points, Journal of the American Statistical Association, 85, 633--639.

P. J. Rousseeuw and K. van Driessen (1999) A fast algorithm for the minimum covariance determinant estimator. Technometrics 41, 212--223.

P. Rousseeuw and M. Hubert (1997) Recent developments in PROGRESS. In L1-Statistical Procedures and Related Topics ed Y. Dodge, IMS Lecture Notes volume 31, pp. 201--214.

Examples

set.seed(123)
cov.rob(stackloss)
#> $center
#>   Air.Flow Water.Temp Acid.Conc. stack.loss 
#>    56.3750    20.0000    85.4375    13.0625 
#> 
#> $cov
#>             Air.Flow Water.Temp Acid.Conc. stack.loss
#> Air.Flow   23.050000   6.666667  16.625000  19.308333
#> Water.Temp  6.666667   5.733333   5.333333   7.733333
#> Acid.Conc. 16.625000   5.333333  34.395833  13.837500
#> stack.loss 19.308333   7.733333  13.837500  18.462500
#> 
#> $msg
#> [1] "20 singular samples of size 5 out of 2500"
#> 
#> $crit
#> [1] 19.89056
#> 
#> $best
#>  [1]  5  6  7  8  9 10 11 12 15 16 18 19 20
#> 
#> $n.obs
#> [1] 21
#> 
cov.rob(stack.x, method = "mcd", nsamp = "exact")
#> $center
#>   Air.Flow Water.Temp Acid.Conc. 
#>   56.70588   20.23529   85.52941 
#> 
#> $cov
#>             Air.Flow Water.Temp Acid.Conc.
#> Air.Flow   23.470588   7.573529  16.102941
#> Water.Temp  7.573529   6.316176   5.367647
#> Acid.Conc. 16.102941   5.367647  32.389706
#> 
#> $msg
#> [1] "266 singular samples of size 4 out of 5985"
#> 
#> $crit
#> [1] 5.472581
#> 
#> $best
#>  [1]  4  5  6  7  8  9 10 11 12 13 14 20
#> 
#> $n.obs
#> [1] 21
#>