Compute gap statistics for k-means-cluster

An implementation of the gap statistic algorithm from Tibshirani, Walther, and Hastie's "Estimating the number of clusters in a data set via the gap statistic". This function calls the clusGap-function of the cluster-package to calculate the data for the plot.

sjc.kgap(x, max = 10, B = 100, SE.factor = 1, method = "Tibs2001SEmax",
  plotResults = TRUE)

Arguments

x	matrix, where rows are observations and columns are individual dimensions, to compute and plot the gap statistic (according to a uniform reference distribution).
max	maximum number of clusters to consider, must be at least two. Default is 10.
B	integer, number of Monte Carlo ("bootstrap") samples. Default is 100.
SE.factor	[When `method` contains "SE"] Determining the optimal number of clusters, Tibshirani et al. proposed the "1 S.E."-rule. Using an SE.factor f, the "f S.E."-rule is used, more generally.
method	character string indicating how the "optimal" number of clusters, k^, is computed from the gap statistics (and their standard deviations), or more generally how the location k^ of the maximum of f[k] should be determined. Default is `"Tibs2001SEmax"`. Possible value are: `"globalmax"` simply corresponds to the global maximum, i.e., is which.max(f). `"firstmax"` gives the location of the first local maximum. `"Tibs2001SEmax"` uses the criterion, Tibshirani et al(2001) proposed: "the smallest k such that f(k) >= f(k+1) - s_k+1". Note that this chooses k = 1 when all standard deviations are larger than the differences f(k+1) - f(k). `"firstSEmax"` is the location of the first f() value which is not larger than the first local maximum minus SE.factor * SE.f[], i.e, within an "f S.E." range of that maximum (see also SE.factor). `"globalSEmax"` (used in Dudoit and Fridlyand (2002), supposedly following Tibshirani's proposition) is the location of the first f() value which is not larger than the global maximum minus SE.factor * SE.f[], i.e, within an "f S.E." range of that maximum (see also SE.factor).
plotResults	logical, if `TRUE` (default), a graph visualiting the gap statistic will be plotted. Use `FALSE` to omit the plot.

Value

An object containing the used data frame for plotting, the ggplot object and the number of found cluster.

References

Tibshirani R, Walther G, Hastie T (2001) Estimating the number of clusters in a data set via gap statistic. J. R. Statist. Soc. B, 63, Part 2, pp. 411-423
Maechler, M., Rousseeuw, P., Struyf, A., Hubert, M., Hornik, K.(2013). cluster: Cluster Analysis Basics and Extensions. R package version 1.14.4. (web)

Examples

# NOT RUN {
# plot gap statistic and determine best number of clusters
# in mtcars dataset
sjc.kgap(mtcars)

# and in iris dataset
sjc.kgap(iris[,1:4])
# }

Compute gap statistics for k-means-cluster

Arguments

Value

References

See also

Examples

Contents