An implementation of the gap statistic algorithm from Tibshirani, Walther, and Hastie's "Estimating the number of clusters in a data set via the gap statistic". This function calls the clusGap-function of the cluster-package to calculate the data for the plot.

sjc.kgap(x, max = 10, B = 100, SE.factor = 1, method = "Tibs2001SEmax",
  plotResults = TRUE)

Arguments

x

matrix, where rows are observations and columns are individual dimensions, to compute and plot the gap statistic (according to a uniform reference distribution).

max

maximum number of clusters to consider, must be at least two. Default is 10.

B

integer, number of Monte Carlo ("bootstrap") samples. Default is 100.

SE.factor

[When method contains "SE"] Determining the optimal number of clusters, Tibshirani et al. proposed the "1 S.E."-rule. Using an SE.factor f, the "f S.E."-rule is used, more generally.

method

character string indicating how the "optimal" number of clusters, k^, is computed from the gap statistics (and their standard deviations), or more generally how the location k^ of the maximum of f[k] should be determined. Default is "Tibs2001SEmax". Possible value are:

"globalmax"

simply corresponds to the global maximum, i.e., is which.max(f).

"firstmax"

gives the location of the first local maximum.

"Tibs2001SEmax"

uses the criterion, Tibshirani et al(2001) proposed: "the smallest k such that f(k) >= f(k+1) - s_k+1". Note that this chooses k = 1 when all standard deviations are larger than the differences f(k+1) - f(k).

"firstSEmax"

is the location of the first f() value which is not larger than the first local maximum minus SE.factor * SE.f[], i.e, within an "f S.E." range of that maximum (see also SE.factor).

"globalSEmax"

(used in Dudoit and Fridlyand (2002), supposedly following Tibshirani's proposition) is the location of the first f() value which is not larger than the global maximum minus SE.factor * SE.f[], i.e, within an "f S.E." range of that maximum (see also SE.factor).

plotResults

logical, if TRUE (default), a graph visualiting the gap statistic will be plotted. Use FALSE to omit the plot.

Value

An object containing the used data frame for plotting, the ggplot object and the number of found cluster.

References

  • Tibshirani R, Walther G, Hastie T (2001) Estimating the number of clusters in a data set via gap statistic. J. R. Statist. Soc. B, 63, Part 2, pp. 411-423

  • Maechler, M., Rousseeuw, P., Struyf, A., Hubert, M., Hornik, K.(2013). cluster: Cluster Analysis Basics and Extensions. R package version 1.14.4. (web)

See also

Examples

# NOT RUN {
# plot gap statistic and determine best number of clusters
# in mtcars dataset
sjc.kgap(mtcars)

# and in iris dataset
sjc.kgap(iris[,1:4])
# }