Univariate Segmentation {Ckmeans.1d.dp} | R Documentation |
Perform optimal univariate k-segmentation.
Cksegs.1d.dp(y, k=c(1,9), x=seq_along(y), method=c("quadratic", "linear", "loglinear"), estimate.k=c("BIC", "BIC 3.4.12"))
y |
a numeric vector of y values. Values can be negative. |
k |
either an exact integer number of clusters, or a vector of length two specifying the minimum and maximum numbers of clusters to be examined. The default is |
x |
an optional numeric vector of data to be clustered. All |
method |
a character string to specify the speedup method to the original cubic runtime dynamic programming. The default is |
estimate.k |
a character string to specify the method to estimate optimal |
Cksegs.1d.dp
minimizes within-cluster sum of squared distance on y
. It offers optimal piece-wise constant approximation of y
within clusters of x
. Only method="quadratic"
guarantees optimality. The "linear" and "loglinear" options are faster but not always optimal and are provided for comparison purposes.
The Bayesian information criterion (BIC) method to select optimal k
is updated to deal with duplicates in the data. Otherwise, the estimated k would be the same with previous versions. Set estimate.k="BIC"
to use the latest method; use estimate.k="BIC 3.4.12"
to use the BIC method in version 3.4.12 or earlier to estimated k
from the given range. This option is effective only when a range for k
is provided.
method
specifies one of three options to speed up the original dynamic programming taking a runtime cubic in sample size n. The default "quadratic"
option, with a runtime of O(kn^2), guarantees optimality. The next two options do not guarantee optimality. The "linear"
option, giving a total runtime of O(n lg n + kn) or O(kn) (if x
is already sorted in ascending order) is the fastest option but uses the most memory (still O(kn)); the "loglinear"
option, with a runtime of O(kn lg n), is slightly slower but uses the least memory.
An object of class "Cksegs.1d.dp
". It is a list containing the following components:
cluster |
a vector of clusters assigned to each element in |
centers |
a numeric vector of the (weighted) means for each cluster. |
withinss |
a numeric vector of the (weighted) within-cluster sum of squares for each cluster. |
size |
a vector of the (weighted) number of elements in each cluster. |
totss |
total sum of (weighted) squared distances between each element and the sample mean. This statistic is not dependent on the clustering result. |
tot.withinss |
total sum of (weighted) within-cluster squared distances between each element and its cluster mean. This statistic is minimized given the number of clusters. |
betweenss |
sum of (weighted) squared distances between each cluster mean and sample mean. This statistic is maximized given the number of clusters. |
xname |
a character string. The actual name of the |
yname |
a character string. The actual name of the |
The class has a print and a plot method: print.Cksegs.1d.dp
and plot.Cksegs.1d.dp
.
Joe Song
Wang, H. and Song, M. (2011) Ckmeans.1d.dp: optimal k-means clustering in one dimension by dynamic programming. The R Journal 3(2), 29–33. Retrieved from https://journal.r-project.org/archive/2011-2/RJournal_2011-2_Wang+Song.pdf
plot.Cksegs.1d.dp
and print.Cksegs.1d.dp
.
# Ex 1. Segmenting by y y <- c(1,1,1,2,2,2,4,4,4,4) res <- Cksegs.1d.dp(y, k=c(1:10)) main <- "k-segs giving 3 clusters\nsucceeded in finding segments" opar <- par(mfrow=c(1,2)) plot(res, main=main, xlab="x") res <- Ckmeans.1d.dp(x=seq_along(y), k=c(1:10), y) main <- "Weighted k-means giving 1 cluster\nfailed to find segments" plot(res, main=main, xlab="x") par(opar) # Ex 2. Segmenting by y y <- c(1,1,1.1,1, 2,2.5,2, 4,5,4,4) res <- Cksegs.1d.dp(y, k=c(1:10)) plot(res, xlab="x") # Ex 3. Segmenting a sinusoidal curve by y x <- 1:125 y <- sin(x * .2) res.q <- Cksegs.1d.dp(y, k=8, x=x) plot(res.q, lwd=3, xlab="x") # Ex 4. Segmenting by y y <- rep(c(1,-3,4,-2), each=20) y <- y + 0.5*rnorm(length(y)) k <- 1:10 res.q <- Cksegs.1d.dp(y, k=k, method="quadratic") main <- paste("Cksegs (method=\"quadratic\"):\ntot.withinss =", format(res.q$tot.withinss, digits=4), "BIC =", format(res.q$BIC[length(res.q$size)], digits=4), "\nGUARANTEE TO BE OPTIMAL") plot(res.q, main=main, xlab="x") res.l <- Cksegs.1d.dp(y, k=k, method="linear") main <- paste("Cksegs (method=\"linear\"):\ntot.withinss =", format(res.l$tot.withinss, digits=4), "BIC =", format(res.l$BIC[length(res.l$size)], digits=4), "\nFAST BUT MAY NOT BE OPTIMAL") plot(res.l, main=main, xlab="x") res.g <- Cksegs.1d.dp(y, k=k, method="loglinear") main <- paste("Cksegs (method=\"loglinear\"):\ntot.withinss =", format(res.g$tot.withinss, digits=4), "BIC =", format(res.g$BIC[length(res.g$size)], digits=4), "\nFAST BUT MAY NOT BE OPTIMAL") plot(res.g, main=main, xlab="x")