fastcpd
takes in formulas, data, families and extra
parameters and returns a fastcpd
object.
Usage
fastcpd(
formula = y ~ . - 1,
data,
beta = NULL,
segment_count = 10,
trim = 0.025,
momentum_coef = 0,
k = function(x) 0,
family = NULL,
epsilon = 1e-10,
min_prob = 10^10,
winsorise_minval = -20,
winsorise_maxval = 20,
p = ncol(data) - 1,
order = c(0, 0, 0),
cost = NULL,
cost_gradient = NULL,
cost_hessian = NULL,
cp_only = FALSE,
vanilla_percentage = 0,
warm_start = FALSE,
lower = NULL,
upper = NULL,
line_search = c(1),
...
)
Arguments
- formula
A formula object specifying the model to be fitted. The optional response variable should be on the left hand side of the formula while the covariates should be on the right hand side. The intercept term should be removed from the formula. The response variable is not necessary if the data considered is not of regression type. For example, a mean or variance change model does not necessarily have response variables. By default an intercept column will be added to the data similar to the
lm
function in R. Thus it is suggested that user should remove the intercept term from the formula by appending- 1
to the formula. The default formula is suitable for regression data sets with one-dimensional response variable and the rest being covariates without intercept. The naming of variables used in the formula should be consistent with the column names in the data frame provided indata
.- data
A data frame containing the data to be segmented where each row denotes each data point. In one-dimensional response variable regression settings, the first column is the response variable while the rest are covariates. The response is not necessary in the case of mean change or variance change, in which case the formula will need to be adjusted accordingly.
- beta
Initial cost value specified in the algorithm in the paper. For the proper choice of a value, please refer to the paper. If not specified, BIC criterion is used to obtain a proper value, i.e.,
beta = (p + 1) * log(nrow(data)) / 2
.- segment_count
Number of segments for initial guess. If not specified, the initial guess on the number of segments is 10.
- trim
Trimming for the boundary change points so that a change point close to the boundary will not be counted as a change point. This parameter also specifies the minimum distance between two change points. If. several change points have mutual distances smaller than
trim * nrow(data)
, those change points will be merged into one single change point. The value of this parameter should be between 0 and 1.- momentum_coef
Momentum coefficient to be applied to each update. This parameter is used when the loss function is bad-shaped so that maintaining a momentum from previous update is desired. Default value is 0, meaning the algorithm doesn't maintain a momentum by default.
- k
Function on number of epochs in SGD.
k
should be a function taking only a parameterx
meaning the current number of data points considered since last segmentaion. The return value of the function should be an integer indicating how many epochs should be performed apart from the default update. By default the function returns 0, meaning no multiple epochs will be used to update the parameters. Example usage:k = function(x) { if (x < n / segment_count / 4 * 1) 3 else if (x < n / segment_count / 4 * 2) 2 else if (x < n / segment_count / 4 * 3) 1 else 0 }
This function will perform 3 epochs for the first quarter of the data, 2 epochs for the second quarter of the data, 1 epoch for the third quarter of the data and no multiple epochs for the last quarter of the data. Experiments show that performing multiple epochs will significantly affect the performance of the algorithm. This parameter is left for the users to tune the performance of the algorithm if the result is not ideal. Details are discussed in the paper.
- family
Family of the model. Can be
"lm"
,"binomial"
,"poisson"
,"lasso"
,"custom"
,"ar"
,"var"
,"ma"
,"arima"
,"garch"
orNULL
. For simplicity, user can also omit this parameter, indicating that they will be using their own cost functions. Omitting the parameter is the same as specifying the parameter to be"custom"
orNULL
, in which case, users must specify the cost function, with optional gradient and corresponding Hessian matrix functions.- epsilon
Epsilon to avoid numerical issues. Only used for the Hessian computation in Logistic Regression and Poisson Regression.
- min_prob
Minimum probability to avoid numerical issues. Only used for Poisson Regression.
- winsorise_minval
Minimum value for the parameter in Poisson Regression to be winsorised.
- winsorise_maxval
Maximum value for the parameter in Poisson Regression to be winsorised.
- p
Number of covariates in the model. If not specified, the number of covariates will be inferred from the data, i.e.,
p = ncol(data) - 1
. This parameter is superseded byorder
in the case of time series models: "ar", "var", "arima".- order
Order of the AR(p), VAR(p) or ARIMA(p, d, q) model.
- cost
Cost function to be used. This and the following two parameters should not be specified at the same time with
family
. If not specified, the default is the negative log-likelihood for the corresponding family. Custom cost functions can be provided in the following two formats:cost = function(data) {...}
cost = function(data, theta) {...}
In both methods, users should implement the cost value calculation based on the data provided, where the data parameter can be considered as a segment of the original data frame in the form of a matrix. The first method is used when the cost function has an explicit solution, in which case the cost function value can be calculated directly from the data. The second method is used when the cost function does not have an explicit solution, in which case the cost function value can be calculated from the data and the estimated parameters. In the case of only one
data
argument is provided,fastcpd
performs the vanilla PELT algorithm since no parameter updating is performed.- cost_gradient
Gradient function for the custom cost function. Example usage:
cost_gradient = function(data, theta) { ... return(gradient) }
The gradient function should take two parameters, the first one being a segment of the data in the format of a matrix, the second one being the estimated parameters. The gradient function should return the gradient of the cost function with respect to the data and parameters.
- cost_hessian
Hessian function for the custom cost function. Similar to the gradient function, the Hessian function should take two parameters, the first one being a segment of the data in the format of a matrix, the second one being the estimated parameters. The Hessian function should return the Hessian matrix of the cost function with respect to the data and parameters.
- cp_only
If
TRUE
, only the change points are returned. Otherwise, the cost function values together with the estimated parameters for each segment are also returned. By default the value is set to beFALSE
so thatplot
can be used to visualize the results for a built-in model.cp_only
has some performance impact on the algorithm, since the cost values and estimated parameters for each segment need to be calculated and stored. If the users are only interested in the change points, settingcp_only
to beTRUE
will help with the computational cost.- vanilla_percentage
How many of the data should be processed through vanilla PELT. Range should be between 0 and 1. The
fastcpd
algorithm is based on gradient descent and thus a starting estimate can be crucial. At the beginning of the algorithm, vanilla PELT can be performed to obtain a relatively accurate estimate of the parameters despite the small amount of the data being used. If set to be 0, all data will be processed through sequential gradient descnet. If set to be 1, all data will be processed through vaniall PELT. If the cost function have an explicit solution, i.e. does not depend on coefficients like the mean change case, this parameter will be set to be 1. If the value is set to be between 0 and 1, the firstvanilla_percentage * nrow(data)
data points will be processed through vanilla PELT and the rest will be processed through sequential gradient descent.- warm_start
If
TRUE
, the algorithm will use the estimated parameters from the previous segment as the initial value for the current segment. This parameter is only used for"glm"
families.- lower
Lower bound for the parameters. Used to specify the domain of the parameter after each gradient descent step. If not specified, the lower bound will be set to be
-Inf
for all parameters.- upper
Upper bound for the parameters. Used to specify the domain of the parameter after each gradient descent step. If not specified, the upper bound will be set to be
Inf
for all parameters.- line_search
If a vector of numeric values are provided, line search will be performed to find the optimal step size for each update.
- ...
Parameters specifically used for time series models. As of the current implementation, only
include.mean
will not be ignored and used in the ARIMA or GARCH model.
References
Zhang X, Dawn T (2023). ``Sequential Gradient Descent and Quasi-Newton's Method for Change-Point Analysis.'' In Ruiz, Francisco, Dy, Jennifer, van de Meent, Jan-Willem (eds.), Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, volume 206 series Proceedings of Machine Learning Research, 1129-1143. https://proceedings.mlr.press/v206/zhang23b.html.
Examples
# \donttest{
if (!requireNamespace("ggplot2", quietly = TRUE)) utils::install.packages(
"ggplot2", repos = "https://cloud.r-project.org", quiet = TRUE
)
### linear regression
library(fastcpd)
set.seed(1)
p <- 3
x <- mvtnorm::rmvnorm(300, rep(0, p), diag(p))
theta_0 <- rbind(c(1, 1.2, -1), c(-1, 0, 0.5), c(0.5, -0.3, 0.2))
y <- c(
x[1:100, ] %*% theta_0[1, ] + rnorm(100, 0, 1),
x[101:200, ] %*% theta_0[2, ] + rnorm(100, 0, 1),
x[201:300, ] %*% theta_0[3, ] + rnorm(100, 0, 1)
)
result <- fastcpd(
formula = y ~ . - 1,
data = data.frame(y = y, x = x),
family = "lm"
)
plot(result)
summary(result)
#>
#> Call:
#> fastcpd(formula = y ~ . - 1, data = data.frame(y = y, x = x),
#> family = "lm")
#>
#> Change points:
#> 98 202
#>
#> Cost values:
#> 53.44023 53.1441 45.04974
#>
#> Parameters:
#> segment 1 segment 2 segment 3
#> 1 0.9704022 -1.07884004 0.5925092
#> 2 1.1786074 -0.01757927 -0.5287126
#> 3 -0.9258587 0.63906143 0.1929411
### linear regression with one-dimensional covariate
library(fastcpd)
set.seed(1)
p <- 1
x <- mvtnorm::rmvnorm(300, rep(0, p), diag(p))
theta_0 <- matrix(c(1, -1, 0.5))
y <- c(
x[1:100, ] * theta_0[1, ] + rnorm(100, 0, 1),
x[101:200, ] * theta_0[2, ] + rnorm(100, 0, 1),
x[201:300, ] * theta_0[3, ] + rnorm(100, 0, 1)
)
result <- fastcpd(
formula = y ~ . - 1,
data = data.frame(y = y, x = x),
family = "lm"
)
plot(result)
summary(result)
#>
#> Call:
#> fastcpd(formula = y ~ . - 1, data = data.frame(y = y, x = x),
#> family = "lm")
#>
#> Change points:
#> 99 197
#>
#> Cost values:
#> 48.36336 65.71493 46.37389
#>
#> Parameters:
#> segment 1 segment 2 segment 3
#> 1 0.956995 -0.8488617 0.4562731
### linear regression with noise variance not equal to 1
library(fastcpd)
set.seed(1)
p <- 4
n <- 300
cp <- c(100, 200)
x <- mvtnorm::rmvnorm(n, rep(0, p), diag(p))
theta_0 <- rbind(c(1, 3.2, -1, 0), c(-1, -0.5, 2.5, -2), c(0.8, -0.3, 1, 1))
y <- c(
x[1:cp[1], ] %*% theta_0[1, ] + rnorm(cp[1], 0, sd = 3),
x[(cp[1] + 1):cp[2], ] %*% theta_0[2, ] + rnorm(cp[2] - cp[1], 0, sd = 3),
x[(cp[2] + 1):n, ] %*% theta_0[3, ] + rnorm(n - cp[2], 0, sd = 3)
)
result <- fastcpd(
data = data.frame(y = y, x = x),
family = "lm"
)
summary(result)
#>
#> Call:
#> fastcpd(data = data.frame(y = y, x = x), family = "lm")
#>
#> Change points:
#> 97 214
#>
#> Cost values:
#> 519.8276 488.0173 408.1008
#>
#> Parameters:
#> segment 1 segment 2 segment 3
#> 1 0.74291290 -0.4872500 0.80982091
#> 2 3.69465275 -0.6402096 0.09460024
#> 3 -1.24746871 2.1728986 0.87547701
#> 4 0.09579985 -1.6546439 1.52837626
### logistic regression
library(fastcpd)
set.seed(1)
x <- matrix(rnorm(1500, 0, 1), ncol = 5)
theta <- rbind(rnorm(5, 0, 1), rnorm(5, 2, 1))
y <- c(
rbinom(125, 1, 1 / (1 + exp(-x[1:125, ] %*% theta[1, ]))),
rbinom(175, 1, 1 / (1 + exp(-x[126:300, ] %*% theta[2, ])))
)
result <- suppressWarnings(fastcpd(
formula = y ~ . - 1,
data = data.frame(y = y, x = x),
family = "binomial"
))
summary(result)
#>
#> Call:
#> fastcpd(formula = y ~ . - 1, data = data.frame(y = y, x = x),
#> family = "binomial")
#>
#> Change points:
#> 126
#>
#> Cost values:
#> 56.90525 30.76875
#>
#> Parameters:
#> segment 1 segment 2
#> 1 0.7259293 1.878525
#> 2 -1.0294802 2.704376
#> 3 1.0576503 3.702310
#> 4 -0.8812767 2.258796
#> 5 0.2419351 2.524173
### poisson regression
library(fastcpd)
set.seed(1)
p <- 3
x <- mvtnorm::rmvnorm(1500, rep(0, p), diag(p))
delta <- rnorm(p)
theta_0 <- c(1, 1.2, -1)
y <- c(
rpois(300, exp(x[1:300, ] %*% theta_0)),
rpois(400, exp(x[301:700, ] %*% (theta_0 + delta))),
rpois(300, exp(x[701:1000, ] %*% theta_0)),
rpois(100, exp(x[1001:1100, ] %*% (theta_0 - delta))),
rpois(200, exp(x[1101:1300, ] %*% theta_0)),
rpois(200, exp(x[1301:1500, ] %*% (theta_0 + delta)))
)
result <- fastcpd(
formula = y ~ . - 1,
data = data.frame(y = y, x = x),
beta = (p + 1) * log(1500) / 2,
k = function(x) 0,
family = "poisson",
epsilon = 1e-5
)
summary(result)
#>
#> Call:
#> fastcpd(formula = y ~ . - 1, data = data.frame(y = y, x = x),
#> beta = (p + 1) * log(1500)/2, k = function(x) 0, family = "poisson",
#> epsilon = 1e-05)
#>
#> Change points:
#> 329 728 1021 1107 1325
#>
#> Cost values:
#> 14425.87 13971.23 697.2187 107.5353 380.7153 51.93594
#>
#> Parameters:
#> segment 1 segment 2 segment 3 segment 4 segment 5 segment 6
#> 1 2.60927673 1.9255183 0.7405125 -0.3965022 1.117753 2.5479308
#> 2 0.02398457 0.1068924 1.4721444 1.8677797 1.019035 0.4947115
#> 3 -1.34361104 -2.7353603 -0.8906937 0.4651667 -1.178933 -2.5038966
result_two_epochs <- fastcpd(
formula = y ~ . - 1,
data = data.frame(y = y, x = x),
beta = (p + 1) * log(1500) / 2,
k = function(x) 1,
family = "poisson",
epsilon = 1e-4
)
summary(result_two_epochs)
#>
#> Call:
#> fastcpd(formula = y ~ . - 1, data = data.frame(y = y, x = x),
#> beta = (p + 1) * log(1500)/2, k = function(x) 1, family = "poisson",
#> epsilon = 1e-04)
#>
#> Change points:
#> 328 716 1020 1102 1323
#>
#> Cost values:
#> 14417.14 2976.961 717.4614 31.48528 296.6285 53.94423
#>
#> Parameters:
#> segment 1 segment 2 segment 3 segment 4 segment 5 segment 6
#> 1 2.60955822 2.4484869 0.7832980 -0.5008107 1.105317 2.5479958
#> 2 0.02371536 0.4084502 1.4456715 1.9282798 1.057743 0.4951862
#> 3 -1.34277129 -2.5426556 -0.8989812 0.5197285 -1.128259 -2.5035143
### penalized linear regression
library(fastcpd)
set.seed(1)
n <- 1500
p_true <- 6
p <- 50
x <- mvtnorm::rmvnorm(1500, rep(0, p), diag(p))
theta_0 <- rbind(
runif(p_true, -5, -2),
runif(p_true, -3, 3),
runif(p_true, 2, 5),
runif(p_true, -5, 5)
)
theta_0 <- cbind(theta_0, matrix(0, ncol = p - p_true, nrow = 4))
y <- c(
x[1:300, ] %*% theta_0[1, ] + rnorm(300, 0, 1),
x[301:700, ] %*% theta_0[2, ] + rnorm(400, 0, 1),
x[701:1000, ] %*% theta_0[3, ] + rnorm(300, 0, 1),
x[1001:1500, ] %*% theta_0[4, ] + rnorm(500, 0, 1)
)
result <- fastcpd(
formula = y ~ . - 1,
data = data.frame(y = y, x = x),
family = "lasso"
)
plot(result)
summary(result)
#>
#> Call:
#> fastcpd(formula = y ~ . - 1, data = data.frame(y = y, x = x),
#> family = "lasso")
#>
#> Change points:
#> 300 700 1000
#>
#> Cost values:
#> 208.798 261.4523 189.2066 308.0149
#>
#> Parameters:
#> 50 x 4 sparse Matrix of class "dgCMatrix"
#> segment 1 segment 2 segment 3 segment 4
#> [1,] -2.877991 0.3218542 4.031359 .
#> [2,] -2.814714 -0.3591500 3.896588 3.0998182
#> [3,] -2.802267 -0.1887120 2.558823 2.7608711
#> [4,] -1.911901 0.4617167 3.310737 -0.5238929
#> [5,] -3.053765 -0.4520943 2.080669 -3.3659884
#> [6,] -1.825253 0.4036493 4.784314 .
#> [7,] . . . .
#> [8,] . . . .
#> [9,] . . . .
#> [10,] . . . .
#> [11,] . . . .
#> [12,] . . . .
#> [13,] . . . .
#> [14,] . . . .
#> [15,] . . . .
#> [16,] . . . .
#> [17,] . . . .
#> [18,] . . . .
#> [19,] . . . .
#> [20,] . . . .
#> [21,] . . . .
#> [22,] . . . .
#> [23,] . . . .
#> [24,] . . . .
#> [25,] . . . .
#> [26,] . . . .
#> [27,] . . . .
#> [28,] . . . .
#> [29,] . . . .
#> [30,] . . . .
#> [31,] . . . .
#> [32,] . . . .
#> [33,] . . . .
#> [34,] . . . .
#> [35,] . . . .
#> [36,] . . . .
#> [37,] . . . .
#> [38,] . . . .
#> [39,] . . . .
#> [40,] . . . .
#> [41,] . . . .
#> [42,] . . . .
#> [43,] . . . .
#> [44,] . . . .
#> [45,] . . . .
#> [46,] . . . .
#> [47,] . . . .
#> [48,] . . . .
#> [49,] . . . .
#> [50,] . . . .
### ar(1) model
library(fastcpd)
set.seed(1)
n <- 1000
p <- 1
x <- rep(0, n + 1)
for (i in 1:600) {
x[i + 1] <- 0.6 * x[i] + rnorm(1)
}
for (i in 601:1000) {
x[i + 1] <- 0.3 * x[i] + rnorm(1)
}
result <- fastcpd_ts(x, "ar", 1)
summary(result)
#>
#> Call:
#> fastcpd_ts(data = x, family = "ar", order = 1)
#>
#> Change points:
#> 609
#>
#> Cost values:
#> 304.2952 228.4288
#>
#> Parameters:
#> segment 1 segment 2
#> 1 0.5648258 0.2227463
plot(result)
### ar(3) model with innovation standard deviation 3
library(fastcpd)
set.seed(1)
n <- 1000
p <- 1
x <- rep(0, n + 3)
for (i in 1:600) {
x[i + 3] <- 0.6 * x[i + 2] - 0.2 * x[i + 1] + 0.1 * x[i] + rnorm(1, 0, 3)
}
for (i in 601:1000) {
x[i + 1] <- 0.3 * x[i + 2] + 0.4 * x[i + 1] + 0.2 * x[i] + rnorm(1, 0, 3)
}
result <- fastcpd.ts(x, "ar", 3)
summary(result)
#>
#> Call:
#> fastcpd.ts(data = x, family = "ar", order = 3)
#>
#> Change points:
#> 615
#>
#> Cost values:
#> 2753.547 2022.597
#>
#> Parameters:
#> segment 1 segment 2
#> 1 0.57616905 0.13006290
#> 2 -0.21476408 -0.03084403
#> 3 0.07938272 -0.04544551
plot(result)
### custom logistic regression
library(fastcpd)
set.seed(1)
p <- 5
x <- matrix(rnorm(375 * p, 0, 1), ncol = p)
theta <- rbind(rnorm(p, 0, 1), rnorm(p, 2, 1))
y <- c(
rbinom(200, 1, 1 / (1 + exp(-x[1:200, ] %*% theta[1, ]))),
rbinom(175, 1, 1 / (1 + exp(-x[201:375, ] %*% theta[2, ])))
)
data <- data.frame(y = y, x = x)
result_builtin <- suppressWarnings(fastcpd(
formula = y ~ . - 1,
data = data,
family = "binomial"
))
logistic_loss <- function(data, theta) {
x <- data[, -1]
y <- data[, 1]
u <- x %*% theta
nll <- -y * u + log(1 + exp(u))
nll[u > 10] <- -y[u > 10] * u[u > 10] + u[u > 10]
sum(nll)
}
logistic_loss_gradient <- function(data, theta) {
x <- data[nrow(data), -1]
y <- data[nrow(data), 1]
c(-(y - 1 / (1 + exp(-x %*% theta)))) * x
}
logistic_loss_hessian <- function(data, theta) {
x <- data[nrow(data), -1]
prob <- 1 / (1 + exp(-x %*% theta))
(x %o% x) * c((1 - prob) * prob)
}
result_custom <- fastcpd(
formula = y ~ . - 1,
data = data,
epsilon = 1e-5,
cost = logistic_loss,
cost_gradient = logistic_loss_gradient,
cost_hessian = logistic_loss_hessian
)
cat(
"Change points detected by built-in logistic regression model: ",
result_builtin@cp_set, "\n",
"Change points detected by custom logistic regression model: ",
result_custom@cp_set, "\n",
sep = ""
)
#> Change points detected by built-in logistic regression model: 200
#> Change points detected by custom logistic regression model: 201
result_custom_two_epochs <- fastcpd(
formula = y ~ . - 1,
data = data,
k = function(x) 1,
epsilon = 1e-5,
cost = logistic_loss,
cost_gradient = logistic_loss_gradient,
cost_hessian = logistic_loss_hessian
)
summary(result_custom_two_epochs)
#>
#> Call:
#> fastcpd(formula = y ~ . - 1, data = data, k = function(x) 1,
#> epsilon = 1e-05, cost = logistic_loss, cost_gradient = logistic_loss_gradient,
#> cost_hessian = logistic_loss_hessian)
#>
#> Change points:
#> 200
#>
#> Parameters:
#> segment 1 segment 2
#> 1 -0.6235240 2.0066479
#> 2 -1.6767614 1.6278889
#> 3 -1.7973433 4.6422022
#> 4 -0.4842969 -0.1521062
#> 5 2.0797875 2.4047092
### custom cost function mean change
library(fastcpd)
set.seed(1)
p <- 1
data <- rbind(
mvtnorm::rmvnorm(300, mean = rep(0, p), sigma = diag(100, p)),
mvtnorm::rmvnorm(400, mean = rep(50, p), sigma = diag(100, p)),
mvtnorm::rmvnorm(300, mean = rep(2, p), sigma = diag(100, p))
)
segment_count_guess <- 10
block_size <- max(floor(sqrt(nrow(data)) / (segment_count_guess + 1)), 2)
block_count <- floor(nrow(data) / block_size)
data_all_vars <- rep(0, block_count)
for (block_index in seq_len(block_count)) {
block_start <- (block_index - 1) * block_size + 1
block_end <- if (block_index < block_count) {
block_index * block_size
} else {
nrow(data)
}
data_all_vars[block_index] <- var(data[block_start:block_end, ])
}
data_all_var <- mean(data_all_vars)
mean_loss <- function(data) {
n <- nrow(data)
n / 2 * (
log(data_all_var) + log(2 * pi) +
sum((data - colMeans(data))^2 / data_all_var) / n
)
}
mean_loss_result <- fastcpd(
formula = ~ . - 1,
data = data.frame(data),
beta = (p + 1) * log(nrow(data)) / 2,
p = p,
cost = mean_loss
)
summary(mean_loss_result)
#>
#> Call:
#> fastcpd(formula = ~. - 1, data = data.frame(data), beta = (p +
#> 1) * log(nrow(data))/2, p = p, cost = mean_loss)
#>
#> Change points:
#> 300 700
### custom cost function multivariate mean change
library(fastcpd)
set.seed(1)
p <- 3
data <- rbind(
mvtnorm::rmvnorm(300, mean = rep(0, p), sigma = diag(100, p)),
mvtnorm::rmvnorm(400, mean = rep(50, p), sigma = diag(100, p)),
mvtnorm::rmvnorm(300, mean = rep(2, p), sigma = diag(100, p))
)
segment_count_guess <- 5
block_size <- max(floor(sqrt(nrow(data)) / (segment_count_guess + 1)), 2)
block_count <- floor(nrow(data) / block_size)
data_all_covs <- array(NA, dim = c(block_count, p, p))
for (block_index in seq_len(block_count)) {
block_start <- (block_index - 1) * block_size + 1
block_end <- if (block_index < block_count) {
block_index * block_size
} else {
nrow(data)
}
data_all_covs[block_index, , ] <- cov(data[block_start:block_end, ])
}
data_all_cov <- colMeans(data_all_covs)
mean_loss <- function(data) {
n <- nrow(data)
demeaned_data <- sweep(data, 2, colMeans(data))
n / 2 * (
log(det(data_all_cov)) + p * log(2 * pi) +
sum(diag(solve(data_all_cov, crossprod(demeaned_data)))) / n
)
}
mean_loss_result <- fastcpd(
formula = ~ . - 1,
data = data.frame(data),
beta = (p + 1) * log(nrow(data)) / 2,
p = p,
cost = mean_loss
)
summary(mean_loss_result)
#>
#> Call:
#> fastcpd(formula = ~. - 1, data = data.frame(data), beta = (p +
#> 1) * log(nrow(data))/2, p = p, cost = mean_loss)
#>
#> Change points:
#> 300 700
### custom cost function variance change
library(fastcpd)
set.seed(1)
p <- 1
data <- rbind.data.frame(
mvtnorm::rmvnorm(300, mean = rep(0, p), sigma = diag(1, p)),
mvtnorm::rmvnorm(400, mean = rep(0, p), sigma = diag(50, p)),
mvtnorm::rmvnorm(300, mean = rep(0, p), sigma = diag(2, p))
)
data_all_mean <- colMeans(data)
var_loss <- function(data) {
n <- nrow(data)
data_cov <- crossprod(sweep(data, 2, data_all_mean)) / (n - 1)
n / 2 * (log(data_cov) + log(2 * pi) + (n - 1) / n)
}
var_loss_result <- fastcpd(
formula = ~ . - 1,
data = data,
beta = (p + 1) * log(nrow(data)) / 2,
p = p,
cost = var_loss
)
summary(var_loss_result)
#>
#> Call:
#> fastcpd(formula = ~. - 1, data = data, beta = (p + 1) * log(nrow(data))/2,
#> p = p, cost = var_loss)
#>
#> Change points:
#> 300 699
### custom cost function multivariate variance change
library(fastcpd)
set.seed(1)
p <- 3
data <- rbind.data.frame(
mvtnorm::rmvnorm(
300, rep(0, p), crossprod(matrix(runif(p^2) * 2 - 1, p))
),
mvtnorm::rmvnorm(
400, rep(0, p), crossprod(matrix(runif(p^2) * 2 - 1, p))
),
mvtnorm::rmvnorm(
300, rep(0, p), crossprod(matrix(runif(p^2) * 2 - 1, p))
)
)
data_all_mean <- colMeans(data)
var_loss <- function(data) {
n <- nrow(data)
p <- ncol(data)
if (n < p) {
data_cov <- diag(p)
} else {
data_cov <- crossprod(sweep(data, 2, data_all_mean)) / (n - 1)
}
n / 2 * (log(det(data_cov)) + p * log(2 * pi) + p * (n - 1) / n)
}
var_loss_result <- fastcpd(
formula = ~ . - 1,
data = data,
beta = (p^2 + 1) * log(nrow(data)) / 2,
trim = 0.1,
p = p^2,
cost = var_loss
)
summary(var_loss_result)
#>
#> Call:
#> fastcpd(formula = ~. - 1, data = data, beta = (p^2 + 1) * log(nrow(data))/2,
#> trim = 0.1, p = p^2, cost = var_loss)
#>
#> Change points:
#> 300 700
### custom cost function mean or variance change
library(fastcpd)
set.seed(1)
p <- 1
data <- rbind.data.frame(
mvtnorm::rmvnorm(300, mean = rep(0, p), sigma = diag(1, p)),
mvtnorm::rmvnorm(400, mean = rep(10, p), sigma = diag(1, p)),
mvtnorm::rmvnorm(300, mean = rep(0, p), sigma = diag(50, p)),
mvtnorm::rmvnorm(300, mean = rep(0, p), sigma = diag(1, p)),
mvtnorm::rmvnorm(400, mean = rep(10, p), sigma = diag(1, p)),
mvtnorm::rmvnorm(300, mean = rep(10, p), sigma = diag(50, p))
)
meanvar_loss <- function(data) {
n <- nrow(data)
data_cov <- 1
if (n > 1) {
data_cov <- var(data)
}
n / 2 * (log(data_cov) + log(2 * pi) + (n - 1) / n)
}
meanvar_loss_result <- fastcpd(
formula = ~ . - 1,
data = data,
beta = (p^2 + p + 1) * log(nrow(data)) / 2,
p = p^2 + p,
cost = meanvar_loss
)
summary(meanvar_loss_result)
#>
#> Call:
#> fastcpd(formula = ~. - 1, data = data, beta = (p^2 + p + 1) *
#> log(nrow(data))/2, p = p^2 + p, cost = meanvar_loss)
#>
#> Change points:
#> 300 700 1000 1300 1700
### custom cost function multivariate mean or variance change
library(fastcpd)
set.seed(1)
p <- 3
data <- rbind.data.frame(
mvtnorm::rmvnorm(300, mean = rep(0, p), sigma = diag(1, p)),
mvtnorm::rmvnorm(400, mean = rep(10, p), sigma = diag(1, p)),
mvtnorm::rmvnorm(300, mean = rep(0, p), sigma = diag(50, p)),
mvtnorm::rmvnorm(300, mean = rep(0, p), sigma = diag(1, p)),
mvtnorm::rmvnorm(400, mean = rep(10, p), sigma = diag(1, p)),
mvtnorm::rmvnorm(300, mean = rep(10, p), sigma = diag(50, p))
)
meanvar_loss <- function(data) {
n <- nrow(data)
p <- ncol(data)
if (n <= p) {
data_cov <- diag(p)
} else {
data_cov <- cov(data)
}
n / 2 * (log(det(data_cov)) + p * log(2 * pi) + p * (n - 1) / n)
}
meanvar_loss_result <- fastcpd(
formula = ~ . - 1,
data = data,
beta = (p^2 + p + 1) * log(nrow(data)) / 2,
trim = 0.01,
p = p^2 + p,
cost = meanvar_loss
)
summary(meanvar_loss_result)
#>
#> Call:
#> fastcpd(formula = ~. - 1, data = data, beta = (p^2 + p + 1) *
#> log(nrow(data))/2, trim = 0.01, p = p^2 + p, cost = meanvar_loss)
#>
#> Change points:
#> 300 700 1000 1300 1700
### custom cost function huber regression
library(fastcpd)
set.seed(1)
n <- 400 + 300 + 500
p <- 5
x <- mvtnorm::rmvnorm(n, mean = rep(0, p), sigma = diag(p))
theta <- rbind(
mvtnorm::rmvnorm(1, mean = rep(0, p - 3), sigma = diag(p - 3)),
mvtnorm::rmvnorm(1, mean = rep(5, p - 3), sigma = diag(p - 3)),
mvtnorm::rmvnorm(1, mean = rep(9, p - 3), sigma = diag(p - 3))
)
theta <- cbind(theta, matrix(0, 3, 3))
theta <- theta[rep(seq_len(3), c(400, 300, 500)), ]
y_true <- rowSums(x * theta)
factor <- c(
2 * stats::rbinom(400, size = 1, prob = 0.95) - 1,
2 * stats::rbinom(300, size = 1, prob = 0.95) - 1,
2 * stats::rbinom(500, size = 1, prob = 0.95) - 1
)
y <- factor * y_true + stats::rnorm(n)
data <- cbind.data.frame(y, x)
huber_threshold <- 1
huber_loss <- function(data, theta) {
residual <- data[, 1] - data[, -1, drop = FALSE] %*% theta
indicator <- abs(residual) <= huber_threshold
sum(
residual^2 / 2 * indicator +
huber_threshold * (
abs(residual) - huber_threshold / 2
) * (1 - indicator)
)
}
huber_loss_gradient <- function(data, theta) {
residual <- c(data[nrow(data), 1] - data[nrow(data), -1] %*% theta)
if (abs(residual) <= huber_threshold) {
-residual * data[nrow(data), -1]
} else {
-huber_threshold * sign(residual) * data[nrow(data), -1]
}
}
huber_loss_hessian <- function(data, theta) {
residual <- c(data[nrow(data), 1] - data[nrow(data), -1] %*% theta)
if (abs(residual) <= huber_threshold) {
outer(data[nrow(data), -1], data[nrow(data), -1])
} else {
0.01 * diag(length(theta))
}
}
huber_regression_result <- fastcpd(
formula = y ~ . - 1,
data = data,
beta = (p + 1) * log(n) / 2,
cost = huber_loss,
cost_gradient = huber_loss_gradient,
cost_hessian = huber_loss_hessian
)
summary(huber_regression_result)
#>
#> Call:
#> fastcpd(formula = y ~ . - 1, data = data, beta = (p + 1) * log(n)/2,
#> cost = huber_loss, cost_gradient = huber_loss_gradient, cost_hessian = huber_loss_hessian)
#>
#> Change points:
#> 401 726
#>
#> Parameters:
#> segment 1 segment 2 segment 3
#> 1 -0.52615415 2.77991463 8.744706508
#> 2 -1.02443443 5.06390528 9.506534878
#> 3 -0.09220421 0.01647923 -0.008908851
#> 4 -0.01326592 -0.08103008 -0.047909865
#> 5 0.02526703 0.01329142 0.025171681
# }