Skip to contents

fastcpd takes in formulas, data, families and extra parameters and returns a fastcpd object.

Usage

fastcpd(
  formula = y ~ . - 1,
  data,
  beta = NULL,
  segment_count = 10,
  trim = 0.025,
  momentum_coef = 0,
  k = function(x) 0,
  family = NULL,
  epsilon = 1e-10,
  min_prob = 10^10,
  winsorise_minval = -20,
  winsorise_maxval = 20,
  p = ncol(data) - 1,
  order = c(0, 0, 0),
  cost = NULL,
  cost_gradient = NULL,
  cost_hessian = NULL,
  cp_only = FALSE,
  vanilla_percentage = 0,
  warm_start = FALSE,
  lower = NULL,
  upper = NULL,
  line_search = c(1),
  ...
)

Arguments

formula

A formula object specifying the model to be fitted. The optional response variable should be on the left hand side of the formula while the covariates should be on the right hand side. The intercept term should be removed from the formula. The response variable is not necessary if the data considered is not of regression type. For example, a mean or variance change model does not necessarily have response variables. By default an intercept column will be added to the data similar to the lm function in R. Thus it is suggested that user should remove the intercept term from the formula by appending - 1 to the formula. The default formula is suitable for regression data sets with one-dimensional response variable and the rest being covariates without intercept. The naming of variables used in the formula should be consistent with the column names in the data frame provided in data.

data

A data frame containing the data to be segmented where each row denotes each data point. In one-dimensional response variable regression settings, the first column is the response variable while the rest are covariates. The response is not necessary in the case of mean change or variance change, in which case the formula will need to be adjusted accordingly.

beta

Initial cost value specified in the algorithm in the paper. For the proper choice of a value, please refer to the paper. If not specified, BIC criterion is used to obtain a proper value, i.e., beta = (p + 1) * log(nrow(data)) / 2.

segment_count

Number of segments for initial guess. If not specified, the initial guess on the number of segments is 10.

trim

Trimming for the boundary change points so that a change point close to the boundary will not be counted as a change point. This parameter also specifies the minimum distance between two change points. If. several change points have mutual distances smaller than trim * nrow(data), those change points will be merged into one single change point. The value of this parameter should be between 0 and 1.

momentum_coef

Momentum coefficient to be applied to each update. This parameter is used when the loss function is bad-shaped so that maintaining a momentum from previous update is desired. Default value is 0, meaning the algorithm doesn't maintain a momentum by default.

k

Function on number of epochs in SGD. k should be a function taking only a parameter x meaning the current number of data points considered since last segmentaion. The return value of the function should be an integer indicating how many epochs should be performed apart from the default update. By default the function returns 0, meaning no multiple epochs will be used to update the parameters. Example usage:

  k = function(x) {
    if (x < n / segment_count / 4 * 1) 3
    else if (x < n / segment_count / 4 * 2) 2
    else if (x < n / segment_count / 4 * 3) 1
    else 0
  }

This function will perform 3 epochs for the first quarter of the data, 2 epochs for the second quarter of the data, 1 epoch for the third quarter of the data and no multiple epochs for the last quarter of the data. Experiments show that performing multiple epochs will significantly affect the performance of the algorithm. This parameter is left for the users to tune the performance of the algorithm if the result is not ideal. Details are discussed in the paper.

family

Family of the model. Can be "lm", "binomial", "poisson", "lasso", "custom", "ar", "var", "ma", "arima", "garch" or NULL. For simplicity, user can also omit this parameter, indicating that they will be using their own cost functions. Omitting the parameter is the same as specifying the parameter to be "custom" or NULL, in which case, users must specify the cost function, with optional gradient and corresponding Hessian matrix functions.

epsilon

Epsilon to avoid numerical issues. Only used for the Hessian computation in Logistic Regression and Poisson Regression.

min_prob

Minimum probability to avoid numerical issues. Only used for Poisson Regression.

winsorise_minval

Minimum value for the parameter in Poisson Regression to be winsorised.

winsorise_maxval

Maximum value for the parameter in Poisson Regression to be winsorised.

p

Number of covariates in the model. If not specified, the number of covariates will be inferred from the data, i.e., p = ncol(data) - 1. This parameter is superseded by order in the case of time series models: "ar", "var", "arima".

order

Order of the AR(p), VAR(p) or ARIMA(p, d, q) model.

cost

Cost function to be used. This and the following two parameters should not be specified at the same time with family. If not specified, the default is the negative log-likelihood for the corresponding family. Custom cost functions can be provided in the following two formats:

  • cost = function(data) {...}

  • cost = function(data, theta) {...}

In both methods, users should implement the cost value calculation based on the data provided, where the data parameter can be considered as a segment of the original data frame in the form of a matrix. The first method is used when the cost function has an explicit solution, in which case the cost function value can be calculated directly from the data. The second method is used when the cost function does not have an explicit solution, in which case the cost function value can be calculated from the data and the estimated parameters. In the case of only one data argument is provided, fastcpd performs the vanilla PELT algorithm since no parameter updating is performed.

cost_gradient

Gradient function for the custom cost function. Example usage:

  cost_gradient = function(data, theta) {
    ...
    return(gradient)
  }

The gradient function should take two parameters, the first one being a segment of the data in the format of a matrix, the second one being the estimated parameters. The gradient function should return the gradient of the cost function with respect to the data and parameters.

cost_hessian

Hessian function for the custom cost function. Similar to the gradient function, the Hessian function should take two parameters, the first one being a segment of the data in the format of a matrix, the second one being the estimated parameters. The Hessian function should return the Hessian matrix of the cost function with respect to the data and parameters.

cp_only

If TRUE, only the change points are returned. Otherwise, the cost function values together with the estimated parameters for each segment are also returned. By default the value is set to be FALSE so that plot can be used to visualize the results for a built-in model. cp_only has some performance impact on the algorithm, since the cost values and estimated parameters for each segment need to be calculated and stored. If the users are only interested in the change points, setting cp_only to be TRUE will help with the computational cost.

vanilla_percentage

How many of the data should be processed through vanilla PELT. Range should be between 0 and 1. The fastcpd algorithm is based on gradient descent and thus a starting estimate can be crucial. At the beginning of the algorithm, vanilla PELT can be performed to obtain a relatively accurate estimate of the parameters despite the small amount of the data being used. If set to be 0, all data will be processed through sequential gradient descnet. If set to be 1, all data will be processed through vaniall PELT. If the cost function have an explicit solution, i.e. does not depend on coefficients like the mean change case, this parameter will be set to be 1. If the value is set to be between 0 and 1, the first vanilla_percentage * nrow(data) data points will be processed through vanilla PELT and the rest will be processed through sequential gradient descent.

warm_start

If TRUE, the algorithm will use the estimated parameters from the previous segment as the initial value for the current segment. This parameter is only used for "glm" families.

lower

Lower bound for the parameters. Used to specify the domain of the parameter after each gradient descent step. If not specified, the lower bound will be set to be -Inf for all parameters.

upper

Upper bound for the parameters. Used to specify the domain of the parameter after each gradient descent step. If not specified, the upper bound will be set to be Inf for all parameters.

line_search

If a vector of numeric values are provided, line search will be performed to find the optimal step size for each update.

...

Parameters specifically used for time series models. As of the current implementation, only include.mean will not be ignored and used in the ARIMA or GARCH model.

Value

A class fastcpd object.

References

Zhang X, Dawn T (2023). ``Sequential Gradient Descent and Quasi-Newton's Method for Change-Point Analysis.'' In Ruiz, Francisco, Dy, Jennifer, van de Meent, Jan-Willem (eds.), Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, volume 206 series Proceedings of Machine Learning Research, 1129-1143. https://proceedings.mlr.press/v206/zhang23b.html.

Examples

# \donttest{
if (!requireNamespace("ggplot2", quietly = TRUE)) utils::install.packages(
  "ggplot2", repos = "https://cloud.r-project.org", quiet = TRUE
)

### linear regression
library(fastcpd)
set.seed(1)
p <- 3
x <- mvtnorm::rmvnorm(300, rep(0, p), diag(p))
theta_0 <- rbind(c(1, 1.2, -1), c(-1, 0, 0.5), c(0.5, -0.3, 0.2))
y <- c(
  x[1:100, ] %*% theta_0[1, ] + rnorm(100, 0, 1),
  x[101:200, ] %*% theta_0[2, ] + rnorm(100, 0, 1),
  x[201:300, ] %*% theta_0[3, ] + rnorm(100, 0, 1)
)
result <- fastcpd(
  formula = y ~ . - 1,
  data = data.frame(y = y, x = x),
  family = "lm"
)
plot(result)

summary(result)
#> 
#> Call:
#> fastcpd(formula = y ~ . - 1, data = data.frame(y = y, x = x), 
#>     family = "lm")
#> 
#> Change points:
#> 98 202 
#> 
#> Cost values:
#> 53.44023 53.1441 45.04974 
#> 
#> Parameters:
#>    segment 1   segment 2  segment 3
#> 1  0.9704022 -1.07884004  0.5925092
#> 2  1.1786074 -0.01757927 -0.5287126
#> 3 -0.9258587  0.63906143  0.1929411

### linear regression with one-dimensional covariate
library(fastcpd)
set.seed(1)
p <- 1
x <- mvtnorm::rmvnorm(300, rep(0, p), diag(p))
theta_0 <- matrix(c(1, -1, 0.5))
y <- c(
  x[1:100, ] * theta_0[1, ] + rnorm(100, 0, 1),
  x[101:200, ] * theta_0[2, ] + rnorm(100, 0, 1),
  x[201:300, ] * theta_0[3, ] + rnorm(100, 0, 1)
)
result <- fastcpd(
  formula = y ~ . - 1,
  data = data.frame(y = y, x = x),
  family = "lm"
)
plot(result)

summary(result)
#> 
#> Call:
#> fastcpd(formula = y ~ . - 1, data = data.frame(y = y, x = x), 
#>     family = "lm")
#> 
#> Change points:
#> 99 197 
#> 
#> Cost values:
#> 48.36336 65.71493 46.37389 
#> 
#> Parameters:
#>   segment 1  segment 2 segment 3
#> 1  0.956995 -0.8488617 0.4562731

### linear regression with noise variance not equal to 1
library(fastcpd)
set.seed(1)
p <- 4
n <- 300
cp <- c(100, 200)
x <- mvtnorm::rmvnorm(n, rep(0, p), diag(p))
theta_0 <- rbind(c(1, 3.2, -1, 0), c(-1, -0.5, 2.5, -2), c(0.8, -0.3, 1, 1))
y <- c(
  x[1:cp[1], ] %*% theta_0[1, ] + rnorm(cp[1], 0, sd = 3),
  x[(cp[1] + 1):cp[2], ] %*% theta_0[2, ] + rnorm(cp[2] - cp[1], 0, sd = 3),
  x[(cp[2] + 1):n, ] %*% theta_0[3, ] + rnorm(n - cp[2], 0, sd = 3)
)

result <- fastcpd(
  data = data.frame(y = y, x = x),
  family = "lm"
)
summary(result)
#> 
#> Call:
#> fastcpd(data = data.frame(y = y, x = x), family = "lm")
#> 
#> Change points:
#> 97 214 
#> 
#> Cost values:
#> 519.8276 488.0173 408.1008 
#> 
#> Parameters:
#>     segment 1  segment 2  segment 3
#> 1  0.74291290 -0.4872500 0.80982091
#> 2  3.69465275 -0.6402096 0.09460024
#> 3 -1.24746871  2.1728986 0.87547701
#> 4  0.09579985 -1.6546439 1.52837626

### logistic regression
library(fastcpd)
set.seed(1)
x <- matrix(rnorm(1500, 0, 1), ncol = 5)
theta <- rbind(rnorm(5, 0, 1), rnorm(5, 2, 1))
y <- c(
  rbinom(125, 1, 1 / (1 + exp(-x[1:125, ] %*% theta[1, ]))),
  rbinom(175, 1, 1 / (1 + exp(-x[126:300, ] %*% theta[2, ])))
)
result <- suppressWarnings(fastcpd(
  formula = y ~ . - 1,
  data = data.frame(y = y, x = x),
  family = "binomial"
))
summary(result)
#> 
#> Call:
#> fastcpd(formula = y ~ . - 1, data = data.frame(y = y, x = x), 
#>     family = "binomial")
#> 
#> Change points:
#> 126 
#> 
#> Cost values:
#> 56.90525 30.76875 
#> 
#> Parameters:
#>    segment 1 segment 2
#> 1  0.7259293  1.878525
#> 2 -1.0294802  2.704376
#> 3  1.0576503  3.702310
#> 4 -0.8812767  2.258796
#> 5  0.2419351  2.524173

### poisson regression
library(fastcpd)
set.seed(1)
p <- 3
x <- mvtnorm::rmvnorm(1500, rep(0, p), diag(p))
delta <- rnorm(p)
theta_0 <- c(1, 1.2, -1)
y <- c(
  rpois(300, exp(x[1:300, ] %*% theta_0)),
  rpois(400, exp(x[301:700, ] %*% (theta_0 + delta))),
  rpois(300, exp(x[701:1000, ] %*% theta_0)),
  rpois(100, exp(x[1001:1100, ] %*% (theta_0 - delta))),
  rpois(200, exp(x[1101:1300, ] %*% theta_0)),
  rpois(200, exp(x[1301:1500, ] %*% (theta_0 + delta)))
)
result <- fastcpd(
  formula = y ~ . - 1,
  data = data.frame(y = y, x = x),
  beta = (p + 1) * log(1500) / 2,
  k = function(x) 0,
  family = "poisson",
  epsilon = 1e-5
)
summary(result)
#> 
#> Call:
#> fastcpd(formula = y ~ . - 1, data = data.frame(y = y, x = x), 
#>     beta = (p + 1) * log(1500)/2, k = function(x) 0, family = "poisson", 
#>     epsilon = 1e-05)
#> 
#> Change points:
#> 329 728 1021 1107 1325 
#> 
#> Cost values:
#> 14425.87 13971.23 697.2187 107.5353 380.7153 51.93594 
#> 
#> Parameters:
#>     segment 1  segment 2  segment 3  segment 4 segment 5  segment 6
#> 1  2.60927673  1.9255183  0.7405125 -0.3965022  1.117753  2.5479308
#> 2  0.02398457  0.1068924  1.4721444  1.8677797  1.019035  0.4947115
#> 3 -1.34361104 -2.7353603 -0.8906937  0.4651667 -1.178933 -2.5038966
result_two_epochs <- fastcpd(
  formula = y ~ . - 1,
  data = data.frame(y = y, x = x),
  beta = (p + 1) * log(1500) / 2,
  k = function(x) 1,
  family = "poisson",
  epsilon = 1e-4
)
summary(result_two_epochs)
#> 
#> Call:
#> fastcpd(formula = y ~ . - 1, data = data.frame(y = y, x = x), 
#>     beta = (p + 1) * log(1500)/2, k = function(x) 1, family = "poisson", 
#>     epsilon = 1e-04)
#> 
#> Change points:
#> 328 716 1020 1102 1323 
#> 
#> Cost values:
#> 14417.14 2976.961 717.4614 31.48528 296.6285 53.94423 
#> 
#> Parameters:
#>     segment 1  segment 2  segment 3  segment 4 segment 5  segment 6
#> 1  2.60955822  2.4484869  0.7832980 -0.5008107  1.105317  2.5479958
#> 2  0.02371536  0.4084502  1.4456715  1.9282798  1.057743  0.4951862
#> 3 -1.34277129 -2.5426556 -0.8989812  0.5197285 -1.128259 -2.5035143

### penalized linear regression
library(fastcpd)
set.seed(1)
n <- 1500
p_true <- 6
p <- 50
x <- mvtnorm::rmvnorm(1500, rep(0, p), diag(p))
theta_0 <- rbind(
  runif(p_true, -5, -2),
  runif(p_true, -3, 3),
  runif(p_true, 2, 5),
  runif(p_true, -5, 5)
)
theta_0 <- cbind(theta_0, matrix(0, ncol = p - p_true, nrow = 4))
y <- c(
  x[1:300, ] %*% theta_0[1, ] + rnorm(300, 0, 1),
  x[301:700, ] %*% theta_0[2, ] + rnorm(400, 0, 1),
  x[701:1000, ] %*% theta_0[3, ] + rnorm(300, 0, 1),
  x[1001:1500, ] %*% theta_0[4, ] + rnorm(500, 0, 1)
)
result <- fastcpd(
  formula = y ~ . - 1,
  data = data.frame(y = y, x = x),
  family = "lasso"
)
plot(result)

summary(result)
#> 
#> Call:
#> fastcpd(formula = y ~ . - 1, data = data.frame(y = y, x = x), 
#>     family = "lasso")
#> 
#> Change points:
#> 300 700 1000 
#> 
#> Cost values:
#> 208.798 261.4523 189.2066 308.0149 
#> 
#> Parameters:
#> 50 x 4 sparse Matrix of class "dgCMatrix"
#>       segment 1  segment 2 segment 3  segment 4
#>  [1,] -2.877991  0.3218542  4.031359  .        
#>  [2,] -2.814714 -0.3591500  3.896588  3.0998182
#>  [3,] -2.802267 -0.1887120  2.558823  2.7608711
#>  [4,] -1.911901  0.4617167  3.310737 -0.5238929
#>  [5,] -3.053765 -0.4520943  2.080669 -3.3659884
#>  [6,] -1.825253  0.4036493  4.784314  .        
#>  [7,]  .         .          .         .        
#>  [8,]  .         .          .         .        
#>  [9,]  .         .          .         .        
#> [10,]  .         .          .         .        
#> [11,]  .         .          .         .        
#> [12,]  .         .          .         .        
#> [13,]  .         .          .         .        
#> [14,]  .         .          .         .        
#> [15,]  .         .          .         .        
#> [16,]  .         .          .         .        
#> [17,]  .         .          .         .        
#> [18,]  .         .          .         .        
#> [19,]  .         .          .         .        
#> [20,]  .         .          .         .        
#> [21,]  .         .          .         .        
#> [22,]  .         .          .         .        
#> [23,]  .         .          .         .        
#> [24,]  .         .          .         .        
#> [25,]  .         .          .         .        
#> [26,]  .         .          .         .        
#> [27,]  .         .          .         .        
#> [28,]  .         .          .         .        
#> [29,]  .         .          .         .        
#> [30,]  .         .          .         .        
#> [31,]  .         .          .         .        
#> [32,]  .         .          .         .        
#> [33,]  .         .          .         .        
#> [34,]  .         .          .         .        
#> [35,]  .         .          .         .        
#> [36,]  .         .          .         .        
#> [37,]  .         .          .         .        
#> [38,]  .         .          .         .        
#> [39,]  .         .          .         .        
#> [40,]  .         .          .         .        
#> [41,]  .         .          .         .        
#> [42,]  .         .          .         .        
#> [43,]  .         .          .         .        
#> [44,]  .         .          .         .        
#> [45,]  .         .          .         .        
#> [46,]  .         .          .         .        
#> [47,]  .         .          .         .        
#> [48,]  .         .          .         .        
#> [49,]  .         .          .         .        
#> [50,]  .         .          .         .        

### ar(1) model
library(fastcpd)
set.seed(1)
n <- 1000
p <- 1
x <- rep(0, n + 1)
for (i in 1:600) {
  x[i + 1] <- 0.6 * x[i] + rnorm(1)
}
for (i in 601:1000) {
  x[i + 1] <- 0.3 * x[i] + rnorm(1)
}
result <- fastcpd_ts(x, "ar", 1)
summary(result)
#> 
#> Call:
#> fastcpd_ts(data = x, family = "ar", order = 1)
#> 
#> Change points:
#> 609 
#> 
#> Cost values:
#> 304.2952 228.4288 
#> 
#> Parameters:
#>   segment 1 segment 2
#> 1 0.5648258 0.2227463
plot(result)


### ar(3) model with innovation standard deviation 3
library(fastcpd)
set.seed(1)
n <- 1000
p <- 1
x <- rep(0, n + 3)
for (i in 1:600) {
  x[i + 3] <- 0.6 * x[i + 2] - 0.2 * x[i + 1] + 0.1 * x[i] + rnorm(1, 0, 3)
}
for (i in 601:1000) {
  x[i + 1] <- 0.3 * x[i + 2] + 0.4 * x[i + 1] + 0.2 * x[i] + rnorm(1, 0, 3)
}
result <- fastcpd.ts(x, "ar", 3)
summary(result)
#> 
#> Call:
#> fastcpd.ts(data = x, family = "ar", order = 3)
#> 
#> Change points:
#> 615 
#> 
#> Cost values:
#> 2753.547 2022.597 
#> 
#> Parameters:
#>     segment 1   segment 2
#> 1  0.57616905  0.13006290
#> 2 -0.21476408 -0.03084403
#> 3  0.07938272 -0.04544551
plot(result)


### custom logistic regression
library(fastcpd)
set.seed(1)
p <- 5
x <- matrix(rnorm(375 * p, 0, 1), ncol = p)
theta <- rbind(rnorm(p, 0, 1), rnorm(p, 2, 1))
y <- c(
  rbinom(200, 1, 1 / (1 + exp(-x[1:200, ] %*% theta[1, ]))),
  rbinom(175, 1, 1 / (1 + exp(-x[201:375, ] %*% theta[2, ])))
)
data <- data.frame(y = y, x = x)
result_builtin <- suppressWarnings(fastcpd(
  formula = y ~ . - 1,
  data = data,
  family = "binomial"
))
logistic_loss <- function(data, theta) {
  x <- data[, -1]
  y <- data[, 1]
  u <- x %*% theta
  nll <- -y * u + log(1 + exp(u))
  nll[u > 10] <- -y[u > 10] * u[u > 10] + u[u > 10]
  sum(nll)
}
logistic_loss_gradient <- function(data, theta) {
  x <- data[nrow(data), -1]
  y <- data[nrow(data), 1]
  c(-(y - 1 / (1 + exp(-x %*% theta)))) * x
}
logistic_loss_hessian <- function(data, theta) {
  x <- data[nrow(data), -1]
  prob <- 1 / (1 + exp(-x %*% theta))
  (x %o% x) * c((1 - prob) * prob)
}
result_custom <- fastcpd(
  formula = y ~ . - 1,
  data = data,
  epsilon = 1e-5,
  cost = logistic_loss,
  cost_gradient = logistic_loss_gradient,
  cost_hessian = logistic_loss_hessian
)
cat(
  "Change points detected by built-in logistic regression model: ",
  result_builtin@cp_set, "\n",
  "Change points detected by custom logistic regression model: ",
  result_custom@cp_set, "\n",
  sep = ""
)
#> Change points detected by built-in logistic regression model: 200
#> Change points detected by custom logistic regression model: 201
result_custom_two_epochs <- fastcpd(
  formula = y ~ . - 1,
  data = data,
  k = function(x) 1,
  epsilon = 1e-5,
  cost = logistic_loss,
  cost_gradient = logistic_loss_gradient,
  cost_hessian = logistic_loss_hessian
)
summary(result_custom_two_epochs)
#> 
#> Call:
#> fastcpd(formula = y ~ . - 1, data = data, k = function(x) 1, 
#>     epsilon = 1e-05, cost = logistic_loss, cost_gradient = logistic_loss_gradient, 
#>     cost_hessian = logistic_loss_hessian)
#> 
#> Change points:
#> 200 
#> 
#> Parameters:
#>    segment 1  segment 2
#> 1 -0.6235240  2.0066479
#> 2 -1.6767614  1.6278889
#> 3 -1.7973433  4.6422022
#> 4 -0.4842969 -0.1521062
#> 5  2.0797875  2.4047092

### custom cost function mean change
library(fastcpd)
set.seed(1)
p <- 1
data <- rbind(
  mvtnorm::rmvnorm(300, mean = rep(0, p), sigma = diag(100, p)),
  mvtnorm::rmvnorm(400, mean = rep(50, p), sigma = diag(100, p)),
  mvtnorm::rmvnorm(300, mean = rep(2, p), sigma = diag(100, p))
)
segment_count_guess <- 10
block_size <- max(floor(sqrt(nrow(data)) / (segment_count_guess + 1)), 2)
block_count <- floor(nrow(data) / block_size)
data_all_vars <- rep(0, block_count)
for (block_index in seq_len(block_count)) {
  block_start <- (block_index - 1) * block_size + 1
  block_end <- if (block_index < block_count) {
    block_index * block_size
  } else {
    nrow(data)
  }
  data_all_vars[block_index] <- var(data[block_start:block_end, ])
}
data_all_var <- mean(data_all_vars)
mean_loss <- function(data) {
  n <- nrow(data)
  n / 2 * (
    log(data_all_var) + log(2 * pi) +
      sum((data - colMeans(data))^2 / data_all_var) / n
  )
}
mean_loss_result <- fastcpd(
  formula = ~ . - 1,
  data = data.frame(data),
  beta = (p + 1) * log(nrow(data)) / 2,
  p = p,
  cost = mean_loss
)
summary(mean_loss_result)
#> 
#> Call:
#> fastcpd(formula = ~. - 1, data = data.frame(data), beta = (p + 
#>     1) * log(nrow(data))/2, p = p, cost = mean_loss)
#> 
#> Change points:
#> 300 700 

### custom cost function multivariate mean change
library(fastcpd)
set.seed(1)
p <- 3
data <- rbind(
  mvtnorm::rmvnorm(300, mean = rep(0, p), sigma = diag(100, p)),
  mvtnorm::rmvnorm(400, mean = rep(50, p), sigma = diag(100, p)),
  mvtnorm::rmvnorm(300, mean = rep(2, p), sigma = diag(100, p))
)
segment_count_guess <- 5
block_size <- max(floor(sqrt(nrow(data)) / (segment_count_guess + 1)), 2)
block_count <- floor(nrow(data) / block_size)
data_all_covs <- array(NA, dim = c(block_count, p, p))
for (block_index in seq_len(block_count)) {
  block_start <- (block_index - 1) * block_size + 1
  block_end <- if (block_index < block_count) {
    block_index * block_size
  } else {
    nrow(data)
  }
  data_all_covs[block_index, , ] <- cov(data[block_start:block_end, ])
}
data_all_cov <- colMeans(data_all_covs)
mean_loss <- function(data) {
  n <- nrow(data)
  demeaned_data <- sweep(data, 2, colMeans(data))
  n / 2 * (
    log(det(data_all_cov)) + p * log(2 * pi) +
      sum(diag(solve(data_all_cov, crossprod(demeaned_data)))) / n
  )
}
mean_loss_result <- fastcpd(
  formula = ~ . - 1,
  data = data.frame(data),
  beta = (p + 1) * log(nrow(data)) / 2,
  p = p,
  cost = mean_loss
)
summary(mean_loss_result)
#> 
#> Call:
#> fastcpd(formula = ~. - 1, data = data.frame(data), beta = (p + 
#>     1) * log(nrow(data))/2, p = p, cost = mean_loss)
#> 
#> Change points:
#> 300 700 

### custom cost function variance change
library(fastcpd)
set.seed(1)
p <- 1
data <- rbind.data.frame(
  mvtnorm::rmvnorm(300, mean = rep(0, p), sigma = diag(1, p)),
  mvtnorm::rmvnorm(400, mean = rep(0, p), sigma = diag(50, p)),
  mvtnorm::rmvnorm(300, mean = rep(0, p), sigma = diag(2, p))
)
data_all_mean <- colMeans(data)
var_loss <- function(data) {
  n <- nrow(data)
  data_cov <- crossprod(sweep(data, 2, data_all_mean)) / (n - 1)
  n / 2 * (log(data_cov) + log(2 * pi) + (n - 1) / n)
}
var_loss_result <- fastcpd(
  formula = ~ . - 1,
  data = data,
  beta = (p + 1) * log(nrow(data)) / 2,
  p = p,
  cost = var_loss
)
summary(var_loss_result)
#> 
#> Call:
#> fastcpd(formula = ~. - 1, data = data, beta = (p + 1) * log(nrow(data))/2, 
#>     p = p, cost = var_loss)
#> 
#> Change points:
#> 300 699 

### custom cost function multivariate variance change
library(fastcpd)
set.seed(1)
p <- 3
data <- rbind.data.frame(
  mvtnorm::rmvnorm(
    300, rep(0, p), crossprod(matrix(runif(p^2) * 2 - 1, p))
  ),
  mvtnorm::rmvnorm(
    400, rep(0, p), crossprod(matrix(runif(p^2) * 2 - 1, p))
  ),
  mvtnorm::rmvnorm(
    300, rep(0, p), crossprod(matrix(runif(p^2) * 2 - 1, p))
  )
)
data_all_mean <- colMeans(data)
var_loss <- function(data) {
  n <- nrow(data)
  p <- ncol(data)
  if (n < p) {
    data_cov <- diag(p)
  } else {
    data_cov <- crossprod(sweep(data, 2, data_all_mean)) / (n - 1)
  }
  n / 2 * (log(det(data_cov)) + p * log(2 * pi) + p * (n - 1) / n)
}
var_loss_result <- fastcpd(
  formula = ~ . - 1,
  data = data,
  beta = (p^2 + 1) * log(nrow(data)) / 2,
  trim = 0.1,
  p = p^2,
  cost = var_loss
)
summary(var_loss_result)
#> 
#> Call:
#> fastcpd(formula = ~. - 1, data = data, beta = (p^2 + 1) * log(nrow(data))/2, 
#>     trim = 0.1, p = p^2, cost = var_loss)
#> 
#> Change points:
#> 300 700 

### custom cost function mean or variance change
library(fastcpd)
set.seed(1)
p <- 1
data <- rbind.data.frame(
  mvtnorm::rmvnorm(300, mean = rep(0, p), sigma = diag(1, p)),
  mvtnorm::rmvnorm(400, mean = rep(10, p), sigma = diag(1, p)),
  mvtnorm::rmvnorm(300, mean = rep(0, p), sigma = diag(50, p)),
  mvtnorm::rmvnorm(300, mean = rep(0, p), sigma = diag(1, p)),
  mvtnorm::rmvnorm(400, mean = rep(10, p), sigma = diag(1, p)),
  mvtnorm::rmvnorm(300, mean = rep(10, p), sigma = diag(50, p))
)
meanvar_loss <- function(data) {
  n <- nrow(data)
  data_cov <- 1
  if (n > 1) {
    data_cov <- var(data)
  }
  n / 2 * (log(data_cov) + log(2 * pi) + (n - 1) / n)
}
meanvar_loss_result <- fastcpd(
  formula = ~ . - 1,
  data = data,
  beta = (p^2 + p + 1) * log(nrow(data)) / 2,
  p = p^2 + p,
  cost = meanvar_loss
)
summary(meanvar_loss_result)
#> 
#> Call:
#> fastcpd(formula = ~. - 1, data = data, beta = (p^2 + p + 1) * 
#>     log(nrow(data))/2, p = p^2 + p, cost = meanvar_loss)
#> 
#> Change points:
#> 300 700 1000 1300 1700 

### custom cost function multivariate mean or variance change
library(fastcpd)
set.seed(1)
p <- 3
data <- rbind.data.frame(
  mvtnorm::rmvnorm(300, mean = rep(0, p), sigma = diag(1, p)),
  mvtnorm::rmvnorm(400, mean = rep(10, p), sigma = diag(1, p)),
  mvtnorm::rmvnorm(300, mean = rep(0, p), sigma = diag(50, p)),
  mvtnorm::rmvnorm(300, mean = rep(0, p), sigma = diag(1, p)),
  mvtnorm::rmvnorm(400, mean = rep(10, p), sigma = diag(1, p)),
  mvtnorm::rmvnorm(300, mean = rep(10, p), sigma = diag(50, p))
)
meanvar_loss <- function(data) {
  n <- nrow(data)
  p <- ncol(data)
  if (n <= p) {
    data_cov <- diag(p)
  } else {
    data_cov <- cov(data)
  }
  n / 2 * (log(det(data_cov)) + p * log(2 * pi) + p * (n - 1) / n)
}
meanvar_loss_result <- fastcpd(
  formula = ~ . - 1,
  data = data,
  beta = (p^2 + p + 1) * log(nrow(data)) / 2,
  trim = 0.01,
  p = p^2 + p,
  cost = meanvar_loss
)
summary(meanvar_loss_result)
#> 
#> Call:
#> fastcpd(formula = ~. - 1, data = data, beta = (p^2 + p + 1) * 
#>     log(nrow(data))/2, trim = 0.01, p = p^2 + p, cost = meanvar_loss)
#> 
#> Change points:
#> 300 700 1000 1300 1700 

### custom cost function huber regression
library(fastcpd)
set.seed(1)
n <- 400 + 300 + 500
p <- 5
x <- mvtnorm::rmvnorm(n, mean = rep(0, p), sigma = diag(p))
theta <- rbind(
  mvtnorm::rmvnorm(1, mean = rep(0, p - 3), sigma = diag(p - 3)),
  mvtnorm::rmvnorm(1, mean = rep(5, p - 3), sigma = diag(p - 3)),
  mvtnorm::rmvnorm(1, mean = rep(9, p - 3), sigma = diag(p - 3))
)
theta <- cbind(theta, matrix(0, 3, 3))
theta <- theta[rep(seq_len(3), c(400, 300, 500)), ]
y_true <- rowSums(x * theta)
factor <- c(
  2 * stats::rbinom(400, size = 1, prob = 0.95) - 1,
  2 * stats::rbinom(300, size = 1, prob = 0.95) - 1,
  2 * stats::rbinom(500, size = 1, prob = 0.95) - 1
)
y <- factor * y_true + stats::rnorm(n)
data <- cbind.data.frame(y, x)
huber_threshold <- 1
huber_loss <- function(data, theta) {
  residual <- data[, 1] - data[, -1, drop = FALSE] %*% theta
  indicator <- abs(residual) <= huber_threshold
  sum(
    residual^2 / 2 * indicator +
      huber_threshold * (
        abs(residual) - huber_threshold / 2
      ) * (1 - indicator)
  )
}
huber_loss_gradient <- function(data, theta) {
  residual <- c(data[nrow(data), 1] - data[nrow(data), -1] %*% theta)
  if (abs(residual) <= huber_threshold) {
    -residual * data[nrow(data), -1]
  } else {
    -huber_threshold * sign(residual) * data[nrow(data), -1]
  }
}
huber_loss_hessian <- function(data, theta) {
  residual <- c(data[nrow(data), 1] - data[nrow(data), -1] %*% theta)
  if (abs(residual) <= huber_threshold) {
    outer(data[nrow(data), -1], data[nrow(data), -1])
  } else {
    0.01 * diag(length(theta))
  }
}
huber_regression_result <- fastcpd(
  formula = y ~ . - 1,
  data = data,
  beta = (p + 1) * log(n) / 2,
  cost = huber_loss,
  cost_gradient = huber_loss_gradient,
  cost_hessian = huber_loss_hessian
)
summary(huber_regression_result)
#> 
#> Call:
#> fastcpd(formula = y ~ . - 1, data = data, beta = (p + 1) * log(n)/2, 
#>     cost = huber_loss, cost_gradient = huber_loss_gradient, cost_hessian = huber_loss_hessian)
#> 
#> Change points:
#> 401 726 
#> 
#> Parameters:
#>     segment 1   segment 2    segment 3
#> 1 -0.52615415  2.77991463  8.744706508
#> 2 -1.02443443  5.06390528  9.506534878
#> 3 -0.09220421  0.01647923 -0.008908851
#> 4 -0.01326592 -0.08103008 -0.047909865
#> 5  0.02526703  0.01329142  0.025171681
# }