Here we provide a brief tutorial of the BayesChange
package. The BayesChange
package contains two main
functions: one that performs change points detection on time series and
survival functions and one that perform clustering of time series and
survival functions with common change points. Here we briefly show how
to implement these.
The function detect_cp
provide a method for detecting
change points, it is based on the work Martínez
and Mena (2014) and on Corradin, Danese,
and Ongaro (2022).
Depending on the structure of the data, detect_cp
might
perform change points detection on univariate time series or
multivariate time series. For example we can create a vector of 100
observations where the first 50 observations are sampled from a normal
distribution with mean 0 and variance 0.1 and the other 50 observations
still from a normal distribution with mean 0 but variance 0.25.
Now we can run the function detect_cp
, as arguments of
the function we need to specify the number of iterations, the number of
burn-in steps and a list with the the autoregressive coefficient
phi
for the likelihood of the data, the parameters
a
, b
, c
for the priors and the
probability q
of performing a split at each step. Since we
deal with time series we need also to specify
kernel = "ts"
.
out <- detect_cp(data = data_uni,
n_iterations = 1000, n_burnin = 100,
params = list(q = 0.25, phi = 0.1, a = 1, b = 1, c = 0.1), kernel = "ts")
#> Completed: 100/1000 - in 0.01 sec
#> Completed: 200/1000 - in 0.025 sec
#> Completed: 300/1000 - in 0.04 sec
#> Completed: 400/1000 - in 0.051 sec
#> Completed: 500/1000 - in 0.061 sec
#> Completed: 600/1000 - in 0.075 sec
#> Completed: 700/1000 - in 0.084 sec
#> Completed: 800/1000 - in 0.095 sec
#> Completed: 900/1000 - in 0.108 sec
#> Completed: 1000/1000 - in 0.117 sec
With the methods print
and summary
we can
get information about the algorithm.
print(out)
#> DetectCpObj object
#> Type: change points detection on univariate time series
summary(out)
#> DetectCpObj object
#> Detecting change points on an univariate time series:
#> Number of burn-in iterations: 100
#> Number of MCMC iterations: 900
#> Computational time: 0.12 seconds
In order to get a point estimate of the change points we can use the
method posterior_estimate
that uses the method
salso by David B. Dahl and Müller
(2022) to get the final latent order and then detect the change
points.
The package also provides a method for plotting the change points.
If we define instead a matrix of data, detect_cp
automatically performs a multivariate change points detection
method.
data_multi <- matrix(NA, nrow = 3, ncol = 100)
data_multi[1,] <- as.numeric(c(rnorm(50,0,0.100), rnorm(50,1,0.250)))
data_multi[2,] <- as.numeric(c(rnorm(50,0,0.125), rnorm(50,1,0.225)))
data_multi[3,] <- as.numeric(c(rnorm(50,0,0.175), rnorm(50,1,0.280)))
Arguments k_0
, nu_0
, phi_0
,
m_0
, par_theta_c
, par_theta_d
and
prior_var_gamma
correspond to the parameters of the prior
distributions for the multivariate likelihood.
out <- detect_cp(data = data_multi, n_iterations = 1000, n_burnin = 100,
q = 0.25, params = list(k_0 = 0.25, nu_0 = 4, phi_0 = diag(1,3,3),
m_0 = rep(0,3), par_theta_c = 2, par_theta_d = 0.2,
prior_var_gamma = 0.1), kernel = "ts")
#> Completed: 100/1000 - in 0.012 sec
#> Completed: 200/1000 - in 0.028 sec
#> Completed: 300/1000 - in 0.051 sec
#> Completed: 400/1000 - in 0.118 sec
#> Completed: 500/1000 - in 0.269 sec
#> Completed: 600/1000 - in 0.412 sec
#> Completed: 700/1000 - in 0.507 sec
#> Completed: 800/1000 - in 0.671 sec
#> Completed: 900/1000 - in 1 sec
#> Completed: 1000/1000 - in 1.369 sec
table(posterior_estimate(out, loss = "binder"))
#>
#> 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
#> 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
#> 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52
#> 1 1 1 1 1 1 1 1 1 2 2 1 1 1 1 1 1 1 1 1 1 1 2 1 2 2
#> 53 54 55 56 57 58 59
#> 1 2 1 1 15 2 20
Function detect_cp
can also be used to detect change
points on survival functions. We define a matrix of one row and a vector
with the infection rates.
With function sim_epi_data
we simulate a set of
infection times.
inf_times <- sim_epi_data(10000, 10, 100, betas, 1/8)
inf_times_vec <- rep(0,100)
names(inf_times_vec) <- as.character(1:100)
for(j in 1:100){
if(as.character(j) %in% names(table(floor(inf_times)))){
inf_times_vec[j] = table(floor(inf_times))[which(names(table(floor(inf_times))) == j)]
}
}
data_mat[1,] <- inf_times_vec
To run detect_cp
on epidemiological data we need to set
kernel = "epi"
. Moreover, besides the usual parameters, we
need to set the number of Monte Carlo replications M
for
the approximation of the integrated likelihood and the recovery rate
xi
. a0
and b0
are optional and
correspond to the parameters of the gamma distribution for the
integration of the likelihood.
out <- detect_cp(data = data_mat, n_iterations = 200, n_burnin = 50,
params = list(xi = 1/8, a0 = 40, b0 = 10, M = 1000), kernel = "epi")
#> Completed: 20/200 - in 1.152 sec
#> Completed: 40/200 - in 2.337 sec
#> Completed: 60/200 - in 3.504 sec
#> Completed: 80/200 - in 4.692 sec
#> Completed: 100/200 - in 5.87 sec
#> Completed: 120/200 - in 7.036 sec
#> Completed: 140/200 - in 8.207 sec
#> Completed: 160/200 - in 9.416 sec
#> Completed: 180/200 - in 10.582 sec
#> Completed: 200/200 - in 12.012 sec
print(out)
#> DetectCpObj object
#> Type: change points detection on a survival function
table(posterior_estimate(out, loss = "binder"))
#>
#> 1 2
#> 33 67
Also here, with function plot
we can plot the survival
function and the position of the change points.
BayesChange
contains another function,
clust_cp
, that cluster respectively univariate and
multivariate time series and survival functions with common change
points. Details about this methods can be found in Corradin et al. (2024).
In clust_cp
the argument kernel
must be
specified, if data are time series then kernel = "ts"
must
be set. Then the algorithm automatically detects if data are univariate
or multivariate.
If time series are univariate we need to set a matrix where each row is a time series.
data_mat <- matrix(NA, nrow = 5, ncol = 100)
data_mat[1,] <- as.numeric(c(rnorm(50,0,0.100), rnorm(50,1,0.250)))
data_mat[2,] <- as.numeric(c(rnorm(50,0,0.125), rnorm(50,1,0.225)))
data_mat[3,] <- as.numeric(c(rnorm(50,0,0.175), rnorm(50,1,0.280)))
data_mat[4,] <- as.numeric(c(rnorm(25,0,0.135), rnorm(75,1,0.225)))
data_mat[5,] <- as.numeric(c(rnorm(25,0,0.155), rnorm(75,1,0.280)))
Arguments that need to be specified in clust_cp
are the
number of iterations n_iterations
, the number of elements
in the normalisation constant B
, the split-and-merge step
L
performed when a new partition is proposed and a list
with the parameters of the algorithm, the likelihood and the
priors..
out <- clust_cp(data = data_mat, n_iterations = 1000, n_burnin = 100,
kernel = "ts",
params = list(B = 1000, L = 1, gamma = 0.5))
#> Normalization constant - completed: 100/1000 - in 0.004 sec
#> Normalization constant - completed: 200/1000 - in 0.008 sec
#> Normalization constant - completed: 300/1000 - in 0.012 sec
#> Normalization constant - completed: 400/1000 - in 0.016 sec
#> Normalization constant - completed: 500/1000 - in 0.021 sec
#> Normalization constant - completed: 600/1000 - in 0.025 sec
#> Normalization constant - completed: 700/1000 - in 0.029 sec
#> Normalization constant - completed: 800/1000 - in 0.033 sec
#> Normalization constant - completed: 900/1000 - in 0.037 sec
#> Normalization constant - completed: 1000/1000 - in 0.042 sec
#>
#> ------ MAIN LOOP ------
#>
#> Completed: 100/1000 - in 0.055 sec
#> Completed: 200/1000 - in 0.116 sec
#> Completed: 300/1000 - in 0.179 sec
#> Completed: 400/1000 - in 0.239 sec
#> Completed: 500/1000 - in 0.298 sec
#> Completed: 600/1000 - in 0.352 sec
#> Completed: 700/1000 - in 0.405 sec
#> Completed: 800/1000 - in 0.466 sec
#> Completed: 900/1000 - in 0.522 sec
#> Completed: 1000/1000 - in 0.576 sec
posterior_estimate(out, loss = "binder")
#> Warning in salso::salso(mcmc_chain, loss = "binder", maxNClusters =
#> maxNClusters, : The number of clusters equals the default maximum possible
#> number of clusters.
#> [1] 1 2 3 4 5
Method plot
for clustering univariate time series
represents the data colored according to the assigned cluster.
plot(out, loss = "binder")
#> Warning in salso::salso(mcmc_chain, loss = "binder", maxNClusters =
#> maxNClusters, : The number of clusters equals the default maximum possible
#> number of clusters.
If time series are multivariate, data must be an array, where each element is a multivariate time series represented by a matrix. Each row of the matrix is a component of the time series.
data_array <- array(data = NA, dim = c(3,100,5))
data_array[1,,1] <- as.numeric(c(rnorm(50,0,0.100), rnorm(50,1,0.250)))
data_array[2,,1] <- as.numeric(c(rnorm(50,0,0.100), rnorm(50,1,0.250)))
data_array[3,,1] <- as.numeric(c(rnorm(50,0,0.100), rnorm(50,1,0.250)))
data_array[1,,2] <- as.numeric(c(rnorm(50,0,0.100), rnorm(50,1,0.250)))
data_array[2,,2] <- as.numeric(c(rnorm(50,0,0.100), rnorm(50,1,0.250)))
data_array[3,,2] <- as.numeric(c(rnorm(50,0,0.100), rnorm(50,1,0.250)))
data_array[1,,3] <- as.numeric(c(rnorm(50,0,0.175), rnorm(50,1,0.280)))
data_array[2,,3] <- as.numeric(c(rnorm(50,0,0.175), rnorm(50,1,0.280)))
data_array[3,,3] <- as.numeric(c(rnorm(50,0,0.175), rnorm(50,1,0.280)))
data_array[1,,4] <- as.numeric(c(rnorm(25,0,0.135), rnorm(75,1,0.225)))
data_array[2,,4] <- as.numeric(c(rnorm(25,0,0.135), rnorm(75,1,0.225)))
data_array[3,,4] <- as.numeric(c(rnorm(25,0,0.135), rnorm(75,1,0.225)))
data_array[1,,5] <- as.numeric(c(rnorm(25,0,0.155), rnorm(75,1,0.280)))
data_array[2,,5] <- as.numeric(c(rnorm(25,0,0.155), rnorm(75,1,0.280)))
data_array[3,,5] <- as.numeric(c(rnorm(25,0,0.155), rnorm(75,1,0.280)))
out <- clust_cp(data = data_array, n_iterations = 1000, n_burnin = 100,
kernel = "ts", params = list(gamma = 0.1, k_0 = 0.25, nu_0 = 5, phi_0 = diag(0.1,3,3), m_0 = rep(0,3)))
#> Normalization constant - completed: 100/1000 - in 0.009 sec
#> Normalization constant - completed: 200/1000 - in 0.019 sec
#> Normalization constant - completed: 300/1000 - in 0.028 sec
#> Normalization constant - completed: 400/1000 - in 0.038 sec
#> Normalization constant - completed: 500/1000 - in 0.047 sec
#> Normalization constant - completed: 600/1000 - in 0.056 sec
#> Normalization constant - completed: 700/1000 - in 0.064 sec
#> Normalization constant - completed: 800/1000 - in 0.071 sec
#> Normalization constant - completed: 900/1000 - in 0.079 sec
#> Normalization constant - completed: 1000/1000 - in 0.087 sec
#>
#> ------ MAIN LOOP ------
#>
#> Completed: 100/1000 - in 0.084 sec
#> Completed: 200/1000 - in 0.155 sec
#> Completed: 300/1000 - in 0.23 sec
#> Completed: 400/1000 - in 0.306 sec
#> Completed: 500/1000 - in 0.385 sec
#> Completed: 600/1000 - in 0.477 sec
#> Completed: 700/1000 - in 0.558 sec
#> Completed: 800/1000 - in 0.634 sec
#> Completed: 900/1000 - in 0.71 sec
#> Completed: 1000/1000 - in 0.791 sec
posterior_estimate(out, loss = "binder")
#> [1] 1 2 3 4 4
Finally, if we set kernel = "epi"
, clust_cp
cluster survival functions with common change points. Also here details
can be found in Corradin et al.
(2024).
Data are a matrix where each row is the number of infected at each
time. Inside this package is included the function
sim_epi_data
that simulates infection times.
data_mat <- matrix(NA, nrow = 5, ncol = 50)
betas <- list(c(rep(0.45, 25),rep(0.14,25)),
c(rep(0.55, 25),rep(0.11,25)),
c(rep(0.50, 25),rep(0.12,25)),
c(rep(0.52, 10),rep(0.15,40)),
c(rep(0.53, 10),rep(0.13,40)))
inf_times <- list()
for(i in 1:5){
inf_times[[i]] <- sim_epi_data(S0 = 10000, I0 = 10, max_time = 50, beta_vec = betas[[i]], xi_0 = 1/8)
vec <- rep(0,50)
names(vec) <- as.character(1:50)
for(j in 1:50){
if(as.character(j) %in% names(table(floor(inf_times[[i]])))){
vec[j] = table(floor(inf_times[[i]]))[which(names(table(floor(inf_times[[i]]))) == j)]
}
}
data_mat[i,] <- vec
}
out <- clust_cp(data = data_mat, n_iterations = 100, n_burnin = 10,
kernel = "epi",
list(M = 100, B = 1000, L = 1, q = 0.1, gamma = 1/8))
#> Normalization constant - completed: 100/1000 - in 0.514 sec
#> Normalization constant - completed: 200/1000 - in 1.035 sec
#> Normalization constant - completed: 300/1000 - in 1.544 sec
#> Normalization constant - completed: 400/1000 - in 2.058 sec
#> Normalization constant - completed: 500/1000 - in 2.562 sec
#> Normalization constant - completed: 600/1000 - in 3.082 sec
#> Normalization constant - completed: 700/1000 - in 3.629 sec
#> Normalization constant - completed: 800/1000 - in 4.199 sec
#> Normalization constant - completed: 900/1000 - in 4.702 sec
#> Normalization constant - completed: 1000/1000 - in 5.223 sec
#>
#> ------ MAIN LOOP ------
#>
#> Completed: 10/100 - in 0.548 sec
#> Completed: 20/100 - in 1.292 sec
#> Completed: 30/100 - in 2.013 sec
#> Completed: 40/100 - in 2.939 sec
#> Completed: 50/100 - in 3.913 sec
#> Completed: 60/100 - in 4.81 sec
#> Completed: 70/100 - in 5.811 sec
#> Completed: 80/100 - in 6.771 sec
#> Completed: 90/100 - in 7.756 sec
#> Completed: 100/100 - in 8.642 sec
posterior_estimate(out, loss = "binder")
#> [1] 1 2 2 1 1