Title: | Bayesian Online Changepoint Detection |
---|---|
Description: | Implements the Bayesian online changepoint detection method by Adams and MacKay (2007) <arXiv:0710.3742> for univariate or multivariate data. Gaussian and Poisson probability models are implemented. Provides post-processing functions with alternative ways to extract changepoints. |
Authors: | Andrea Pagotto |
Maintainer: | Andrea Pagotto <[email protected]> |
License: | GPL-3 |
Version: | 0.1.1 |
Built: | 2025-02-13 03:57:10 UTC |
Source: | https://github.com/anjapago/ocp |
Provides an implementation of Bayesian online changepoint detection. Handles multivariate and missing data. Computes the set of changepoints with highest probability in an online way (updating the results with each incoming point). Also provides post-processing functions with alternative ways to extract changepoints.
Pagotto, Andrea
Hazard function for use with gaussian underlying distribution.
const_hazard(r, lambda)
const_hazard(r, lambda)
r |
The current R vector length. |
lambda |
The parameter for the hazard function. |
A vector of the hazard function for the length of the current R vector.
H<- const_hazard(10, 1/100)
H<- const_hazard(10, 1/100)
This function calculates the changepoints with highest probability in the online algorithm to take in the current probabilities at time t in the form of a list of lists. It will not calculate the result at every possible end point, because this will be done in the main loop of online cpd as it iterates: the probmaxes and cps list will be returned and passed into the function again each time.
findCPprobs(currrunprobs, probmaxes, logprobcpstrunc, Rlength, t, minsep = 3, maxsep = 90, ppres = FALSE)
findCPprobs(currrunprobs, probmaxes, logprobcpstrunc, Rlength, t, minsep = 3, maxsep = 90, ppres = FALSE)
currrunprobs |
The current most recently calculated "R" vector, of run length probabilities (sums to 1). |
probmaxes |
The probabilities of the set of changepoints with the highest probability for each preceding time point. |
logprobcpstrunc |
The set of changepoints with the highest probability for each previous time point. |
Rlength |
The length of the current R vector, to use in case it was truncated. |
t |
The current time point. |
minsep |
The minimum distance of separation allowed for eligible changepoint locations to be included in the list of changepoints with the highest probability. |
maxsep |
The maximum distance of separation allowed for eligible changepoint locations to be included in the list of changepoints with the highest probability. |
ppres |
Set to true if wanting to return optional outputs, useful for plotting and inspecting the algorithm, but not necessary. |
Two lists needed for the use in calculating this changepoints for the next incoming time point: the vector of max probabilities for each time point (probmaxes), and the list of changepoints with the highest probability at each time point (changepoints: a list of lists). It also returns ppresult: optional outputs, will be null if ppres=FALSE.
Data used in the LREC paper on the 2016 eurogames tweets. Includes a column with the counts of numbers of tweets. The columns present in the matrix at the three sentiment scores: "neg", "neu", and "pos".
http://www.lrec-conf.org/proceedings/lrec2018/pdf/335.pdf
demo(eurogames)
demo(eurogames)
Data used in the LREC paper on the 2016 eurogames tweets. Includes a column with the counts of numbers of tweets. The columns present in the matrix at the three sentiment scores: "neg", "neu", and "pos", and an additional column for the total number of tweets: "counts"
http://www.lrec-conf.org/proceedings/lrec2018/pdf/335.pdf
Takes in the desired initialization parameters,
initializes the vectors needed for the gaussian probability
function gaussian_update
gaussian_init(init_params = list(m = 0, k = 0.01, a = 0.01, b = 1e-04), dims)
gaussian_init(init_params = list(m = 0, k = 0.01, a = 0.01, b = 1e-04), dims)
init_params |
The list of parameters to be used for initialization |
dims |
the dimensionality of the data |
List of vectors to be used in the iteratively updating algorithm of parameters describing the underlying gaussian distribution of the data.
Updates the parameters of the gaussians based on each possible run length, after taking into consideration the most recent data point
gaussian_update(datapt, update_params0, update_paramsT, Rlength, skippt = FALSE)
gaussian_update(datapt, update_params0, update_paramsT, Rlength, skippt = FALSE)
datapt |
the current data point |
update_params0 |
The initialization parameters, corresponding to predicting a changepoint (run length=0) |
update_paramsT |
The vectors of parameters corresponding to each possible run length, updated with each incoming data point |
Rlength |
the length of the current vector of possible run lengths |
skippt |
set to FALSE if not needing to accommodate skipping missed points during the update of parameters |
The list of the parameters for gaussians corresponding to each possible runlength up to the current data point. Lengths of vectors should correspond the length of the R vector ("run length vector")
Compute the probability of observing the current point, given the current parameters of the gaussian for each possible run length. Returns a vector of predictive probabilities from each possible run length, the parameters of the gaussian, the most likely mean of the current gaussian, and the current point.
gaussianProb(update_params0, update_paramsT, datapt, time, cps, missPts, Rlength, skippt = FALSE)
gaussianProb(update_params0, update_paramsT, datapt, time, cps, missPts, Rlength, skippt = FALSE)
update_params0 |
The initialization parameters, corresponding to predicting a changepoint (run length=0) |
update_paramsT |
The vectors of parameters corresponding to each possible run length, updated with each incoming data point |
datapt |
the current data point |
time |
the number of time points passed so far |
cps |
the current most likely list of changepoints |
missPts |
the method set to handle missing points |
Rlength |
the length of the current vector of possible run lengths |
skippt |
If the current point should be skipped in the updating because it was missing, and missPts was set to skip |
Returns a vector of predictive probabilities from each possible run length, the parameters of the gaussian, the most likely mean of the current gaussian, and the current point.
This function initializes the ocpd object. It returns an ocpd object with no data, but matrixes and vectors set up to begin adding to throughout the running of the algorithm.
initOCPD(dims, init_params = list(list(m = 0, k = 0.01, a = 0.01, b = 1e-04)), initProb = c(gaussian_init))
initOCPD(dims, init_params = list(list(m = 0, k = 0.01, a = 0.01, b = 1e-04)), initProb = c(gaussian_init))
dims |
The dimensions calculated from the first input data points. |
init_params |
The list of params required to initialize the underlying distribution model. |
initProb |
The chosen type of underlying distribution. |
oCPD object initialized with initialization settings.
empty_ocpd<- initOCPD(1) # initialize bject with 1 dimensions
empty_ocpd<- initOCPD(1) # initialize bject with 1 dimensions
Computes the negative-binomial posterior predictive density from input parameter vectors corresponding to each possible run length for the current time point. Outputs a vector of probabilities for use in the accompanying poisson functions.
negbinpdf(x, a, b)
negbinpdf(x, a, b)
x |
the current data point |
a |
matrix of alpha params |
b |
matrix of beta params |
Matrix of negative binomial pdf values corresponding to each possible run length, for use in accompanying poisson probability functions.
The main algorithm called "Bayesian Online Changepoint Detection". Input is data in form of a matrix and, optionally an existing ocp object to build on. Output is the list of changepoints and other values calculated during running the model.
onlineCPD(datapts, oCPD = NULL, missPts = "none", hazard_func = function(x, lambda) { const_hazard(x, lambda = 100) }, probModel = list("g"), init_params = list(list(m = 0, k = 0.01, a = 0.01, b = 1e-04)), multivariate = FALSE, cpthreshold = 0.5, truncRlim = .Machine$double.xmin, minRlength = 1, maxRlength = 10^4, minsep = 1, maxsep = 10^4, timing = FALSE, getR = FALSE, optionalOutputs = FALSE, printupdates = FALSE)
onlineCPD(datapts, oCPD = NULL, missPts = "none", hazard_func = function(x, lambda) { const_hazard(x, lambda = 100) }, probModel = list("g"), init_params = list(list(m = 0, k = 0.01, a = 0.01, b = 1e-04)), multivariate = FALSE, cpthreshold = 0.5, truncRlim = .Machine$double.xmin, minRlength = 1, maxRlength = 10^4, minsep = 1, maxsep = 10^4, timing = FALSE, getR = FALSE, optionalOutputs = FALSE, printupdates = FALSE)
datapts |
the input data in form of a matrix, where the rows correspond to each data point, and the columns correspond to each dimension. |
oCPD |
ocp object computed in a previous run of an algorithm. it can be built upon with the input data points, as long as the settings for both are the same. |
missPts |
This setting indicates how to deal with missing points (e.g. NA). The options are: "mean", "prev", "none", and a numeric value. If the data is multivariate. The numeric replacement value could either be a single value which would apply to all dimensions, or a vector of the same length as the number of dimensions of the data. |
hazard_func |
This setting allows choosing a hazard function, and also setting the constants within that function. For example, the default hazard function is: function(x, lambda)const_hazard(x, lambda=100) and the lambda can be set as appropriate. |
probModel |
This parameter is a function to be used to calculate the predictive probabilities and update the parameters of the model. The default setting uses a gaussian underlying distribution: "gaussian" |
init_params |
The parameters used to initialize the probability model. The default settings correspond to the input default gaussian model. |
multivariate |
This setting indicates if the incoming data is multivariate or univariate. |
cpthreshold |
Probability threshold for the method of extracting a list of all changepoints that have a run length probability higher than a specified value. The default is set to 0.5. |
truncRlim |
The probability threshold to begin truncating the R vector. The R vector is a vector of run-length probabilities. To prevent truncation, set this to 0. The defaults setting is 10^(-4) as suggested by the paper. |
minRlength |
The minimum size the run length probabilities vector must be before beginning to check for the truncation threshold. |
maxRlength |
The maximum size the R vector is allowed to be, before enforcing truncation to happen. |
minsep |
This setting constrains the possible changepoint locations considered in determining the optimal set of changepoints. It prevents considered changepoints that are closer together than the value of minsep. The default is 3. |
maxsep |
This setting constrains the possible changepoint locations considered in determining the optimal set of changepoints. It prevents considered changepoints that are closer farther apart than the value of maxsep. The default is 100. |
timing |
To print out times during the algorithm running, to track its progress, set this setting to true. |
getR |
To output the full R matrix, set this setting to TRUE. Outputting this matrix causes a major slow down in efficiency. |
optionalOutputs |
Output additional values calculated during running the algorithm, including a matrix containing all the input data, the predictive probability vectors at each step of the algorithm, and the vector of means at each step of the algorithm. |
printupdates |
This setting prints out updates on the progress of the algorithm if set to TRUE. |
An ocp object containing the main output: a list of changepoints from each time point, and many additional outputs: the number of time points, the initial settings of the algorithm, the current model parameters, the means from each time point, the most recently processed point, the most recently calculated vector of run length probabilities, and a vector of probabilities of changepoints at each time point.
simdatapts<- c(rnorm(n = 50), rnorm(n=50, 100)) ocpd1<- onlineCPD(simdatapts) ocpd1$changepoint_lists # view the changepoint lists
simdatapts<- c(rnorm(n = 50), rnorm(n=50, 100)) ocpd1<- onlineCPD(simdatapts) ocpd1$changepoint_lists # view the changepoint lists
Plot ocpd object, to show the data and the R matrix probabilities.
## S3 method for class 'ocp' plot(x, data = NULL, Rmat = NULL, graph_changepoints = TRUE, graph_probabilities = TRUE, showmaxes = TRUE, showmeans = TRUE, showcps = TRUE, showdata = TRUE, showRprobs = TRUE, cplistID = 3, main_title = "", trueCPs = NULL, showdataleg = TRUE, timepoints = NULL, timeunits = NULL, grey_digits = 4, varnames = NULL, ...)
## S3 method for class 'ocp' plot(x, data = NULL, Rmat = NULL, graph_changepoints = TRUE, graph_probabilities = TRUE, showmaxes = TRUE, showmeans = TRUE, showcps = TRUE, showdata = TRUE, showRprobs = TRUE, cplistID = 3, main_title = "", trueCPs = NULL, showdataleg = TRUE, timepoints = NULL, timeunits = NULL, grey_digits = 4, varnames = NULL, ...)
x |
the ocp object to plot |
data |
optional input data to plot |
Rmat |
optional input Rmat to plot |
graph_changepoints |
set to TRUE to graph the changepoints |
graph_probabilities |
set TRUE to show R matrix graphed |
showmaxes |
set TRUE to show the maxes in each columns in the R matrix plot |
showmeans |
set TRUE to show the means on the changepoints plot |
showcps |
set TRUE to show the the locations of changepoints |
showdata |
set TRUE to show the actual data points |
showRprobs |
set TRUE to show the probabilities in the R matrix |
cplistID |
method of extracting the changepoints: either "colmaxes", "threshcps", or "maxCPs" stored in the "changepoints_list" in the ocpd object |
main_title |
The main title for both plots, e.g. "Eurogames Data" |
trueCPs |
input the true known changepoints for comparison |
showdataleg |
Set true to show legend for the data points, set to false if there are too many dimensions, legend will be crowded. |
timepoints |
List of timepoints to use as x-axis labels. |
timeunits |
Units to display for the timescale on the plot. |
grey_digits |
The limit of decimal places to keep in the probability before converting to an index in the grey-scale, controls amount of detail and darkness of the shading on the plot. |
varnames |
List of variable names to display in the legend. |
... |
(optional) additional arguments, ignored. |
simdatapts<- c(rnorm(n = 50), rnorm(n=50, 100)) ocpd1<- onlineCPD(simdatapts, getR=TRUE) plot(ocpd1) # basic plot plot(ocpd1, data= simdatapts) # plot with the original data plot(ocpd1, trueCPs = c(1, 51)) # plot with showing the true changepoints plot(ocpd1, main_title="Example plot", showmaxes = FALSE) # not showing max probabilities plot(ocpd1, graph_changepoints=FALSE) # not showing the changepoints plot plot(ocpd1, graph_probabilities=FALSE) # not showing the R matrix plot(ocpd1, showRprobs=FALSE, showcps= FALSE)#plotting r with maxes but no probabilities, # and not showing the locations of the found changepoints
simdatapts<- c(rnorm(n = 50), rnorm(n=50, 100)) ocpd1<- onlineCPD(simdatapts, getR=TRUE) plot(ocpd1) # basic plot plot(ocpd1, data= simdatapts) # plot with the original data plot(ocpd1, trueCPs = c(1, 51)) # plot with showing the true changepoints plot(ocpd1, main_title="Example plot", showmaxes = FALSE) # not showing max probabilities plot(ocpd1, graph_changepoints=FALSE) # not showing the changepoints plot plot(ocpd1, graph_probabilities=FALSE) # not showing the R matrix plot(ocpd1, showRprobs=FALSE, showcps= FALSE)#plotting r with maxes but no probabilities, # and not showing the locations of the found changepoints
Takes in the desired initialization parameters,
initializes the vectors needed for the poisson probability
function poisson_update
poisson_init(init_params = list(a = 1, b = 1), dims)
poisson_init(init_params = list(a = 1, b = 1), dims)
init_params |
The list of parameters to be used for initialization |
dims |
the dimensionality of the data |
List of vectors to be used in the iteratively updating algorithm of parameters describing the underlying gaussian distribution of the data.
Updates the parameters of the poissons based on each possible run length, after taking into consideration the most recent data point
poisson_update(datapt, update_params0, update_paramsT, Rlength, skippt = FALSE)
poisson_update(datapt, update_params0, update_paramsT, Rlength, skippt = FALSE)
datapt |
the current data point |
update_params0 |
The initialization parameters, corresponding to predicting a changepoint (run length=0) |
update_paramsT |
The vectors of parameters corresponding to each possible run length, updated with each incoming data point |
Rlength |
the length of the current vector of possible run lengths |
skippt |
If the current point should be skipped in the updating because it was missing, and missPts was set to skip |
The list of the parameters for gaussians corresponding to each possible runlength up to the current data point. Lengths of vectors should correspond the length of the R vector ("run length vector")
Compute the probability of observing the current point, given the current parameters of the poisson for each possible run length. Returns a vector of predictive probabilities from each possible run length, the parameters of the poisson, the most likely lambda of the current poisson, and the current point.
poissonProb(update_params0, update_paramsT, datapt, time, cps, missPts, Rlength, skippt = FALSE)
poissonProb(update_params0, update_paramsT, datapt, time, cps, missPts, Rlength, skippt = FALSE)
update_params0 |
The initialization parameters, corresponding to predicting a changepoint (run length=0) |
update_paramsT |
The vectors of parameters corresponding to each possible run length, updated with each incoming data point |
datapt |
the current data point |
time |
the number of time points passed so far |
cps |
the current most likely list of changepoints |
missPts |
the method set to handle missing points |
Rlength |
the length of the current vector of possible run lengths |
skippt |
If the current point should be skipped in the updating because it was missing, and missPts was set to skip |
Returns a vector of predictive probabilities from each possible run length, the parameters of the gaussian, the most likely mean of the current gaussian, and the current point.
Print information about the ocpd object.
## S3 method for class 'ocp' print(x, ...)
## S3 method for class 'ocp' print(x, ...)
x |
the object to print |
... |
(optional) additional arguments, ignored. |
simdatapts<- c(rnorm(n = 50), rnorm(n=50, 100)) ocpd1<- onlineCPD(simdatapts) print(ocpd1)
simdatapts<- c(rnorm(n = 50), rnorm(n=50, 100)) ocpd1<- onlineCPD(simdatapts) print(ocpd1)
Print out information about the ocpd object.
## S3 method for class 'ocp' str(object, ...)
## S3 method for class 'ocp' str(object, ...)
object |
the object to show |
... |
(optional) additional arguments, ignored. |
simdatapts<- c(rnorm(n = 50), rnorm(n=50, 100)) ocpd1<- onlineCPD(simdatapts) str(ocpd1)
simdatapts<- c(rnorm(n = 50), rnorm(n=50, 100)) ocpd1<- onlineCPD(simdatapts) str(ocpd1)
Computes the student pdf from input parameter vectors corresponding to each possible run length for the current time point. Outputs a vector of probabilities for use in the accompanying gaussian functions.
studentpdf(x, mu, var, nu)
studentpdf(x, mu, var, nu)
x |
the current data point |
mu |
vector of means |
var |
var parameter of student pdf, degrees of freedom |
nu |
nu parameter of student pdf (number of points so far) |
Vector of student pdf values corresponding to each possible run length, for use in accompanying gaussian probability functions.
Print out ocpd object summary.
## S3 method for class 'ocp' summary(object, ...)
## S3 method for class 'ocp' summary(object, ...)
object |
the object to summarize |
... |
(optional) additional arguments, ignored. |
simdatapts<- c(rnorm(n = 50), rnorm(n=50, 100)) ocpd1<- onlineCPD(simdatapts) summary(ocpd1)
simdatapts<- c(rnorm(n = 50), rnorm(n=50, 100)) ocpd1<- onlineCPD(simdatapts) summary(ocpd1)