Detailing common methods and preprocessing steps for regime shift detection methods
Published
December 13, 2024
Code
library(here)library(tidyverse)library(rshift)library(gmRi)library(mgcv)library(EnvCpt)# Make sure lag is done correctlyconflicted::conflict_prefer("lag", "stats")conflicted::conflict_prefer("filter", "dplyr")# Path to timeserieslob_ecol_path <-cs_path("mills", "Projects/Lobster ECOL")# Load and add inshore/offshore labels on the daily datalobecol_fvcom <-read_csv(str_c(lob_ecol_path, "FVCOM_processed/area_timeseries/all_regions_fvcom_temperatures_daily.csv")) %>%mutate(depth_type =if_else(area_id %in%c("GOM_GBK", "SNE"), "offshore", "nearshore"))theme_set(theme_minimal())
Code
# Use GMRI style for fontsuse_gmri_style_rmd()
Regime Shifts & Tipping Points
This is a methods exploration of various packages available for regime shifts and for ecological tipping point detection. I will also try to flag and cite key ideas found in the literature.
What is a regime shift / tipping point - Core Ideas
The concept of tipping points usually implies discontinuous regime shifts of systems and includes three characteristics: (1) abrupt change, (2) hysteresis, and (3) non-stationary functional relationships. -@blöcker2023
Hysteresis: When the pathway of recovery of an ecosystem differs from its pathway of degradation (Suding & Hobbs 2009)
Alternative-Steady State: Since the first papers on the subject appeared, two perspectives have developed to describe how communities shift from one stable state to another. One assumes a constant environment with shifts in variables such as population density, and the other anticipates changes to underlying parameters or environmental “drivers”. (Beisner et al. 2003)
Non-Stationary Relationships:
Resilience:
Detection Methods
Abrupt Shifts
Statistical Changepoint Analysis
Sequential t-test Analysis of Regime Shifts (STARS, rodionov 2004).
Originally an Excel add-in. STARS mean-shift testing has been implemented with {rshift}
Outlier weighting and prewhitening routines have been implemented in Stirnimann 2019’s repository {rstars}.
{EnvCpt} Detection of Structural Changes in Climate and Environment Time Series using 12 methods and evaluated with AIC. Based on paper by Beaulieu et al 2012.
Non-stationarity in the SRR (stock recruit relationship)
Trend Changes
{EnvCpt} has model forms that evaluate non-stationary trends
The echo: false option disables the printing of code (only output is displayed).
STARS Implementation - Basic
One of the most broadly cited/used regime shift methods is Sequential t-test Analysis of Regime Shifts (STARS, rodionov 2004).
This approach is made available as an excel macro through Sergei Rodionov’s personal site. There have also been recent adaptations of this approach implemented in R/python.
The original method was built to detect shifts in the mean for some series of data. The approach was later improved, now incorporating outlier weighting via a Huber’s weight argument, as well as red-noise detection and “prewhitening” routines. This additions lowered the orignal algorithms sensitivity to outliers and reduced the likelihood of false-positive detections in cases where an autocorrelative structure is present.
In R, the methodology for implementing the STARS approach is available in {rshift}, however at this time the outlier detection and prewhitening routines are not part of the package.
Another repository/package in R (rstars) contains the rodionov stars implementation, with huber weighting, and functions for the four prewhitening routines described in Stirnimann 2019.
Basic Implementation - rshift
The STARS method requires several arguments: data a dataframe containing a time index column time and the variable of interest col, an argument for the minimum length for a regime l (integer value using whatever timestep time is in), and a detection probability prob with which to flag a regime shift.
Code
# Test timeseriestest_area <-"GOM_GBK"tester <-filter(lobecol_fvcom, area_id == test_area)# Regime Shift Testt1_regimes <- rshift::Rodionov(data = tester, col ="bottom_t", time ="time", l =365*7, prob =0.05, startrow =1, merge = T )# PlotRSI_graph( t1_regimes, col ="bottom_t", time ="time", rsi ="RSI", mean_lines =TRUE)
Upon quick glance we may get the sense that there is something going on that is leading to many values near the end of the timeseries flagged for different areas of interest.
STARS v3.2+ Enhancments
Since the initial 2004 paper (and not included in rshift presently), Rodionov added a number of enhancements and recommendations to the approach which improve its ability to detect true shifts in the mean in the presence of outliers, autocorrelation (red-noise), and trends.
Outlier Detection / Weighting
Due to outliers, the average is not representative for the mean value of the regimes, and this may significantly affect the results of the regime shift detection. Ideally the weight for the data value should be chosen such that it is small if that value is considered as an outlier. To handle the outliers, the program uses the Huber’s weight function, which is calculated here as
weight = min( 1, parameter/(|anomaly|))
Where anomaly is the deviation from the expected mean value of the new regime normalized by the standard deviation averaged for all consecutive sections of the cut-off length in the series. If anomalies are less than or equal to the value of the parameter then their weights are equal to one. Otherwise, the weights are inversely proportional to the distance from the expected mean value of the new regime.
Red Noise Removal
One pre-processing / pre-whitening step done by Rodionov in newer STARS versions is accounting for variability related to “red noise” processes. These are sequential runs (autocorrelation) in the data which can appear as regimes themselves and interfere with regime shift identification.
Red noise is the same as a first-order autoregressive (AR1) stationary Gaussian process with a positive correlation at lag 1, also called the Markov process. If red noise or serial correlation is present in the time series, it is necessary to either adjust the significance level of the shifts by calculating the effective degrees of freedom, or use a so-called “prewhitening” procedure prior to application of a regime shift detection method. In any case, it requires an estimate of the AR1 autoregressive parameter, which is equal to the sample lag-1 autocorrelation coefficient (r1).
Prewhitening describes aproaches used to remove “red-noise” from a time series before analysis. There are three methods evaluated in Stirnimann et al. 2019 that were added in the Rodionov 2006 study include:
Ordinary Least Squares (OLS)
Marriott-Pope and Kendall (MPK)
Inverse Proportionality with four corrections (IP4)
A written description of the approach that Rodionov used/uses can be found here: https://www.beringclimate.noaa.gov/regimes/help3.html
Note that the OLS estimate is calculated using the entire time series. The MPK and IP4 methods break the time series into subsamples, estimate bias corrected r1 for each subsample and then use the median value of all estimates. The suggested subsample size m is calculated as m = (l + 1)/3, where l is the cutoff length. It is recommended to experiment with different subsample sizes to see how it affects the r1 estimate.
The following chunk of code is to my understanding the OLS approach, which is done on the full dataset:
Code
# Take the tester to demo prewhitening# Fit AR(1) Modelfit <-arima(ts(tester$bottom_t), order =c(1, 0, 0))# Extract Residualsprewhitened_series <-residuals(fit)# Plot the original and pre-whitened seriestester %>%mutate(mpk = prewhitened_series) %>%ggplot(aes(time)) +geom_line(aes(y = bottom_t, color ="Original Time Series")) +geom_line(aes(y = mpk, color ="Pre-Whitened Time Series"))
Code
# # Test for Remaining Autocorrelation# Box.test(prewhitened_series, lag = 10, type = "Ljung-Box")# # # Get the AR1 coefficient# summary(fit)# coef(fit)
This approach would need to be adjusted to faithfully follow the Rodionov 2006 methods to include subsampling for the MPK and IP4 approaches:
The MPK and IP4 methods break the time series into subsamples, estimate bias corrected r1 for each subsample and then use the median value of all estimates. The suggested subsample size m is calculated as m = (l + 1)/3, where l is the cutoff length. It is recommended to experiment with different subsample sizes to see how it affects the r1 estimate.
Thankfully, code from the Stirnimann paper is available which implements the other methods.
{rstars} Implementation
We can try the more detailed approach using the {rstars} implementation by sourcing the repo functions which have been downloaded locally:
This repository was made by Luca Stirnimann for their 2019 publication.
Code
# # Data used as an example looks like this# PDO <- read.table(here::here("rstars-master","PDO.txt"),header = T, dec = ".")# Stirnimann used these values in their paper:# l = 5,10,15,17.5 years, with monthly data# Huber = 1# Subsampling = (l + 1)/3# Load the function(s)source(here::here("rstars-master","rSTARS.R"))# Need it to be a dataframe not a tibble for indexing to worktester_df <-as.data.frame(tester[,c("time", "bottom_t")])test_rstars <-rstars(data.timeseries = tester_df, l.cutoff =365*7, pValue =0.05, Huber =1, Endfunction = T, # Some behavior at the end of the timeseriespreWhitening = T, # Apply prewhitening T/FOLS = F, # OLSprewhitening method T/FMPK = T, # Marriott-Pope + Kennedy prewhitening method T/FIP4 = F, # IP4 prewhitening method T/FSubsampleSize = (365*7+1)/3, # subsampling rate for hubers + prewhiteningreturnResults = T # Return the results as a dataframe ) # # Saves 3 Files before I modified it:# filtered_ts <- read_table(str_c(here::here("rstars_results/tester_"), "Filteredts.txt"))# tester_rsi <- read_table(str_c(here::here("rstars_results/tester_"), "RSI.txt"))# tester_mean <- read_table(str_c(here::here("rstars_results/tester_"), "tsMean.txt"))# Results now look like this:ggplot(test_rstars) +geom_line(aes(time, bottom_t), linewidth =0.5, alpha =0.5) +geom_line(aes(time, regime_mu), linewidth =1) +geom_vline(data =filter(test_rstars, RSI !=0),aes(xintercept = time), color ="red", linewidth =1) +labs(x ="", y ="Bottom Temp")
Code
# Results now look like this on anomaly scales:ggplot(test_rstars) +geom_line(aes(time, bottom_t_pw), linewidth =0.5, alpha =0.5) +geom_line(aes(time, regime_mu_pw), linewidth =1) +geom_vline(data =filter(test_rstars, RSI !=0),aes(xintercept = time), color ="red", linewidth =1) +scale_y_continuous(limits =c(-0.35, 0.35)) +labs(x ="", y ="Prewhitened Temperature")
Running the code from this repository on the same daily bottom temperature series yields the above results.
Everything is done with a single function, and so its possible (once the pre-processing has been done) to run each timeseries of interest using map/lapply/etc.
And if we did all of them for bottom temperature, we’d return a bunch of lines that look like this:
Code
# Load that data againbot_temp_shifts <-read_csv(here::here("rstars_results/lobecol_shifts.csv"))# Plot the RSIbot_temp_shifts %>%ggplot() +geom_line(aes(time, RSI), linewidth =0.5, alpha =0.5) +geom_hline(yintercept =0) +facet_grid() +labs(x ="Time",y ="RSI",title ="")
Preprocessing Topics:
It is strongly recommended by a number of publications to apply a number of preprocessing steps when performing a changepoint/breakpoint or mean-shift tests. These preprocessing steps increase the likelihood that detected events are true abrupt shifts in the data, and not other common processes. They also point out that prewhitening alone does not resolve sytemic processes in the data which can present
From Lund et al. 2023
To account for correlation in an AMOC changepoint analysis, prewhitening first fits a pth-order autoregressive [AR(p)] model to the series (this assumes nonperiodic data).
In their review paper they also note a number of best practices and pitfalls (some below, not exhaustive).
BEST PRACTICE 3 Account for all potential features in the mean of a series. If in doubt, allow for a trend and/or seasonality and use the statistical methods to distinguish which features are present in the series.
As an aside, we comment that model fitting is not about maximizing or mini- mizing p values, but rather making sure that all relevant statistical features are accounted for in a parsimonious model
PITFALL 5: APPLYING MEAN SHIFT MCPT TECHNIQUES TO SERIES WITH TRENDS OR SEASONALITY WITHOUT ACCOUNTING FOR THESE FEATURES. Similar to AMOC techniques in pitfall 2, applying a MCPT technique that neglects trends and seasonality can result in spurious changepoint declarations. For example, an increasing long-term trend will likely be estimated as a series of change-points acting as an increasing stairway.
In a multiple changepoint analysis of a daily series, methods may flag many spurious changepoints within a year in an attempt to track the seasonal mean cycle should it be ignored in the modeling cycle
Based on reading the review paper.
If we wish to continue with STARS we should:
Deal with seasonality (subtract seasonal mean)
Deal with any long-term trend (remove trend)
Then-prewhiten to address AR()
Then run through rstars
If we want to use EnvCPT:
Deal with seasonality (subtract seasonal mean)
then run through envcpt
Removing Trend & Seasonality
STARS works best when the time series contains no deterministic trend, such as the PDO series, from which the global warming trend is removed by design. However, assuming the AR1 autoregressive parameter is not changing over time, both MPK and IP4 procedures can accurately estimate it not only in the presence of regime shifts, but also in the presence of a weak-to-moderate trend. Subsequent prewhitening eliminates both the linear trend and red noise, without having a substantial impact on the magnitude of regime shifts (Rodionov, 2006).
In summary, if a trend and/or serial correlation are present in the time series, it is critically important to eliminate them using either MPK or IP4 prewhitening procedure, specifically designed to work with time series containing regime shifts. If a linear trend is strong, it is better to detrend the time series first to improve accuracy of estimation of the lag-1 autocorrelation coefficient.
De-trending can be done quickly with lm
Code
# linear model with timetester_trend_mod <-lm(bottom_t ~ time, tester)# residuals are the detrended timeseries# we can re-add global mean if we want it in original unitstester <- broom::augment(x = tester_trend_mod) %>%rename(trend_fit = .fitted,trend_resid = .resid)# # Plot detrended timeseries# tester %>% # ggplot(aes(time, trend_resid)) +# geom_line() +# labs(title = "Detrended Timeseries")# Next, prewhitening# Fit AR(1) Modelar_fit <-arima(ts(tester$trend_resid), order =c(1, 0, 0))# Extract Residualstester$detrend_prewhitened <-as.numeric(residuals(ar_fit))# Plot the original and pre-whitened seriestester %>%ggplot(aes(time)) +geom_line(aes(y = bottom_t, color ="Original Time Series")) +geom_line(aes(y = trend_resid, color ="Detrended Time Series")) +geom_line(aes(y = detrend_prewhitened, color ="Detrended & Pre-Whitened")) +labs(title ="Linear Detrending + Prewhitening of Bottom Temperature")
To remove a seasonal trend as well, we can use a number of methods, but using a GAM we could do the trend and seasonality in one step:
Code
# Load GAM librarylibrary(mgcv)# Make a day of year value to cycle overtester <- tester %>%mutate(yday = lubridate::yday(time))# Model a linear trend and the seasonal cycleseasonal_gam <-gam( bottom_t ~s(yday, bs ="cc") + time, data = tester,method ="REML")# Add the predictions and residualstester <- broom::augment(x = seasonal_gam) %>%rename(seasonal_fit = .fitted,seasonal_resid = .resid)# Plot the fit of that processtester %>%ggplot(aes(x = time)) +geom_line(aes(y = bottom_t, color ="Original Timeseries"), alpha =0.75, linewidth =1) +geom_line(aes(y = seasonal_fit, color ="Seasonal Cycle + Trend"), alpha =0.75, linewidth =0.5) +labs(color ="Data")
The last step is to remove autocorrelation:
Code
# Next, prewhitening# Fit AR(1) Modelseason_ar_fit <-arima(ts(tester$seasonal_resid), order =c(1, 0, 0))# Extract Residualstester$season_detrend_prewhitened <-as.numeric(residuals(season_ar_fit))# Make the Detrended Datasettester %>%ggplot(aes(x = time)) +geom_line(aes(y = bottom_t, color ="Original Timeseries"), alpha =0.75, linewidth =1) +geom_line(aes(y = seasonal_fit, color ="Seasonal Cycle + Trend"), alpha =0.75, linewidth =0.5) +geom_line(aes(y = seasonal_resid, color ="Detrend: Seasonal Cycle + Trend"), alpha =0.75, linewidth =1) +geom_line(aes(y = season_detrend_prewhitened, color ="Seasonality + Trend Removed + PW"), alpha =0.75, linewidth =0.5) +labs(title ="Full Prep: Remove Seasonality + Trend, Pre-whiten")
If we pass this new pre-procesed timeseries into the rstars function we can get the following new changepoints:
Code
# Need it to be a dataframe not a tibble for indexing to workpreprocess_df <-as.data.frame(tester[,c("time", "season_detrend_prewhitened")])# Run it without prewhitening# Or do the prewhitening on the detrended data?preprocess_rstars <-rstars(data.timeseries = preprocess_df, l.cutoff =365*7, pValue =0.05, Huber =1, Endfunction = T, # Some behavior at the end of the timeseriespreWhitening = F, # Apply prewhitening T/FOLS = F, # OLSprewhitening method T/FMPK = F, # Marriott-Pope + Kennedy prewhitening method T/FIP4 = F, # IP4 prewhitening method T/FSubsampleSize = (365*7+1)/3, # subsampling rate for hubers + prewhiteningreturnResults = T # Return the results as a dataframe ) # Results now look like this:ggplot(preprocess_rstars) +geom_line(aes(time, season_detrend_prewhitened), linewidth =0.5, alpha =0.5) +geom_line(aes(time, regime_mu), linewidth =1) +geom_vline(data =filter(preprocess_rstars, RSI !=0),aes(xintercept = time), color ="red", linewidth =1) +scale_y_continuous(limits =c(-0.35, 0.35)) +labs(x ="", y ="Detrended Bottom Temp Anomaly",title ="STARS Results - After full pre-processing steps",subtitle = test_area)
EnvCpt Implementation
{EnvCpt} is another changepoint implementation that fits a variety of model forms based on the work of Beaulieu and Killick (2018) .
This is Rodionov’s rundown of their methods:
They used their own method named EnvCpt that fits eight models with different combinations of a trend, changepoints (regime shifts) and red noise and then identifies the most appropriate one according to the Akaike information criterion.
This approach throws a range of 12 possible trend and changepoint model structures at a dataset and can quickly visualize each. AIC/BIC are then used to evaluate support for the different changepoint structures.
Code
# The envcpt function is the workhorsetest_envcpt <-envcpt(tester$bottom_t, minseglen =365*7)
# Can plot the breaks found easilyplot(preprocess_envcpt)
Code
# Pull details on the winner (most parsimonious)best_cpt <-pull_best(preprocess_envcpt)#ggplot(preprocess_df) +geom_line(aes(time, season_detrend_prewhitened), linewidth =0.5, color ="gray50") +geom_vline(data =data.frame(x = tester$time[best_cpt$best_summ@cpts]),aes(xintercept = x), linetype =1, linewidth =1, color ="royalblue") +theme_classic() +labs(x ="time", y ="bottom temperature", title = best_cpt$best_type)
Source Code
---title: "Regime Shift / Tipping Points"description: | Detailing common methods and preprocessing steps for regime shift detection methodsdate: "Updated on: `r Sys.Date()`"format: html: code-fold: true code-tools: true df-print: kable self-contained: trueexecute: echo: true warning: false message: false fig.align: "center" comment: ""---```{r}library(here)library(tidyverse)library(rshift)library(gmRi)library(mgcv)library(EnvCpt)# Make sure lag is done correctlyconflicted::conflict_prefer("lag", "stats")conflicted::conflict_prefer("filter", "dplyr")# Path to timeserieslob_ecol_path <-cs_path("mills", "Projects/Lobster ECOL")# Load and add inshore/offshore labels on the daily datalobecol_fvcom <-read_csv(str_c(lob_ecol_path, "FVCOM_processed/area_timeseries/all_regions_fvcom_temperatures_daily.csv")) %>%mutate(depth_type =if_else(area_id %in%c("GOM_GBK", "SNE"), "offshore", "nearshore"))theme_set(theme_minimal())``````{r}#| label: style-sheet#| results: asis# Use GMRI style for fontsuse_gmri_style_rmd()``````{r}#| label: fonts-config#| echo: falselibrary(showtext)# Path to the directory containing the font file (replace with your actual path)font_dir <-paste0(system.file("stylesheets", package ="gmRi"), "/GMRI_fonts/Avenir/")# Register the fontfont_add(family ="Avenir",file.path(font_dir, "LTe50342.ttf"),bold =file.path(font_dir, "LTe50340.ttf"),italic =file.path(font_dir, "LTe50343.ttf"),bolditalic =file.path(font_dir, "LTe50347.ttf"))# Load the fontshowtext::showtext_auto()```# Regime Shifts & Tipping PointsThis is a methods exploration of various packages available for regime shifts and for ecological tipping point detection. I will also try to flag and cite key ideas found in the literature.## What is a regime shift / tipping point - Core Ideas> The concept of tipping points usually implies discontinuous regime shifts of systems and includes three characteristics: (1) abrupt change, (2) hysteresis, and (3) non-stationary functional relationships. -@blöcker2023**Hysteresis:** When the pathway of recovery of an ecosystem differs from its pathway of degradation (Suding & Hobbs 2009)**Alternative-Steady State:** Since the first papers on the subject appeared, two perspectives have developed to describe how communities shift from one stable state to another. One assumes a constant environment with shifts in variables such as population density, and the other anticipates changes to underlying parameters or environmental “drivers”. (Beisner et al. 2003)**Non-Stationary Relationships:****Resilience:**## Detection Methods### Abrupt Shifts- Statistical Changepoint Analysis - Sequential t-test Analysis of Regime Shifts (STARS, rodionov 2004). - Originally an Excel add-in. STARS mean-shift testing has been implemented with [{rshift}](https://github.com/alexhroom/rshift) - Outlier weighting and prewhitening routines have been implemented in Stirnimann 2019's repository [{rstars}](https://github.com/LStirnimann/rstars/tree/master). - Additional code snippet also available [here](https://figshare.com/articles/dataset/Supplement_1_Annotated_R_scripts_and_sediment_core_data_used_to_investigate_regime_shift_dynamics_in_a_Galapagos_coastal_lagoon_/3560655?file=5632209) - [{EnvCpt}]() Detection of Structural Changes in Climate and Environment Time Series using 12 methods and evaluated with AIC. Based on paper by Beaulieu et al 2012. - [{bcp}](https://github.com/swang87/bcp) Bayesian changepoint detection - [{changepoint}](https://cran.r-project.org/web/packages/changepoint/index.html) Binary segmentation algorithm *BinSeg*, multiple changepoint search - [{mcp}](https://lindeloev.github.io/mcp/index.html) Regression with multiple change points/trend evaluation - {strucchange} - {rpart}: Package for chronological clustering, a clustering approach for multivariate timeseries. See: Peretti et al 2017### Hysteresis- Temporal evolution of relationships e.g. SSB \~ F - stats::ccf() for visual inspection of correlation for SSB \~ F at $t-n$ time lags. - [{strucchange}](https://cran.r-project.org/web/packages/strucchange/index.html)### Non-Stationary Relationships- Non-stationarity in the SRR (stock recruit relationship)### Trend Changes- [{EnvCpt}]() has model forms that evaluate non-stationary trendsThe `echo: false` option disables the printing of code (only output is displayed).# STARS Implementation - BasicOne of the most broadly cited/used regime shift methods is Sequential t-test Analysis of Regime Shifts (STARS, rodionov 2004).This approach is made available as an excel macro through [Sergei Rodionov's personal site](https://sites.google.com/view/regime-shift-test/downloads?authuser=0). There have also been recent adaptations of this approach implemented in R/python.The original method was built to detect shifts in the mean for some series of data. The approach was later improved, now incorporating outlier weighting via a Huber's weight argument, as well as red-noise detection and "prewhitening" routines. This additions lowered the orignal algorithms sensitivity to outliers and reduced the likelihood of false-positive detections in cases where an autocorrelative structure is present.In R, the methodology for implementing the STARS approach is available in {rshift}, however at this time the outlier detection and prewhitening routines are not part of the package.Another [repository/package in R (rstars)](https://github.com/LStirnimann/rstars/tree/master) contains the rodionov stars implementation, with huber weighting, and functions for the four prewhitening routines described in Stirnimann 2019.### Basic Implementation - rshiftThe STARS method requires several arguments: `data` a dataframe containing a time index column `time` and the variable of interest `col`, an argument for the minimum length for a regime `l` (integer value using whatever timestep `time` is in), and a detection probability `prob` with which to flag a regime shift.```{r}# Test timeseriestest_area <-"GOM_GBK"tester <-filter(lobecol_fvcom, area_id == test_area)# Regime Shift Testt1_regimes <- rshift::Rodionov(data = tester, col ="bottom_t", time ="time", l =365*7, prob =0.05, startrow =1, merge = T )# PlotRSI_graph( t1_regimes, col ="bottom_t", time ="time", rsi ="RSI", mean_lines =TRUE)```Upon quick glance we may get the sense that there is something going on that is leading to many values near the end of the timeseries flagged for different areas of interest.# STARS v3.2+ EnhancmentsSince the initial 2004 paper (and not included in rshift presently), Rodionov added a number of enhancements and recommendations to the approach which improve its ability to detect true shifts in the mean in the presence of outliers, autocorrelation (red-noise), and trends.### Outlier Detection / WeightingDue to outliers, the average is not representative for the mean value of the regimes, and this may significantly affect the results of the regime shift detection. Ideally the weight for the data value should be chosen such that it is small if that value is considered as an outlier. To handle the outliers, the program uses the Huber's weight function, which is calculated here as`weight = min( 1, parameter/(|anomaly|))`Where `anomaly` is the deviation from the expected mean value of the new regime normalized by the standard deviation averaged for all consecutive sections of the cut-off length in the series. If anomalies are less than or equal to the value of the parameter then their weights are equal to one. Otherwise, the weights are inversely proportional to the distance from the expected mean value of the new regime.### Red Noise RemovalOne pre-processing / pre-whitening step done by Rodionov in newer STARS versions is accounting for variability related to "red noise" processes. These are sequential runs (autocorrelation) in the data which can appear as regimes themselves and interfere with regime shift identification.> Red noise is the same as a first-order autoregressive (AR1) stationary Gaussian process with a positive correlation at lag 1, also called the Markov process. If red noise or serial correlation is present in the time series, it is necessary to either adjust the significance level of the shifts by calculating the effective degrees of freedom, or use a so-called "prewhitening" procedure prior to application of a regime shift detection method. In any case, it requires an estimate of the AR1 autoregressive parameter, which is equal to the sample lag-1 autocorrelation coefficient (r1).Prewhitening describes aproaches used to remove "red-noise" from a time series before analysis. There are three methods evaluated in Stirnimann et al. 2019 that were added in the Rodionov 2006 study include: * Ordinary Least Squares (OLS)\ * Marriott-Pope and Kendall (MPK)\ * Inverse Proportionality with four corrections (IP4)A written description of the approach that Rodionov used/uses can be found here: https://www.beringclimate.noaa.gov/regimes/help3.html> Note that the OLS estimate is calculated using the entire time series. The MPK and IP4 methods break the time series into subsamples, estimate bias corrected r1 for each subsample and then use the median value of all estimates. The suggested subsample size m is calculated as m = (l + 1)/3, where l is the cutoff length. It is recommended to experiment with different subsample sizes to see how it affects the r1 estimate.The following chunk of code is to my understanding the OLS approach, which is done on the full dataset:```{r}# Take the tester to demo prewhitening# Fit AR(1) Modelfit <-arima(ts(tester$bottom_t), order =c(1, 0, 0))# Extract Residualsprewhitened_series <-residuals(fit)# Plot the original and pre-whitened seriestester %>%mutate(mpk = prewhitened_series) %>%ggplot(aes(time)) +geom_line(aes(y = bottom_t, color ="Original Time Series")) +geom_line(aes(y = mpk, color ="Pre-Whitened Time Series"))# # Test for Remaining Autocorrelation# Box.test(prewhitened_series, lag = 10, type = "Ljung-Box")# # # Get the AR1 coefficient# summary(fit)# coef(fit)```This approach would need to be adjusted to faithfully follow the Rodionov 2006 methods to include subsampling for the MPK and IP4 approaches:> The MPK and IP4 methods break the time series into subsamples, estimate bias corrected r1 for each subsample and then use the median value of all estimates. The suggested subsample size m is calculated as m = (l + 1)/3, where l is the cutoff length. It is recommended to experiment with different subsample sizes to see how it affects the r1 estimate.Thankfully, code from the Stirnimann paper is available which implements the other methods.# {rstars} ImplementationWe can try the more detailed approach using the [{rstars}](https://github.com/LStirnimann/rstars) implementation by sourcing the repo functions which have been downloaded locally:This repository was made by Luca Stirnimann for their 2019 publication.```{r}# # Data used as an example looks like this# PDO <- read.table(here::here("rstars-master","PDO.txt"),header = T, dec = ".")# Stirnimann used these values in their paper:# l = 5,10,15,17.5 years, with monthly data# Huber = 1# Subsampling = (l + 1)/3# Load the function(s)source(here::here("rstars-master","rSTARS.R"))# Need it to be a dataframe not a tibble for indexing to worktester_df <-as.data.frame(tester[,c("time", "bottom_t")])test_rstars <-rstars(data.timeseries = tester_df, l.cutoff =365*7, pValue =0.05, Huber =1, Endfunction = T, # Some behavior at the end of the timeseriespreWhitening = T, # Apply prewhitening T/FOLS = F, # OLSprewhitening method T/FMPK = T, # Marriott-Pope + Kennedy prewhitening method T/FIP4 = F, # IP4 prewhitening method T/FSubsampleSize = (365*7+1)/3, # subsampling rate for hubers + prewhiteningreturnResults = T # Return the results as a dataframe ) # # Saves 3 Files before I modified it:# filtered_ts <- read_table(str_c(here::here("rstars_results/tester_"), "Filteredts.txt"))# tester_rsi <- read_table(str_c(here::here("rstars_results/tester_"), "RSI.txt"))# tester_mean <- read_table(str_c(here::here("rstars_results/tester_"), "tsMean.txt"))# Results now look like this:ggplot(test_rstars) +geom_line(aes(time, bottom_t), linewidth =0.5, alpha =0.5) +geom_line(aes(time, regime_mu), linewidth =1) +geom_vline(data =filter(test_rstars, RSI !=0),aes(xintercept = time), color ="red", linewidth =1) +labs(x ="", y ="Bottom Temp")# Results now look like this on anomaly scales:ggplot(test_rstars) +geom_line(aes(time, bottom_t_pw), linewidth =0.5, alpha =0.5) +geom_line(aes(time, regime_mu_pw), linewidth =1) +geom_vline(data =filter(test_rstars, RSI !=0),aes(xintercept = time), color ="red", linewidth =1) +scale_y_continuous(limits =c(-0.35, 0.35)) +labs(x ="", y ="Prewhitened Temperature")```Running the code from this repository on the same daily bottom temperature series yields the above results.Everything is done with a single function, and so its possible (once the pre-processing has been done) to run each timeseries of interest using map/lapply/etc.```{r}# # Run them all# bot_temp_shifts <- lobecol_fvcom %>%# split(.$area_id) %>%# map_dfr(function(.x){# # rstars(# data.timeseries = as.data.frame(.x[,c("time", "bottom_t")]),# l.cutoff = 365*7, # pValue = 0.05, # Huber = 1, # Endfunction = F,# preWhitening = T, # OLS = F, # MPK = T, # IP4 = F, # SubsampleSize = (365*7 + 1)/3, # returnResults = T)# # }, .id = "area_id")# # # # Save it# write_csv(bot_temp_shifts, here::here("rstars_results/lobecol_shifts.csv"))```And if we did all of them for bottom temperature, we'd return a bunch of lines that look like this:```{r}#| eval: true# Load that data againbot_temp_shifts <-read_csv(here::here("rstars_results/lobecol_shifts.csv"))# Plot the RSIbot_temp_shifts %>%ggplot() +geom_line(aes(time, RSI), linewidth =0.5, alpha =0.5) +geom_hline(yintercept =0) +facet_grid() +labs(x ="Time",y ="RSI",title ="")```## Preprocessing Topics: It is strongly recommended by a number of publications to apply a number of preprocessing steps when performing a changepoint/breakpoint or mean-shift tests. These preprocessing steps increase the likelihood that detected events are true abrupt shifts in the data, and not other common processes. They also point out that prewhitening alone does not resolve sytemic processes in the data which can presentFrom Lund et al. 2023>To account for correlation in an AMOC changepoint analysis, prewhitening first fits a pth-order autoregressive [AR(p)] model to the series (this assumes nonperiodic data).In their review paper they also note a number of best practices and pitfalls (some below, not exhaustive).> BEST PRACTICE 3Account for all potential features in the mean of a series. If in doubt, allow for a trend and/or seasonality and use the statistical methods to distinguish which features are present in the series.> As an aside, we comment that model fitting is not about maximizing or mini-mizing p values, but rather making sure that all relevant statistical features are accounted for in a parsimonious model> PITFALL 5: APPLYING MEAN SHIFT MCPT TECHNIQUES TO SERIES WITH TRENDS ORSEASONALITY WITHOUT ACCOUNTING FOR THESE FEATURES. Similar to AMOC techniques in pitfall 2, applying a MCPT technique that neglects trends and seasonality can result in spurious changepoint declarations. For example, an increasing long-term trend will likely be estimated as a series of change-points acting as an increasing stairway. > In a multiple changepoint analysis of a daily series, methods may flag many spurious changepoints within a year in an attempt to track the seasonal mean cycle should it be ignored in the modeling cycleBased on reading the review paper. If we wish to continue with STARS we should: 1. Deal with seasonality (subtract seasonal mean) 2. Deal with any long-term trend (remove trend) 3. Then-prewhiten to address AR() 4. Then run through rstarsIf we want to use EnvCPT: 1. Deal with seasonality (subtract seasonal mean) 2. then run through envcpt### Removing Trend & Seasonality> STARS works best when the time series contains no deterministic trend, such as the PDO series, from which the global warming trend is removed by design. However, assuming the AR1 autoregressive parameter is not changing over time, both MPK and IP4 procedures can accurately estimate it not only in the presence of regime shifts, but also in the presence of a weak-to-moderate trend. Subsequent prewhitening eliminates both the linear trend and red noise, without having a substantial impact on the magnitude of regime shifts (Rodionov, 2006).> In summary, if a trend and/or serial correlation are present in the time series, it is critically important to eliminate them using either MPK or IP4 prewhitening procedure, specifically designed to work with time series containing regime shifts. If a linear trend is strong, it is better to detrend the time series first to improve accuracy of estimation of the lag-1 autocorrelation coefficient.De-trending can be done quickly with `lm````{r}# linear model with timetester_trend_mod <-lm(bottom_t ~ time, tester)# residuals are the detrended timeseries# we can re-add global mean if we want it in original unitstester <- broom::augment(x = tester_trend_mod) %>%rename(trend_fit = .fitted,trend_resid = .resid)# # Plot detrended timeseries# tester %>% # ggplot(aes(time, trend_resid)) +# geom_line() +# labs(title = "Detrended Timeseries")# Next, prewhitening# Fit AR(1) Modelar_fit <-arima(ts(tester$trend_resid), order =c(1, 0, 0))# Extract Residualstester$detrend_prewhitened <-as.numeric(residuals(ar_fit))# Plot the original and pre-whitened seriestester %>%ggplot(aes(time)) +geom_line(aes(y = bottom_t, color ="Original Time Series")) +geom_line(aes(y = trend_resid, color ="Detrended Time Series")) +geom_line(aes(y = detrend_prewhitened, color ="Detrended & Pre-Whitened")) +labs(title ="Linear Detrending + Prewhitening of Bottom Temperature")```To remove a seasonal trend as well, we can use a number of methods, but using a GAM we could do the trend and seasonality in one step:```{r}# Load GAM librarylibrary(mgcv)# Make a day of year value to cycle overtester <- tester %>%mutate(yday = lubridate::yday(time))# Model a linear trend and the seasonal cycleseasonal_gam <-gam( bottom_t ~s(yday, bs ="cc") + time, data = tester,method ="REML")# Add the predictions and residualstester <- broom::augment(x = seasonal_gam) %>%rename(seasonal_fit = .fitted,seasonal_resid = .resid)# Plot the fit of that processtester %>%ggplot(aes(x = time)) +geom_line(aes(y = bottom_t, color ="Original Timeseries"), alpha =0.75, linewidth =1) +geom_line(aes(y = seasonal_fit, color ="Seasonal Cycle + Trend"), alpha =0.75, linewidth =0.5) +labs(color ="Data")```The last step is to remove autocorrelation:```{r}# Next, prewhitening# Fit AR(1) Modelseason_ar_fit <-arima(ts(tester$seasonal_resid), order =c(1, 0, 0))# Extract Residualstester$season_detrend_prewhitened <-as.numeric(residuals(season_ar_fit))# Make the Detrended Datasettester %>%ggplot(aes(x = time)) +geom_line(aes(y = bottom_t, color ="Original Timeseries"), alpha =0.75, linewidth =1) +geom_line(aes(y = seasonal_fit, color ="Seasonal Cycle + Trend"), alpha =0.75, linewidth =0.5) +geom_line(aes(y = seasonal_resid, color ="Detrend: Seasonal Cycle + Trend"), alpha =0.75, linewidth =1) +geom_line(aes(y = season_detrend_prewhitened, color ="Seasonality + Trend Removed + PW"), alpha =0.75, linewidth =0.5) +labs(title ="Full Prep: Remove Seasonality + Trend, Pre-whiten")```If we pass this new pre-procesed timeseries into the rstars function we can get the following new changepoints:```{r}# Need it to be a dataframe not a tibble for indexing to workpreprocess_df <-as.data.frame(tester[,c("time", "season_detrend_prewhitened")])# Run it without prewhitening# Or do the prewhitening on the detrended data?preprocess_rstars <-rstars(data.timeseries = preprocess_df, l.cutoff =365*7, pValue =0.05, Huber =1, Endfunction = T, # Some behavior at the end of the timeseriespreWhitening = F, # Apply prewhitening T/FOLS = F, # OLSprewhitening method T/FMPK = F, # Marriott-Pope + Kennedy prewhitening method T/FIP4 = F, # IP4 prewhitening method T/FSubsampleSize = (365*7+1)/3, # subsampling rate for hubers + prewhiteningreturnResults = T # Return the results as a dataframe ) # Results now look like this:ggplot(preprocess_rstars) +geom_line(aes(time, season_detrend_prewhitened), linewidth =0.5, alpha =0.5) +geom_line(aes(time, regime_mu), linewidth =1) +geom_vline(data =filter(preprocess_rstars, RSI !=0),aes(xintercept = time), color ="red", linewidth =1) +scale_y_continuous(limits =c(-0.35, 0.35)) +labs(x ="", y ="Detrended Bottom Temp Anomaly",title ="STARS Results - After full pre-processing steps",subtitle = test_area)```# EnvCpt Implementation{EnvCpt} is another changepoint implementation that fits a variety of model forms based on the work of Beaulieu and Killick (2018) .This is Rodionov's rundown of their methods:> They used their own method named EnvCpt that fits eight models with different combinations of a trend, changepoints (regime shifts) and red noise and then identifies the most appropriate one according to the Akaike information criterion.This approach throws a range of 12 possible trend and changepoint model structures at a dataset and can quickly visualize each. AIC/BIC are then used to evaluate support for the different changepoint structures.```{r}# The envcpt function is the workhorsetest_envcpt <-envcpt(tester$bottom_t, minseglen =365*7)# Can plot the breaks found easilyplot(test_envcpt)# Details on bestpull_best <-function(x){ best <-names(which.min(AIC(x))) best_summ <- x[[best]] x$best_type <- best x$best_summ <- best_summreturn(x)}```With this test timeseries the structure with most support is the "meanar2cpt", which is a mean changepoint structure with an Ar(2) structure:```{r}# Pull details on the winner (most parsimonious)best_cpt <-pull_best(test_envcpt)# best_cpt$best_type# best_cpt$best_summ# test_envcpt$trendar2cpt# Plot GOM, the only one with changepoints#str(best_cpt$best_summ) # structure#best_cpt$best_summ@cpts # changepoint locationsggplot(tester) +geom_line(aes(time, bottom_t), linewidth =0.5, color ="gray50") +geom_vline(data =data.frame(x = tester$time[best_cpt$best_summ@cpts]),aes(xintercept = x), linetype =1, linewidth =1, color ="royalblue") +theme_classic() +labs(x ="time", y ="bottom temperature", title = best_cpt$best_type)```As with STARS it may be necessary to remove a trend prior to changepoint detection.Here are the results when done on the pre-processed timeseries from above:```{r}# The envcpt function is the workhorsepreprocess_envcpt <-envcpt( preprocess_df$season_detrend_prewhitened, minseglen =365*7)# Can plot the breaks found easilyplot(preprocess_envcpt)# Pull details on the winner (most parsimonious)best_cpt <-pull_best(preprocess_envcpt)#ggplot(preprocess_df) +geom_line(aes(time, season_detrend_prewhitened), linewidth =0.5, color ="gray50") +geom_vline(data =data.frame(x = tester$time[best_cpt$best_summ@cpts]),aes(xintercept = x), linetype =1, linewidth =1, color ="royalblue") +theme_classic() +labs(x ="time", y ="bottom temperature", title = best_cpt$best_type)```