Skip to contents
library(impart)
library(deming) # Required for deming::theilsen()

This vignette demonstrates how to use impart to monitor the amount of information contained in the accrued data so that analyses are not under- or over-powered. To see all available vignettes in impart, use the vignettes command:

vignette(package = "impart")
Title Item

Monitoring Ongoing Studies

In a fixed sample size design, the timing of the final analysis is based on the last participant’s last visit. Group sequential analyses are based when pre-specified fractions of the maximum sample size have their final outcome observed. Timing analyses in such studies only depends on counting the number of final outcomes observed during the study.

In information-monitored designs, the times at which analyses are performed and recruitment is stopped depend on when the information reaches pre-specified thresholds. The amount of information contained in the data depends on the number of observations, the completeness of the data, the analytic methods used, and the interrelationships among the observed data. Depending on the analysis methods used, an individual could contribute baseline covariates and treatment assignment information, post-randomization outcomes, and even post-randomization auxiliary variables.

Challenges in Information Monitoring

A new challenge in analyzing data in an ongoing trial is how to appropriately treat missing data. Suppose the following are observed:

  • A participant is newly enrolled, has their baseline covariates measured, is randomized, is still on study, and has not yet entered any follow-up windows
  • A participant is randomized, but did not attend any follow-up visits

In both of these cases, the covariates, and treatment assignment are observed, but the outcomes are missing. In the first case, outcomes are not yet observed: while they are missing, they could still be observed when the participant enters the study window. In the latter case, they are known to be missing: the study window is closed without the outcome being observed.

A convention can be used to differentiate outcomes that are not yet observed from those known to be missing:

  • Completed assessments have both an observed outcome and an observed assessment time.
  • Missed assessments have a missing outcome, but an observed assessment time:
    • If an assessment occurred, but no valid outcome was obtained, the time of the assessment is used.
    • If an assessment was missed, the end of the study window is used, indicating the last time an outcome could have been observed per protocol.
  • Not-yet-observed assessments are missing both an outcome and an assessment time.

Example Data:

Information monitoring will be illustrated using a simulated dataset named example_1:

head(example_1)
#>   .id        x_1        x_2        x_3        x_4 tx      y_1       y_2
#> 1   1  2.0742970  0.1971432 -0.8425884  0.2794844  0 1.591873 -4.535711
#> 2   2  0.2165473 -0.7384296  0.1315016 -1.2419134  1       NA        NA
#> 3   3  0.8294726  0.4997821  1.6932555 -0.4063889  0       NA        NA
#> 4   4 -1.0206893 -0.2189937 -1.7719120  0.1936013  1 1.212620 -4.533776
#> 5   5 -0.0417332  0.9282685  0.8078133  0.9317145  0 8.655326  6.970372
#> 6   6  0.7275778  1.1756811  0.0226265 -0.2556343  1 6.902055 17.381316
#>        y_3       y_4 .enrollment_time     .t_1      .t_2      .t_3     .t_4
#> 1 13.98543 -1.320242          2.24846 25.01538  56.62636  98.51499 133.0050
#> 2       NA        NA         11.05565 55.05565  85.05565 115.05565 145.0556
#> 3       NA        NA         16.96591 60.96591  90.96591 120.96591 150.9659
#> 4 11.17615 -6.629545         25.13396 59.84544  81.58577 127.78659 154.3419
#> 5 17.62329  9.126240         50.07301 75.94952 111.21967 143.00338 181.4162
#> 6 -2.42570  3.549977         50.93935 80.29181 114.96781 151.34249 177.3846

The dataset is in wide format, with one row per individual. There are four baseline covariates: x_1, x_2, …, x_4. Treatment assignment is indicated a binary indicator (tx), with the time from study initiation to randomization indicated by .enrollment_time. There are four outcomes assessed at 30, 60, 90, and 120 days post-randomization: y_1, y_2, y_3 and y_4. The time to each observed outcome is indicated in columns .t_1, .t_2, .t_3 and .t_4. For missed study visits, the time recorded is the last day within the study window at which the individual’s outcome could have been assessed.

Preparing Interim Datasets

A function prepare_interim_data allows the user to create indicator variables for each outcome, with each variable indicating whether an outcome is observed (1), is known to be missing (0), or has not yet been observed (NA). This allows software to appropriately handle missing information during an ongoing study.

prepare_interim_data retains only the columns of data relevant to the analysis at hand: covariates, study entry/enrollment time, treatment assignment, outcomes, and the times at which outcomes were measured.

# Obtain time of last event
last_event <-
  example_1[, c(".enrollment_time", ".t_1", ".t_2", ".t_3", ".t_4")] |>
  unlist() |>
  max(na.rm = TRUE) |>
  ceiling()

prepared_final_data <-
  prepare_monitored_study_data(
    data = example_1,
    study_time = last_event,
    id_variable = ".id",
    covariates_variables = c("x_1", "x_2", "x_3", "x_4"),
    enrollment_time_variable = ".enrollment_time",
    treatment_variable = "tx",
    outcome_variables = c("y_1", "y_2", "y_3", "y_4"),
    outcome_time_variables = c(".t_1", ".t_2", ".t_3", ".t_4"), 
    # Observe missingness 1 week after target study visit
    observe_missing_times = c(30, 60, 90, 120) + 7
  )

The resulting object contains the prepared dataset, the original dataset, the study time of the data lock, and a list of variables and their role in analyses.

Reverting to Information Earlier in the Study

These conventions can also be used to take a dataset from one point in time during the study, and revert to the information that was only available at an earlier point in the study. This can be useful for determining how quickly information is accruing during an ongoing study. Let (X1,,Xp)(X_{1}, \ldots, X_{p}), AA, (Y1,,Yj)(Y_{1}, \ldots, Y_{j}), and (TY1,,TYJ)(T_{Y_{1}}, \ldots, T_{Y_{J})} respectively denote baseline covariates, treatment assignment, the outcome observed at study visit j=1,,Jj = 1, \ldots, J, and the times at which the study outcomes are observed. Let wjcw^{c}_{j} denote the closing of the study window for visit jj.

To obtain the information available at tt days after the randomization of the first participant:

  1. Retain only participants where TA<tT_{A} < t: i.e. those randomized by study time tt
  2. Set outcome YjY_{j} to unobserved if TYj>tT_{Y_{j}} > t: i.e. outcomes not observed by time tt
  3. Set YjY_{j} to missing if TYjTA>wjcT_{Y_{j}} - T_{A} > w^{c}_{j}: otherwise, treat the outcome as not yet observed.

For example, the data known at 90 days can be obtained using the data_at_time_t() function as follows:

data_90 <-
  data_at_time_t(
    prepared_data = prepared_final_data,
    study_time = 90
  )

Consider 90 days after study initiation:

  • Participant 1 is known to have missed their first post-randomization assessment: .r_1 is 0
  • Participants 2-5 have had their first post-randomization outcome obtained: .r_1 is 1
  • Participant 6 missed their first post-randomization assessment, but this is not yet known at day 90: .r_1 is NA
show_cols <- c(".id", "x_2", "x_3", "x_4", "tx")

# Original Data
data_90$original_data[1:6, c(show_cols, ".enrollment_time", ".t_1", "y_1")]
#>   .id        x_2        x_3        x_4 tx .enrollment_time     .t_1      y_1
#> 1   1  0.1971432 -0.8425884  0.2794844  0          2.24846 25.01538 1.591873
#> 2   2 -0.7384296  0.1315016 -1.2419134  1         11.05565 55.05565       NA
#> 3   3  0.4997821  1.6932555 -0.4063889  0         16.96591 60.96591       NA
#> 4   4 -0.2189937 -1.7719120  0.1936013  1         25.13396 59.84544 1.212620
#> 5   5  0.9282685  0.8078133  0.9317145  0         50.07301 75.94952 8.655326
#> 6   6  1.1756811  0.0226265 -0.2556343  1         50.93935 80.29181 6.902055

# Data known at study day 90
data_90$data[1:6, c(show_cols, ".e", ".t_1", "y_1", ".r_1")]
#>   .id        x_2        x_3        x_4 tx       .e     .t_1      y_1 .r_1
#> 1   1  0.1971432 -0.8425884  0.2794844  0  2.24846 25.01538 1.591873    1
#> 2   2 -0.7384296  0.1315016 -1.2419134  1 11.05565 55.05565       NA    0
#> 3   3  0.4997821  1.6932555 -0.4063889  0 16.96591 60.96591       NA    0
#> 4   4 -0.2189937 -1.7719120  0.1936013  1 25.13396 59.84544 1.212620    1
#> 5   5  0.9282685  0.8078133  0.9317145  0 50.07301 75.94952 8.655326    1
#> 6   6  1.1756811  0.0226265 -0.2556343  1 50.93935 80.29181 6.902055    1

If observe_missing_times were set to 0 for each outcome, any outcome that is NA and has a recorded time of assessment will be treated as missing.

Plotting Observed Number of Outcomes

One part of monitoring involves determining how many individuals contribute a given type of information to analyses, including baseline covariates, intermediate outcomes, and final outcomes. For binary and time-to-event analyses, monitoring should include the number of observed events, not just the number of participants who have completed monitoring.

Plotting can be done for all available events:

# Plot events at the end of the study
plot_outcome_counts(
  prepared_data = prepared_final_data
)

A plot of the study events over the entire study: randomized participants, intermediate outcomes, and final outcomes.

Plotting can also be done for events known at a particular point in time:

# Plot events two years into the study
plot_outcome_counts(
  prepared_data = prepared_final_data,
  study_time = 731 # 2 Years
)

A plot of the study events over the first two years of the study: randomized participants, intermediate outcomes, and final outcomes.

Tabulating Event Times

The times of each event can be obtained using count_outcomes:

example_1_counts <-
  count_outcomes(
    prepared_data = prepared_final_data
  )

# Timing of first n randomizations
subset(
  example_1_counts,
  event == "randomization"
) |> head()
#>           event     time count_total count_complete count_events
#> 1 randomization  2.24846           1              1            1
#> 2 randomization 11.05565           2              2            2
#> 3 randomization 16.96591           3              3            3
#> 4 randomization 25.13396           4              4            4
#> 5 randomization 50.07301           5              5            5
#> 6 randomization 50.93935           6              6            6

# Timing of first n observations of `y_4`
subset(
  example_1_counts,
  event == "y_4"
) |> head()
#>      event     time count_total count_complete count_events
#> 1201   y_4 133.0050           1              1           NA
#> 1202   y_4 145.0556           2              1           NA
#> 1203   y_4 150.9659           3              1           NA
#> 1204   y_4 154.3419           4              2           NA
#> 1205   y_4 171.3021           5              3           NA
#> 1206   y_4 177.3249           6              4           NA

# Find when n observations of `y_4` are first available:
subset(
  example_1_counts,
  event == "y_4" & count_complete == 70
)
#>      event     time count_total count_complete count_events
#> 1296   y_4 707.0069          96             70           NA

This can also be used with data_at_time_t to reconstruct the study data when a particular number of observations have accrued:

# Reconstruct the data when N = 70 final outcomes were obtained
data_n_final_70 <-
  data_at_time_t(
    prepared_data = prepared_final_data,
    study_time = 
      # Time when 70 final outcomes are observed:
      ceiling(
        subset(
          example_1_counts,
          event == "y_4" & count_complete == 70
        )$time
      )
  )

data_n_70 <- data_n_final_70$data

Monitoring Information Levels

Information can be computed using the estimate_information function: users pass the function which conducts the analysis (estimation_function), along with a list of parameters the function requires (estimation_arguments). When there are multiple analyses, the orthogonalize argument specifies whether the test statistics and covariance should be orthogonalized to meet the independent increments assumption. A random number generator seed (rng_seed) can be supplied for reproducibility. By default, only the variance and information are returned, which can be compared against the target information level for analyses:

information_n_70 <-
  estimate_information(
    data = data_n_70,
    monitored_design = NULL,
    estimation_function = standardization,
    estimation_arguments =
      list(
        estimand = "difference",
        y0_formula = y_4 ~ x_1 + x_2 + x_3 + x_4,
        y1_formula = y_4 ~ x_1 + x_2 + x_3 + x_4,
        family = gaussian,
        treatment_column = "tx",
        outcome_indicator_column = ".r_4"
      ),
    orthogonalize = TRUE,
    rng_seed = 23456
  )

information_n_70$covariance_uncorrected
#>           estimates
#> estimates  4.333183
information_n_70$information
#> estimates 
#> 0.2307772
information_n_70$covariance_orthogonal_uncorrected
#>           estimates
#> estimates  4.333183
information_n_70$information_orthogonal
#>           estimates
#> estimates 0.2307772

When a study is designed with interim and final analyses, monitored_design allows the user to pass the results of previously conducted analyses to estimate_information.

Using the design specified in the impart_study_design vignette:

# Universal Study Design Parameters
minimum_difference <- 5 # Effect Size: Difference in Means of 5 or greater
alpha <- 0.05 # Type I Error Rate
power <- 0.9 # Statistical Power
test_sides <- 2 # Direction of Alternatives

# Determine information required to achieve desired power at fixed error rate
information_single_stage <-
  impart::required_information_single_stage(
    delta = minimum_difference,
    alpha = alpha,
    power = power
  )

# Group Sequential Design Parameters
information_rates <-
  c(0.50, 0.75, 1.00) # Analyses at 50%, 75%, and 100% of the Total Information
type_of_design <- "asOF" # O'Brien-Fleming Alpha Spending
type_beta_spending <- "bsOF" # O'Brien-Fleming Beta Spending

# Set up group sequential testing procedure
trial_design <-
  rpact::getDesignGroupSequential(
    alpha = alpha,
    beta = 1 - power,
    sided = 2,
    informationRates = information_rates,
    typeOfDesign = type_of_design,
    typeBetaSpending = type_beta_spending,
    bindingFutility = FALSE
  )

# Inflate information level to account for multiple testing
information_adaptive <-
  impart::required_information_sequential(
    information_single_stage = information_single_stage,
    trial_design = trial_design
  )

# Initialize the monitored design
monitored_design <-
  initialize_monitored_design(
    trial_design = trial_design,
    null_value = 0,
    maximum_sample_size = 280,
    information_target = information_adaptive,
    orthogonalize = TRUE,
    rng_seed_analysis = 54321
  )

The current information fraction can be computed:

information_n_70$information/information_adaptive
#> estimates 
#> 0.5062808

Tools for Monitoring Information

While estimate_information can provide an estimate of the information at a particular point in the study, understanding the rate at which information is accruing over time can be useful in projecting when pre-specified information levels may be met:

data_n_70_trajectory <- 
  information_trajectory(
    prepared_data = data_n_final_70,
    monitored_design = monitored_design,
    estimation_function = standardization,
    estimation_arguments =
      list(
        estimand = "difference",
        y0_formula = y_4 ~ x_1 + x_2 + x_3 + x_4,
        y1_formula = y_4 ~ x_1 + x_2 + x_3 + x_4,
        family = gaussian,
        treatment_column = "tx",
        outcome_indicator_column = ".r_4"
      ),
    correction_function = standardization_correction,
    orthogonalize = TRUE,
    n_min = 40,
    n_increment = 2,
    rng_seed = 23456,
    # Note: control = monitored_analysis_control() is the default
    # This does more bootstrap replicates by default
    control = monitored_analysis_control_testing() 
  )

data_n_70_trajectory
#>       times randomization y_1 y_2 y_3 y_4 information information_lag_1
#> 1  491.3059            79  59  49  43  40  0.02354317                NA
#> 2  506.9791            83  64  52  46  42  0.06373427        0.02354317
#> 3  508.6034            83  64  52  46  42  0.06373427        0.06373427
#> 4  518.1410            84  64  54  47  44  0.06273970        0.06373427
#> 5  527.2583            85  65  54  50  44  0.06569907        0.06273970
#> 6  527.6485            85  65  54  50  46  0.06672800        0.06569907
#> 7  538.5259            86  68  56  52  48  0.11713994        0.06672800
#> 8  540.6071            87  68  57  52  48  0.09814035        0.11713994
#> 9  551.8682            90  69  57  53  50  0.12625125        0.09814035
#> 10 555.9573            92  70  60  54  50  0.12003936        0.12625125
#> 11 578.2366            95  73  61  56  52  0.14664434        0.12003936
#> 12 579.6637            95  73  61  57  52  0.14664434        0.14664434
#> 13 582.8345            95  73  63  57  52  0.14664434        0.14664434
#> 14 597.3935            96  77  63  58  54  0.15105217        0.14664434
#> 15 630.7461           100  79  70  64  56  0.14185643        0.15105217
#> 16 633.0925           100  79  70  64  58  0.15632099        0.14185643
#> 17 635.4027           100  79  70  64  60  0.16605940        0.15632099
#> 18 658.6659           109  82  72  69  62  0.18410677        0.16605940
#> 19 662.5691           110  82  72  69  64  0.17192467        0.18410677
#> 20 675.4510           113  87  74  70  66  0.17923912        0.17192467
#> 21 701.1256           117  96  80  71  68  0.17267007        0.17923912
#> 22 707.0069           117  96  81  73  70  0.18025072        0.17267007
#>    information_change information_pct_change information_fraction
#> 1                  NA                     NA           0.05164918
#> 2        0.0401910997              63.060421           0.13982070
#> 3        0.0000000000               0.000000           0.13982070
#> 4       -0.0009945712              -1.585234           0.13763880
#> 5        0.0029593721               4.504435           0.14413109
#> 6        0.0010289294               1.541975           0.14638836
#> 7        0.0504119382              43.035653           0.25698243
#> 8       -0.0189995846             -19.359605           0.21530101
#> 9        0.0281108949              22.265835           0.27697089
#> 10      -0.0062118854              -5.174874           0.26334321
#> 11       0.0266049799              18.142520           0.32170941
#> 12       0.0000000000               0.000000           0.32170941
#> 13       0.0000000000               0.000000           0.32170941
#> 14       0.0044078303               2.918085           0.33137934
#> 15      -0.0091957493              -6.482434           0.31120564
#> 16       0.0144645633               9.253117           0.34293810
#> 17       0.0097384163               5.864417           0.36430230
#> 18       0.0180473670               9.802663           0.40389475
#> 19      -0.0121821058              -7.085723           0.37716956
#> 20       0.0073144562               4.080837           0.39321606
#> 21      -0.0065690556              -3.804397           0.37880482
#> 22       0.0075806509               4.205615           0.39543531

This trajectory can be smoothed using regression, ideally using a method resistant to outliers, such as deming::theilsen. Inverse regression can be used to obtain an estimated number of outcomes necessary to achieve a given level of information:

plot(
  information_fraction ~ y_4,
  data = data_n_70_trajectory,
  ylim = c(0, 1)
)

abline(
  lm(
    formula = information_fraction ~ y_4,
    data = data_n_70_trajectory
  ),
  lty = 1
)

# Requires `deming` package
abline(
  deming::theilsen(
    formula = information_fraction ~ y_4,
    data = data_n_70_trajectory
  ),
  lty = 3
)

abline(
  h = monitored_design$original_design$information_fractions,
  lty = 2
)

A plot of the time series of information with linear regression fits superimposed on top.