Monitoring Information for a Continuous Outcome
Source:vignettes/monitoring_continuous.Rmd
monitoring_continuous.Rmd
This vignette demonstrates how to use impart
to monitor
the amount of information contained in the accrued data so that analyses
are not under- or over-powered. To see all available vignettes in
impart, use the vignettes
command:
vignette(package = "impart")
Title | Item |
---|
Monitoring Ongoing Studies
In a fixed sample size design, the timing of the final analysis is based on the last participant’s last visit. Group sequential analyses are based when pre-specified fractions of the maximum sample size have their final outcome observed. Timing analyses in such studies only depends on counting the number of final outcomes observed during the study.
In information-monitored designs, the times at which analyses are performed and recruitment is stopped depend on when the information reaches pre-specified thresholds. The amount of information contained in the data depends on the number of observations, the completeness of the data, the analytic methods used, and the interrelationships among the observed data. Depending on the analysis methods used, an individual could contribute baseline covariates and treatment assignment information, post-randomization outcomes, and even post-randomization auxiliary variables.
Challenges in Information Monitoring
A new challenge in analyzing data in an ongoing trial is how to appropriately treat missing data. Suppose the following are observed:
- A participant is newly enrolled, has their baseline covariates measured, is randomized, is still on study, and has not yet entered any follow-up windows
- A participant is randomized, but did not attend any follow-up visits
In both of these cases, the covariates, and treatment assignment are observed, but the outcomes are missing. In the first case, outcomes are not yet observed: while they are missing, they could still be observed when the participant enters the study window. In the latter case, they are known to be missing: the study window is closed without the outcome being observed.
A convention can be used to differentiate outcomes that are not yet observed from those known to be missing:
- Completed assessments have both an observed outcome and an observed assessment time.
- Missed assessments have a missing outcome, but an observed
assessment time:
- If an assessment occurred, but no valid outcome was obtained, the time of the assessment is used.
- If an assessment was missed, the end of the study window is used, indicating the last time an outcome could have been observed per protocol.
- Not-yet-observed assessments are missing both an outcome and an assessment time.
Example Data:
Information monitoring will be illustrated using a simulated dataset
named example_1
:
head(example_1)
#> .id x_1 x_2 x_3 x_4 tx y_1 y_2
#> 1 1 2.0742970 0.1971432 -0.8425884 0.2794844 0 1.591873 -4.535711
#> 2 2 0.2165473 -0.7384296 0.1315016 -1.2419134 1 NA NA
#> 3 3 0.8294726 0.4997821 1.6932555 -0.4063889 0 NA NA
#> 4 4 -1.0206893 -0.2189937 -1.7719120 0.1936013 1 1.212620 -4.533776
#> 5 5 -0.0417332 0.9282685 0.8078133 0.9317145 0 8.655326 6.970372
#> 6 6 0.7275778 1.1756811 0.0226265 -0.2556343 1 6.902055 17.381316
#> y_3 y_4 .enrollment_time .t_1 .t_2 .t_3 .t_4
#> 1 13.98543 -1.320242 2.24846 25.01538 56.62636 98.51499 133.0050
#> 2 NA NA 11.05565 55.05565 85.05565 115.05565 145.0556
#> 3 NA NA 16.96591 60.96591 90.96591 120.96591 150.9659
#> 4 11.17615 -6.629545 25.13396 59.84544 81.58577 127.78659 154.3419
#> 5 17.62329 9.126240 50.07301 75.94952 111.21967 143.00338 181.4162
#> 6 -2.42570 3.549977 50.93935 80.29181 114.96781 151.34249 177.3846
The dataset is in wide format, with one row per individual. There are
four baseline covariates: x_1
, x_2
, …,
x_4
. Treatment assignment is indicated a binary indicator
(tx
), with the time from study initiation to randomization
indicated by .enrollment_time
. There are four outcomes
assessed at 30, 60, 90, and 120 days post-randomization:
y_1
, y_2
, y_3
and
y_4
. The time to each observed outcome is indicated in
columns .t_1
, .t_2
, .t_3
and
.t_4
. For missed study visits, the time recorded is the
last day within the study window at which the individual’s outcome could
have been assessed.
Preparing Interim Datasets
A function prepare_interim_data
allows the user to
create indicator variables for each outcome, with each variable
indicating whether an outcome is observed (1
), is known to
be missing (0
), or has not yet been observed
(NA
). This allows software to appropriately handle missing
information during an ongoing study.
prepare_interim_data
retains only the columns of data
relevant to the analysis at hand: covariates, study entry/enrollment
time, treatment assignment, outcomes, and the times at which outcomes
were measured.
# Obtain time of last event
last_event <-
example_1[, c(".enrollment_time", ".t_1", ".t_2", ".t_3", ".t_4")] |>
unlist() |>
max(na.rm = TRUE) |>
ceiling()
prepared_final_data <-
prepare_monitored_study_data(
data = example_1,
study_time = last_event,
id_variable = ".id",
covariates_variables = c("x_1", "x_2", "x_3", "x_4"),
enrollment_time_variable = ".enrollment_time",
treatment_variable = "tx",
outcome_variables = c("y_1", "y_2", "y_3", "y_4"),
outcome_time_variables = c(".t_1", ".t_2", ".t_3", ".t_4"),
# Observe missingness 1 week after target study visit
observe_missing_times = c(30, 60, 90, 120) + 7
)
The resulting object contains the prepared dataset, the original dataset, the study time of the data lock, and a list of variables and their role in analyses.
Reverting to Information Earlier in the Study
These conventions can also be used to take a dataset from one point in time during the study, and revert to the information that was only available at an earlier point in the study. This can be useful for determining how quickly information is accruing during an ongoing study. Let , , , and respectively denote baseline covariates, treatment assignment, the outcome observed at study visit , and the times at which the study outcomes are observed. Let denote the closing of the study window for visit .
To obtain the information available at days after the randomization of the first participant:
- Retain only participants where : i.e. those randomized by study time
- Set outcome to unobserved if : i.e. outcomes not observed by time
- Set to missing if : otherwise, treat the outcome as not yet observed.
For example, the data known at 90 days can be obtained using the
data_at_time_t()
function as follows:
data_90 <-
data_at_time_t(
prepared_data = prepared_final_data,
study_time = 90
)
Consider 90 days after study initiation:
- Participant 1 is known to have missed their first post-randomization
assessment:
.r_1
is0
- Participants 2-5 have had their first post-randomization outcome
obtained:
.r_1
is1
- Participant 6 missed their first post-randomization assessment, but
this is not yet known at day 90:
.r_1
isNA
show_cols <- c(".id", "x_2", "x_3", "x_4", "tx")
# Original Data
data_90$original_data[1:6, c(show_cols, ".enrollment_time", ".t_1", "y_1")]
#> .id x_2 x_3 x_4 tx .enrollment_time .t_1 y_1
#> 1 1 0.1971432 -0.8425884 0.2794844 0 2.24846 25.01538 1.591873
#> 2 2 -0.7384296 0.1315016 -1.2419134 1 11.05565 55.05565 NA
#> 3 3 0.4997821 1.6932555 -0.4063889 0 16.96591 60.96591 NA
#> 4 4 -0.2189937 -1.7719120 0.1936013 1 25.13396 59.84544 1.212620
#> 5 5 0.9282685 0.8078133 0.9317145 0 50.07301 75.94952 8.655326
#> 6 6 1.1756811 0.0226265 -0.2556343 1 50.93935 80.29181 6.902055
# Data known at study day 90
data_90$data[1:6, c(show_cols, ".e", ".t_1", "y_1", ".r_1")]
#> .id x_2 x_3 x_4 tx .e .t_1 y_1 .r_1
#> 1 1 0.1971432 -0.8425884 0.2794844 0 2.24846 25.01538 1.591873 1
#> 2 2 -0.7384296 0.1315016 -1.2419134 1 11.05565 55.05565 NA 0
#> 3 3 0.4997821 1.6932555 -0.4063889 0 16.96591 60.96591 NA 0
#> 4 4 -0.2189937 -1.7719120 0.1936013 1 25.13396 59.84544 1.212620 1
#> 5 5 0.9282685 0.8078133 0.9317145 0 50.07301 75.94952 8.655326 1
#> 6 6 1.1756811 0.0226265 -0.2556343 1 50.93935 80.29181 6.902055 1
If observe_missing_times
were set to 0
for
each outcome, any outcome that is NA
and has a recorded
time of assessment will be treated as missing.
Plotting Observed Number of Outcomes
One part of monitoring involves determining how many individuals contribute a given type of information to analyses, including baseline covariates, intermediate outcomes, and final outcomes. For binary and time-to-event analyses, monitoring should include the number of observed events, not just the number of participants who have completed monitoring.
Plotting can be done for all available events:
# Plot events at the end of the study
plot_outcome_counts(
prepared_data = prepared_final_data
)
Plotting can also be done for events known at a particular point in time:
# Plot events two years into the study
plot_outcome_counts(
prepared_data = prepared_final_data,
study_time = 731 # 2 Years
)
Tabulating Event Times
The times of each event can be obtained using
count_outcomes
:
example_1_counts <-
count_outcomes(
prepared_data = prepared_final_data
)
# Timing of first n randomizations
subset(
example_1_counts,
event == "randomization"
) |> head()
#> event time count_total count_complete count_events
#> 1 randomization 2.24846 1 1 1
#> 2 randomization 11.05565 2 2 2
#> 3 randomization 16.96591 3 3 3
#> 4 randomization 25.13396 4 4 4
#> 5 randomization 50.07301 5 5 5
#> 6 randomization 50.93935 6 6 6
# Timing of first n observations of `y_4`
subset(
example_1_counts,
event == "y_4"
) |> head()
#> event time count_total count_complete count_events
#> 1201 y_4 133.0050 1 1 NA
#> 1202 y_4 145.0556 2 1 NA
#> 1203 y_4 150.9659 3 1 NA
#> 1204 y_4 154.3419 4 2 NA
#> 1205 y_4 171.3021 5 3 NA
#> 1206 y_4 177.3249 6 4 NA
# Find when n observations of `y_4` are first available:
subset(
example_1_counts,
event == "y_4" & count_complete == 70
)
#> event time count_total count_complete count_events
#> 1296 y_4 707.0069 96 70 NA
This can also be used with data_at_time_t
to reconstruct
the study data when a particular number of observations have
accrued:
# Reconstruct the data when N = 70 final outcomes were obtained
data_n_final_70 <-
data_at_time_t(
prepared_data = prepared_final_data,
study_time =
# Time when 70 final outcomes are observed:
ceiling(
subset(
example_1_counts,
event == "y_4" & count_complete == 70
)$time
)
)
data_n_70 <- data_n_final_70$data
Monitoring Information Levels
Information can be computed using the
estimate_information
function: users pass the function
which conducts the analysis (estimation_function
), along
with a list of parameters the function requires
(estimation_arguments
). When there are multiple analyses,
the orthogonalize
argument specifies whether the test
statistics and covariance should be orthogonalized to meet the
independent increments assumption. A random number generator seed
(rng_seed
) can be supplied for reproducibility. By default,
only the variance and information are returned, which can be compared
against the target information level for analyses:
information_n_70 <-
estimate_information(
data = data_n_70,
monitored_design = NULL,
estimation_function = standardization,
estimation_arguments =
list(
estimand = "difference",
y0_formula = y_4 ~ x_1 + x_2 + x_3 + x_4,
y1_formula = y_4 ~ x_1 + x_2 + x_3 + x_4,
family = gaussian,
treatment_column = "tx",
outcome_indicator_column = ".r_4"
),
orthogonalize = TRUE,
rng_seed = 23456
)
information_n_70$covariance_uncorrected
#> estimates
#> estimates 4.333183
information_n_70$information
#> estimates
#> 0.2307772
information_n_70$covariance_orthogonal_uncorrected
#> estimates
#> estimates 4.333183
information_n_70$information_orthogonal
#> estimates
#> estimates 0.2307772
When a study is designed with interim and final analyses,
monitored_design
allows the user to pass the results of
previously conducted analyses to estimate_information
.
Using the design specified in the impart_study_design
vignette:
# Universal Study Design Parameters
minimum_difference <- 5 # Effect Size: Difference in Means of 5 or greater
alpha <- 0.05 # Type I Error Rate
power <- 0.9 # Statistical Power
test_sides <- 2 # Direction of Alternatives
# Determine information required to achieve desired power at fixed error rate
information_single_stage <-
impart::required_information_single_stage(
delta = minimum_difference,
alpha = alpha,
power = power
)
# Group Sequential Design Parameters
information_rates <-
c(0.50, 0.75, 1.00) # Analyses at 50%, 75%, and 100% of the Total Information
type_of_design <- "asOF" # O'Brien-Fleming Alpha Spending
type_beta_spending <- "bsOF" # O'Brien-Fleming Beta Spending
# Set up group sequential testing procedure
trial_design <-
rpact::getDesignGroupSequential(
alpha = alpha,
beta = 1 - power,
sided = 2,
informationRates = information_rates,
typeOfDesign = type_of_design,
typeBetaSpending = type_beta_spending,
bindingFutility = FALSE
)
# Inflate information level to account for multiple testing
information_adaptive <-
impart::required_information_sequential(
information_single_stage = information_single_stage,
trial_design = trial_design
)
# Initialize the monitored design
monitored_design <-
initialize_monitored_design(
trial_design = trial_design,
null_value = 0,
maximum_sample_size = 280,
information_target = information_adaptive,
orthogonalize = TRUE,
rng_seed_analysis = 54321
)
The current information fraction can be computed:
information_n_70$information/information_adaptive
#> estimates
#> 0.5062808
Tools for Monitoring Information
While estimate_information
can provide an estimate of
the information at a particular point in the study, understanding the
rate at which information is accruing over time can be useful in
projecting when pre-specified information levels may be met:
data_n_70_trajectory <-
information_trajectory(
prepared_data = data_n_final_70,
monitored_design = monitored_design,
estimation_function = standardization,
estimation_arguments =
list(
estimand = "difference",
y0_formula = y_4 ~ x_1 + x_2 + x_3 + x_4,
y1_formula = y_4 ~ x_1 + x_2 + x_3 + x_4,
family = gaussian,
treatment_column = "tx",
outcome_indicator_column = ".r_4"
),
correction_function = standardization_correction,
orthogonalize = TRUE,
n_min = 40,
n_increment = 2,
rng_seed = 23456,
# Note: control = monitored_analysis_control() is the default
# This does more bootstrap replicates by default
control = monitored_analysis_control_testing()
)
data_n_70_trajectory
#> times randomization y_1 y_2 y_3 y_4 information information_lag_1
#> 1 491.3059 79 59 49 43 40 0.02354317 NA
#> 2 506.9791 83 64 52 46 42 0.06373427 0.02354317
#> 3 508.6034 83 64 52 46 42 0.06373427 0.06373427
#> 4 518.1410 84 64 54 47 44 0.06273970 0.06373427
#> 5 527.2583 85 65 54 50 44 0.06569907 0.06273970
#> 6 527.6485 85 65 54 50 46 0.06672800 0.06569907
#> 7 538.5259 86 68 56 52 48 0.11713994 0.06672800
#> 8 540.6071 87 68 57 52 48 0.09814035 0.11713994
#> 9 551.8682 90 69 57 53 50 0.12625125 0.09814035
#> 10 555.9573 92 70 60 54 50 0.12003936 0.12625125
#> 11 578.2366 95 73 61 56 52 0.14664434 0.12003936
#> 12 579.6637 95 73 61 57 52 0.14664434 0.14664434
#> 13 582.8345 95 73 63 57 52 0.14664434 0.14664434
#> 14 597.3935 96 77 63 58 54 0.15105217 0.14664434
#> 15 630.7461 100 79 70 64 56 0.14185643 0.15105217
#> 16 633.0925 100 79 70 64 58 0.15632099 0.14185643
#> 17 635.4027 100 79 70 64 60 0.16605940 0.15632099
#> 18 658.6659 109 82 72 69 62 0.18410677 0.16605940
#> 19 662.5691 110 82 72 69 64 0.17192467 0.18410677
#> 20 675.4510 113 87 74 70 66 0.17923912 0.17192467
#> 21 701.1256 117 96 80 71 68 0.17267007 0.17923912
#> 22 707.0069 117 96 81 73 70 0.18025072 0.17267007
#> information_change information_pct_change information_fraction
#> 1 NA NA 0.05164918
#> 2 0.0401910997 63.060421 0.13982070
#> 3 0.0000000000 0.000000 0.13982070
#> 4 -0.0009945712 -1.585234 0.13763880
#> 5 0.0029593721 4.504435 0.14413109
#> 6 0.0010289294 1.541975 0.14638836
#> 7 0.0504119382 43.035653 0.25698243
#> 8 -0.0189995846 -19.359605 0.21530101
#> 9 0.0281108949 22.265835 0.27697089
#> 10 -0.0062118854 -5.174874 0.26334321
#> 11 0.0266049799 18.142520 0.32170941
#> 12 0.0000000000 0.000000 0.32170941
#> 13 0.0000000000 0.000000 0.32170941
#> 14 0.0044078303 2.918085 0.33137934
#> 15 -0.0091957493 -6.482434 0.31120564
#> 16 0.0144645633 9.253117 0.34293810
#> 17 0.0097384163 5.864417 0.36430230
#> 18 0.0180473670 9.802663 0.40389475
#> 19 -0.0121821058 -7.085723 0.37716956
#> 20 0.0073144562 4.080837 0.39321606
#> 21 -0.0065690556 -3.804397 0.37880482
#> 22 0.0075806509 4.205615 0.39543531
This trajectory can be smoothed using regression, ideally using a
method resistant to outliers, such as deming::theilsen
.
Inverse regression can be used to obtain an estimated number of outcomes
necessary to achieve a given level of information:
plot(
information_fraction ~ y_4,
data = data_n_70_trajectory,
ylim = c(0, 1)
)
abline(
lm(
formula = information_fraction ~ y_4,
data = data_n_70_trajectory
),
lty = 1
)
# Requires `deming` package
abline(
deming::theilsen(
formula = information_fraction ~ y_4,
data = data_n_70_trajectory
),
lty = 3
)
abline(
h = monitored_design$original_design$information_fractions,
lty = 2
)