\documentclass{sciposter}
\usepackage{lipsum}
\usepackage{epsfig}
\usepackage{amsmath}
\usepackage{amssymb}
\usepackage{multicol}
\usepackage{graphicx,url}
%\usepackage[portuges, brazil]{babel}
\usepackage[utf8]{inputenc}
%\usepackage{fancybullets}
\newtheorem{Def}{Definition}
\newtheorem{theorem}{Theorem}
\title{Predictive Posterior Power for Sample Size Re-estimation}
%Título do projeto
\author{Marc Sobel and Ibrahim Turkoz}
%nome dos autores
\institute
{Dept of Statistics, Temple University \\ Janssen Research and Development, LLC, Titusville, NJ
}
%Nome e endereço da Instituição
\email{marc.sobel@temple.edu and iturkoz@its.jnj.com}
% Onde você coloca os emails dos integrantes
%\date is unused by the current \maketitle
\rightlogo[1]{}
\leftlogo[1]{}
% Exibe os logos (direita e esquerda)
% Procure usar arquivos png ou jpg, e de preferencia mantenha na mesma pasta do .tex
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%% Begin of Document
\begin{document}
%define conference poster is presented at (appears as footer)
\conference{\bf ASA Conference 2014}
%\LEFTSIDEfootlogo
% Uncomment to put footer logo on left side, and
% conference name on right side of footer
% Some examples of caption control (remove % to check result)
%\renewcommand{\algorithmname}{Algoritme} % for Dutch
%\renewcommand{\mastercapstartstyle}[1]{\textit{\textbf{#1}}}
%\renewcommand{\algcapstartstyle}[1]{\textsc{\textbf{#1}}}
%\renewcommand{\algcapbodystyle}{\bfseries}
%\renewcommand{\thealgorithm}{\Roman{algorithm}}
\maketitle
%%% Begin of Multicols-Enviroment
\begin{multicols}{3}
%%% Abstract
\begin{abstract}
Information before unblinding regarding the success of confirmatory clinical trials is highly uncertain. Estimates of expected future power which purport to use this information for purposes of sample size adjustment after given interim points need to reflect this uncertainty. Estimates of future power at later interim points need to track the evolution of the clinical trial. We employ sequential models to describe this evolution.
We show that current techniques using point estimates of auxiliary parameters for estimating expected power: (i) fail to describe the range of likely power obtained after the anticipated data are observed, (ii) fail to adjust to different kinds of thresholds, and (iii) fail to adjust to the changing patient population. Our algorithms address each of these shortcomings. We show that the uncertainty arising from clinical trials is characterized by filtering later auxiliary parameters through their earlier counterparts and employing the resulting posterior distribution to estimate power. We devise MCMC-based algorithms to implement sample size adjustments after the first interim point. Bayesian models are designed to implement these adjustments in settings where both hard and soft thresholds for distinguishing the presence of treatment effects are present. Sequential MCMC-based algorithms are devised to implement accurate sample size adjustments for multiple interim points. We apply these suggested algorithms to a depression trial for purposes of illustration.
\end{abstract}
%%% Introduction
\section{Introduction}
\label{introduction}
\begin{itemize}
\item[$\circ$]
During the design of a confirmatory clinical trial, it is often the case that required information is not fully available and information that is used is often subject to a high degree of uncertainty.
\item[$\circ$] This information includes, but is not limited to, the expected treatment differences, the assumed population variance, and estimated dropout rates.
\item[$\circ$]
Group sequential and adaptive designs enable the evaluation of uncertainty in the planning phase without compromising the integrity of the trial.
\item[$\circ$]
At interim points during the trial, re-evaluations of preplanned effect sizes and variance estimates may be beneficial. If the original assumptions appear to be incorrect, adjustments can be made to improve the chance that the trial will reach a definitive conclusion. One such adjustment, which has been discussed extensively in the literature, is to modify the sample size (i.e., sample size re-estimation).
\item[$\circ$]
Breaking the blind to perform sample size adjustment in a clinical trial is frequently resource intensive.There are significant credibility issues arising when the sample size is examined using unblinded data. Unblinding may inflate the Type I error rate.
\item[$\circ$]
The 2010 draft guidelines \cite{guidance}
on adaptive designs recommend that:
\begin{enumerate} \item[$\circ$] blinded sample size adjustment procedures increase the potential for a successful study while maintaining Type I error control, \item[$\circ$] blinded sample size adjustment procedures greatly reduce the risk of bias, and
\item[$\circ$] estimators of variance in support of sample size readjustment are subject to increased variability during the course of the trial.
\end{enumerate}
\item[$\circ$]
ICH guidelines
\cite{guidance1}
also cover blinded sample size adjustment.
\item[$\circ$]
In view of recommendations (1) and (2), we adopt sample size procedures which provide sample size adjustments in blinded settings. In view of (3), we adopt procedures which: (a) take account of the error resulting from estimating the variance, and (b) adjust to changes in the variance and associated auxiliary parameters over the course of the trial.
\item[$\circ$]
We adopt a Bayesian approach to estimating the variance and associated auxiliary parameters at each stage of the trial; we use particle filter models to adjust for changes in the auxiliary parameters.
%%% Historical Overview
\end{itemize}
\normalsize
\section{Overview}
There have been three main approaches to sample size determination in general settings in which hierarchical hypotheses are being tested \cite{Pezeshk} and \cite{Adcock}.
\begin{enumerate}
\item[(i)]
Predictive approaches to sample size determination
\cite{Carlin},\cite{Geisser}, and \cite{Zelen}.
\item[(ii)]
Goal oriented approaches to sample size determination \cite{Santis2}, \cite{Smith},
\cite{Santis1}, and \cite{Berry}.
\label{goal-oriented}
\item[(iii)]
Sample size determination using power estimation. Sample sizes are determined by calculating the ``future power'' obtainable from adding future observations to the test statistic (used to distinguish whether a significant response is present) \cite{Gould} and \cite{Santis}.
Historical data have been used in this setting together with the EM algorithm.
\label{power estimation}
\end{enumerate}
We focus on sample size determination using power estimation in blinded settings (item (iii) above). Our methodology can also be applied to goal-oriented sample size determination (item (ii) above, \cite{Shuster}, \cite{Self1}, and \cite{Self2}).
We leave this for future work.
Posthoc power \cite{Lenth}
is the retrospective power of an observed effect based on the sample size and parameter estimates. We compare our results below with those obtained using posthoc power calculations made after the additional subjects have been observed and unblinding has occurred. Gould and Shi \cite{Gould1}
calculate power and expected power (using what we refer to below as the approximate strategy) in blinded settings. The approximate strategy fails to:
\begin{itemize}
\item[$\circ$] provide a range of expected power with regard to what is achievable;
\item[$\circ$] adjust to the presence of soft treatment effect thresholds; these occur when there is disagreement over which threshold to use; and
\item[$\circ$] does not adjust to changing patient populations (i.e., the heterogeneity of early versus later enrolled patients).
\end{itemize}
Our proposed methodology employs a Bayesian strategy to address all of these shortcomings.
\begin{itemize}
\item[$\circ$]
We provide a markov chain monte carlo (MCMC) approach to calculating expected power in both hard and soft threshold settings.
\item[$\circ$] Particle filters are utilized to formulate models which properly adjust to changing patient populations.
\end{itemize}
\begin{itemize}
\item[$\circ$]
There are a wide variety of Bayesian strategies proposed in the literature for sample size determination \cite{Smith}, \cite{Joseph}, and \cite{Hartley}.
Related to these are a number of model selection approaches which employ simulation-based approaches \cite{Santis2}, \cite{O'Hagan}, \cite{Gelfand}, and \cite{Gelfand1}.
\item[$\circ$]
Most Bayesian and model selection strategies involve providing sample size adjustment at a single interim point. The problem of providing a sample size adjustment after a number of interim points have been observed has received much less attention. Below, we offer a sequential framework for addressing this problem.
\item[$\circ$] We address the issue of how patient population changes between interim points influence sample sizes recommendations by using particle filter methodology
\cite{Doucet1} and \cite{Carvalho1}.
\item[$\circ$]
Our methodology combines nonsequential Bayesian and model selection strategies for sample size estimation with their sequential counterparts.
\end{itemize}
\vspace{1in}
%%% Former Work
\section{Previous Work}
Gould and Shih (\cite{Gould} and \cite{Gould1})
discussed modifying the design of ongoing trials without unblinding by providing an adjusted version of the one-sample variance estimator. They proposed a procedure to estimate the within-group variance for sample size re-estimation without unblinding the clinical trial data at interim stages using the EM algorithm. This procedure made use of Maximum Marginal Likelihood Estimates (MMLEs) of within-group variability.
Friede and Kieser
\cite{Friede1} and \cite{Friede2}
questioned the reliability of the within group variance estimates of the Gould and Shih approach
\cite{Gould1}
and later provided a number of alternatives for blinded sample size evaluations. Xing \cite{Xing}
used the enrollment order of subjects and the randomization block size to estimate the within group variance.
%%% Setup
\section{Setup}
We propose using information from blinded data. The purpose of this research is to provide a framework for sample size determination under these conditions. We assume two subject groups; our methodology readily extends to more than two groups.
\begin{enumerate}
\item[(i)]
Assume $n$ identical, mutually independent subjects are randomly assigned to the control or experimental treatment groups with known probabilities $1-p$ and $p$, respectively.
\item[(ii)] The parameter $\delta$ corresponds to the treatment effect in the clinical trial; $\theta$ includes all the auxiliary parameters, such as the pooled standard deviation.
The parameters $\theta$ and $\delta$ are both assumed to be unknown.
\item[(iii)] observed subject responses
\begin{itemize}
\item[(a)]
Observed subject responses $X_i$ in the experimental treatment arm are assumed to be distributed according to $f_1(x \vert \theta,\delta)$, with known density $f_1$.
\item[(b)]
Observed subject responses in the control arm are assumed to be distributed according to $f_0(x \vert \delta, \theta)$, with known density $f_0$.
\end{itemize}
\item[(iv)]
We use the notation $Z_i=1$ to indicate that subject $i$ is assigned the treatment; $Z_i=0$ denotes the control group assignment. The probability that $Z_i=1$ is assumed to be the known value $p$ ($i$=1,..,n).
\item[(v)]
We use the notation $\mathbf{X}=(X_1,...,X_n)$ for the interim sample having size $n$. We anticipate that the additional, as yet unobserved, $m$ observations $\mathbf{X}^{(new)}=(X_{1}^{(new)},...,X_{m}^{(new)})$ can also be selected from the same families of distributions.
\item[(vi)]
We would like to distinguish between a null and alternative model. We will be concerned with two different settings in which sample size adjustments can be implemented:
\begin{enumerate} \item[(a)] In the hypothesis testing setting, tests are devised in which the assumed threshold, distinguishing whether a treatment effect is present, is fixed over the entire length of the trial; \item[(b)] In the model selection setting, more conservative tests with noisy thresholds are devised. \end{enumerate}
Tests used under a model selection setting are distinguished from those devised in hypothesis testing settings by the assumption of a prior distribution with additional noise for the treatment effect; in this Section we use the notation, $\lambda$ for the additional (auxiliary) parameters introduced in this case.
\end{enumerate}
\begin{enumerate}
\item[(i)] Hypothesis Testing Setting: \\
Deciding which of two disjoint sets (referred to below as $g_0$ and $g_a$) the parameter $\delta$ belongs to. Group sequential and adaptive designs enable the evaluation of uncertainty in the planning phase without compromising the integrity of the trial. These techniques, including sample size re-estimation have been discussed extensively in the literature.
Abusing notation slightly we assume, in this case, that $\theta$ has prior $h(\theta)$. \\
\item[(ii)] The Model Selection Setting: \\
In this setting, the aforementioned models correspond to two distinct families of prior distributions.
One model assumes that the parameter of interest $\delta$ is distributed according to the family of distributions, $g_0(\bullet \vert \lambda)$ with unknown (auxiliary) parameter $\lambda$. The other model assumes that $\delta$ is distributed according to the family of distributions $g_a(\bullet \vert \lambda)$. We test the hypotheses:
\begin{eqnarray} H_0: \delta & \sim & g_0(\bullet \vert \lambda) \nonumber \\
H_a: \delta & \sim & g_a(\bullet \vert \lambda) \nonumber \end{eqnarray}
where $\lambda$ is assumed to be independent of the parameter $\delta$ but possibly not independent of $\theta$. In view of this, we assume that the parameters $\theta$ and $\lambda$ have joint prior $h(\theta,\lambda)$.
\end{enumerate}
Assuming responses from $n$ subjects are observed, our primary objective is to calculate the additional sample size $m$ required to differentiate between the models $g_0$ and $g_a$.
Classical statistics interprets this requirement in terms of choosing a number $m$ of additional observations needed to insure that the resulting power of the test distinguishing between the two models is above a particular threshold. We adopt this viewpoint.
\newcommand{\imsize}{0.45\columnwidth}
\section{Theory and Methods}
\label{Theory}
\begin{itemize}
\item[$\circ$]
The proposed strategy assumes that the parameters $\lambda$ and $\theta$ are estimated a posteriori using the data $\mathbf{X}=(X_1,...,X_{n})$. We use simulated data $\mathbf{X}^*=(X_1^*,...,X_n^*)$ and yet to be observed random variables, $\mathbf{X}^{(new)}$ (of size $m$) to estimate the power.
\item[$\circ$]
For this purpose we employ conditional likelihood ratio tests
(\cite{Lehmann} and \cite{Meng}) and a simulation based approach
(\cite{Santis2}, \cite{Santis1}, \cite{Gelfand}, and \cite{Gelfand1}).
Our approach makes substantial use of MCMC methodologies \cite{Robert}.
\item[$\circ$]
In the model selection setting, we employ null and alternative posterior distributions (\cite{Weiss} and \cite{Spiegelhalter})
which are calculated using marginalized likelihoods. This is frequently equivalent, in the model selection setting, to calculating the posterior probabilities of the null and alternative hypotheses when the simulated and additional data are generated from their (respective) null and alternative predictive posterior distributions
\cite{Gelfand1} and \cite{Weiss}.
In the hypothesis testing setting, we construct (generalized) conditional likelihood ratio statistics (CLRT) by choosing parameters least favorable to the null and alternative hypotheses
\cite{Lehmann} and \cite{Meng}.
\item[$\circ$]
For reasons of convenience, we employ the test value, $T_{n+m}(\mathbf{X^*,X^{(new)}},\lambda, \theta,\mathbf{Z^*,Z^{(new)}})$, corresponding to the additive inverse of the conditional likelihood ratio statistic. In hypothesis testing settings, we drop $\lambda$ from the description of $T_{n+m}$. The binary treatment assignment variables $\mathbf{Z}^*$ and $\mathbf{Z}^{(new)}$, associated respectively with the simulated and as yet unobserved values are assumed to have their prior distribution: $P(Z=1)=p$. For purposes of simplification, we assume that $p$ is 0.5 and that the experimental design has two treatment groups. Our results generalize easily to arbitrary $p$ with more than two treatment groups. In both the critical value and power calculations described below, we make extensive use of the law of large numbers and the central limit theorem
\cite{Gnedenko}.
We adopt the notation $(\bullet)$ for the density of the variable $\bullet$. The notation, $P_{H}$ denotes the posterior probability operator over the simulated and anticipated observations $\mathbf{X^*,X^{(new)}}$ under the null and alternative hypotheses $H=H_0,H_a$, respectively.
\item[$\circ$]
The critical value of the test is a parameter $Crit=Crit(\lambda,\theta)$ whose posterior distribution is calculated with $Crit=Crit(\lambda,\theta)$ and $T_{n+m}=T_{n+m}(\mathbf{X^*,X^{(new)}},\lambda,\theta,\mathbf{Z^*,Z^{(new)}}))$ from:
\begin{eqnarray} \alpha & = &
P_{H_0} \left( T_{n+m} < Crit \Big| \lambda, \theta \right) \label{null} \nonumber \\
(\mathbf{X^*},\mathbf{X}^{(new)}) & \propto & m_{H_0}(\bullet \vert \lambda,\theta, \mathbf{Z^*,Z^{(new)}})(\mathbf{Z^*,Z^{(new)}}) \nonumber \\
\left(\lambda, \theta \Big| \mathbf{X} \right) & \propto & m_{H_0}(\mathbf{X} \vert \lambda,\theta,Z) h(\theta,\lambda)(Z) \label{posterior} \label{critical} \end{eqnarray}
The notation $m_H(\bullet \vert \lambda,\ldots)$ used in equations (\ref{critical}) and (\ref{power}), refers to the distribution of the observations (both simulated and anticipated) marginalized over the hypothesis $H=H_0,H_a$;
the notation $m_H(X \vert \lambda, \theta, \mathbf{Z})$ refers to the observations marginalized over the hypothesis $H=H_0,H_a$.
\item[$\circ$]
The power of the test, also a parameter, is then calculated using the previously calculated critical parameter and the anticipated observations $\mathbf{X^{(new)}}$
together with the above specifications via:
\begin{eqnarray} {\rm Power}(n+m \vert \lambda,\theta) & = & P_{H_a} \left( T_{n+m} < Crit \right) \nonumber \nonumber \\
(\mathbf{X}^{(new)},\mathbf{X^*}) & \propto & m_{H_a}(\bullet \vert \lambda, \theta, \mathbf{Z^*,Z^{(new)}})(\mathbf{Z^*,Z^{(new)}}) \nonumber \\
\left(\lambda, \theta \Big| \mathbf{X} \right) & \propto & m_{H_a}(\mathbf{X} \vert \lambda, \theta,\mathbf{Z} ) h(\lambda,\theta)(\mathbf{Z}) \label{power}
\end{eqnarray}
\item[$\circ$]
Having simulated the power aposteriori, we calculate High Posterior Density (HPD) intervals for the power and present them in lieu of fixed power estimates.
The critical value given in equation (\ref{critical}) and the power given in equation (\ref{power}) are analogous to the (Bayesian) sample size determination (SSD) separation quantities given in Wang and Gelfand (equations 10a and 10b of \cite{Gelfand1}).
\end{itemize}
\section{Relevant Theoretical Results}
\label{Theorem}
The primary theorems relevant to using our formulation are the following:
We use the notation $\widehat{\delta}_{n+m}[H]=\widehat{\delta}_{n+m}(\lambda,\theta)[H]$ for the value of $\delta$ resulting from maximizing the likelihood combining the simulated and additional observations under the given hypothesis $H=H_0,H_a$. To simplify notation, we do not drop the auxiliary parameter $\lambda$ from discussion of the hypothesis testing setting, below. In accordance with this simplification, $I(X)$ denotes a given posterior HPD interval for the parameters $\lambda$ and $\theta$.
\begin{theorem}
\label{critical value}
The critical value parameter $crit=crit(\lambda,\theta)$ can be chosen to satisfy the significance level condition,
\begin{equation} P_{H_0} \left( T_{n+m}(\mathbf{X^*,X^{(new)}},\lambda,\theta) <crit \right)= \alpha \end{equation}
if, for the conditional HPD interval $\lambda,\theta \in I(X)$ having given size, eventually as $m \to \infty$:
\begin{equation} \widehat{\delta}_{n+m}(\lambda,\theta)[H_0] \in g_0, \forall \lambda,\theta \in I(X) \label{HT condition1} \end{equation}
or, in the model selection setting, $\forall \epsilon>0$
\begin{equation} \lim_{m \to \infty} P_{\delta \sim g_0} \left( \vert \widehat{\delta}_{n+m}(\lambda,\theta)[H_0]- \delta \vert < \epsilon \vert X \right)=1, \quad \forall \lambda,\theta \in I(X) \label{MS condition1} \end{equation}
\end{theorem}
\begin{theorem}
\label{power theorem}
Using the same notation as was introduced in Theorem \ref{critical value}, and assuming that
the critical value parameter $crit=crit(\lambda,\theta)$ has been chosen, the condition,
\begin{equation} P_{H_a} \left( T_{n+m}(\mathbf{X^*,X^{(new)}},\lambda,\theta)< crit \right) \geq 1 - \beta \end{equation}
is satisfied for large enough $m$ if, for the conditional HPD interval $\lambda,\theta \in I(X)$, having given size, eventually as $m$ tends to infinity,
\begin{equation} \widehat{\delta}_{n+m}(\lambda,\theta)[H_a] \in g_a, \forall \lambda,\theta \in I(X) \label{HT condition2} \end{equation}
or, in the model selection setting, $\forall \epsilon>0$,
\begin{equation} \lim_{m \to \infty} P_{\delta \sim g_a,\lambda,\theta \vert X} \left( \vert \widehat{\delta}_{n+m}(\lambda,\theta)[H_a]- \delta \vert < \epsilon \vert X \right)=1 \label{MS condition2} \end{equation}
\end{theorem}
Theorems \ref{critical value} and \ref{power theorem} hold in the hypothesis testing setting if, for example, the assumed prior for $\theta$ is proper and supported on the whole real line.
The theorems hold in the model selection setting if, in addition to the aforementioned assumption, the assumed prior for the $\lambda$ is proper and has conditional support (for all $\theta$) on the whole real line.
\section{Hypothesis Testing Setting: Depression Trial Example}
\label{simple}
\begin{itemize}
\item[$\circ$]
Assume $n$ identical, mutually independent subject responses, $\mathbf{X}=(X_1,...,X_{n})$ are observed at the first interim stage. Each subject is randomly assigned to the control or experimental treatment groups with known probabilities $1-p$ and $p$, respectively. This setting can easily be adapted to more than two treatment groups. We assume that lower response scores indicate improvement.
\item[$\circ$]
The average effect of the treatment is denoted by the parameter $\delta$; the mean control response is denoted by the auxiliary parameter $\mu$. The pooled standard deviation is denoted by the auxiliary parameter $\tau$. Using the notation of Section \ref{Theory}:
\begin{enumerate} \item[(i)] the auxiliary parameter $\theta$ corresponds to $(\mu,\tau)$, \item[$\circ$] subject responses in the treatment arm are distributed according to $f_1(\bullet \vert \delta, \theta)= {\cal N}(\mu-\delta,\tau)$ and \item[$\circ$] subject responses in the control arm are distributed according to
$f_0(\bullet \vert \delta,\theta)={\cal N}(\mu,\tau)$.
\end{enumerate}
\item[(ii)]
We test the hypothesis,
\begin{eqnarray} H_0: \delta & = & 0; \label{HT0} \\
H_a: \delta & > & \delta_1 \label{HTA} \end{eqnarray}
The latent variable $Z_i$ is 1 if the $i$'th subject is in the treatment arm and 0 otherwise.
For simplicity we assume two treatment groups with a 1:1 allocation ratio; thus $p$=0.5.
\item[(iii)]
We employ the quantity,
\begin{equation} T(X,Z,\tau,\mu)= \frac{\sum_i (X_i- \mu)Z_i}{\tau^2} \end{equation}
which is the additive inverse of the conditional likelihood ratio, up to a constant of proportionality.
\item[(iv)]
We calculate the posterior distributions under the null and separately under the alternative.
\item[(v)]
Parameters calculated under the null posterior are denoted by: $\mu_0 \ {\rm and} \ \tau_0 $; those under the alternative posterior are denoted by: $\mu_a \ {\rm and} \ \tau_a$. We assume $\tau^2$ has an (an approximately indifferent) inverse gamma prior with shape hyperparameter 1 and (a small) scale hyperparameter $\epsilon_1$.
\end{itemize}
\label{our approach}
{\bf The proposed approach}:
For a given significance level ($\alpha=.05$), the critical parameter $c$ is necessary to compute the power associated with $m$ additional observations. We characterize the posterior distribution of $c$ using $B$ MCMC simulations indexed by $b=1,...,B$ of the marginal null posterior distribution. We characterize the posterior distribution of the power using $B$ analogous simulations of the marginal alternative posterior distributions. Below, $\Phi$ denotes the standard normal ${\rm cdf}$.
\begin{eqnarray} c^{(b)} & = & \frac{\Phi^{-1}(\alpha) \sqrt{n+m}}{\tau_0^{(b)} \sqrt{2}} \label{critpower} \end{eqnarray}
\[ {\rm power}^{(b)}[n+m] = \]
\begin{equation} \Phi \left\{ {\left[ {\frac{{{\tau_a^{(b)}}}}{{{\tau_0^{(b)}}}}{\Phi ^{ - 1}}\left( \alpha \right) + \frac{{{\delta _1}}}{{\tau_a^{(b)}}}\left( {\sqrt {\frac{{n + m}}{2}} } \right)} \right]} \right\} \label{rpower} \end{equation}
Power is properly estimated by an HPD interval taking the form:
\[ \underline{\rm power}[n+m] < {\rm power} < \overline{\rm power}[n+m] \label{HPD} \]
where $\underline{\rm power}[n+m]$ denotes a lower posterior quantile and $\overline{\rm power}[n+m]$ an upper posterior quantile measurement. We employ HPD power estimates as described in equation (\ref{HPD}).
In order to apply the proposed methods, we considered a placebo-controlled study of depression. The details of this trial are given in Mahmoud et al.
\cite{Mahmoud}.
Adult outpatients with major depressive disorders who had an incomplete response to antidepressant treatment were randomly assigned (1:1) to active drug or placebo regimens for 6 weeks duration in a double-blind multicenter trial. The primary efficacy endpoint was the mean difference between treatments at endpoint using a 17-item Hamilton Rating Scale for Depression (HRSD-17). A sample size of 116 patients in each group was anticipated to have 90$\%$ power to detect a difference in mean HRSD-17 total score change from baseline of 3.0 units assuming that the common standard deviation was 7 using a two-group t-test with a 0.05 two-sided significance level. Adjusting for drop outs, approximately 270 subjects were assumed to be randomized.
Enrollment visit dates were used to order subject entry into the trial. Sample size assumptions were evaluated for demonstration purposes after the 100$^{\rm th}$, 150$^{\rm th}$, 200$^{\rm th}$, and 250$^{\rm th}$ subject completed the trial. In the left panel of Figure \ref{combined}, 90$\%$ HPD intervals were calculated for expected power after the 100$^{\rm th}$ subject level data (i.e.,observation) had been examined. The posterior null and alternative distributions for the hypothesis testing (respectively, model selection setting) of the pooled standard deviation $\tau$ are given in the left (respectively, right) panels of Figure \ref{tausigma}; their mean corresponds roughly to the pooled standard deviation estimates computed in the original study. In the next Section, we examine multiple stage sample size determination in the context of this example.
\begin{figure}
\includegraphics[width=1 \textwidth]{combined.pdf}
\caption{90 Percent HPD intervals for expected power using the first 100 subjects when 50 additional observations are anticipated for the Depression Trial in the hypothesis testing and model selection settings, respectively. Posthoc power was calculated to be 66$\%$ after the 150th observation. This is well within the HPD interval given in the left panel.}
\label{combined}
\end{figure}
\begin{table}
\includegraphics[width=1 \textwidth]{posthocpower.pdf}
\caption{Posthoc power, calculated in an unblinded setting, for the Depression data}
\label{posthocpower}
\end{table}
\begin{figure}
\includegraphics[width=1 \textwidth]{tausigma.pdf}
\caption{Null and Alternative posterior densities of the $\tau$ and $\sigma$ parameters in the model selection and hypothesis testing setting}
\label{tausigma}
\end{figure}
\vspace{2.7in}
\section{Clinical Trial Sample Size Adjustments in a Model Selection Setting}
\begin{itemize}
\item[$\circ$]
The model selection setting is similar to that given in Section \ref{simple}, but we now incorporate a variety of judgments about the threshold $\delta_1$, distinguishing whether a treatment effect is present.
\item[$\circ$] By adding noise to both the null and alternative hypotheses, we effectively incorporate all of these judgments; we call this the, ``model selection setting.'' Tests in model selection settings are more conservative and hence give rise to smaller expected power than their hypothesis testing counterparts.
\item[$\circ$]
The average effect of the treatment is denoted by the parameter $\delta$; the mean control response is denoted by $\mu$. The pooled standard deviation is denoted by $\tau$. Subjects in the treatment arm are assumed to be distributed according to ${\cal N}(\mu-\delta,\tau)$; subjects in the control arm are distributed according to
${\cal N}(\mu,\tau)$. Our objective in this Section is to test the noisy (vague) null and alternative hypotheses given below.
\item[$\circ$]
The null and alternative hypotheses, for fixed, known $\delta_1$, are:
\begin{eqnarray} H_0: \delta & \sim & {\cal N}(0,\sigma^2) \\
H_a: \delta & \sim & {\cal N}(\delta_1,\sigma^2) \end{eqnarray}
\item[$\circ$]
The null hypothesis effectively adds the (Gaussian) noise factor ${\cal N}(0,\sigma^2)$ to the null hypothesis
assumed in equation (\ref{HT0}) and the alternative hypothesis in equation (\ref{HTA}).
\item[$\circ$]
The addition of noise converts the assumed hard threshold $\delta_1$ into a soft threshold.
\item[$\circ$]
In the notation of Section \ref{Theory}, the auxiliary parameter $\lambda$ corresponds to the parameter $\sigma$, defined above.
\item[$\circ$]
The marginal likelihoods (ML) under the null and alternative hypotheses are:
\[ {\rm ML \ under \ null} \sim \frac{\exp \left\{ -(1/2)\sum_i \left( \frac{ (X_i - \mu)^2}{Z_i\sigma^2+\tau^2} \right) \right\}}{\prod_i \sqrt{\left(Z_i\sigma^2+\tau^2 \right)}} \]
\[ {\rm ML \ under \ alternative} \sim \frac{\exp \left\{ -(1/2) \sum_i \left( \frac{ (X_i - \mu+\delta_1 Z_i)^2}{Z_i\sigma^2+\tau^2} \right) \right\}}{\prod_i \sqrt{\left(Z_i\sigma^2+\tau^2 \right)}} \]
\end{itemize}
\begin{itemize}
\label{reparametrize}
\item[$\circ$]
Let $\psi=\sigma^2+\tau^2$.
We assume a nearly indifferent prior for $\psi$, the usual Bernoulli prior ($p, \ 1-p$) for the Z's, and the prior described in Section \ref{simple} for $\tau^2$. We assume an inverse gamma prior for $\psi$ having shape parameter 1 and scale $\epsilon_1$.
$\sigma^2$ inherits a prior from that given for $\tau^2$ and $\psi$.
\item[$\circ$]
We can compute the critical value parameter by first marginalizing over the null and separately over the alternative hypotheses (see e.g., \cite{Weiss}).
\item[$\circ$]
The additive inverse of the conditional likelihood ratio statistic is:
\[ T_n(\mathbf{X}, \mu,\sigma,\tau, \mathbf{Z}) \propto \sum_i \left( \frac{Z_i(X_i - \mu)^2-Z_i(X_i - \mu + \delta_1)^2}{Z_i \sigma^2 + \tau^2} \right) \]
\[ \propto \frac{\sum_{i=1}^n Z_i(X_i-\mu)}{\psi} \]
\end{itemize}
\begin{itemize}
\item[$\circ$] The proposed approach: We adopt the same conventions as were adopted in section \ref{our approach} (above). The notation
$\psi$, is as defined above. The quantities, $\psi_0$ and $\psi_a$, denote the parameter $\psi$ under the null and alternative posterior distributions, respectively. The critical value and power parameters are computed at significance level $\alpha$ as:
\begin{equation} c^{(b)} = \frac{\Phi^{-1}(\alpha) \sqrt{n+m}}{\sqrt{\psi_0^{(b)}} \sqrt{2}} \end{equation}
\begin{equation} {\rm power}^{(b)}[n+m]= \quad \nonumber \end{equation}
\begin{equation} \quad \Phi \left\{ {\left[ {\frac{{{\sqrt{\psi_a^{(b)}}}}}{{{\sqrt{\psi_0^{(b)}}}}}{\Phi ^{ - 1}}\left( \alpha \right) + \frac{{{\delta _1}}}{{\sqrt{\psi_a^{(b)}}}}\left( {\sqrt {\frac{{n + m}}{2}} } \right)} \right]} \right\} \end{equation}
\item[$\circ$]
As an example, we describe our results for the depression study.
The parameters $\tau$ and $\sigma$ take on a variety of values in this case under both the null and alternative posterior distributions, as a consequence of the noisy nature of the test (see Figure \ref{tausigma}).
\item[$\circ$]
Note the lower expected power in this case. This is a consequence of the fact that the hypotheses are noisier and hence provide less evidence of future power. (see Figure \ref{combined}, right panel).
\end{itemize}
\section{Advanced Stage Sample Size Determination}
\begin{figure}
\includegraphics[width=1 \textwidth]{HPDmultiple.pdf}
\caption{90 percent HPD intervals for the Advanced Stage Power Estimation after Samples of sizes 100+50 have been observed. The figure on the left estimates power in the Depression Trial assuming a change in reliability; the figure on the right estimates power assuming no change in reliability. Table 1 gives a (posthoc) power of 78$\%$ after 200 observations. This is roughly comparable to the lower HPD quantile but not the median HPD quantile.}
\label{HPDmultiple}
\end{figure}
\begin{itemize}
\item[$\circ$]
Nearly all ongoing clinical trials are monitored continuously and blinded data sets become available at pre-specified periodic intervals. This condition provides ample opportunity to examine blinded data at various interim points.
\item[$\circ$] Early enrolled patients frequently demonstrate different behavior than those patients entering the study later.
\item[$\circ$] In the depression trial, introduced above, early enrolled patients demonstrated more reliable behavior than those entering the study later. In this case, predictions are improved by giving more weight to earlier patients.
\item[$\circ$]
In this Section, we propose an algorithm for calculating sample size adjustments at a later interim point which takes account of the aforementioned reliability concerns. This enables us to accurately update the auxiliary parameters using all of the data observed before the adjustment is recommended. We refer to this below as advanced stage sample size adjustment.
\item[$\circ$]
We demonstrate our results for two interim points; generalizations to more than two interim points are clear.
\end{itemize}
Using notation analogous to that introduced in Section \ref{Theory}:
\begin{enumerate}
\item[(i)]
Assume at interim stage $j$ ($j$=1,2), $n_j$ identical, mutually independent subjects are randomly assigned to treatment groups with known probability, $p$.
\item[(ii)]
\begin{itemize} \item[$\circ$] Subjects in the experimental treatment arm with observed values $X_{i,j}$ ($i$=1,..,n; $j$=1,2) are modeled as coming from the normal distribution ${\cal N}(\mu-\delta Z_{i,j}, \tau_j)$. \item[$\circ$] Subjects in the control arm with observed values $X_{i,j}$ are modeled as coming from the normal distribution, ${\cal N}(\mu,\tau_j)$.
\end{itemize}
\item[(iii)]
We use the notation $Z_{i,j}=1$ to indicate that subject $i$ corresponding to interim stage $j$ is assigned the treatment; we use the notation $Z_{i,j}=0$ to denote the control group assignment. The probability of $Z_{i,j}=1$ is assumed to be $p$. We use the notation $\mathbf{X_j}=(X_{1,j},...,X_{n_j,j})$ for the interim sample having size $n_j$ ($j$=1,2).
\item[(iv)]
We anticipate that the additional, as yet unobserved, m observations $\mathbf{X}^{(new)}=\mathbf{X_{n+1:m}}=(X_{n+1},...,X_m)$ are generated from the normal distribution ${\cal N}(\mu-\delta Z_{i,j},\tau_2)$. We test the null and alternative hypotheses given by
\begin{eqnarray} H_0: \delta & = & 0 \\
H_a: \delta & > & \delta_1; \end{eqnarray}
\item[(v)]
We use the notation $f_{h,j}$ to denote the likelihood under hypothesis $h=0,a$ at interim point $j$ and $\tau_{h,j}$ for the scale parameter under hypothesis $h$ at interim point $j$. We omit mention of $\mu$ in this notation.
\item[(vi)] $\kappa_{h1}$ and $\kappa_{h2}$ characterize the shape and scale respectively of gamma random variables tending to take values larger than 1 having a ratio larger than 1.
\item[(vii)] The gamma variable $c_h$ with shape $\kappa_{h1}$ and scale $\kappa_{h2}$, used below, reflects the presumed reduction in certainty in going from the first to the second interim data set; we use the notation ${\it Gamma}(\kappa_{h1},\kappa_{h2})$ for the resulting gamma distribution. We assume the same approximate indifference prior for $\tau_{h,1}$ as was assumed for $\tau_h$ above.
\item[(viii)] We employ the model:
\begin{eqnarray} \mathbf{X_{1}} & \sim & f_{h,1}(\bullet \vert \tau_{h,1}) \nonumber \\
\tau_{h,2} & \sim & c_h \tau_{h,1} \qquad c_h \sim {\it Gamma}(\kappa_{h1},\kappa_{h2}) \label{particle filter prior} \\
\mathbf{X_{2}} & \sim & f_{h,2}(\bullet \vert \tau_{h,2}) \nonumber
\end{eqnarray}
\item[(ix)]
Note that, by assumption, the prior distribution for $\tau_{h,2}$ provides a large prior probability that $\tau_{h,2}$ is larger than $\tau_{h,1}$; the size of this probability depends on the hyperparameters $\kappa_{h1}$ and $\kappa_{h2}$ for $h=0,a$. We adopt the notation $(\bullet \vert \mathbf{X}_1,\mathbf{X}_2)_{H_j}$ to denote the $H_j$ posterior distribution of $\tau$ given all of the observed data.
\item[(x)]
Posterior inference in this case makes use of standard particle filter algorithms
\cite{Doucet1} and \cite{Carvalho1}.
\item[(xi)]
We calculate critical values and power using the posterior distributions:
\[ \tau_{0,2}^{(b)} \sim \left( \bullet \vert \mathbf{X}_1,\mathbf{X}_2 \right)_{H_0}; \quad b=1,....,B \]
\[ \tau_{a,2}^{(b)} \sim \left( \bullet \vert \mathbf{X}_1,\mathbf{X}_2 \right)_{H_a}; \quad b=1,....,B \]
\item[(xii)]
The critical values and power can then be calculated using:
\begin{eqnarray} c^{(b)} & = & \frac{\Phi^{-1}(\alpha) \sqrt{n+m}}{\tau_{0,2}^{(b)} \sqrt{2}} \end{eqnarray}
\[ {\rm power}^{(b)}[n+m] = \]
\begin{equation} \Phi \left\{ {\left[ {\frac{{{\tau _{a,2}^{(b)}}}}{{{\tau _{0,2}^{(b)}}}}{\Phi ^{ - 1}}\left( \alpha \right) + \frac{{{\delta _1}}}{{\tau _{a,2}^{(b)}}}\left( {\sqrt {\frac{{n + m}}{2}} } \right)} \right]} \right\} \end{equation}
(for $n=n_1+n_2$).
\item[(xiii)]
We calculate estimated power using the data from the depression trial, described previously. The first interim point comes after 100 data points are observed; the second comes after an additional 50 data points have been observed. We assumed $\kappa_{h1}=3$ and
$\kappa_{h2}=2$. Our results were compared with those for which no change in reliability was assumed (i.e., the original framework). The upper and median quantiles of the adjusted future power estimates given in Figure \ref{HPDmultiple} are comparable to their unadjusted counterparts. The lower quantiles of the adjusted future power estimate is substantially smaller than its unadjusted counterpart; this is a consequence of the fact that by adjusting for the greater reliability of earlier patients we give less weight to the accumulated evidence against the null at interim point 2.
We note that Table \ref{posthocpower} gives a (posthoc) power of .78 after 200 observations. This is easily within the scope of the adjusted HPD interval but roughly outside the scope of the unadjusted HPD interval (see Figure \ref{HPDmultiple}).
\end{enumerate}
\section{Conclusion}
\begin{enumerate}
\item[(i)]
We have argued in favor of:
\begin{enumerate} \item[(a)] providing sample size adjustments before unblinding, \item[(b)] providing adjustments in both soft and hard threshold settings, and \item[(c)] providing more accurate and more flexible auxiliary parameter (e.g., variance) estimators in support of changes in patient population. \end{enumerate}
\item[(ii)]
In further support we note the many guidelines recommending these adjustment changes (see Section \ref{introduction}).
\item[(iii)]
The information available before unblinding, although useful, is highly uncertain.
Estimates of expected (future) power obtained in this setting need to reflect this uncertainty. We have shown that current techniques using point estimates of auxiliary parameters for estimating expected power fail to:
\begin{enumerate} \item[(a)] accurately describe the range of likely power obtained after the anticipated data are observed, \item[(b)] fail to anticipate the need for sample size adjustments in the presence of both hard and soft threshold settings, and \item[(c)] fail to adjust to changes in the patient population. \end{enumerate}
\item[(iv)]
The procedures devised above addressed all of these shortcomings.
\item[(v)]
Breaking the blind to perform sample size adjustment in a clinical trial is resource intensive; blinded sample size re-estimation is generally well accepted by regulators.
Nearly all ongoing clinical trials are monitored continuously and data sets become available at periodic intervals. This monitoring provides ample opportunity to examine blinded data at various interim points. The data set consisting of the collection of all interim data sets is the {\it combined} data set. Patients enrolled in earlier interim data sets may demonstrate more reliable behavior than patients entering the study at a later point.
\item[(vi)]
The proposed multistage algorithm provides flexibility in assigning weights to auxiliary parameters associated with different interim data points, according to the subjective assessment of the researcher.
\item[(vii)]
For the depression example, predictions are frequently more accurate when the pooled standard deviation for early enrolling patients is assumed to be smaller apriori than the pooled standard deviation for their later enrolling counterparts. We have argued that this difference in accuracy should be modeled by filtering auxiliary parameters arising from later interim points through those arising from earlier interim points.
\item[(viii)]
Particle filter models
were shown to provide an appropriate mechanism for modelling these prior relationships.
\item[(ix)]
In the depression trial example, the differences in response between the last set of subjects and the first 200 subjects were apparent. This response heterogeneity had a significant effect on the posthoc power, underscoring the need for estimates of future power which accurately model it. The uncertainty in the information available before unblinding is accurately characterized by statistical models which make use of the posterior distribution, conditional on the observed response data, of the auxiliary parameters. More generally, response heterogeneity over the course of a clinical trial, is a common problem; it is hoped that the suggested methodology can be useful.
\end{enumerate}
%%% References
%% Note: use of BibTeX als works!!
\begin{thebibliography} {99}
\bibitem{guidance} (2010) FDA Guidelines, web address: \\ http://www.fda.gov/downloads/Drugs/.../ \\ Guidances/ucm201790.pdf .
\bibitem{guidance1} (2010) ICH Guidelines, web address: \\
http://www.fda.gov/downloads/Drugs/ \\ GuidanceComplianceRegulatoryInformation \\
/Guidances/ucm073137.pdf .
\bibitem{Gould} Gould AL, Shih WJ. (1998) Modifying the design of ongoing trials without unblinding. {\it Statistics in Medicine}, 17, pp. 89-100.
\bibitem{Gould1} Gould AL, Shih WJ. (1992) Sample size re-estimation without unblinding for normally distributed outcomes with unknown variance, {\it Communications in Statististics Theory and Methods}, 21, pp. 2833-2853.
\bibitem{Friede1} Friede T, Kieser M. (2002) On the inappropriateness of an EM algorithm based procedure for blinded sample size re-estimation. {\it Statistics in Medicine}, 21, pp. 165-176.
\bibitem{Friede2} Friede T, Kieser M. (2003) Blinded sample size assessment in non-inferiority and equivalence trials. {\it Statistics in Medicine}, 22, pp. 995-1007.
\bibitem{Xing} Xing B, Ganju J. (2005) A method to estimate the variance of an endpoint from an on-going blinded trial. {\it Statistics in Medicine}, 24: 1807-1814
\bibitem{Kieser} Kieser M, Friede T., (2003) Simple procedures for blinded sample size adjustment that do not affect the type I error rate. {\it Statistics in Medicine}, 22, pp. 3571-3581.
\bibitem{Pezeshk} Peseshk, H. Bayesian techniques for sample size determination in clinical trials: a short review.
{\it Statistical Methods for Medical Research} 12, pp. 489-504.
\bibitem{Adcock} Adcock, C.J., (1997) Sample Size Determination: A Review. {\it The Statistician}, 46.
\bibitem{Carlin} Zhong,W., et. al. (2013) A two-stage Bayesian design with sample size reestimation and subgroup analysis for phase II binary response trials. Contemp Clin Trials. Nov;36(2):587-596
\bibitem{Santis2} Santis, F.D. and Spezzaferri, F. (1997) Alternative Bayes factors for model selection. {\it The Canadian Journal of Statistics}, 25, pp 503-515.
\bibitem{Smith} Sahu,S.K., and Smith, T.M.F. (2006) A Bayesian method for sample size determination with practical applications. {\it J.R. Statist. Soc. A} 169, pp. 235-253.
\bibitem{Santis1} Santis, F.D. (2007) Using historical data for Bayesian sample size determination. {\it J.R. Statist. Soc. A.}, 170, pp. 95-113.
\bibitem{Geisser} Geisser, S., and Eddy, W., (1979) A predictive approach to model selection, {\it J. Amer. Statist. Assoc.}, 74, pp 153-160.
\bibitem{Zelen} Lee, S.J., and Zelen, M. (2000) Clinical Trials and Sample Size Considerations: Another Perspective {\it Statistical Science}, 15, pp. 95-110.
\bibitem{Berry} Inoue, L.Y.T., et. al. (2005) Relationship between Bayesian and Frequentist Sample Size Determination. {\it The American Statistician}, 59, pp.79-87.
\bibitem{Santis} Santis, F.D., and Spezzaferri, F., (1997) Alternative Bayes Factors for Model Selection. {\it The Canadian Journal of Statistics}, 25. pp. 503-515.
\bibitem{Shuster} Schuster, J.J. (1993) {\it Practical Handbook of Sample Size Guidelines for Clinical Trials}, CRC Press, Boca Raton, Fl.
\bibitem{Lenth} Lenth, R. (2013) Posthoc Power: Tables and Commentary. {\it Technical Report 368, Department of Statistics and Accounting, University of Iowa}, Available on the web.
\bibitem{Self1} Self, S.G. and Mauritsen, R.H., (1988) Power/sample size calculations for generalized linear models. {\it Biometrics}, 44, pp 79-86.
\bibitem{Self2} Self, S.G. and Mauritsen, R.H., (1992) Power calculations for likelihood ratio tests in generalized linear models. {\it Biometrics}, 48, pp 31-39.
\bibitem{Aitkin} Aitken, M. (1991) Posterior Bayes Factors. {\it J.R. Statist. Soc. B}, 53, pp.111-142.
\bibitem{Joseph} Joseph,L. and Belisle, P., (1997) Bayesian sample size determination for normal means and differences between normal means. {\it The Statistician}, 46, pp. 209-226.
\bibitem{O'Hagan} O'hagan, (1995) Anthony, Fractional Bayes Factors for Model Comparison, {\it Journal of the Royal Statistical Society B}, pp. 99-138
\bibitem{Belisle} Belisle, P. and Joseph, L. (1997) Bayesian sample size determination for normal means and differences between normal means. {\it The Statistician}, 46, pp. 208-226.
\bibitem{Hartley} Hartley, A. Adaptive blinded sample size adjustment for comparing two normal means - a mostly Bayesian approach, {\it Pharmaceutical Statistics}, 2012, 11, pp. 230-240.
\bibitem{Gelfand} Gelfand, A., and Dey, D.K., (1994) Bayesian model choice: Asymptotics and Exact calculation. {\it Journal of the Royal Statistical Soc. Ser. B}, 56, pp. 501-506.
\bibitem{Doucet1} Doucet, A.,Godsill, S., and Andrieu, C. (2000) On sequential Monte Carlo sampling
methods for Bayesian filtering, {\it Statistics and Computing}, 10, pp. 197-208.
\bibitem{Carvalho1} Carvalho, C., et al. (2010) Particle Learning and Smoothing, {\it Statistical Science}, 2010, 25, pp. 88-106.
\bibitem{Gnedenko} Gnedenko, B.V. (1969) {\it The Theory of Probability}, Mir Publishers.
\bibitem{Lehmann} Lehmann, E. (1986) {\it Testing Statistical Hypotheses}, Wiley, Second Edition, New York.
\bibitem{Meng} Meng, X.Li (1994) Posterior Predictive p-Values. {\it The Annals of Statistics}, 22, pp. 1142-1160.
\bibitem{Gelfand1} Wang, F., and Gelfand A. (2002) A Simulation-based Approach to Bayesian sample size determination for performance under a given model and for separating models. {\it Statistical Science}, 17, pp. 193-208.
\bibitem{Robert} Robert, C., and Cassella, G., (2004) {\it Monte Carlo Statistical Methods}, Springer, Second Edition.
\bibitem{Weiss} Weiss, R., (1997) Bayesian sample size calculations for hypothesis testing. {\it The statistician}, 46.
\bibitem{Spiegelhalter} Spiegelhalter, D.J., Best, N.G., Carlin, B.P., and Van Der Linde, A. (2002) Bayesian measures of model complexity and fit {\it J. Roy. Statist. Soc. Ser. B}.
\bibitem{Mahmoud} Mahmoud, R.A., et al. (2007) Risperidone for Treatment-Refractory Major Depressive Disorder: A Randomized Trial, {\it Ann Intern Med.}, 147(9), pp 593-602.
\end{thebibliography}
\end{multicols}
\end{document}