1 / 103

EPI 260Statistics in Phase II Clinical Trials

Jimmy Hwang, Ph.D. Biostatistics Core, Cancer

Center UC San Francisco April 29, 2010

Early Phase Clinical Development Phase II

studies Statistics (in syllabus)

- Purpose of Phase II clinical studies
- Phase II study design
- formulation of testable hypotheses
- determine the study endpoints and when they will

be evaluated - define the population to be studied
- select the appropriate study design
- Determine the required sample size by making

assumptions about the extent of benefit to be

achieved with the new treatment and acceptable

errors in making a final decision about whether

the null hypothesis can be rejected - Methods for statistical analysis

Four types of trial designs (1)

- Phase I pharmacologically oriented
- The safe dose range
- The side effects
- How the body copes with the drug
- If the treatment shrinks cancer
- Phase II preliminary evidence of efficacy and

safety - If the new treatment works well enough to test in

phase 3 - Which types of cancer it is effective

against - More about side effects and how to manage

them - More about the most effective dose to use

Four types of trial designs (2)

- Phase III new treatments are compared with the

best currently available treatment (the standard

treatment). - A completely new treatment with the standard

treatment - Different doses or ways of giving a standard

treatment - A new radiotherapy schedule with the standard one
- Phase IV post-marketing surveillance
- More about the side effects and safety of the

drug - What the long term risks and benefits are
- How well the drug works when its used more

widely than in clinical trials

Statistical Considerations

- Define Clinical Question (Objectives).
- Study Development and Protocol Development
- Types of Study (pilot, clinical trial,

observational, etc.) - Endpoints (feasibility and appropriateness)
- Protocol Development (objectives, aims,

statistical design, patient selection, data

collection procedures, number of points, stopping

rules and interim analysis, statistical

endpoints, analysis plan, sample size) - During Study Randomization, data quality

control, interim analysis and/or monitoring of

patient safety - Study Finishing Data lock, data analysis and

interpretation, assist decisions for the

follow-up studies, and preparation for papers and

presentations

Statistical Perspectives

- Philosophy of inference divides statisticians

Frequentists versus Bayesian - Statistical procedures are not standardized.
- Things to consider
- Randomization
- Intent-to-treat Design
- Unbalanced groups
- Stratification
- Large-scale, small clinical trials, meta analysis
- Adjusted or weighted analysis
- Trials can provide confirmatory evidence.
- Other methods are valid for making clinical

inferences.

Basic Question

- Clinical reasoning requires generalizing from

individual patients. - Statistical reasoning emphasizes inference based

on structured data processing.

- Which treatment is safer and better?

- Benefit could be defined as
- Antitumor activity
- Safety
- The pharmacokinetics or pharmacodynamics
- The biologic correlates which may predict

response or resistance to treatment and/or

toxicity

Intent-to-treat (ITT) Principle

- Unlike animal studies, investigator cannot

dictate what a participant should do in a

clinical trial. - A participant may forget to take the pills,

receive dose reduction due to toxicity, drop out

from the study at any point or lost to f/u. - Use only full compliers? Use all subjects?
- ITT compares intervention strategies and not

interventions.

Standards of Ethical Conduct

- The study participants must give voluntary

consent. - There must be no reasonable alternative to

conducting the experiment. - The anticipated results must have a basis in

biological knowledge and animal experimentation. - The procedures should avoid unnecessary suffering

and injury. - There is no expectation for death or disability

as a result of the trial.

Standards of Ethical Conduct

- The degree of risk for the patient is consistent

with the humanitarian importance of the study. - The subjects are protected against even a remote

possibility of death or injury. - The study must be conducted by qualified

scientists. - The subject can stop participation at will.
- The investigator has an obligation to terminate

the experiment if injury seems likely.

Study Protocol

- Every well-designed study required a protocol.
- Protocol is a written agreement between

investigators, participants, and the scientific

community. - Protocol is a comprehensive operational manual.

It specifies the standard operation procedure

(SOP).

Defining study questions

- Each clinical trial must have a primary question.
- The primary question, as well as any secondary or

subsidiary questions, should be carefully

selected, clearly defined, and stated in advance. - Selection of the questions
- Primary and secondary objectives
- Interventions
- Response variables
- Surrogate endpoints, biomarkers

Primary Objective

- Define one question the investigators are most

interested in answering and is capable of being

adequately answered. - Define the primary endpoint
- Toxicity, efficacy (response/survival), QOL
- Define the type of study
- Hypothesis testing or estimation,
- Superiority or equivalence trials
- The sample size is based on.

Secondary Objectives

- Different endpoints
- Subgroup hypotheses
- Prospectively defined
- Based on reasonable expectations
- Limited in number
- Hypothesis testing vs. hypothesis generating
- Hunting expedition vs. fishing expedition
- Multiplicity Issues

What Study Aims Tell You

- Type of study general design
- (pilot, phase I, II or III study arms)
- Who is eligible
- Outcome measure
- (e.g. toxicity, response, duration, biomarker)
- When outcome will be evaluated
- (Timing of evaluations)

Interim Analysis Why ?

- Many trials require large N and/or long duration.
- Interim analysis can result in more efficient

designs and correct conclusion can be reached

sooner. - Ethical considerations
- Pace of scientific advancement demands learning

from the observed data. - Public health concerns, pressure from activists
- Requirement from IRB and other regulatory agencies

Interim Analysis Factors to Consider before

Early Termination

- Possible difference in prognostic factors among

arms - Bias in assessing response variables
- Impact of missing data
- Differential concomitant tx or adherence
- Differential side effects
- Secondary outcomes
- Internal consistency
- External consistency, other trials

Interim Analysis Reasons for early stopping

- Efficacy Treatments are convincingly different

or not different (by impartial knowledgeable

experts) - Toxicity Serious Adverse Events, Side effects or

toxicity are too severe (outweight the potential

benefits) - Futility Significant difference at the end of

the trial is unlikely - Data are of poor quality
- Accrual too slow in a timely fashion
- More information becomes available outside the

study (unnecessary or unethical to continue) - Scientific questions are no longer important
- Poor adherence (preventing answers to basic

question) - Resources to study are lost or no longer

available - Fraud or misconduct undermines study integrity.

Interim Analysis To Stop or Not To Stop?

- How sure?
- Is the evidence strong enough, or just due to

stochastic variation, or imbalance in covariates

or other factors? - Wrongly stopping for efficacy false positive
- False claim that the drug is active
- Waste time and money for future development
- Wrongly stopping for futility false negative
- Kill a promising drug
- Group ethics vs. individual ethics

Data Assessment Reasons for Noncompliance

- Toxicity or side effects
- Involving life style/behavior change
- Complex or inconvenient interventions
- Insufficient or lack of understanding

instructions - Change of mind, refusal
- Lack of family support
- If non-compliance is treatment dependent, it will

result in biased data

Data Assessment Non-adherence

- Include non- or partial compliers, drop-in and

drop-outs - Could due to toxicity, lack of efficacy, refusal.
- Need to compare the non-adherence rate between

arms - Exclude in the analysis
- Rationale pts not taking medication will not

benefit from it. - Compare the optimal intervention vs. control
- Can lead to biased result
- Include in the analysis
- Intend-to-treat (ITT) principle
- Power reduced but also less bias
- More relevant to generalize study result to the

real world setting - Do both. Sensitivity analysis

Data Assessment Poor Quality or Missing Data

- Missing visits may or may not due to outcomes

related to treatment, such as pts health status - Informative or non-informative missing
- Missing completely at random
- Missing at random (missing does not depends on

unobserved values) - Not missing at random
- Available methods
- Complete case analysis
- Last value carried forward
- Single imputation
- Multiple imputation
- Sensitivity analysis

Defining Response Variables

- Dose limiting toxicities (DLT), complications
- Response, incidence of a disease, total

mortality, death from a specific cause - Overall survival, time to progression, time to

cancer - Blood pressure, biomarkers, PSA, CD4 count
- Quality of life
- Cost and ease of administrating the intervention
- In general, a single response variable should be

identified to answer the primary question.

Defining Response Variables

- Define the questions prospectively and

specifically - Study drug can increase the response rate

(PRCR) from 25 to 50 in patients with certain

cancer - The primary response variable can be assessed in

all participants and as completely as possible - Informative drop-out or lost to f/u due to

toxicity - Participation generally ends when the primary

response variable occurs - Off-drug, off-study, extended f/u
- Response variables should be unbiased and

precisely assessed - Hard, objective endpoints vs. soft, subjective

endpoints - Standardization of evaluation, central lab and

pre-trial training

Scales of measurement

- Nominal
- Ordinal
- Interval
- Ratio

Statistical Methods for Categorical Data

- Goal Analysis
- Describe one group Proportion
- Compare one group to a Chi-square test
- hypothetical value
- Compare two unpaired groups Chi-square test
- Compare two paired groups McNemar's test
- Compare three or more Chi-square test
- unmatched groups
- Model the effect of multiple Logistic regression

- prognostic variables
- When sample size is small, use Fishers exact

test

Statistical Methods for Continuous Data

- Goal Analysis
- Describe one group Mean, SD
- Compare one group to a One-sample t-test
- hypothetical value
- Compare two unpaired groups Two-sample t-test
- Compare paired data Paired t-test
- Compare three or more One-way ANOVA
- unmatched groups

Statistical Methods for Non-Parametric Data

- Goal Analysis
- Describe one group Median, Percentiles
- Compare one group to a Signed-rank test
- hypothetical value
- Compare two unpaired groups Mann-Whitney test
- Wilcoxon rank sum test
- Compare paired data Signed-rank test
- Compare three or more
- unmatched groups Kruskal-Wallis test

Statistical Methods for Survival Data

- Goal Analysis
- Describe one group Kaplan-Meier
- Compare two unpaired groups log-rank test
- Compare three or more Cox regression
- unmatched groups/continuous
- risk factors
- Model the effect of multiple Cox regression
- prognostic factors

Samples and Population

- Research findings are based on samples drawn from

populations - Inferential statistics allow us to infer what the

population is like, based on sample data - The defined group of individuals from which a

sample is drawn - Sample should closely reflect the population

otherwise there is sampling bias.

Sampling

- The process of choosing members of a population

to be included in the sample - Research uses data from a sample to make

inferences about a population.

Variability

- How much do scores vary about the average?
- Variance (sum of squared deviations of each

score from the mean)/(n-1) - Variance is small when scores are close to the

mean - Standard deviation square root of variance

Within-group variability

- Variability within-groups is measured by the

variance and divided by sample size - Tells us how far individual scores deviate from

the group mean - This reflects "error"
- The number becomes lower with increasing sample

size

Two Group Means

- Ask samples of males and females about their

number of doctor visits during the past year - Suppose the mean for males is 1.3 and the mean

for females is 2.1

Do males and females differ?

- Is the mean number for males different from the

mean number for females? - Obviously, the sample means are different
- Can we infer that the population means differ as

well?

Whats the Problem?

- The difference observed in the samples may be

real - However, the difference could just reflect the

fact that there is some chance of error there

is always a margin of error around the sample

value

Hypothesis Testing

a Type I error (level of significance) b

Type II error (1- b Power)

(specificity)

(false - )

power

(sensitivity)

(false )

Inverse relationship between a and b for given

sample size Sample Size Calculation Find N s.t.

to a and b are under control. Typically, compute

N for a given a to yield (1-b)x100 power. For

example, compute N for a 0.05 to yield 80

power.

Null and Research Hypotheses

- Null hypothesis H0
- Population means are in fact equal
- Any mean difference observed in the samples

reflects the margin of error - straw man or what you want to reject
- any observed deviation from what we expect to see

is due to chance variability - Research hypothesis H1
- Population means are not equal
- The mean difference observed is real
- claim, or what you want to accept or test)

Alternative Hypotheses H1

Is the New" Treatment Different from the

standard? (2-sided) Better than the standard?

(1-sided, directional) Not different from the

standard? (Equivalency) Not worse than

the standard? (Not inferiority)

Hypothesis testing

- Problem Determine whether or not the population

means of two groups of subjects truly differ with

respect to the outcome of interest. - Solution Assume that the two groups do not

differ, and see if the sample data disagree with

this assumption. That is, perform a hypothesis

test.

Hypothesis testing (contd)

- The null hypothesis assumes that there is no

difference in outcome between the two groups. - The alternative hypothesis assumes that one group

has a more favorable outcome than the other. - The research hypothesis is usually the

alternative hypothesis.

Hypothesis testing (contd)

- To do a hypothesis test
- Calculate a test statistic from the data.
- Determine whether the value of the test statistic

is likely or unlikely under the null hypothesis. - If the value is very unlikely, reject the null

hypothesis.

Hypothesis testing (contd)

- Problem we might reject the null hypothesis

when it is true. - That is, we might commit Type I error.
- Solution Construct the test so that there is

only a 5 chance of incorrectly rejecting the

null hypothesis. - That is, the level of the test (alpha) is 0.05.

Type I Error

- The chance of rejecting a NULL which is true is

a this type of mistake is called a Type I error

or false positive - Reject the null hypothesis when it is true
- Likelihood is set the alpha level decision rule

(.05 usually) - 5 is a reasonably low probability of being

wrong, but could set lower - For early phase II trials, we often use more

liberal type I errors for not missing the

possible treatments - In medical contexts, the specificity of a test is

the chance that the test result is negative given

that the subject is negative this is just 1 - a

P lt .05

- The alpha level for rejecting the null hypothesis

is conventionally set as .05 - Obtained sample data are inconsistent with what

the null hypothesis expects - Reject the null hypothesis and therefore accept

the research hypothesis - Therefore, conclude that the obtained difference

in means is statistically significant

Type II Error

- Incorrectly accepting the null hypothesis when

there really is a difference - The chance of not rejecting a NULL which is false

is ß this type of mistake is called a Type II

error or a false negative - In medical contexts, the sensitivity of a test is

the chance that the test result is positive given

that the subject is positive this is just 1 - ß,

also called power

Power

- Probability of correctly rejecting the null

hypothesis - 1-Beta
- Power is higher with
- Large sample size
- Large difference between group means
- Low within-group variability

What is p value?

- The p-value is the probability of obtaining data

as extreme as the observed result when the null

hypothesis is true. - That is, the p-value is the strength of the

evidence against the null hypothesis. - For a level 0.05 test, we reject the null

hypothesis if the p-value is 0.05 or less. - Smaller p-values ? stronger evidence against H0.
- Statistical Significance or Clinical Significance
- Large samples small differences may be

significant - Small samples large differences may not be

significant - The frequentist inference depends on sample

space, i.e. the design.

What is p value?

- Decide on whether or not to reject the NULL

hypothesis H0 based on the chance of obtaining a

TS as or more extreme (as far away from what we

expected or even farther, in the direction of the

ALT) than the one we got, ASSUMING THE NULL IS

TRUE - The likelihood of observing the same outcome or

one more extreme if the study were carried out

again. - This chance is called the observed significance

level, or p-value - A TS with a p-value less than some prespecified

false positive level (or size) a is said to be

statistically significant at that level

What is p value?

- The interpretation of a p-value is a little

tricky In particular, it does NOT tell us the

probability that the NULL hypothesis is true - The p-value represents the chance that we would

see a difference as big as we saw (or bigger) if

there were really nothing happening other than

chance variability. - p 0.08, 8 times out of 100 the same result or

more extreme would occur due to chance alone - A single convenient number giving a measure of

the degree of surprise which the experiment

should cause a believer of the null hypothesis

Judging a p-value

0.01

0.05

The results are significant.

The results are highly significant.

The results are very highly significant

lt 0.001

The results are not statistically significant

gt 0.05

0.05

0.10

A trend toward statistical significance

Statistical Significance Tests

- Significance tests provide a way of making a

decision about the population means - There are many such tests used for different

types of data. But all use the same logic

Test statistic

- Measure how far the observed data are from what

is expected assuming the NULL (H0) by computing

the value of a test statistic (TS) from the data - The particular TS computed depends on the

parameter - For example, to test the population mean µ, the

TS is the sample mean (or standardized sample

mean)

Example

- An experiment is conducted to study the effect of

exercise on the reduction of the cholesterol

level in slightly obese patients considered to be

at risk for heart attack. 80 patients are put on

a specified exercise plan while maintaining a

normal diet. At the end of 4 weeks the change in

cholesterol level will be noted. It is thought

that the program will reduce the average

cholesterol reading by more than 25 points. - Data
- sample mean 27
- sample SD 18

Steps in hypothesis testing (I)

- 1. Identify the population parameter being tested

(ie population mean). Here, the parameter being

tested is the population mean cholesterol reading

µ - 2. Formulate the NULL (H0) and ALT hypotheses

(H1) - H0 µ 25 (or µ 25)
- Ha µ gt 25
- 3. Compute the test statistic (TS)
- t (27 25)/(18/v 80) .99

Steps in hypothesis testing (II)

- Compute the p-value.
- Here, p P(T79 gt .99) .16
- (Optional) Decision Rule
- REJECT H0 if the p-value a
- (This is a type of argument by contradiction)

A typical value of a is .05, but theres no law

that it needs to be. If we use .05, the decision

here will be) - DO NOT REJECT H0

Summary

Hypotheses Null New drug doesnt work

Alternative New drug works Decisions New

drug works Correctly reject H0Power Abandon

new drug Correctly dont reject H0 Proceed

with an ineffective drug Type I error

Abandon a drug that might work Type II error

Pitfalls in hypothesis testing

- Even if a result is statistically significant,

it can still be due to chance - Statistical significance is not the same as

practical importance - A test of significance does not say how important

the difference is, or what caused it - A test does not check the study design If the

test is applied to a nonrandom sample (or the

whole population), the p-value may be meaningless - Data-snooping makes p-values hard to interpret

Introduction to Permutation test (Rank Test)

- A type of nonparametric hypothesis test
- Also called randomization test, exact test
- Very widely applicable class of tests
- Introduced in the 1930s
- Usually require only a few weak assumptions
- Often shows good power

5 Steps to a permutation test

- 1. Analyze the problem identify the NULL and ALT

hypotheses - 2. Choose a test statistic (TS)
- 3. Compute the TS for the original labeling of

the observations - 4. Rearrange (permute) the labels and recompute

the TS for the rearranged labels (do for all

possible permutations) - 5. Decide whether to reject NULL based on this

permutation distribution

Permutations

- A permutation is a reordering of the numbers 1,

..., n - Example What are some permutations of the

numbers 1, 2, 3, 4?? - The NULL specifies that the permutations are all

equally likely - The sampling distribution of the TS under the

NULL is computed by forming all permutations,

calculating the TS for each and considering these

values all equally likely

Example

- Suppose we wish to compare the length of stay in

the hospital for patients with the same diagnosis

at two different hospitals. We have the following

results - 1st hospital
- 21,10,32,60,8,44,29,5,13,26,33
- 2nd hospital
- 86,27,10,68,87,76,125,60,35,73,96,44,238
- How could we carry out a permutation test to test

the NULL hypothesis of no difference between two

hospitals? - Why is a t test not useful in this case?

Example

- The distribution of length of stay is very skewed

and far from normal distribution. - Using Rank-sum test,
- R 83.5, T 3.10 p 0.002

- This is an example of an unpaired 2 sample test
- Here, we have to find all of the combinations

(since order within each group doesnt matter)

Advantages

- Can get a permutation test for any TS, even if

its sampling distribution is unknown - This gives more freedom in choosing a TS
- Can use on unbalanced designs
- Can combine dependent tests on mixtures of

different data types (e.g. with numerical and

categorical data)

Limitations

- Assumption that the observations are exchangeable

under the NULL - This allows us to randomly move observations

between the groups - For example, when testing for a difference in 2

group means you would need to assume that the

distributions in both groups have the same shape

and spread - Cannot use for testing hypotheses in a single

population, or to compare groups that are

different under the NULL

Introduction to ROC curves

- ROC Receiver Operating Characteristic
- Started in electronic signal detection theory

(1940s - 1950s) - Has become very popular in biomedical

applications, particularly radiology and imaging - Also used in machine learning applications to

assess classifiers - Can be used to compare tests/procedures
- True positive rate (sensitivity) vs. false

positive rate (1-specificity)

Examples using ROC analysis

- Threshold selection for tuning on already trained

classifier (eg neural nets) - Defining signal thresholds in DNA microarrays
- Comparing test statistics for identifying

differentially expressed genes in replicated

microarray data - Assessing performance of different protein

prediction algorithms - Inferring protein homology

ROC curves simplest case

- Consider diagnostic test for a disease
- Test has 2 possible outcomes
- positive suggesting presence of disease
- negative
- An individual can test either positive or

negative for the disease

Specific Example

Test Result

Threshold

Test Result

Four groups

True Positives

True Negatives

False Negatives

False Positives

Test Result

Moving the threshold

True Positives

True Negatives

False Negatives

False Positives

Test Result

ROC Curve

True positive rate (sensitivity)

False Positive Rate (1-specificity)

ROC Curve

True positive rate (sensitivity)

False Positive Rate (1-specificity)

Area under ROC curve (AUC)

- Overall measure of test performance
- Comparisons between two tests based on

differences between (estimated) AUC - For continuous data, AUC is equivalent to

Mann-whitney U-statistic (non-parametric test of

difference in location between two populations)

Interpretation of AUC

- The probability that the test result from a

randomly chosen diseased individual is more

indicative of disease than that from a randomly

chosen nondiseased individual P(Xi gt Xj Di1,

Dj0) - A nonparametric distance between

disease/nondisease test results. - No clinically relevant meaning
- A lot of the area is coming from the range of

large false positive values, no one cares whats

going on in that region. - The curves might cross, so that there might be a

meaningful difference in performance that is not

picked up by AUC

Elements of sample size calculation

- Hypothesis
- H0 New treatment standard treatment
- Ha New treatment is better.
- Type I and Type II errors
- ? .025 (or two-sided ? .05)
- ? .15 (Power 85)
- Effect size
- ? mu1 mu2 (for continuous outcomes)
- ? Pi1 Pi2 (for dichotomous outcomes)
- Sample variation
- s(? )

Test of Proportions

- Determining the Sample Size
- What is the level of significance?
- (Prob. or ? level)
- Rejecting a true null hypothesis
- What are the chances of detecting
- a real difference? (Power)
- How large a difference (?) is clinically

important?

Determining the Sample Size

- Criteria are inter-related
- If you know 3 of 4 parameters, the other is fixed

(n, ?, ? and ?) - Must keep the study feasible
- There are trade offs
- There is no one correct answer

Sample Size Calculation Is Only An Estimate

- Parameters used in calculation are estimates

themselves with a level of uncertainty. - Estimated treatment effect may be based on a

different population. - Estimated treatment effect is often overly

optimistic based on highly selected pilot

studies. - Patients eligibility criteria may be changed,

thus, affect the sample population. - Better to design a larger study with early

stopping and a smaller study than try to expand N

/extend f/u during the trial.

Sample Size and Power Why?

- Before a study how large of a sample does a

study require? (in planning) - After a study if no association was found, could

it be due to either true lack of association in

population low power and small sample size?

Power sample size

- Problem we might fail to reject the null

hypothesis when the alternative is true. - That is, we might commit Type II error.
- Solution Select a large enough sample so that

there is an 80 chance of rejecting the null

hypothesis if the alternative is true. - Then the power to detect the alternative is 80.

Power sample size (contd)

- Problem Sometimes the sample size required is

too large. - Solutions
- Be content to detect with less power (allow more

type II error). - Increase the level of the test (allow more type I

error). - Pick a more extreme alternative.

Sample Size

- Larger sample sizes provide more accurate

estimates of the characteristics of the

population - Confidence interval specify where the

population value probably lies - As sample size increases, there is less margin of

error

Change in Sample Size Test of Proportions

Test of Hypothesis for Phase II Trial 1 arm H0

p lt 0.10 H1 p gt 0.25 n

40 Design ?10.04 1-sided test 1 - ?

0.82 ? 0.15 1 arm

Change in Sample Size Test of Proportions

Test of Hypothesis for Phase II Trial 1 arm H0

p lt 0.10 H1 p gt 0.25 ? 0.15 ?1

0.05 0.025 0.01 1 - ? 0.80 40 49 62 0.90 55

64 78 0.95 70 79 103

TTP Example

Assumptions 1

arm ?1 0.05 Power 0.80 H0 Med30

mos. H1 Med40 mos. Hazard Reduction 26 Accrua

l 12/mo. Duration of Accrual 14.7

mos. Follow-up 24 mos. Total Sample

Size 176 pts.

Change to a 2 Arm Study

Assumptions 2 arm study ?1 0.05 Power

0.80 H0 Med30 mos. H1 Med40 mos. Hazard

Reduction 26 Accrual12/mo. Duration of Accrual

(mos) 43.1 Follow-up 24 mos. Total Sample

Size 518

Increase Power

Assumptions 2 arm study ?1 0.05 Power

0.80 0.90 H0 Med30 mos. H1 Med40

mos. Hazard Reduction 26 Accrual12/mo. Duration

of Accrual (mos) 43.1 55.8 Follow-up 24

mos. Total Sample Size 518 670

Statistical Power

Unacceptable

0.01

0.69

Poor

0.80

Good

0.89

0.90

0.99

Excellent

Characteristics of Phase I Trials

- Small sample sizes
- Not hypothesis driven
- Toxicity (DLT and MTD) and Efficacy
- Patient safety and benefit
- Dose escalation and drug discovery
- Clinician, Patients and Drug Development

Phase I trial designs

- Conventional/Standard Method
- 33 Dose Escalation Design
- Sequential/Bayesian Methods
- Continual Reassessment Method (CRM)
- Random Walk Rules (RWR)
- Decision-theoretic Approaches
- Escalation with Overdose Control (EWOC)

Phase I Dose StudyStandard Method- 33 design

- At each predefined dose level, treat 3 patients

with dose level 1. - If 0 of 3 have DLT, increase to next level
- If 2 or more have DLT, decrease to previous level
- If 1 of 3 has DLT, treat 3 more at current dose
- If 1 of 6 has DLT, increase to next level
- If 2 or more have DLT, decrease to previous level
- If a dose has de-escalated to previous level
- If only 3 had been treated, enroll 3 more for a

total of 6 - If 6 have been treated, stop study and declare it

as MTD. - MTD the largest dose for which 1 or fewer DLT

occurred. - Escalation never occurs to a dose at which 2 or

more DLT have occurred.

Sample Size for Safety Trials

- Type I Error (Alpha)
- Acceptable Safety Rate (Rho)
- Sample Size (N)

Alpha 0.10 0.05 0.05 0.10 0.10

Rho 5 10 14 20 25

Sample Size 45 28 20 10 8

Characteristics of Phase II Trials

- Aim To determine the efficacy of a new

treatment (what outcomes to observe) - Small study of one experimental treatment (E)
- Often a single-arm trial of E alone, without

randomization - Efficacy and safety are evaluated using an

early outcome - Data on E are compared to historical data on

standard treatment (S) - If E is promising, then Organize a randomized

phase III trial of E-vs-S based on a

time-to-event outcome (T)

Primary Outcome Measure Point Estimate

Mean Hgb µ Proportion Responding p Median

Nadir PSA Failure Rate ?

Typical Phase II Trials

- Typical cancer phase II trials investigate the

response rate - Historical reference p0
- Desired clinical significant response p1
- Hypotheses
- H0 pp0 (If true response rate is no larger

than p0, a minimum response rate of interest) - H1 pp1 (If true response rate is at least

p1, a target response rate) - Stop the trial early if p is not sufficiently

promising

Typical Phase II Trials

- One stage design Using the Fishers exact test

to reject null - Two stage design First stage to have N1

patients. If not enough responses, stop the

trial Otherwise, continue to full N (gt N1 )

patients evaluate treatment response based on

the number of responses - The choice of N and N1 according prespecified

type I and II errors.

Phase II Trial Designs

- Single sample (1 stage)
- Multiple stage design
- Gehan (2 stage), Fleming,
- Simons Optimal, MiniMax
- Bayesian
- Multiple Outcomes Measures
- Interim Analyses
- Stop for toxicity or lack of activity
- Not rejecting null hypothesis.

Phase II Trial Designs

- Randomized Phase II (2 samples)
- Reduce bias by randomizing pts.
- Concurrent accrual/Comparative
- Control/Selection
- Randomized discontinuation
- Interim Analysis
- Stop for toxicity or lack of activity
- Not rejecting null hypothesis
- Adaptive

Simons Optimal 2-Stage Design

- P00.6 vs P10.75

? 0.10 ? 0.10 E(Np0) 48 PET(p0) 0.65

Characteristics of Phase III Trials

- Use phase 2 data to decide what to test in phase

3 - Randomize between E and S, usually multi-center
- Typically based on T survival time or DFS time
- The scientific standard for deciding if E is

effective