1 / 64

Statistical aspects of clinical research

- David Giltinan
- May 2006

Outline

- Why is clinical research hard?
- Key statistical concerns
- Get the correct answer to the right question,

using the appropriate number of subjects - Key components of a clinical trial
- Clear, feasible, appropriate study objective(s)
- Target patient population
- Study design visit and evaluation schedule
- Efficacy and safety endpoints
- Sample size
- Analysis methods
- Next week
- Interim analyses early termination?
- Subgroup analyses

Clinical research is not for sissies

- Answering even relatively simple questions under

the best conditions a controlled clinical

trial can be tricky. Possible sources of bias

abound, and if appropriate safeguards are not

taken, may combine to give a false or misleading

conclusion - Some of the factors which make clinical research

hard - Formulating the right scientific question can

be deceptively tricky - Logistical complexity, especially the need to use

multiple sites - Trial conduct is highly interdisciplinary,

requiring sustained, well-coordinated effort from

many groups - Staggered recruitment of subjects, uncertainty

about accrual pattern is unavoidable - Patient dropout, particularly in longer trials
- Potential for the goalpost to move mid-trial

unforeseen events can destroy, or severely

reduce, the relevance of the study even before it

ends

Laws governing clinical trial conduct ¹

- Lasagnas Law
- The prevalence of any disease under study drops

dramatically once study enrolment opens up, and

returns to previous levels only once enrolment

closes - Murphys Law
- Anything that can go wrong, will go wrong
- In particular, the most egregious breach of

protocol instructions will occur at the

highest-enrolling site - Giltinans Law
- The quality of data obtained from any site is

inversely proportional to the degree of

exaltation of the thought leader or principal

investigator at the site (in extreme cases, the

role of thought leader is so all-consuming that

delays in filing the necessary paperwork result

in actual enrolment levels close to zero) - ¹ clearly, all just different manifestations of

Murphys Law

Strategy to tactics protocol development

- A key concern is that each individual study

protocol must achieve its goals, not just on its

own terms it must also make sense within the

broader picture - A major practical issue is the ever-changing

nature of the landscape the long duration of

most trials, and the uncertainty about the

results means that the original target may have

shifted by completion of a given trial - Nonetheless, a key requirement when designing any

trial is that the proposed design should give the

best chance possible of enabling the development

plan to proceed to the next stage, once results

from the trial become available - The previous condition should be met, even when

results do not correspond to the desired answer

it is important to remember that a failed

clinical trial is not one which fails to give the

desired answer, but rather one which fails to

give an unambiguous answer

Study objectives should be clear, specific, and

relevant

- Phase III objectives determined primarily by (i)

target product profile (think desired label

claim) (ii) norms for the given disease - Primary and secondary objectives should map

readily to corresponding statistical hypotheses - Safety objectives are given greater emphasis in

Phases I and II Phase III focuses on efficacy

and safety - Objectives should be specified as precisely as

possible. At a minimum, include information on - What measure of efficacy/safety will be used?
- Key features of the target patient population
- Dosing regimen, i.e. amount, frequency, and route

of dosing - Preferable to use neutral language when

specifying objectives (personal opinion). Phrases

like to compare (investigate) the efficacy or

to characterize the pharmacokinetics are

preferable to, e.g., to demonstrate efficacy or

to establish superiority

Protocol Tip 1 Specify clear study objectives

- Examples
- To investigate the effect of a single 5mg dose

of rhwonderprotein, administered by transgenic

snakebite, on clotting ability in Irish

clergymen, as measured by the change from

baseline in prothrombin time, rather than To

demonstrate the efficacy of rhwonderprotein in

improving clotting ability - To investigate the effect of twice daily SC

injection of 40µg/kg of rhIGF-I for 12 weeks on

glycemic control, in subjects with moderate to

severe Type II diabetes, as measured by the

average change from baseline in HbA1c, compared

to subjects in the placebo group

Bias sources and precautions

- Selection bias
- Allocation bias
- Evaluation bias (observer/instrument)
- Recall bias
- Time (systematic change in patient population,

treatment, or other aspect of study conduct as

trial progresses) - Withdrawal / drop out patterns
- Lack of compliance with study protocol
- Unblinding (of patient, physician, or study

personnel)

- Unambiguous eligibility criteria
- Randomization, stratification, blinding
- Blinding, standardization

(training, or central evaluation) - Appropriate data collection instruments
- Balanced treatment allocation, protocol should

specify salient details of study conduct,

avoiding room for differential interpretations - Pre-specified analysis conventions, sensitivity

analyses - Training engaged study coordinators at site
- Randomized allocation suitable precautions

surrounding treatment codes and drug

inventory/supply

Bias the statisticians arch-nemesis

- Loosely speaking bias arises as a result of
- Groups differ at baseline w.r.t. an important

prognostic factor - Groups differ w.r.t. some aspect of study conduct

that could affect response - Key statistical tools against bias are
- Randomization (allocation of subjects to

treatment groups is randomized) - Blinding
- Stratification
- Uniform implementation of study procedures across

study sites is also critical. Differences may

complicate interpretation, or compromise

generalizability of results. Of particular

concern - Different interpretation of eligibility criteria
- Systematic differences across sites in how key

variables are measured

Bias, efficiency, and generalizability

- Trial design and execution should
- Avoid bias - wrong, or misleading, result
- Generalize to the target population of interest -

avoid an irrelevant result - Be efficient - avoid using more subjects than

necessary - Studies which are inadequately powered, or

otherwise deficiently designed, may be viewed as

particularly inefficient (and ethically dubious)

Randomization

- Randomization is the basis for statistical

inference - A significance level represents the probability

that differences in outcome can be the result of

random fluctuations. - Without randomization a statistically

significant difference may be the result of non

random differences in the distribution of unknown

prognostic factors - Randomization does not ensure that groups are

medically equivalent, but it distributes randomly

the unknown biasing factors - Randomization plays an important role for the

generalization of the observed clinical trials

data

Randomization Practical Tips

- If prognostic factors are known use randomization

methods that can account for it - Stratification / blocking
- Adaptive randomization
- If possible randomize patients within a site
- Patients enrolled early may differ from patients

enrolled later - Watch out for staggered enrollment
- Temporary closing of study sites or arms can

cause problems - Protocol amendments that affect

inclusion/exclusion criteria may be tricky - Even in open label studies randomization codes

should be locked

Blinding

- Randomization does not guarantee that there will

be no bias by subjective judgment in evaluating

and reporting the treatment effect - Such bias can be minimized by blocking the

identity of treatment (blinding) - Types of blinding
- Challenges
- Ethical considerations
- Unblinding procedures for safety reasons
- Unblinding procedures at final analysis

Protocol Tip 2 Avoid Ambiguity

- Protection against certain types of bias is

through appropriate design precautions

(stratification, randomization, blinding) - Other types of bias are prevented only by giving

unambiguous instructions to the sites on the

intended patient population and how all aspects

of the study should be conducted - Sites will sniff out each ambiguity in the

protocol, and interpret and execute the

instructions more divergently than you can

imagine - There is vagueness regarding key aspects of study

conduct, e.g. use of con meds, evaluation

schedule, endpoint definition, handling of

dropouts, how key evaluations will be carried

out, etc. etc. etc. - Major divergence in interpretation (e.g. in

deciding eligibility, or how to measure a key

response variable) - has the potential to torpedo the protocol

entirely - may not become evident until its too late

Protocol Tip 3 Accommodating multiple sites

- As a routine precaution, it is advisable to limit

the contribution to enrolment of any single site

to no more than 15 of the total. Note that this

limit is generally not specified explicitly in

protocol text, but is communicated to sites at

study initiation nonetheless - Non-standard evaluations may require intensive

training of site personnel to reduce systematic

differences in evaluation among sites - Centralized (blinded) evaluation, when feasible,

is often the best option - It is a good idea to develop a prospective

publication strategy, securing upfront buy-in

from key stakeholders - A plan and timetable for disseminating study

results should be developed, following existing

SOPs, and communicated to sites prospectively

Protocol Tip 3 Accommodating multiple sites

- Regular, frequent communication with sites is

important - Early monitoring of key variables is advisable,

to allow problems to be detected and fixed early - Appropriate mechanisms should be in place to

allow evaluation of aggregated safety data in a

timely fashion, (remember that individual sites

may not be able to discern adverse patterns,

based only on their data) - Each team member should try to attain at least a

basic understanding of the role of every other

team member

Endpoints (1)

- Discussion here will focus primarily on efficacy

endpoints - What about other kinds of endpoints?
- Pharmacokinetic endpoints are generally standard

parameters derived from the observed

concentration-time profiles - Safety endpoints also tend to be fairly standard

most are common across protocols, with occasional

disease/drug-specific markers - Incidence of adverse events (general,

protocol-specified, by body system, etc.) - Changes in key laboratory parameters
- Incidence of antibodies (neutralizing or not)
- Pharmacodynamic endpoints, in contrast, are

measures of activity, and will vary from study to

study. Recommendations for efficacy endpoints

apply.

Endpoints (2) General Remarks

- No problem in Phase I, where focus is primarily

on safety and PK endpoints. Limited sample sizes

preclude formal evaluation of efficacy if it

must be mentioned in the protocol, it is

preferable to refer to activity, rather than

efficacy - Drug approval requires establishing an acceptable

risk-benefit profile. It is important to bear in

mind that the regulatory expectation is that of

clinical benefit to the patient - Thus, in general, the primary efficacy endpoint

should be a measure of clinical effect (as

opposed to, e.g. a biochemical or physiological

marker) - Taking the primary efficacy endpoint in a pivotal

trial to be a biomarker which is not a direct

measure of clinical benefit is something which

should be done only with prior buy-in from all

relevant regulatory agencies - In general, such buy-in can be attained only in

the case of an established surrogate endpoint

more on this below

Endpoints (3) relevance should be accepted

- Ideally, there is a well-established primary

efficacy endpoint, accepted as a suitable

measure of patient benefit. - This can circumvent much tedious discussion, and

has the added advantage that consensus on what

constitutes a meaningful treatment effect is

likely already to exist. - When such consensus exists, to ignore it would be

foolhardy - Often there may be consensus on the choice of

primary efficacy variable, but secondary aspects,

such as definition of relapse or rebound may

still be under debate - For diseases with no consensus on how best to

measure efficacy, expect longer development times - It is not recommended to launch Phase I without a

reasonably clear vision of what the primary

efficacy variable will be in pivotal studies

postponing difficult discussions wont

necessarily make them any easier - Agreement on conventions for handling

dropouts/missing data is also important

Endpoints (4) Objective is better

- Generally speaking, endpoints which can be

measured in a completely objective fashion are

preferred - This may not always be possible some degree of

subjectivity may be unavoidable (e.g. in

endpoints such as physicians or patients

evaluation of improvement) - The degree to which this kind of subjectivity may

be acceptable is likely to depend on perceptions

about the integrity of blinding in the study - In evaluating quality of life, use of a

validated instrument is preferable. In many

cases, a disease-specific QOL questionnaire

exists - Consultation with the Health Economics group is

highly recommended, to ensure that collection of

QOL data supports the target product profile

(dont wait until Phase III to do this)

Endpoints (5) measurement aspects

- In general, key efficacy endpoints should be

straightforward to measure. Avoid measures which

might still be considered experimental, which

require highly complex instrumentation, or

involve extremely specialized assays.

Measurements which rely heavily on technician

skill or judgement can also be problematic - Centralized evaluation of key endpoints may help

guard against inter-site variation - If key variables do involve specialized assays,

make sure that assay procedures are thoroughly

understood, and consistently implemented

Endpoints (6) Multiple Endpoints

- Multiple secondary endpoints are common
- Multiple primary endpoints are sometimes used
- If consensus on a single 1? endpoint is

impossible - Should be a course of last resort (personal view)
- Have an associated penalty, in terms of a higher

bar to declare statistical significance at a

given level ? - A common approach is to require significance at

level ? k, where k is the number of

co-primary endpoints (Bonferroni) - Bonferroni works reasonably, provided k is not

too large, and if the constituent endpoints are

uncorrelated - For highly correlated endpoints, Bonferroni is

inefficient true attained significance will be lt

? - Especially problematic if there is interest in

multiple subsets - Try to show some discipline regarding of 2?

endpoints

Endpoints (7) a statistical taxonomy

- Continuous - e.g. reduction in cholesterol,

HbA1c, visual acuity - Categorical
- Multiple categories with no natural ordering
- Ordered categorical - e.g. different degrees of

improvement - Dichotomous e.g. response/non-response,

dead/alive at a specific time post-treatment - Time-to-event e.g. survival, time to

progression - Different analysis methods are appropriate for

each main - endpoint type sample size requirements differ as

well - (3) is obviously a special case of (2)

Endpoints (8) statistical properties

- Approximate ordering by information content (from

highest to lowest) is - Continuous gt time-to-event ordered

categorical - gt categorical gt binary
- As a result, demonstrating an effect when the

primary efficacy measure is a response rate is

typically most demanding, in terms of sample size - Although continuous response variables may have

preferable statistical properties, it is quite

common for FDA to require the primary efficacy

variable to be a response rate, where response is

defined as the proportion of subjects who reach a

specified threshold of improvement on the

continuous scale (Raptiva, Lucentis)

Endpoints in cancer trials

- Response rate (where response is based on change

in tumor size, according to well-defined

criteria best post-treatment evaluation is

counted, so response is not linked to a specific

timepoint) - Duration of response (note that the resolution

with which this can be determined will depend on

the frequency of scheduled evaluations) - Survival time
- Time to disease progression, where criteria for

progression are well-defined - Progression-free survival
- One major question is the extent to which a

treatment effect - on response, in terms of reduction of tumor size,

is predictive - for treatment effect on survival. Unfortunately,

this seems to vary by tumor - and treatment class.

Sample Size Considerations

- In the standard hypothesis testing framework for

efficacy - Type I error conclude an ineffective drug is

effective (false positive) - Type II error conclude an effective drug is

ineffective (false negative) - Ideally, both error probabilities should be

controlled - Generally, sample size is chosen to give

acceptable power (defined as 1- Type II error

rate, or 1 - ?) for a prespecified false positive

rate, ? - In phase III efficacy trials, ? is 0.05, by

regulatory fiat - Acceptable power is generally taken to be 90 for

pivotal studies

Phase III Trials Sample Sizes

- This has implications for sample size, due to

tension between both types of error - Timeline implications, as study duration

treatment duration accrual time - Common pitfall exaggerate extent of the

possible treatment effect (power for the home

run), over-optimistic sample sizes - General guideline power study to detect

treatment effect specified in the target product

profile (regular, not optimistic, scenario) - In some cases, sample size is dictated by safety,

rather than efficacy, considerations (satisfy

minimum regulatory requirements)

Sample Size Considerations

- For a given value of ?, power depends on
- Magnitude of the treatment effect (?)
- Sample size (?)
- Inter-subject variability for continuous

measurements (?) - Response rates for binary responses (??)
- For most pivotal efficacy trials, the standard

approach is to calculate the sample size

necessary to give adequate (90) power to detect

a clinically meaningful treatment effect, with a

type I error rate of 5 - Calculating the sample size needed for a given

power requires some knowledge about variability

of continuous responses (or response rates, for

binary data) - Clinically meaningful needs to be defined in

terms of the target product profile, not as the

effect size which will give acceptable power for

the sample size Im willing/able to use

Sample Size other approaches

- Sample size is not always dictated by this kind

of power analysis in some cases, safety

requirements may be the deciding factor

(rheumatoid arthritis, psoriasis) - In earlier phases, it may not be practical to run

trials big enough to control both Type I and Type

II error rates as well as we might like - 80 power is generally considered adequate in

Phase II on occasion we may settle for less - Similarly, requiring significance at the 5 level

may be overly stringent in Phase II - Personal view it is foolish to allow the

hegemony of hypothesis testing to control our

thinking prior to Phase III - Instead, view the issue as an estimation problem
- Precision analysis
- Choose sample size in such a way that there is a

desired precision at fixed confidence level - Small chance of detecting true treatment effect

Sample Size for Time to Event Endpoints

- Challenge
- Power for correctly detecting a clinical

meaningful difference at a fixed type I error

rate depends primarily on the number of events

(deaths, progressions, etc.) - Specifying the number of events doesnt uniquely

determine the number of subjects - For instance, suppose the required number of

events is 280. If 300 subjects per group is

sufficient to give the required number of events,

then 250 per group must as well it will just

take longer - Thus, sample size calculations are a little more

complex for time-to-event responses and will

depend on - calculating the number of events needed to give

the desired power - an assumption about the median time-to-event in

the control group - an assumption about the size of the difference

between control and treated groups - projected accrual patterns
- targeted study duration

Interim Analyses

- Interim analysis is a tool to protect the welfare

of subjects - By stopping enrollment/treatment as soon as a

drug is determined to be harmful - By stopping enrollment as soon as a drug is

determined to be beneficial - By stopping trials which will yield little

additional useful information (or which have

negligible chance of demonstrating efficacy if

fully enrolled, given results to date) - The associated statistical methods are generally

referred to as group sequential methods

Interim analysis Concerns

- Should preserve an overall false positive rate of

? for the trial cannot claim statistical

significance at level ? if the unadjusted p-value

at one of the interim analyses happens to be less

than ? - In general, the unadjusted p-value for testing

treatment effect at any given interim analysis

will be compared to a more stringent (lower)

bound to stop early (for efficacy) requires

compelling evidence - Regulatory agencies need to be convinced that

interim analyses do not compromise the integrity

of the blind - Regulatory guidelines over the past 10 years have

become stricter and stricter, ultimately

requiring that interim analyses be conducted by

an external, independent group, i.e. study team

members are no longer privy to interim results

Interim analysis Concerns

- Basically, interim results should not be shared

with anyone in the sponsor company, or at

participating study centers - The only feedback to the sponsor is in the form

of the recommendations from the Data Monitoring

Committee - Details of any proposed interim analysis,

including the sponsors expectations of the DMC,

should be laid out prospectively in a written

charter - SOPs and a charter template exist and should be

followed - Although team members do not conduct the actual

analyses, scheduled interim analyses can be

highly labor-intensive nonetheless. Genentechs

biostatistician/statistical programmer will still

need to work with the external data group to

develop detailed specifications for the analyses

and displays to be made available to the Data

Monitoring Board

Interim analysis

- Early stopping for efficacy is not the only

possibility (recent experience notwithstanding).

Doing so is generally non-controversial, provided

an appropriate group sequential stopping rule,

and the role of the DMC, have been identified

prospectively - Early stopping for safety can range from

scenarios which are very clear-cut to situations

which are considerably more ambiguous. In the

latter case, having an experienced DMC chair can

be particularly important - Early stopping for lack of efficacy (futility

analysis) is not particularly common (with one

exception, discussed on the next slide) the

idea that incorporating this option can result in

substantial reduction in the number of patients

(gating risk) seems slightly misleading

(personal opinion) - Stopping for futility in a controlled trial will

typically happen only if the treatment appears

considerably inferior to control at the interim

analysis - Enrolment continues during preparation for the

interim analysis, which typically occurs at a

point where accrual has gained momentum, so of

subjects saved may not be that great

Early stopping for futility

- An exception is the case of uncontrolled oncology

trials focusing on estimation of response rate - Use of a two-stage (or multi-stage) design is

common - At a given analysis stage, if the observed

response rate is so low that it essentially rules

out the possibility that the true response rate

is acceptable, may choose to stop - Typically the argument is based on the upper 90

or 95 confidence limit for the true response

rate stop if this is lower than the minimum

rate identified as interesting in the TPP - Recall the rule of 3, often invoked in the

context of safety data. If a particular event

(adverse reaction, response) occurs in 0 out of N

subjects tested, then the 95 upper confidence

limit for the true rate of occurrence is 3/N. - Thus, for instance, if no responses are observed

in the first 20 subjects, this effectively rules

out values of the true response rate greater than

3/20, or 15. If the TPP requires a response rate

of at least 20, stopping for futility seems

warranted

Statistical analysis methods for rates

- A fairly detailed exposition can be found on our

website at gwiz/projects/stathelp

introductory course notes, lecture 4 - Use of the binomial distribution
- Calculating standard errors normal approximation

for large samples - Estimation and confidence intervals for a single

rate - Testing for difference between two rates (z-test,

?²-test, Fishers exact test) - Estimation and confidence intervals for the

difference between two rates - Testing for differences in rates among several

groups (?²-test, Fishers exact test)

Statistical methods for survival analysis

- If the response of interest is survival time,

then specialized methods are needed, for two main

reasons - Frequency distribution of survival times is

usually not well-behaved not normal, not even

symmetric - In the context of clinical studies, cannot wait

to observe all survival times this means, for

some subjects, all we know is that their survival

time exceeds the observation period - In statistical jargon, such survival times are

called (right)-censored observations - Methods for survival times are also applicable to

any response of type time-to-event e.g. time

to disease progression, etc.

Overview of survival analysis methods

- Definitions survivor function, hazard function
- Estimation of survival curve Kaplan-Meier
- Comparison of one or more survival curves

logrank test, Wilcoxon test - Comparing survival curves, allowing adjustment

for other factors (e.g. baseline disease status)

proportional hazard regression, aka the Cox

model

Kaplan-Meier disease-free survival curves

stratified by p53 mutation status (n 542)

Solid/dotted without/with a p53 tumor mutation

Graphing survival data Kaplan-Meier estimation

- We wish to estimate the proportion remaining

disease-free at any given time, equivalently, the

estimated probability of that a member of the

population from which the sample is drawn is

alive without disease at that time - Because of the censoring we use the Kaplan-Meier

method. For each time interval we estimate the

probability that those without disease at the

beginning remain so throughout the interval. This

is a conditional probability. - The probability of being disease-free at any time

point is calculated as the product of the

conditional probabilities of surviving without

disease through each interval prior to that time

point. - The calculations are simplified by ignoring times

at which there were no recorded events (whether

progressions or losses to censorship). - Censorship is accommodated in the calculations by

ensuring that all subjects previously lost to

censoring are removed from the risk set when

calculating the conditional probability for a

given timepoint - Because the overall probability of being disease

free at a particular timepoint is calculated as a

product of the relevant conditional

probabilities, this (Kaplan-Meier) method of

estimating the survival curve is sometimes

referred to as the product-limit estimate

Describing survival pattern for a single group

- Survival probabilities are usually presented as a

connected "curve. The curve takes the form of a

step function, with changes in the estimated

probability occurring (only) when an event

(progression) was observed - Observations censored during any interval affect

the number still at risk at the start of the next

interval. Censoring is thus accommodated when

calculating the step sizes, its effect on the

curve is relatively subtle, but becomes

cumulatively more important over time. Some

versions of the Kaplan-Meier curve display

censoring times as superimposed short vertical

lines (works best for relatively small sample

sizes) - In practice, a computer is used to do these

calculations. - Standard errors and confidence intervals for

estimated survival probabilities can be found by

using a formula due to Greenwood - Reporting estimated median survival with

associated confidence limits is usual estimating

other percentiles is also possible

Comparing survival patterns across groups

- Two most common tests are
- Logrank test
- Wilcoxon test
- If comparison needs to allow adjustment for other
- covariates besides group ID (e.g baseline disease
- status), the most common approach is
- Cox (proportional hazards) regression
- As the name implies, this analysis frames the

comparison in terms - of the effect a treatment or covariate exerts on

the hazard function, - rather than directly on the survival function

Comparing survival patterns testing

- Logrank test
- Basic idea at each new event time, figure out

the survival pattern that would be expected if

the null hypothesis (no difference) were true - Quantify the difference between the observed

survival pattern and that expected under null

hypothesis. This is done at each new event time. - Obtain a cumulative measure of discrepancy from

H0 by adding up the contributions across all

event times - Compare the result to appropriate tables

(chi-square) to obtain a p-value - Wilcoxon test variation of logrank text which

gives greater weight to discrepancies occurring

earlier

Comparing survival patterns estimation

- Limitations of the logrank test
- Only addresses the question is there a

difference? No direct quantification of the size

of the difference - Doesnt allow adjustment for other relevant

prognostic factors (e.g. differences at baseline) - These questions usually addressed by Cox

(proportional hazards) regression. Salient output

is - estimated coefficient with standard error and/or

confidence interval - Usually interested in whether or not coefficient

is zero - Quantifies effect on hazard, rather than the

survival function

Definitions of survival and hazard functions

- For completeness, here are the definitions
- Survival function
- S(t) Probability of surviving past time t
- Hazard function
- h(t) Probability of dying at time t, given one

has survived until that time - For calculus fans, the hazard function turns out

to be d/dt - log (S(t)

Safety analyses

- Safety and efficacy data differ in some key

aspects - Safety hypotheses are not specified a priori
- Failure to achieve statistical significance does

not mean that a safety finding can be ignored - With safety data the goal is to prove a negative
- Safety analyses are usually descriptive
- A few serious medical events can lead to the

termination of products development extreme

value distributions are relevant to safety

analyses - Concurrent controls may not provide adequate

context for interpretation

Safety Analysis - Challenges

- Phase III trials are typically sized based on

efficacy what type of safety statements are

appropriate? - Drug exposure how to summarize, how to

correlate with adverse events observed, etc. - Dose response
- Open label trials
- Placebo-controlled trials
- Sources of bias (under-reporting, longer

follow-up leads to more events) - Adverse events very very many types, so what is

an appropriate way to summarize/analyze? - Multiplicity

Safety Analyses - Challenges

- Number of subjects and duration of exposure

during development is minimal relative to the

of patients that may receive drug post-approval - Only the most common AEs (e.g., incidence of 1

or more) are identified - Less common AEs (1 in 1000) cannot be reliably

detected - Rare events (1 in 10,000) will almost certainly

not be observed at all - Some patient groups may have been excluded from

trials entirely, or insufficiently represented

to a degree which precludes identifying any risks

specific to them

Regulatory Requirements

- Safety
- Applicant must demonstrate product safety (FDA

has obligation to demand) - Extent of data There must be sufficient

information to decide whether the drug is safe. - Adequate analyses Adequate tests by all

methods reasonably applicablemust be performed

to evaluate safety for labeled use. - Reasonable results Tests should show that drug

is safe as labeled - Risks must be adequately defined.
- Extreme risks (even if rare) must be obvious.

Regulatory Requirements

- Efficacy
- Applicant must demonstrate substantial evidence

of effectiveness claimed. - Substantial evidence evidence consisting of

adequate and well-controlled investigations,

including clinical investigations, from which

experts could conclude the drug will have the

claimed effect. - Investigations imply replication or

corroboration. - Typical 2 Phase III trials with identical or

similar designs - In special circumstances 1 Phase III trial may

be sufficient. - E.g. life-threatening diseases with very limited

therapeutic options (always a good idea to talk

to regulatory agencies prior to trial initiation)

Guidelines and Regulations

- Regulatory Agencies
- FDA
- EEC (European Economic Community)
- U.S. Codes of Federal Regulations for Clinical

Trials - ICH (International Conference on Harmonization)
- Initiatives undertaken by regulatory authorities

and industry associations to promote

international harmonization of regulatory

requirements - Good Clinical Practice (GCP)
- Structure and content of clinical studies
- Clinical safety data management Definitions and

standards for expedited reporting - Statistical principles for clinical trials

Biomarker - working definition

- . a laboratory measurement or physical sign

used as a substitute for a clinical endpoint that

measures how a patient feels, functions, or

survives. - from a definition of the term surrogate

endpoint by - Temple, cited in Fleming and DeMets (1996),
- Annals of Internal Medicine, 125, pages 605-613
- Surrogate endpoints in clinical trials are we

being misled?

Appendix

- Some thoughts on biomarkers

Biomarkers as surrogate endpoints

- Predict clinical efficacy of treatment based
- on its effect on biomarker (data may be
- available earlier may provide answer with fewer
- number of subjects)
- Use in Phase II is common
- dose ranging based on biomarker
- Phase III go/no go decision based on observed

treatment effect on biomarker

Common biomarker types

- Biochemical (cholesterol, HIV viral load,

cytokine concentration, hemoglobin A1c ) - Immunological (lymphocyte subpopulation counts,

CD4 , CD11a T cells, CD20 B cells..) - Saturation of target cell surface antigen or

soluble ligand - Physiological (e.g. blood pressure, pulmonary

function testing, episodes of arrythmia ) - Imaging (angiography, tumor size, bone density by

DEXA scan )

Biomarkers as surrogates - successes

- Lowering of cholesterol level by treatment with

statins (survival benefit established) - Reduction in viral RNA in peripheral blood

through treatment with protease inhibitors delays

HIV disease progression - Improved glycemic control (HbA1c) predictive of

delayed onset of microvascular complications

(retino-, nephro-, neuropathy) in Type I diabetes

- 90-minute TIMI flow (angiography) predictive of

30-day survival following thrombolytic therapy - Reduction in free IgE following treatment with an

anti-IgE antibody correlates with symptom

improvement scores in allergic rhinitis and asthma

Biomarkers as surrogates cant win em all

- Experience with biomarkers is not always positive
- CD4 counts as a surrogate in AIDS trials mixed

performance as a predictor of clinical benefit - Tumor size in cancer trials experience runs

both ways appears to depend both on tumor type

and on class of treatments - Experience in the CAST trial demonstrated that

treatment with encainide/flecainide clearly

reduced the incidence of arrythmias, but

increased mortality - Similar results in context of treating atrial

fibrillation - Blood pressure as surrogate effect translates

to clinical benefit for some drug classes, but

not others

What can make biomarkers unreliable?

- Biomarker not on causal pathway of disease

process - Several pathways intervention affects that

mediated through biomarker, but not others

(redundancy) - Biomarker not on the pathway affected by the

intervention, or is insensitive to treatment

effect - Intervention has mechanisms of action unrelated

to the disease process (aka the law of

unintended consequences) - Failure of either type is possible - biomarker

could falsely predict, or fail to predict,

clinical benefit

What can make biomarkers unreliable?

- Other potential contributing factors include
- Measurement difficulties due to rater effects
- GNE experience (?-interferon in renal cell

carcinoma) - strongly supports advisability of blinded tumor
- evaluation by a single central review board

(avoid - bias, minimize center differences)
- Measurement difficulties arising from sample

preparation, - transport, storage, and handling
- Time constraints in assaying fresh blood,

possible effects of - activation of T-cells, lack of standardization of

FACS assay - protocols and reporting methods, heterogeneity of

tumor - samples, center differences (use of local or

central labs)

What can make biomarkers unreliable?

- Other potential assay-related difficulties

include - - Matrix effects
- Interference by other proteins can affect assay
- specificity and/or sensitivity
- Development of antibodies
- Can be hard to detect harder to quantify

reliably - extremely difficult to assess clinical

significance, if any - Inter-laboratory differences
- Can be large enough to make biomarker data

uninterpretable

Biomarkers editorial comments

- Avoid the what we can measure is what we should

measure fallacy - Experience with imaging-based biomarkers to date

has been disappointing - Non-targeted genomic assays (e.g. microarrays

followed by data mining) has the potential for

much wasted effort - Avoid the rearranging the deckchairs on the

Titanic fix, e.g. straining to improve assay

precision from a CV of 20 to 15 when the

within-subject CV for the marker is 40 and the

inter-subject CV is 50. - Cytokines make particularly treacherous

biomarkers - Proteomics is not for sissies
- Distinguish between must know and

nice-to-know - An understanding of mechanism of action may be

nice to know, but is not a requirement for drug

approval

Personal opinions (tongue in cheek)

- If the word cascade appears in the description

of the disease process, all bets are off - The topic of biomarkers seems to drive otherwise

thoughtful researchers to an irrational frenzy of

wishful thinking - The message so eloquently expounded by Jagger et

al remains as relevant today as it was in 1969 - Lasagnas Law already mitigates against rapid

accrual of eligible subjects to clinical trials - To slow recruitment from a trickle to a complete

grinding halt only two words are needed in the

protocol serial biopsy

Biomarkers - general conclusions

- Utility of a particular biomarker depends not

only on the disease, but also on the nature of

the therapeutic intervention - Validation of any candidate biomarker must

necessarily be considered on a case-by-case basis - Validity of a marker for a given drug class may

not transfer to other drug classes for the same

disease - Success is most likely when intervention clearly

affects the biomarker, whose role in the disease

process is well-established and clearly

understood - Validation of a putative marker cannot happen

without ultimately generating the required

clinical outcome data - Regulatory conservatism is to be expected, and

seems appropriate

(No Transcript)