Why Clinical Trials Fail

Why do clinical trials fail?  That is to say, what do you need to know when you design a clinical trial so you can minimize the likelihood of failure?

Generally speaking, there are three root causes of clinical trial failures: molecule issues, logistic issues, and study design issues.  I will focus on the third, but will cover the other two briefly.

Molecule Doesn’t Work

When I say the molecule doesn’t work, I mean that it 1) doesn’t have sufficient biological activity or 2) doesn’t have manageable toxicity.  We need to distinguish between cases where there is no biological activity at all, and cases where there is biological activity but it is different from what you expected.  In the latter case, then a well-designed clinical program can pick up the side activity and the program can be redirected. But if the molecule has no biological activity when it enters clinical development, then a clinical trialist can do little to salvage it.

With regard to toxicity, people often focus only on the magnitude of the toxicity.  But just as important are predictability and pervasiveness of the toxicity.  There are two types of toxicities you can’t design you clinical program around: 1) severe and generalized toxicity and 2) severe and unpredictable toxicity.  The other types of toxicities, such as severe but predictable toxicity  (such as toxicity after a certain cumulative dose or in patients with reduced renal function), are undesirable but you can often design the clinical program around those if the activity is high enough.  This is why I say unmanageable toxicity is the problem rather than just saying toxicity is the problem. Every molecule has toxicity and the job of the clinical trialist is to design around them.

Logistic Issues

Those outside the clinical development world often underappreciate how many things can go wrong in a clinical trial, and how often they go wrong.  I suppose this may be a common theme with many fields. I commented in another post that half of the published preclinical experiments may be unreproducible. There is little formal data on how many clinical studies are irreproducible because of logistic issue but I would venture to say it’s nontrivial.

For example, one of the first things you should do if a trial fails is to check the randomization codes and the labeling. Well, actually that is the first thing you should do when you start the trial but if you haven’t done that then at least check it afterwards. The (in)famous Intrabiotics study that failed because of mis-randomization is the most public example of randomization code gone wrong but it’s not the only one.  In the hurry to get the clinical trial started, sometimes sponsors neglect to triple-check the randomization algorithm.

Similarly, despite validation, mistakes in data analysis programming can occur.  For example, the per-protocol analysis may call for exclusion of patients who have a certain protocol violation but that specification may be missed by the programmers.  Many companies don’t double program (have two sets of independent programmers program two full sets of analysis independently) and in that case, it is almost inevitable there will be mistakes somewhere.  It is simply impossible to write any single-use, complicated program that is bug-free.

And of course, it is very common to find GCP issues in clinical trials. In fact, when I train people, I tell them if your monitors aren’t flagging errors and violations during a trial then they’re not doing their jobs properly. No one’s infallible, and no study is logistically perfect. Don’t get me wrong–if you do the study properly, you will catch and correct those types of errors. But that’s not a trivial task.

Inadvertant unblinding can also be difficult to prevent in certain types of studies. One of the most common unblinding occurs when the drug has a common side effect, such as a rash.

And of course, as you already know, the most common logistic issue is inadequate sample size due to financial constraints.

Clinical Trial Design Issues

Error in clinical trial design is perhaps the most common reason trials fail, other than the above two reasons.  There are many variables in clinical trial design but of those, three are the most important when it comes to insuring a successful clinical trial: selecting the right patients, selecting the right dosing, and selecting the right endpoint.

Often, a drug has biological activity but it is tested for the wrong indication.  Or, it is tested in the right indication but in the wrong subpopulations.  In other instances, the wrong dose or dose interval is selected.

Below is a modified excerpt (with permission) from two of my books, Principles and Practice of Clinical Trial Medicine and Global Clinical Trials Playbook.

The Right Endpoint

The first critical factor, endpoint, is the clinical or surrogate phenomenon that you are assessing to determine whether the drug is effective.  This item can be a disease characteristic, health state, symptom, sign, or test result (e.g., laboratory, radiological).

Endpoint is not the same as measurement, by the way.  Measure is the scale you use to measure the endpoint, such as mmHg or number of days.  Endpoint is the thing being measured, such as blood pressure or length of stay in the hospital.

There are many considerations that go into designing a good endpoint and they’re listed below. But for this post, I won’t talk about issues such as how the wrong endpoint can cause regulatory problems or how it can cause commercial failure. Instead, I will concentrate on how the wrong endpoint can cause a study to be negative.

Characteristics of a Good Endpoint

  • Clinically relevant
  • Closely and comprehensively reflects overall disease being treated
  • Rich in information
  • Responsive (sensitive, discriminating, and has good distribution)
  • Reliable (precise, low variability, and is reproducible) even across studies
  • Robust to dropouts and missing data
  • Does not influence treatment response or have biological effect in and of itself
  • Practical (implementable at different sites, measurable in all patients, economical, and reasonably noninvasive)

One of the ways you can mis-design an endpoint is to include non-modifiable phenomena. One example was the pexelizumab (C5 inhibitor) study for bypass surgeries. The endpoint was myocardial infarctions (MI) and deaths.  The level of troponin necessary to qualify as an MI was set quite low, presumably in an attempt to power the study adequately. The problem is that during a bypass, you almost always get a low level of troponin leak just from the surgical trauma to the heart.  This non-modifiable low troponin elevation should have been excluded from the endpoint. Then the study would have been positive rather than negative.

Another way that an endpoint can fail is if it lacks responsiveness (i.e., sensitivity of the measure to actual changes in a phenomenon). For example, if the endpoint is remission, defined as 30% improvement in symptom, a drug that only improves symptoms by 10% will not generate a signal. There are several ways to address this problem, including using a continuous endpoint (discussed in the post, “Reducing Sample Size”) and using a surrogate endpoint (discussed in the post “Surrogate Endpoints”). You want an endpoint that will change as much as possible when there is a drug effect.

Related to responsiveness is selection of the wrong statistical analysis. The analysis must match the type of disease and the response curve you expect to get. I write more about this in the “Reducing Sample Size” post, but as an example, if you expect the drug to show an effect only after 3 months of treatment, you shouldn’t use an analysis that examines the hazard ratio throughout the duration of the study. Statistical tests are based on certain assumptions, like normal distribution curve and linear response. More often than not, drugs and populations violate these assumptions, but not enough to cause a substantial problem with the analysis. But there are times when they do.

In addition, the analysis methodology and the endpoint should be robust to dropouts and missing data.  Patients will drop out of trials.  Data will be lost.  So you will have to interpolate the missing data.  All-cause mortality is relatively robust to a few dropouts because you may count dropouts as deaths.  However, frequency of flare is not robust to dropouts because you cannot predict how many flares the patients who dropped out would have had during the study.

The fourth major way that a poor endpoint can lead to a negative trial is high variability. Ideally, repeated measurements for an endpoint should produce similar values but they often don’t.

Sometimes variability can’t be avoided, but in other cases, the variability rises from rectifiable factors. For example, if you don’t use a core lab for reading X-rays, you get inter-reader variability. If you don’t use the same rater to rate joint swelling on every visit, you get inter-rater variability. If you don’t train the ultrasound technicians and validate ultrasounds before the study, you get technique-dependent variability. If you use subjective patient symptoms rather than oxygen saturation, you will also get higher variability.

The fifth (and I would like to say the final, but I can’t because the ways a poor endpoint can lead you down the wrong path are innumerable) way the endpoint can lead to a negative study is wrong timing. This is sometimes underappreciated, but for every drug and study there is a point in or a period of time during which you see maximal effect. This is similar to the maximally effective dose. This is the time when the gap between the active and the control arm is at the greatest. If you measure before that, you may not have had enough time for the drug to exert its full effect. If you wait too long, then placebo group might catch up (if the disease is self-resolving such as a cold), the drug effects may become attenuated, or patient dropouts may dilute the effect size.

The Right Dose

The second critical parameter in clinical study design is dosing. By dosing, I mean all the characteristics of the dose, including: the amount of an intervention administered, the route of administration (e.g., oral, IV, SC), the dosing interval, the rate and duration of administration.

One of the most common errors in dose selection is simply selecting a dose that is too high or too low. Unfortunately, it often happens that the previous studies simply were not large or thorough enough to collect enough data to select the ideal dose.

A less common but still frequently seen error is using a dosing regimen that is too undifferentiated, such as a flat dose. When great heterogeneity in patient response or a narrow therapeutic window exists, you may have to customize the dose:

  • Dose by baseline characteristic:  Customizing doses by baseline physiological factors is the most common method.  For example, if you find that drug response varies by patient weight, you may have to give heavier patients higher doses than lighter patients.
  • Titrate to an Endpoint:  An alternative method is choosing a relevant clinical endpoint (i.e., outcome) and adjusting the dose for each patient until the endpoint reaches a certain value.   (e.g., change the medication dose until a certain blood pressure is achieved or a certain plasma level of the drug is reached).
  • Dose by Subpopulation: Another method is to identify subpopulations that may respond differently to the drug and giving each subpopulation a different appropriate dose (e.g., men may receive higher doses than women or African Americans may require different doses than Latinos).

The Right Patient

The third critical parameter in clinical study design is patient selection.

One way that poor patient selection can lead to a negative study is via selection of patient who are physiologically incapable of responding to therapy. This seems obvious in theory but sometimes not so obvious in practice. For example, stroke patients typically become symptomatic hours after the initial injury. Patients with sepsis often have multi-organ failure. Patients with irreversible damage and those beyond the point of recovery generally do not respond to therapy.

Similarly, patients who are only mildly sick or symptomatic are not likely to show a substantial benefit. And it is difficult to show a benefit in a population that has only 1% per year risk of experiencing an event. If you’re designing a study to prevent asthma exacerbations, for example, excluding patients with very low risk of experiencing the outcome will be a good idea.

Another way patient selection can go awry is to blindly follow the conventional disease categories and definitions. There are many ways to define patient populations and diseases.  It is not always optimal to define the patient population by a previously recognized disease category because disease categories are intellectual constructs. Diseases categories are based on a number of possible criteria including:

  • Histological changes: (e.g., Crohn’s disease and ulcerative colitis alter the intestinal lining in different manners)
  • Pathophysiologic mechanisms: (e.g., lack of insulin secretion results in Type I Diabetes while lack of response to insulin results in Type II Diabetes).
  • Causative agent: (e.g., hepatitis A is caused by the hepatitis A virus, asbestosis is caused by asbestos)
  • Physical manifestations: (e.g., rheumatologic conditions are defined by the joints they affect and how they affect them)
  • Symptoms and signs: (e.g., stable angina is the presence of chest pain during exertion and unstable angina is the presence of chest pain at rest)
  • Body part, organ, or organ system affected: (e.g., iritis is inflammation of the iris and uveitis is inflammation of the uvea)
  • Pre-disposing, preceding, or concurrent conditions: (e.g., concussions occur after head trauma, frost-bite occurs with extreme cold)
  • Prognosis and natural history (e.g., cancerous masses can spread to distant locations while benign masses do not)
  • Measurement Thresholds: (e.g., hypertension is defined as a systolic blood pressure above 140 and a diastolic pressure above 90).
  • Response to treatment:  Two very similar conditions may have different treatments.  (e.g., the distinction between ST elevation myocardial infarction and on-ST elevation myocardial infarction emerged only after it was found that thrombolytics were effective in one and not in the other).  Sometimes several conditions with very different mechanisms and clinical manifestations will be defined as a single disease if their treatments are the same (e.g., schizophrenia includes catatonic and paranoid schizophrenia).
  • Of the above, the most common way to classify disease is by response to therapy.  In fact, you may classify several distinct conditions with very different pathophysiology and clinical manifestations as the same disease if the treatment is the same—or more commonly, if no good treatments exist for any of the conditions. (e.g., schizophrenia has very different possible manifestations that range from catatonic to paranoid schizophrenia.  These conditions all fall under the umbrella of one disease partly because they respond to similar therapy.)  Classifying diseases based on the available treatment options is often more pragmatic.  (e.g., distinguishing between ST elevation myocardial infarction and non-ST elevation myocardial infarction was not necessary before thrombolytics were shown to be effective in one but not the other.  Since that discovery, one disease has become two).

So, it is important to think carefully about the “disease” that a clinical trial is targeting.  You may have to expand the trial beyond the normal confines of a disease – such as “lumping” all patients with arterial atherosclerosis together, including patients requiring coronary artery bypass surgery, percutaneous coronary interventions, and stroke interventions—or limit the trial to a subgroup of patients such as classic or occult age-related macular degeneration.

As a general rule, if the benefit to the patients can be captured in a common endpoint, it may be possible to lump the patients together. On the other hand, if you find it difficult to define a common endpoint that would capture the benefit for all members of the target population, you may be on the wrong track.