Showing posts with label statistics in emergency medicine. Show all posts
Showing posts with label statistics in emergency medicine. Show all posts

Saturday, May 28, 2016

Risky Business: Framing Statistics

The Gist: Medical literature frequently frames effect size using relative and absolute risks in ways that intentionally alter the appearance of an intervention, an example of framing bias. Effect sizes seem larger using relative risk and smaller using absolute risk [1,2].

Part of our 15 Minutes - 'Stats are the Dark Force?' residency lecture series
Risk from Lauren Westafer on Vimeo.

Perneger and colleagues surveyed physicians and patients about the efficacy of 4 new drugs. They were presented with the following scenarios:
While these numbers reflect the same data, patients and physicians selected that the drug that "reduced mortality by 1/3" was "clearly better" [3].  This indicates both parties are susceptible to the cognitive tricks statistics, particularly the way that relative risk make numbers appear larger and the ways absolute risk makes numbers appear smaller.  Authors can consistently switch between relative and absolute risk to maximize the appearance of benefit and minimize risks, as seen in the following abstract:

Perhap authors should be encouraged to report statistics consistently, using relative OR absolute rather than switching between the two to give the appearance of maximum benefit and minimal risk.

References:
1. Barratt A. Tips for learners of evidence-based medicine: 1. Relative risk reduction, absolute risk reduction and number needed to treat. Canadian Medical Association Journal. 171(4):353-358. 2004
2. Malenka DJ, Baron JA, Johansen S, Wahrenberger JW, Ross JM. The framing effect of relative and absolute risk. Journal of general internal medicine. 1993;8(10):543-8.
3. Perneger TV, Agoritsas T. Doctors and patients' susceptibility to framing bias: a randomized trial. Journal of general internal medicine. 26(12):1411-7. 2011.

Saturday, May 21, 2016

P Values: Everything You Know Is Wrong

The Gist:  P values are probably the most “understood” statistic amongst clinicians yet are widely misunderstood.  P values should not be used alone to accept or reject something as “truth” but they may be thought of as representing the strength of evidence against the null hypothesis [1].

Part of our 15 Minutes - 'Stats are the Dark Force?' residency lecture series

P: Everything You Know Is Wrong from Lauren Westafer on Vimeo.


At various times in my life I, like many others, have believed the p value to represent one of the following (none of which are true):
    • Significance
      • Problem:  Significance is a loaded term.  A value of 0.05 has become synonymous with “statistical significance.”  Yet, this value is not magical and was chosen predominantly for convenience [3].  Further, the term “significant” may be confused with clinical importance, something a statistic cannot answer.
    • The probability that the null hypothesis is true.
      • Problem: The calculation of the p value includes the assumption that the null hypothesis is true.  Thus, a calculation that assumes the null hypothesis is true cannot, in fact, tell you that the null hypothesis is false.
    • The probability of getting a Type I Error 
      • Background: Type I Error is the incorrect rejection of a true null hypothesis (i.e. a false positive) and the probability of getting a Type I Error is represented by alpha. Alpha is often set at 0.05 so that there is a 5% chance you are wrong if you reject the null hypothesis. This is a PRE test calculation (set before the experiment)
      • Problem: Again, the calculation of the p value assumes the null hypothesis is true. The p value only tells us the probability of getting the data we did, it does NOT speak to the underlying truth of whatever is being tested (i.e. efficacy). The p value is also a POST test calculation.
      • The error rate associated with various p values varies, depending on the assumptions in the calculations, particularly prevalence. However, it's interesting to look at some of the estimates of false positive error rates often associated with various p values:  p=0.05 - false positive error rate of 23-50%; p=0.01 - false positive error rate of 7-15% [5].
P value is the probability of getting results as extreme or more extreme, assuming the null hypothesis is true. Originally, this statistic was intended to serve as a gauge for researchers to decided whether or not a study was worth investigating further [3].  
  • High P value - data are likely with a true null hypothesis [Weak evidence against the null hypothesis]
  • Low P value - data are UNlikely with a true null hypothesis [Stronger evidence against the null hypothesis]
Example:  A group is interested in evaluating needle decompression of tension pneumothorax and proposes the following:
    • Hypothesis - Longer angiocatheters are more effective than shorter catheters in decompression of tension pneumothorax.
    • Null hypothesis - There is no difference in effective decompression of tension pneumothorax using longer or shorter angiocatheters.
A group, Aho and colleagues, did this study and found a p value of 0.01 with 8 cm catheters compared with 5 cm catheters.  How do we interpret this p value?  
  • We would expect the same number of effective decompressions or more in 1% of cases due to random sampling error.  
  • The data are UNLIKELY with a true null hypothesis and this is decent strength evidence against the null hypothesis.
Limitations of the Letter “P”
    • Reliability.  P values depend on the statistical power of a study. A small study with little statistical power may have a p value greater than 0.05 and a large study may reveal that a trivial effect has statistical significance [2,4].  Thus, even if we are testing the same question, the p value may be "significant" or "nonsignificant" depending on the sample size.
    • P-hacking.  Definition:  "Exploiting –perhaps unconsciously - researcher degrees of freedom until p<.05" Alternatively: "Manipulation of statistics such that the desired outcome assumes "statistical significance", usually for the benefit of the study's sponsors" [7].
      • A recent study of abstracts between 1990-2015 showed 96% contained at least 1 p value < 0.05.  Are we that wildly successful in research? Or, are statistically nonsignificant results published less frequently (probably).  Or, do we try to find something in the data to report as significant, i.e. p-hack (likely).
P values are neither good nor bad. They serve a role that we have distorted and, according to the American Statistical Association: The widespread use of “statistical significance” (generally interpreted as “p ≤ 0.05”) as a license for making a claim of a scientific finding (or implied truth) leads to considerable distortion of the scientific process [1].  In sum, acknowledge what the p value is and is not and, by all means, do not p alone.

References:
  1. Wasserstein RL, Lazar NA. The ASA’s statement on p-values: context, process, and purpose. Am Stat. 2016;1305(April):00–00. doi:10.1080/00031305.2016.1154108.
  2. Goodman S. (2008) A dirty dozen: twelve p-value misconceptions. Seminars in hematology, 45(3), 135-40. PMID: 18582619 
  3. Fisher RA. Statistical Methods for Research Workers. Edinburgh, United Kingdom: Oliver&Boyd; 1925.
  4. Sainani KL. Putting P values in perspective. PM R. 2009;1(9):873–7. doi:10.1016/j.pmrj.2009.07.003.
  5. Sellke T, Bayarri MJ, Berger JO.  Calibration of p Values for Testing Precise Null Hypotheses.  The American Statistician, February 2001, Vol. 55, No. 1
  6. Chavalarias D, Wallach JD, Li AHT, Ioannidis JPA. Evolution of Reporting P Values in the Biomedical Literature, 1990-2015. Jama. 2016;315(11):1141. doi:10.1001/jama.2016.1952.
  7. PProf.  "P-Hacking."  Urban Dictionary. Accessed May 1, 2016.  Available at: http://www.urbandictionary.com/define.php?term=p-hacking

Thursday, January 17, 2013

Would You Rather....

The Gist:  Composite endpoints, a sneaky way to skew data, are common in medical literature.  These outcomes often represent disease states across a spectrum, making it difficult to parse out what the intervention may mean for a given patient.  Check out this article by Peter Kleist, the Jan 2013 mini-JC with Dr. David Newman (if you're an EMRAP subscriber), and SMARTEM (particularly on coronary CT or stress testing). 

Composite endpoint:  multiple single end-points that are combined into a 'single outcome.'

FOAM SMARTEM, EM Lit of Note, and Twitter discussions can provide one with a cognitive toolkit and skeptical lens to look beyond abstracts and evaluate papers/assertions critically.  Recently, however, I was privy to a debate on PCI versus medical management for coronary artery disease.  I realized that a lack of understanding about composite endpoints abounds.  Dr. Newman showed me that death, MI, and cardiac catheterization aren't the same (silly as it sounds now) so I feel inclined to spread this wisdom. 

Understanding the problem with composite endpoints reminds me of a game my rugby teammates and I played when we traveled for games, 'Would you rather.."  The object is to pose a question with two very similar, difficult choices that often expose weaknesses or fears.  Questions are often gross or entertaining (and many are juvenile).  They are along the lines of:  "Would you rather travel a thousand years back and meet your ancestors or travel a thousand years forward and meet your grandchildren?" or "Would you rather sleep in a bed of cockroaches or a bed of maggots?"  This game would lose all appeal if the dilemma had a clear answer, "Would you rather get a cut on your arm or have your leg sawed in half?"  When this perspective is applied to the components of a composite endpoint, the weakness is exposed.  Like the last question, they're not equal.   


The Debated Paper:  FAME 2 Trial
  • Looked at PCI versus medical management in patients who had a stenosis of at least one major coronary artery that rendered a fractional flow reserve of <0.80.
  • Primary end-point: composite death from any cause, nonfatal MI, or unplanned hospitalization leading to urgent revascularization during the first 2 years (the triple composite endpoint).  
    • Urgent revascularization:  Patient with persistent or increasing chest pain admitted to the hospital with revascularization during that hospitalization). 
  • Results:  more patients in the medical management group had a primary end-point compared with the PCI group, 12.7% vs 4.2%.  That's an 8.5% absolute reduction in the PCI group.  
    • The rate of death did not differ between the groups (0.2%, n=1 in PCI group; 0.7% n=3 in medical management group)
    • The rate of MI did not differ between the groups. (3.4%, n=15 in PCI group; .3% n=14 in medical management group)
    • The rate of urgent revascularization was significantly less in the PCI group at 1.6% (n=7) compared with the medical management group at 11.1% (n=49).
  • Where did the 8.5% reduction in primary end-point come from?  Revascularization, a subjective outcome.  Who gets revascularized?
    • Whomever the cardiologists choose, and this may differ between cardiologists.  It's not necessarily rooted in disease. 
    • When these patients come into the ED, is a physician more likely to take a patient to the cath lab who just had a cath?  Or, is a physician more likely to take a patient who has been trying medical management.
  • Are all of these outcomes the same or do they  represent different portions on a spectrum? I see one firm endpoint in this composite (death), and then a spectrum.  Furthermore, while MIs are unfavorable, there are varying degrees of badness even within this singular end-point.  
That's one article.  What about composite endpoints across the literature?
  • Cordoba et al conducted a systematic review of RCTs published in 2008 that used a composite endpoint as a primary outcome (search via PubMed). 
    • The definition of the endpoint changed between the abstract, methods, and results section in 33% (n=13) of articles
    • 70% (n=28) of trials had a primary composite endpoint comprised of components of dissimilar significance (ex. death and hospital admission).
  • A meta-analysis of cardiovascular therapeutic interventions from 2002-2003 demonstrated that 32% (n=37) of studies did not report the effects of the individual components of the endpoints, obscuring the impact of components of different clinical significance. 
  • Freemantle et al (full text) looked at trials with mortality as part of a composite primary outcome in 9 high impact journals and found that only 11% (n=19) trials had both significant composite and mortality outcomes.  The below figure, demonstrates the difference in treatment effects using the composite and component portions in several trials of glycoprotein IIb/IIIa inhibitors in ACS.   
 
So, are composite endpoints rooted in malevolence?
  • Composite endpoints may allow a trial to enroll a fewer number of patients, thereby decreasing cost and increasing feasibility.  When attempting to measure rare events, it could be difficult to power a study so that an intervention could achieve statistical significance. To borrow an example from Tomlinson et alif an outcome is expected to occur at a 5% annual rate and the trial is planned to last five years, more than 2,500 patients are needed to establish a hazard ratio of 0.75 with p<0.05. 
    • We need and like studies to evaluate interventions and cost and feasibility are important so it's likely that there's some sort of trade-off here. 
  • Some interventions may have a few important outcomes.  For example, some people argue that when evaluating a cardiovascular medication, looking at cardiovascular events such as deaths due to cardiac events, non-fatal strokes, and non-fatal MIs is theoretically reasonable.
So, then, what do we do with these composite endpoints
  • Pick them apart.  The FOAM website, theNNT has the following philosophy on composite endpoints:  "Generally speaking there is no need to use composite endpoints. In many cases the use of composite endpoints (e.g. ‘death, MI, or revascularization’ in studies of coronary treatments) obfuscates the patient-oriented utility of an intervention (Tomlinson, 2010). In addition, composite endpoints are very often made up of components with considerably varying patient-interest. At TheNNT our sense is that separating these endpoints into separate outcomes helps to clarify the degree to which an intervention may be beneficial based on the value system of an individual patient. We therefore separate all composite endpoints into their constituent parts for NNT calculations."
  • Be sensitive for detecting these in literature and ask some key questions:
    • Does the composite endpoint really measure a disease?
    • Are the components of the composite have the same clinical significance (or do they cover a broad spectrum of disease)?
    • Does the composite endpoint camouflage a single negative outcome bundled in the composite?
    • Are the individual components of the composite endpoint valid? Are they of importance for patients?
    • Are the results clinically meaningful? Do they provide a basis for therapeutic decisions? Does each single endpoint support the overall result?
  •  Educate those around you.
Update:  Stolker JM et al. Re-Thinking Composite Endpoints in Clinical Trials: Insights from Patients and Trialists. Circulation. 2014. PMID: 25200210

  • Composite endpoints are commonplace, especially in cardiology literature. It takes massive power to find mortality/major mobidity benefits for many interventions; thus, many studies are powered for a primary composite outcome, often: death, myocardial infarction (MI), and revascularization. This cardiology survey data highlights that both patients and trial researchers appreciate the inequity between death and revascularization. The shocker? Patients rated MI and stroke worse than death, whereas researchers rated MI and stroke as 1/3 to 1/2 as important as death. Both clinical trialists and patients rated revascularization as a minor event, in contradistinction to the equal weight placed in the composite primary outcome in many trials.