false
Home
2022 AOSSM Annual Meeting Recordings with CME
How to Review a Large Database Study
How to Review a Large Database Study
Back to course
[Please upgrade your browser to play this video content]
Video Transcription
All right, well, thank you. I appreciate the invitation to speak again. I look back, this is the seventh one and I really enjoy passing on some of these skills to you and I just thank Dr. Ryder for the invitation and thank Donna for her assistance and help. My conflicts of interest are disclosed here. So, the objectives for today is we'll review a definition of a large database study and then look at some of the strengths and limitations of these studies. We'll entertain the influence of the numerators that authors choose to study and the denominators that authors choose to study. And briefly explore some of the statistical issues including the confidence intervals of rates and simultaneous testing and multiple hypotheses. So, the definition I like, I chose from this review article published in 2016. I think this most closely parallels my idea of a large database study, a study of any large collection of data that encompasses records at a state, multi-state or national level reporting on more than one procedure or subspecialty. And this is markedly different from registries obtaining detailed information specific to a diagnosis or procedure, which in turn is markedly different than a prospective cohort study with a hypothesis and prospective longitudinal follow-up. Some common examples are listed here. These are all based on administrative, billing or discharge records. That same article had this figure on the publication rate per journal by year. And of note, the American Journal of Sports Medicine, it's kind of this light green one near the bottom. But you can still see there's an uptick and an increase in these publications. The strengths of a large database study, you probably know, the quantity of data, right? The number of records. You can report on rare patient groups or complications that would be impossible to adequately power at a single institution. I'll show you an example of that. You can provide generalizable conclusions across multiple regions. So here's an example from the American Journal of Sports Medicine published in 2015, the incidence of manipulation under anesthesia or license of adhesions after arthroscopic knee surgery. And the statistical power of 330,000 knee arthroscopies is undeniable. It would take me about 1,000 years to collect all of this data. But you can see some of the valuable information you can get about the relative rates of patients undergoing these procedures following different knee arthroscopy procedures. Limitations, the quality of the data. Again, these administrative or billing records were not intended for research or clinical purposes. One example that I like to give is who remembers from ICD-9 the code for osteochondritis dissecans? No one? 732.7? It was a, I mean, this is like my favorite code. It's not fair. But I use it for a lot of things. Like any focal cartilage defect, I couldn't find a better code. So I just use 732.7. It unlocked my iPhone, my house in Nashville as part of the alarm system. But, so I would use that code a lot. But if someone in a few, I never thought it'd be used for research, if someone would look at that and think of it as an incidence of OCD, they would think there was a pandemic or an epidemic of cases in Nashville around 2006 to 2011 when I lived there. So you just got to remember, these weren't intended for research. Some of the databases don't have any follow-up outside of an inpatient stay or outside of 30 days after a hospital discharge. And in some of the databases do not record laterality. So you have to look at that critically. The distance from the patients is what limits the details. So what you can really get out of these large database studies are counts. Counts of diagnoses or counts of procedures. That's what you get. Just counts. In contrast, on the other side of the spectrum, consider this prospective cohort study where the authors knew the particulars of the surgery, the details of the rehabilitation. They had direct contact with the patients to collect patient-oriented outcome measures, objective measurements, as you can see there in radiographs. Your large database studies are not going to have goniometer measurements. So with these counts, it's critical for us to understand numerators and denominators again. In mathematics, you know the numerator is the number above the line in a fraction showing how many of the pieces indicated by the denominator are taken. In epidemiology, when we say the numerator, it usually reflects disease-specific variables. And the denominator reflects population size or time or both. How does this apply to counts in database studies? Well, a rate is defined as a count in the numerator divided by some other measure that is usually time-based in the denominator. Rates have units. A rate can have any value from zero to infinity. A specific rate, the incidence rate, is defined as the number of new cases of a condition that occur during a specific period of time in a population at risk for developing the condition. So that's really important to think about that. The population at risk for developing the condition and a common denominator is person years, the sum of the units of time that each individual is at risk and observed. Rates and proportions are not synonymous. A proportion is defined as a count divided by another count of the same units. A proportion is a ratio. Proportions have no units. This unitless measure varies from zero to one. The proportion is defined as the sum of the units from zero to one. Prevalence is a specific proportion defined as the number of affected people in the population at a specific time point divided by the total number of people in the population at that time. The following slides come from this epidemiology text by Dr. Gordas at Hopkins. And you can see prevalence is depicted as this glass jar filled with marbles. Incidence is the rate of marbles going into the jar. And the way out of the jar is deaths and cures. I offer both to my patients. Accurately classifying is the step before counting. And both these require high-quality data. And we're familiar with the International Classification Disease, the ICD and the ICD-CM, they're the most widely used classification schemes for coding morbid conditions. But take a look how you can get a different answer for what's the prevalence of dementia based on, they're using DSM criteria, using one of the ICD criteria. So it really varies a lot. That's something you really don't know in these large database studies. What was used to classify the data at each site when it was entered? The distance from the data is the problem. Here's another example of rheumatoid arthritis looking, using the New York criteria or the ARA criteria. You get a really different answer about what the prevalence of rheumatoid arthritis is. In orthopedics, you might think, well, but we're privileged from that, like an ACL. I mean, how many ways can you classify an ACL? It's just, it's a binary thing. It's a zero or one or, yeah, there's a few partial ACLs. But it's actually been looked at by Dr. Critch at the Mayo Clinic, the accuracy of diagnostic codes for ACL tears. And his team actually went back to the medical record and tried to verify every case. And what they found, and you may remember that code for ACL for cruciate ligament injury in this time period covered ACL and PCL. So you would think, yeah, there's going to be some problems that you can see, isolated PCL and combined ACL-PCL together make up about 2% of the ones classified as a cruciate tear. But 31% had no cruciate ligament injury. What they had, contusion about the knee or a knee sprain, an isolated meniscus tear, isolated collateral ligament injury. You can see here what they really had. So two-thirds of them were accurate ACLs and about a third of them were not. So we're not really exempt from this. Now I want to go on a little more of an interactive part of this. A lot of talk in my house about driving. My son just got his driver's license last week and had to do like 60 hours in the car. So I had a chance to talk to him about some of the stuff I learned from George Carlin about driving, which have you ever noticed when you're driving that anyone's driving slower than you as an idiot and anyone driving faster than you as a maniac? And there's parallels to this like with masking and like plica excisions. I mean, think about how many plica excisions you think you did last year and if you heard someone did like 10 times that many, you'd be like, whoa, you're scoping for dollars, huh? You're a maniac. But if someone did zero, you'd say, oh, they're missing it. You know, we all have this sense like the proper speed to drive and our thoughts on driving. But I want to do some, a little more statistics in it. So you're at your son's baseball game. It goes into extra innings. Your daughter's at the game with you, needs to arrive at a musical rehearsal very soon. Two trustworthy families are in the same situation and simultaneously offer to drive your daughter to the rehearsal. You have option one, the mother from one family, or option two, the grandmother from the other family. Who do you choose? They're both, they both seem fine. But I had some concerns about the parent maybe being exhausted or distracted with technology and some concerns about the grandparent maybe for some physical, cognitive, or visual impairments. So I want to look at the data. So in the numerator, you've got the count of fatal crashes and denominator estimated person years. And if you look at middle-aged people, they had 2.0 fatal crash events per 10,000 person years. And elderly people had 2.2 fatal crashes per 10,000 person years. Does anyone see any problems with looking at it this way? Looking at it as person years, fatal crashes per person years? Well, yeah, go to Arvind. Maybe the amount they drive would be different? The amount they drive is different. Yeah. Yeah. Do you think all the people age 75 years and older are even drivers? So I mean, that's really the, it's a flawed denominator. This chose the wrong denominator because you can't be a driver at risk for a fatal crash if you're not a driver. And so this includes lots of, you know, lots of people might not be driving. They might not have ever driven, or they might have stopped driving for health reasons, but they're included in this denominator. So this is flawed. So I chose to change the denominator first to driver years. And then you can see, if you look at licensed drivers, you have 2.0 fatal crash events. So it didn't really change for the middle-aged drivers because they were all licensed drivers, all those people. But elderly drivers, it did change. It went up to 3.5 fatal crash events per 10,000 licensed driver years. But then the next point you brought up is time spent living is not always time spent driving. Right? A lot of these people might not, they might be drivers, but they're not driving much. They might just drive, maybe the elderly are not driving as much. They're just driving to the store like once or twice a week or to doctor's appointments, and the middle-aged drivers are driving more. So what could you change the denominator to, to include, to reflect that? Because you don't really know exactly. Hours driven per year. Yeah. That would be perfect. If you knew hours driven per year, that would be perfect. And the airline industry does it with passenger miles, and they use miles as a surrogate for time, time spent in the air, flying, covers a certain amount of miles. And in this sort of crash data, they do driver miles. And so the numerator now is still the count of fatal crashes, denominator estimated driver miles. You can see it really did change. 2.0 fatal crash events versus 11.5 fatal crash events per 100 million driver miles. So I feel good about the denominator. Does anyone have any concerns about the numerator? Do you think fatal crash events is looking at, you know, the right thing? So these are defined as the driver dies in a car accident. That's just motor vehicle accidents. Yeah. I think that would be better if you had motor vehicle collisions instead of the fatal crash events. Because this also reflects the fragility of the driver and how survivable the crash is for the driver. And it actually has been looked at why have fatality rates among older drivers declined, the relative contributions of the changes in survivability and the crash involvement. And it was found the main factor contributing to a higher instance of fatal crashes in elderly drivers was not their higher involvement in crashes, but their lower survivability in crashes. So I was looking at the wrong thing all along. So it's really tricky. You can see you had counts. It was high-quality data, but how you chose what went in the numerator and what went in the denominator really changed your decisions. So I think they both were safe, the mother and the grandmother. And the daughter makes it to the vent fine. When you're working with counts and rates, there's some statistical considerations. A count, again, is just a number of events that occur. In a rates, the count divided by a denominator, which accounts for population size or time interval or both, counts and rates must be greater than or equal to zero. Negative rates, like in this setting, don't make sense. So special considerations need to be made when calculating the confidence intervals of rates. In order to take this constraint of rates and accounts, it's preferable to work in the log domain, derive a confidence interval for a log rate, and then analog it. This is actually pretty simple for a statistician to do. It's suitable and appropriate to calculate confidence intervals that are not negative. How about chance? We had a whole workshop on chance before. Right, pass out the pennies and we flipped them. Fisher popularized the alpha standard of P less than 0.05. And this cutoff has been sanctified by many years of use. Of course, he'd want it to be zero if possible. But what if you do a large collection of tests, perhaps part of one of these large database studies, and how do you account for that? At some level, if the researcher does not take multiplicity of testing account, then the probability that some true null hypotheses are rejected by chance alone may be unduly large. If you look at this, even if you did 20 independent hypotheses and test them, the probability is 0.64. It's more likely than not that at least one hypothesis will be statistically significant by chance alone. So we talked previously about some ways to correct for this. One way out of the dilemma is to adjust the P values or the thresholds for interpretation. And that's the Bonferroni approach, named after the Italian mathematician. And it divides alpha by the number of hypotheses tested. I just want to present this to you in a couple modifications because they're way easier than I thought they would be. And I think when you see them, you're like, oh, that's actually quite, it's pretty easy. In a study with 20 hypotheses, the level of significance would be P less than 0.0025. Just take 0.05 and divide it by 20. And then following this correction, if the alpha is set at 0.0025 and you do 20 independent simultaneous hypotheses that you're testing, then the probability is 0.05. It's just the same as usual that at least one hypothesis will be statistically significant by chance alone. Now it turns out, even though that's the easiest one to understand, and you can apply it in any simultaneous inference situation, the correction is considered to be a little bit too strict for many cases, and the price is loss of power, which you might not have to worry about as much in these large database studies. But these two modifications are also pretty easy to do. I mean, you could just kind of read about it and do it within an hour. Holm has his method presenting a sequentially rejective Bon Ferroni procedure, slightly more complicated by an increase in power. It still maintains the overall study error rate of alpha in almost every case. What you do is just rank your P value, let's say the 20 P values. You just rank them from lowest to highest, and you take the smallest one and compare it to 0.05 divided by 20. And you take the next smallest one and compare it to 0.05 divided by 19, and so on. And so you get a non-significant result, and all remaining hypotheses are considered non-significant. So it's not that hard to do these things. Hotchberg's is very similar. You just rank them in the other order, and there's a few statistical conditions need to be met in order to use that one. But you sometimes will see those when you're reviewing a manuscript. These are considered the simple Bon Ferroni modifications, and there's more complex methods that may be more appropriate for certain situations. And in many cases, a statistician with expertise in the field may be employed. But it's just important to remember as a critical reviewer that it's appropriate to, it's appropriate, and it's pretty simple to provide suitably adjusted p-values. The final thing I want to bring up is the role of chance, prospective or retrospective. Because the role of chance is different, prospectively and retrospectively, but the statistical testing is the same. Richard Feynman brought this up nicely. The physicist and Nobel Laureate for his work in QED once illustrated this point as follows. You know, the most amazing thing happened to me tonight. I was coming here on the way to the lecture. I came in through the parking lot. You won't believe what happened. I saw a car with a license plate, ARW357. Can you imagine that? Of all the millions of license plates in the state, what was the chance I would see that one tonight? Amazing. So retrospectively hunting for and discovering, of course, patterns in a large database study generates skepticism in the mind of the critical reader and reviewer. Whereas prospectively generating a research hypothesis and then testing it creates more confidence in findings. That's the difference between prospective or retrospective roles of chance. So when reviewing a large database study, I'm going to talk about three steps. Step one, assess the numerators that the authors chose to study. This parallels the fatal crashes in our analogy. The measurement approach may not be what the investigator would prefer. For example, the presence or absence of the ICD code for osteoarthritis rather than careful classification by the investigator based on radiographs. The critical reviewer knows that accurate classification and counting requires high-quality data. Step two, assess the denominators that the authors chose to study. You can see the influence of that in our analogy, looking at driver miles versus driver years and person years. The existing data may have been collected from a population that is not ideal and not stratified satisfactorily. For example, a population of all high school students is a denominator because it's easier and available rather than a population of high school students by sex and by sport. The content expertise and experience may guide the critical reviewer to ask, could denominator choices have influenced these findings? The same way with the driving analogy, it was all content expertise and experience that guided us to the right numerators and denominators. We weren't doing any statistics there. Step three, evaluate the statistics used to analyze rates and proportions. If the critical reviewer observes the reporting of negative rates, that's considered to be a red flag. It's possible and appropriate to obtain suitably adjusted p-values. It's fine to request that from the authors. That's it. Thank you. Did you want to entertain any questions at this point? Yeah. Anyone have any questions? I have a question. Sometimes we see an author will say they did a Bonferroni correction, but they seem to be using 0.05 as their value for significance. So, are they being disingenuous here or is there some other way to describe a Bonferroni correction that you can still have done it and still the value of significance is 0.05? I've seen that as well. And I think there's a couple ways to do it. One is to adjust your interpretation of the p-values, which is the way I showed it, where we weren't interpreting it as significant unless it was under p less than 0.0025 because that's what we corrected our interpretation threshold to. And I've seen some ways where they actually adjust the p-values in light of the other findings and the simultaneous testing. But I think in those cases, I would be most comfortable checking with the authors and see, well how did you actually calculate the p-values? When did you make the adjustment or is this just an adjustment in our interpretation of the p-values? That's a good point. Any other questions? Here's another one. Now Jim, you evaluated a large number of large database studies. What catches your eye that makes you say, this is a good one, this is better than most of the run of the mill ones we have coming through here? Yeah. I think there's been a few that have been influential in changing my practice. Some of the ones with injections close to surgery time. And those are pretty well captured I think in some of these large databases. Because they can capture the CPT code of the arthroscopy, the 29881, and they can capture the CPT code of the injection. And I understand with some of the infection rates were like 0.06 in some of those studies, 0.06 percent, that it would be years for me to figure that out for my own patients. So I like these large database studies, I like the ones that I would not be able to do myself at a single institution. And it would be hard to coordinate for a multi-center cohort study. And then where they have the capacity to change practice with some simple counts of infections, and simple counts of complications. Those are sometimes easier to count. Here's another one. What about the risk of finding things being statistically significant that are not clinically significant because of the size of the data? Oh yeah. Well that's a huge issue. The magnitude of the findings and what's a clinically relevant difference. Because these studies will often come up with a statistically significant difference. You know, P less than 0.0001, even after a Bonferroni correction. But the difference in the IKDC subjective may be a 65 over a 64. And so a good rule of thumb is the rule of sevens. And if you take the scale, if it's 100 points, just divide it by seven. And that's usually the meaningful difference to have. So if it's a 100 point scale and divide it by seven and around 15 points or so, then that would be a good difference to have. If you know the standard deviation of the data, there's also a half standard deviation rule. That's really important with these database studies because there will be lots of statistically significant findings but it might not have clinical relevance. And that's where the content expertise really comes in to reviewers. Other questions? Yes. Hi, I'm Erica from Indonesia. A follow-up question to Dr. Bruce Ryder. Should we be looking at the MCID when we are looking at this is a statistically insignificant or clinically significant? Yeah, that's a great question. Yeah, I think there are several of these features of a scale, especially these patient-oriented outcome measures that you can look at. And the MCID is one of those that you can look at. And if the difference is greater than MCID, then you have more confidence that, yeah, this is a clinically relevant difference. Some of the patient-oriented outcome measures that we use have published metrics. But many of them do not. And that's why I present sometimes these rules of thumb. And one of the editorials from maybe six or seven years ago was on magnitude of findings. And I went through how to assess that. But, yeah, the MCID is a good one. Thank you.
Video Summary
In this video, the speaker discusses large database studies and their strengths and limitations. They define a large database study as a study of a large collection of data that encompasses records at a state, multi-state, or national level, reporting on more than one procedure or subspecialty. The speaker highlights the strengths of large database studies, such as the quantity of data and the ability to report on rare patient groups or complications. However, they also acknowledge the limitations, such as the quality of the data and the lack of detailed information specific to a diagnosis or procedure. The speaker discusses statistical issues, including confidence intervals of rates and simultaneous testing and multiple hypotheses. They also emphasize the importance of accurately classifying and counting data, as well as evaluating the numerators and denominators chosen by authors. The role of chance in prospective and retrospective analyses is also discussed. Lastly, the speaker suggests using adjusted p-values, such as through Bonferroni corrections, and considering clinically significant differences when interpreting the findings of large database studies.
Asset Caption
James L. Carey, MD, MPH
Keywords
large database studies
strengths and limitations
quantity of data
rare patient groups
statistical issues
data classification
×
Please select your language
1
English