false
Catalog
AOSSM 2022 Annual Meeting Recordings - no CME
How to Review a Large Database Study
How to Review a Large Database Study
Back to course
[Please upgrade your browser to play this video content]
Video Transcription
All right, well, thank you. I appreciate the invitation to speak again. I look back, this is the seventh one, and I really enjoy passing on some of these skills to you. And I just thank Dr. Ryder for the invitation, and thank Donna for her assistance and help. My conflicts of interest are disclosed here. So the objectives for today is we'll review a definition of a large database study, and then look at some of the strengths and limitations of these studies. We'll entertain the influence of the numerators that authors choose to study, and the denominators that authors choose to study. And briefly explore some of the statistical issues, including the confidence intervals of rates and simultaneous testing and multiple hypotheses. So the definition I like, I chose from this review article published in 2016. I think this most closely parallels my idea of a large database study, a study of any large collection of data that encompasses records at a state, multi-state, or national level reporting on more than one procedure or subspecialty. And this is markedly different from registries obtaining detailed information specific to a diagnosis or procedure, which in turn is markedly different than a prospective cohort study with a hypothesis and prospective longitudinal follow-up. Some common examples are listed here. These are all based on administrative, billing, or discharge records. That same article had this figure on the publication rate per journal by year. And of note, the American Journal of Sports Medicine, it's kind of this light green one near the bottom. But you can still see there's an uptick and an increase in these publications. The strengths of a large database study, you probably know. The quantity of data, right, the number of records. You can report on rare patient groups or complications that would be impossible to adequately power at a single institution. I'll show you an example of that. You can provide generalizable conclusions across multiple regions. So here's an example from the American Journal of Sports Medicine published in 2015. The incidence of manipulation under anesthesia or license of adhesions after arthroscopic knee surgery. And the statistical power of 330,000 knee arthroscopies is undeniable. It would take me about 1,000 years to collect all of this data. But you can see some of the valuable information you can get about the relative rates of patients undergoing these procedures following different knee arthroscopy procedures. Limitations, the quality of the data. Again these administrative or billing records were not intended for research or clinical purposes. One of the examples I like to give is, who remembers from ICD-9 the code for osteochondritis dissecans? No one? 732.7? I mean, this is like my favorite code. It's not fair. But I use it for a lot of things. Like any focal cartilage defect, I couldn't find a better code. So I just used 732.7. It unlocked my iPhone, my house in Nashville as part of the alarm system. So I would use that code a lot. But if someone in a few, I never thought it would be used for research. If someone would look at that and think of it as an incidence of OCD, they would think there was a pandemic or an epidemic of cases in Nashville around 2006 to 2011 when I lived there. So you just got to remember, these weren't intended for research. Some of the databases don't have any follow-up outside of an inpatient stay or outside of 30 days after a hospital discharge. And in some of the databases do not record laterality. So you have to look at that critically. The distance from the patients is what limits the details. So what you can really get out of these large database studies are counts. The counts of diagnoses or counts of procedures. That's what you get, just counts. In contrast, on the other side of the spectrum, consider this prospective cohort study where the authors knew the particulars of the surgery, the details of the rehabilitation. They had direct contact with the patients to collect patient-oriented outcome measures, objective measurements, as you can see there in radiographs. Your large database studies are not going to have goniometer measurements. So with these counts, it's critical for us to understand numerators and denominators again. In mathematics, you know the numerator is a number above the line in a fraction showing how many of the pieces indicated by the denominator are taken. In epidemiology, when we say the numerator, it usually reflects disease-specific variables. And the denominator reflects population size or time or both. How does this apply to counts in database studies? Well, a rate is defined as a count in the numerator divided by some other measure that is usually time-based in the denominator. Rates have units. A rate can have any value from zero to infinity. A specific rate, the incidence rate, is defined as the number of new cases of a condition that occurred during a specific period of time in a population at risk for developing the condition. So that's really important to think about that, the population at risk for developing the condition. And a common denominator is person years, the sum of the units of time that each individual is at risk and observed. Rates and proportions are not synonymous. A proportion is defined as a count divided by another count of the same units. A proportion is a ratio. Proportions have no units. This unitless measure varies from zero to one. Prevalence is a specific proportion defined as the number of affected people in the population at a specific time point divided by the total number of people in the population at that time. The following slides come from this epidemiology text by Dr. Gordas at Hopkins. You can see prevalence is depicted as this glass jar filled with marbles. Incidence is the rate of marbles going into the jar. And the way out of the jar is deaths and cures. I offer both to my patients. Accurately classifying is the step before counting. And both these require high-quality data. And we're familiar with the International Classification Disease, the ICD and the ICD-CM. They're the most widely used classification schemes for coding morbid conditions. Take a look how you get a different answer for what's the prevalence of dementia based on using DSM criteria, using one of the ICD criteria. So it really varies a lot. That's something you really don't know in these large database studies. What was used to classify the data at each site when it was entered? The distance from the data is the problem. Here's another example of rheumatoid arthritis using the New York criteria or the ARA criteria. You get a really different answer about what the prevalence of rheumatoid arthritis is. In orthopedics, you might think, well, but we're privileged from that, like an ACL. How many ways can you classify an ACL? It's a binary thing. It's a zero or one. Yeah, there's a few partial ACLs. It's actually been looked at by Dr. Critch at the Mayo Clinic, the accuracy of diagnostic codes for ACL tears. And his team actually went back to the medical record and tried to verify every case. And what they found, and you may remember that that code for ACL for cruciate ligament injury in this time period covered ACL and PCL, so you would think, yeah, there's going to be some problems that you can see isolated PCL and combined ACL-PCL together make up about 2% of the ones classified as a cruciate tear. But 31% had no cruciate ligament injury. What they had, contusion about the knee or knee sprain, an isolated meniscus tear, isolated collateral ligament injury. I mean, you can see here what they really had. So two-thirds of them were accurate ACLs and about a third of them were not. So we're not really exempt from this. Now I want to go on a little more of an interactive part of this. A lot of talk in my house about driving. My son just got his driver's license last week and had to do like 60 hours in the car. So I had a chance to talk to him about some of the stuff I learned from George Carlin about driving, which have you ever noticed when you're driving that anyone's driving slower than you as an idiot and anyone driving faster than you as a maniac? And there's parallels to this, like with masking and like plica excisions. I mean, think about how many plica excisions you think you did last year. And if you heard someone did like 10 times that many, you'd be like, whoa, you're scoping for dollars, huh? You're a maniac. But if someone did zero, you'd say, oh, they're missing it. You know, we all have the sense like the proper speed to drive. And our thoughts on driving, but I want to do some, a little more statistics in it. So you're at your son's baseball game, it goes into extra innings. Your daughter's at the game with you, needs to arrive at a musical rehearsal very soon. Two trustworthy families are in the same situation, simultaneously offered to drive your daughter to the rehearsal. You have option one, the mother from one family, or option two, the grandmother from the other family. Who do you choose? They both seem fine. But I had some concerns about the parent maybe being exhausted or distracted with technology, and some concerns about the grandparent maybe for some physical, cognitive, or visual impairments. So I want to look at the data. So in the numerator, you've got the count of fatal crashes, and denominator, estimated person years. And if you look at middle-aged people, they had 2.0 fatal crash events per 10,000 person years. And elderly people had 2.2 fatal crashes per 10,000 person years. Does anyone see any problems with looking at it this way? Looking at it as person years. Fatal crashes per person years. Well, yeah, go ahead, Arvind. Maybe the amount they drive would be different? The amount they drive is different. Yeah. Yeah. Do you think all of the people age 75 years and older are even drivers? So, I mean, that's really the, it's a flawed denominator. This chose the wrong denominator because you can't be a driver at risk for a fatal crash if you're not a driver. And so this includes lots of, you know, lots of people might not be driving. They might not have ever driven, or they might have stopped driving for health reasons. But they're included in this denominator, so this is flawed. So I chose to change the denominator first to driver years, and you can see if you look at licensed drivers, you have 2.0 fatal crash events. So it didn't really change for the middle-aged drivers because they were all licensed drivers, all those people. But elderly drivers, it did change. We went up to 3.5 fatal crash events per 10,000 licensed driver years. But then the next point you brought up is time spent living is not always time spent driving. Right? A lot of these people might not, they might be drivers, but they're not driving much. They might just drive, maybe the elderly are not driving as much. They're just driving to the store like once or twice a week or to doctor's appointments, and the middle-aged drivers are driving more. So what could you change the denominator to, to include, to reflect that? Because you don't really know exactly. Hours driven per year. Yeah, that would be perfect. If you knew hours driven per year, that would be perfect. And the airline industry does it with passenger miles, and they use miles as a surrogate for time, time spent in the air flying covers a certain amount of miles. And in this sort of crash data, they do driver miles. And so the numerator now is still the count of fatal crashes, denominator estimated driver miles. And you can see it really did change, 2.0 fatal crash events versus 11.5 fatal crash events per 100 million driver miles. So I feel good about the denominator does anyone have any Concerns about the numerator you think fatal crash events is looking at you know the right thing So these are defined as the driver dies in a car accident Yeah, I think I think that would be better if you had motor vehicle collision instead of the fatal crash events because this also reflects the fragility of the driver and how survivable the the crash is for the driver and It actually has been looked at why have fatality rates among older drivers decline the relative contribution Contributions of the changes in survivability and the crash involvement and was found the main factor Contributing to a higher instance of fatal crashes and elderly drive drivers was not their higher involvement in crashes But the lower survivability in crashes, so I was looking at the wrong thing all along So it's really tricky you can see you had counts it was high quality data but how you chose What what went in the numerator went when the denominator really changed your decisions? So I think they're both were safe the mother and the grandmother and daughter makes to the vent fine When you're working with counts and rates, there's some statistical considerations Account against just a number of events that occur in a rates the count divided by a denominator Which accounts for population size or time interval or both? Counts and rates must be greater than or equal to zero Negative rates like in this setting don't make sense so special considerations need to be made one calculating the confidence intervals of rates In order to take this constraint of rates and account is preferable to work in the log domain Derive a confidence interval for a log rate and then analog it This is actually pretty simple for a statistician to do It's suitable and appropriate to calculate conf conference intervals that are not negative How about chance we had a whole workshop on chance before Right pass out the pennies and we flipped them Fisher popularized the alpha standard of P less than 0.05 and this cutoff has been sanctified by many years of use Of course, he'd want it to be zero if possible But what if you do a large collection of tests perhaps part of one of these? But what if you do a large collection of tests perhaps part of one of these large database studies and How do you account for that at some level if the researcher does not take Multiplicity of testing account then the probability that some true null hypotheses are rejected by chance alone may be unduly large If you look at this, even if you did 20 Independent hypotheses and test them the probability is zero point six four It's more likely than not that at least one hypothesis will be statistically significant by chance alone So we talked Previously about some ways to correct for this One way out of the dilemma is to adjust the P values or the thresholds for interpretation And that's the Bonferroni approach Named after the Italian mathematician and divides alpha by the number of hypotheses tested I just want to present this to you in a couple modifications Because they're way easier than I thought they would be and I think when you see them you're like, oh that's it's actually quite It's pretty easy in a study with 20 hypotheses. The level of significance would be P less than zero point zero zero two five just take point zero five divided by 20 and Then following this correction if the alpha set at zero point zero zero two five and you do 20 independent Simultaneous hypotheses that you're testing then the probability is zero point zero five It's just the same as usual that at least one hypothesis will be statistically significant by chance alone Now it turns that even though that's the easiest one to understand and you can apply it in any Simultaneous inference situation the correction is considered to be a little bit too strict for many cases and the price is loss of power which you might have to worry about as much in these large database studies, but These two modifications are also pretty easy to do. I mean you could just kind of read about and do it within an hour home has His method presenting a sequentially rejected Bonferroni procedure Slightly more complicated by an increase in power. It still maintains the overall study error rate of alpha in almost every case We do just rank your P about let's say to 20 P values you just rank them from lowest to highest and you take the smallest one and compare it to point zero five divided by 20 and you take the next smallest one compare it to Point zero five divided by 19 and so on and so you get a non-significant result and all remaining hypotheses are considered non-significant So it's not that hard to do these things Hatchberg's is very similar He's rank them in the other order and there's a few statistical conditions need to be met in order to use that one But you sometimes will see those when you're reviewing a manuscript These are considered the simple Bonferroni modifications, and there's more complex methods It may be more appropriate for certain situations and in many cases a statistician with expertise in the field May be employed, but it's just important to remember as a critical reviewer that it's appropriate to It's appropriate. It's pretty simple to provide suitably adjusted P values The final thing I want to bring up is a role of chance prospective or retrospective because The role of chance is different Prospectively and retrospectively, but the statistical testing is the same Richard Feynman brought this up nicely the physicists and Nobel laureate for his work in QED Once illustrated this point as follows You know the most amazing thing happened to me tonight. I was coming here on the way to the lecture I came in through the parking lot you won't believe what happened. I saw a car with the license plate ARW 357 Can you imagine that of all the millions of license plates and say what was the chance? I would see that one tonight amazing So retrospectively hunting for and discovering of course patterns in a large database study Generates skepticism in the mind of the critical reader and reviewer Whereas prospectively generating a research hypothesis and then testing it creates more confidence and findings. That's the difference between prospective or retrospective roles of chance So when reviewing a large database study I'm going to talk about three steps step one assess the numerators that the authors chose to study this parallels the Fatal crashes in our analogy the measurement approach may not be what the investigator would prefer For example the presence or absence of the ICD code for osteoarthritis rather than careful classification by the investigator based on radiographs The critical reviewer knows that accurate classification and counting requires high quality data Step two assess the denominators that the authors chose to study you can see the influence of that in our analogy looking at Driver miles versus driver years and Person years the existing data may have been collected from population is not ideal and not stratified satisfactorily for example a population of all high school students is a denominator because it's easier and available rather than a population of high high school students by sex and by sport The content expertise and experience may guide the critical reviewer to ask could denominator choices have influenced these findings the same way with the Driving analogy it was all content expertise and experience that guided us to the right numerators and denominators We weren't doing any statistics there Step three Evaluate the statistics used to analyze rates and proportions if the critical reviewer observes the reporting of negative rates That's considered to be a red flag. It's possible and appropriate to obtain suitably adjusted p-values It's fine to request that from the authors That's it, thank you Did you want to entertain any questions at this point yeah anyone have any questions I have a question sometimes we see an author will say they did a Bonferroni correction But they seem to be using 0.05 as as their value for significance, so Are they being disingenuous here, or is there some other way? To describe a Bonferroni Correction that you can still have done it and still the the value of significance is 0.05 I've seen that as well And I think there's a couple ways to do one is to adjust your interpretation of the p-values Which is the way I showed it where we weren't in interpreting it as significant unless those under P less than 0.0025 because that's what we corrected our interpretation threshold to and I've seen some ways where they actually Adjust the p-values in light of the other findings in the simultaneous simultaneous testing But I think in those cases I would be most comfortable checking with the authors and see what how did you actually calculate the p-values? When did you make the adjustment or is this just an adjustment and interpretation of the p-values? That's good point Any other questions Here's another one now Jim you evaluated a large number of large database studies What catches your eye? That makes you say this is a good one This is better than most of the run-of-the-mill ones we have coming through here. Yeah, I Think there's been a few that have been influential and changing my practice some of the ones with injections close to surgery time and Those are pretty well captured. I think in some of these large databases because they can capture the CPT code of the Arthroscopy of the two nine eight eight one and they can capture the CPT code of the injection and I understand with some of the Infection rates were like zero point zero six and some of those studies Zero point zero six percent that it would be years for me to figure that out for my own patients. So I like I These large database studies. I like the ones that I would not be able to do myself And at one a single institution and it'd be hard to coordinate for a multi-center Cohort study and then that where they have the capacity to change practice with some simple counts of infections and simple counts of complications those are sometimes easier to count Here's Another one. What about the risk of finding things being statistically significant? That are not clinically significant because of the size of the data. Oh, yeah well, that's that's a huge issue that the magnitude of the findings and the What's a clinically relevant difference because it these studies will often come up with a statistically significant difference You know P less than point zero zero zero one even after a Bonferroni correction but the difference in the IKDC subjective may be a 65 over a 64 and so a good rule of thumb is the rule of sevens and if you take the scale if it's a hundred points just divided by seven and that's usually the That's usually the meaningful difference to have so if it's a hundred point scale and divided by seven and around 15 points or so Then that would be a good difference to have if you know the standard deviation of the the data There's also a half standard deviation That's that's really important these database studies because they'll there'll be lots of statistically significant findings But I might not have clinical relevance and that's where the content expertise really comes in the reviewers Other questions Hi, I'm Erica from Indonesia a follow-up question to dr. Bruce Ryder. Should we looking at the MC ID? When we are looking at this is a statistically insignificant or clinically significant You know, it's a great question, yeah, I think there are several of these features of a scale especially these patient oriented outcome measures that you can look at and the MC ID is one of those that you can look at and if the Difference is greater than MC ID then you have more confidence that yeah, this is a clinically relevant difference it's meant some of the Patient or an outcome measures that we use have published metrics, but many of them do not and that's why I present sometimes these rules of thumb and one of the Editorials from maybe six or seven years ago is on magnitude of findings and and I've Went through how to assess that but yet MC ID is a good one Thank you
Video Summary
In this video, the speaker discusses large database studies, their strengths, and limitations. They emphasize the importance of accurate classification and counting in these studies, as well as the influence of numerators and denominators on the findings. The speaker also talks about statistical considerations, such as confidence intervals, multiple hypothesis testing, and adjusting p-values. They explain the difference between prospective and retrospective roles of chance in data analysis. The critical reviewer is advised to assess the numerators and denominators chosen by the authors, evaluate the statistics used, and consider the clinical relevance of the findings. The speaker also addresses audience questions about Bonferroni correction, clinical significance, and minimal clinically important difference. Overall, the video provides insights into reviewing and interpreting large database studies.
Asset Caption
James L. Carey, MD, MPH
Keywords
large database studies
numerators
denominators
statistics
clinical relevance
interpreting
×
Please select your language
1
English