An Alternative to the Bland—Altman Repeated-Measures Correlation to Account for Variability of Slopes Across Persons


1. Introduction

When a data set includes repeated measures, i.e., multiple measurements taken for each person, estimating a correlation between two variables requires separation of variance within and across persons. This is necessary because standard Pearson correlations assume that each observation is independent of all others, but this assumption is violated when multiple measurements come from the same person. When we have repeated measurements from individuals, observations within each person are naturally more similar to each other than to observations from different people. This creates a hierarchical or nested data structure where measurements are clustered within individuals. If we simply calculated a Pearson correlation across all data points while ignoring this nesting, we would conflate two distinct sources of variation: the differences between people’s average levels (between-person variation) and the relationships between variables within each person over time (within-person variation). For example, imagine we are studying the relationship between exercise and mood. Some people might generally exercise more and have better moods overall, creating a positive between-person correlation. However, the relationship we are often really interested in is whether an individual person’s mood improves when they exercise more than their personal average—the within-person correlation. By using a repeated-measures correlation, we can specifically examine these within-person relationships while accounting for the fact that each person has their own baseline levels and patterns. This gives us a more accurate understanding of how variables relate to each other at the individual level, rather than mixing this information with broader patterns that exist across different people.

A method for calculating the repeated-measures correlation (rrm or “rmcorr” from here) using a within-person design [1] has gained widespread use, particularly because it (1) provides a single correlation coefficient summarizing all within-person correlations, and (2) remains within the framework of the general linear model, meaning it can be calculated entirely from a table of sums of squared errors (an ANOVA table). To calculate the Bland–Altman rmcorr between X1 and X2, one can simply predict (using regression) X1 using X2 and a series of binary (“dummy”) variables representing each subject; for example, if there were 10 subjects, X1 would be predicted by X2 and nine binary variables (with one binary subject variable omitted as the reference category). The partial sums of squares (SS) of this model (usually presented in an ANOVA table) would then be broken down into portions for each independent variable, as well as for the error (residual SS). To find the rmcorr, one would simply divide the SS for the independent variable of interest (X2 in this case) by the sum of that value and the residual SS. The square root of this quotient is the rmcorr. Note the SS for the subjects (binary variables) is ignored, which is by design: the variance attributable to each person’s unique mean is considered a “nuisance” in this context. This procedure is quite similar to giving subjects random intercepts in a mixed-effects regression, though there are important differences [2]. The problem with the Bland–Altman approach is that subjects usually do not have the same slope, so estimating random intercepts while assuming the same slope for everyone is often misleading. Of course, reporting a separate correlation for each person defeats the purpose of having a single correlation coefficient to describe the within-person correlation patterns within a sample; therefore, any method summarizing within-person correlations in a single number will have to contend with varying slopes.

Here, we propose an alternative to the Bland–Altman rmcorr. Specifically, we propose calculating all within-person correlations, taking the average of them weighted by the square root of the within-person observations, and testing this weighted average for statistical significance using a one-sample t-test also weighted by the square root of the observations. We recommend that researchers use both this method and the original Bland–Altman method for comparison. As we demonstrate below, an extreme inconsistency between these two estimates could suggest methodological or data quality problems that should be addressed before further statistical analyses are conducted.

4. Discussion

We propose an alternative to the Bland–Altman repeated-measures correlation (rmcorr) for estimating within-person correlations between two variables in a repeated-measures data set. Our proposed method, the weighted mean within-person correlation (wmcorr), calculates the average of all within-person correlations, weighted by the square root of the number of observations for each person. We demonstrate that in most cases, rmcorr and wmcorr will yield similar results; however, in cases where subjects have at least moderately varying slopes, wmcorr may provide a more visually intuitive estimate of the within-person correlation. Further, simulation results showed that while wmcorr had less statistical power than rmcorr (reducing Type I errors), neither method exhibited systematic bias in estimating correlations; however, wmcorr demonstrated superior accuracy, particularly when sample sizes or true correlations were low, as evidenced by its smaller error variability (SD = 0.086 vs. 0.108). While we call wmcorr an “alternative” because it could be used as a standalone method, we encourage researchers to estimate both wmcorr and the original rmcorr. Conflicting significance levels or opposite directions of rmcorr and wmcorr can serve as “warning signs” for researchers, indicating potential data quality issues or the need for further data collection before drawing conclusions about within-person relationships.

The method proposed here shares some fundamental characteristics with mixed-effects regression models, but differs in important technical aspects. Both methods aim to account for the nested structure of repeated-measures data and focus on within-person relationships, although mixed-effects models also provide between-subject effects. The weighting by square root of sample size in the proposed approach parallels how mixed models naturally give more weight to subjects with more observations, as these subjects provide more reliable information about within-person patterns. However, the methods diverge in their underlying “machinery” and assumptions. Mixed-effects regression explicitly models both fixed and random effects, allowing for variation in both intercepts and slopes across individuals, while simultaneously estimating an overall population-level relationship. It achieves this through sophisticated maximum likelihood estimation that considers the entire data structure. Our method, by contrast, takes a two-stage approach, first calculating individual correlations, and then combining them through weighted averaging. This makes our method more computationally straightforward and potentially more intuitive to researchers, because it is clear how each person’s data contributes to the final estimate. It also avoids some of the complex assumptions about the distributions of random effects that mixed models require. However, this simplicity means our method might be less efficient at using all available information in the data, particularly when some subjects have very few observations or when there are missing data patterns that mixed models could handle more elegantly through their likelihood-based framework.

The method proposed here is motivated primarily by potential real-world applications of the Bland–Altman rmcorr. From a scientific perspective, the p-value obtained from the original method comes from a known distribution of a test statistic given some degrees of freedom, so it will always be “correct” in that sense, i.e., the probability of obtaining the rmcorr result by chance would be the p-value already offered by the method presented in the original Bland–Altman paper [1]. However, because statistical significance is often used as a threshold to indicate whether a model should be applied in the real world, we hope we have demonstrated how some significant rmcorr models could be applied unreasonably. Using the original data as an example, the significant unadjusted p-value might lead some to conclude that if someone’s PaCO2 levels changed by one standardized unit, his/her pH would decrease by 0.507 standardized units. While this might be true on average, the results in Figure 1 demonstrate that the variability in expected pH is enormous; indeed, two of the eight subjects show an increase in pH, one quite substantially. Our alternative suggestion, wmcorr, provides a reasonable alternative to the Bland–Altman [1] method. Our hope is that cases in which the original and weighted average methods provide quite different results will alert researchers to data patterns that warrant closer inspection.
While wmcorr is a viable addition to the “correlation toolbox”, several limitations should be considered. First, our simulation conditions, though varied, do not capture all possible real-world data scenarios, particularly those with extreme outliers or highly non-linear relationships. Second, the method’s reliance on calculating individual correlations means it requires at least three observations per person to compute a correlation coefficient, potentially limiting its application in studies with sparse sampling or irregular measurement intervals. Third, the square root weighting approach, while theoretically justified, is just one possible weighting method; alternative schemes might be more appropriate in certain contexts but were not explored in this study. Fourth, while the examples used in the present study are compelling, they are likely to be quite rare in practice. Indeed, something as extreme as that shown in Figure 2 would likely only occur if there were serious problems in data entry or cleaning. However, this is also one of the strengths of the method: it will detect such problems when they might otherwise go undetected. Finally, while our method can identify potentially problematic data patterns, it does not provide specific guidance on how to address such issues beyond suggesting further data collection or inspection.



Source link

Tyler M. Moore www.mdpi.com