Half-sibling Regression
Half-sibling model
Half-sibling regression is an approach to de-noise data that follows the following model, with true signal $Q$ and measurement $Y$ which has noise added due to $N$. Crucially there are other variables we have measured which are affected by the same noise but are independent of the true signal $Q$. We call these variables $X_1, X_2$ half-siblings of $Y$.
The idea is to regress $Y$ on its half-siblings $X_1, X_2, \dots X_k$. Since the half-siblings are independent of the true signal, anything shared by $Y$ and any half-sibling must correspond to noise from $N$. This is shown in this (slightly terrible) diagram, where I have written $y$ as a vector which is a combination of true signal and noise. In terms of the true signal (of $y$) and noise, $x$ corresponds almost entirely to noise. Of course, $x$ should really be shown as a combination of noise and its true signal, which we could show using a 3rd axis.
When we regress $y$ on its half-sibling $x$, we recover $\hat{y}$ which is in the same direction as $x$. The residuals $y-\hat{y}$ are then almost in the same direction as the signal i.e. we have recovered the part of $y$ that is due to signal.
Example application - searching for exoplanets
I came across this method in a lecture series by Jonas Peters on causality, where he used it as an example of how thinking about the causal structure of data can help to suggest statistical methods for their analysis. In the course Peters gives the example from (Schölkopf et al. 2016) of measurements of light intensity of various points in the sky as measured by the Kepler space observatory. The aim is to detect troughs of light intensity, as if there are regular troughs this might correspond to an exoplanet orbiting a star.
My example - noisy measurements of weight
To explore this idea I decided to simulate some data from the following model:
The weights of 20 individuals are measured every week for 20 weeks. We want to understand how the weights change over the study, for example we might be interested in understanding the impact of diet on weight. However, let’s suppose the weighing scales we use are calibrated each week in an inconsistent fashion, so that we cannot directly compare a measurement from week 1 to week 2. If we simply considered the weight of each individual in isolation we would have no way to remove the noise. However, by using half-sibling regression we can remove much of the noise.
In this plot of the measurements throughout the year we see that there is a similar pattern for each individual, caused by the shared measurement noise. This structure is what we will exploit to recover the true signal.
We now fit the half-sibling regression model, regressing one of the variables $X_1$ on its half-siblings. In R we specify this with the formula “X1 ~ . -week” meaning “regress X1 on all the remaining variables except for ‘week’”.
We can now construct our predictions using the residuals of the regression model.
So we have achieved significant noise reduction, though we are still not perfectly recovering the signal even in this case where the noise has mean 0. Now we can plot the true signal, noisy signal and our predictions using half-sibling regression. In each case the black line shows the baseline value as a reference. Since the noise had mean 0 our predictions fluctuate about this line.
Note for example in this graph that the noise variable has a large increase in week 6, but the true value has a decrease in week 6. The measured value obviously shares the increase from the noise variable, but we have managed to predict a decrease, matching the pattern of the true signal.
Further questions
How can we quantify noise reduction?
Does this work when the noise doesn’t have zero mean? Or when the signal is increasing? E.g. if the weight is increasing over time, will we recover this upwards trend? What if the noise is decreasing to balance the increase? I think a key thing that needs to be achieved is independence of noise and signal. At one point I tried increasing the signal (for only one variable) at the same rate as decreasing the noise, but this effectively means that the two are dependent (in this case if this decrease/increase is strong compared to the other changes occurring, the two will be inversely correlated and thus dependent).
Note that this is only valid if all individuals are truly independent. Perhaps if they weren’t all independent (e.g. if they were all following the same diet/exercise plan) we would have to regress only on the genuine half-siblings e.g. regress a control sample on treatment samples and regress a treatment sample only on control samples.
Sources
I used this guide to draw the DAG using latex.
Bernhard Schölkopf et al., ‘Modeling Confounding by Half-Sibling Regression’, Proceedings of the National Academy of Sciences 113, no. 27 (5 July 2016): 7391–98, https://doi.org/10.1073/pnas.1511656113.