I've recently had a New paper out in EPJ Data Science concerning preferential
attachment (the 'rich get richer' effect) in social networks.
This is a relatively controversial area, and the aim of the paper was to try to
be as systematic / objective as possible. The simplest way to explain our
results, suggested by one of two pleasingly constructive reviewers, is that we
estimate around 40% of social contacts arise due to the existence of existing
social contacts (e.g. your friends introduce you to their friends) rather than
from direct social activity. If this figure had been much larger, it would have
created an epidemiological paradox - no infectious disease would be
controllable.
The main innovation was to use a general finite-state-space Markov chain for
the non-preferentially attached links, which generates a mechanistically
interpretable phase-type distribution as a way to capture other sources of
heterogeneity in contact numbers.
The journal's editor (quite a 'big name' with many followers) was
also nice enough to promote the work on social media:
A ubiquitous problem in epidemiology is confounding, which is (very loosely
speaking) where something else correlates with the actual causal mechanism for
disease. In the current context, the issue is that there may be clusters of ill
people on a social network due to spreading of the illness on the network, or
it could be that people who share a risk factor tend to be linked on the
network. So for depression (which we studied) it could be that people who drink
heavily tend to be friends, but it is the drinking that causes the clusters of
depression rather than the friendships.
There is a lot of interesting discussion of such questions in the context of
social networks in a Special issue of Statistics in Medicine. Personally,
I tend to think that while it is never possible to eliminate all forms of
confounding in observational studies, these offer information that is not
possible to obtain in experiments. Therefore the task for methodologists is to
try to come up with methods that allow information to be reliably extracted
from observational data. In the paper in question, we tried to reduce the
possible role for confounding in the following way.
Suppose we have a Markov chain \(\mathbf{X}_t = (X^i_t)\), where
\(X^i_t=1\) if individual \(i\in\{1,2,\ldots,N\}\) is
ill at time \(t\) and \(X^i_t=0\) otherwise. If there
is no social influence, then a standard modelling approach is to let
\[\mathrm{Pr}(X^i_t=1) = f(\boldsymbol{\theta}_i) \mathrm{ ,}\]
where \(\boldsymbol{\theta}_i\) is a vector of individual-level
properties / behaviours for individual \(i\), and the function
\(f\) is usually a logit transform on
\(\boldsymbol{\beta}\cdot\boldsymbol{\theta}_i\). Here there is
a lot of room for confounding if different elements of the vectors in
\(\{\boldsymbol{\theta}_i\}_{i=1}^{N}\) are correlated with each
other.
Now add the social network - a matrix with elements \[G_{i,j} =
\begin{cases} 1 & \text{ if } j \text{ is } i \text{'s friend,} \\ 0 & \text{
otherwise.}\end{cases}\] Clustering in such a network is a tendency for
friends to be in the same disease state - \(\mathrm{Pr}(X^i_t = X^j_t \vert
G_{i,j}=1) \gt \mathrm{Pr}(X^i_t = X^j_t \vert G_{i,j}=0)\). The issue of confounding
then becomes whether the links \(\{G_{i,j}\}_{i,j=1}^{N}\) (usually
thought of as Bernoulli random variables like the disease states) are
correlated with the \(\{\boldsymbol{\theta}_i\}_{i=1}^{N}\), or
whether there is evidence for "spreading" of the disease as in Figure 3 of the
paper.
Our way of answering this rests on the bit of stochastic process theory that
says that distinct Markov chains can share the same stationary distribution,
meaning that observations from this distribution alone cannot distinguish them,
but may have different transition probabilities. A direct observation of
transitions between disease states therefore allows us to consider determine
model probabilities like \(\mathrm{Pr}(X^i_{t+1} = 1 \vert
\mathbf{X}_t,\mathbf{G})\), and thereby distinguish between spreading over a
network, and correlation of other properties / behaviours. The full details are
given in the paper and supplement, but the basic idea that a lot of confounding
is related to equivalent stationarity of stochastic processes behind the
hypotheses to be considered, and could be mitigated by fitting to temporal
observations (or other features that do distinguish those processes) is, I
believe, a relatively general one.
This experience had a large effect on my thinking about the field, and allowed
me to start new collaborations. One of the things that we did together is write
the review paper Modeling infectious disease dynamics in the complex landscape
of global health, which has
just appeared in its final form. While the lead author, Hans Heesterbeek, did the really hard work of
pulling everything together, the paper comes remarkably close to a consensus
view of the most exciting directions in infectious disease modelling, and it's
a good starting point for someone thinking about making contributions to this
field.