Data source
The UK Biobank (UKB) is a prospective UK population-based study that enrolled approximately half a million adults aged 40–69 between 2006 and 2010 designed to investigate the genetic and lifestyle determinants for a wide range of diseases. Participants underwent genome-wide genotyping, with linkage to longitudinal hospitalization, primary care (GP), and self-report data dating back to 1940 (Fig. 2 and Supplementary Figs. 11 and 12)41. Using the ukbpheno package (version 1.0)41, we assembled detailed longitudinal data from the various sources documenting events from 1940 until December 2021 for 481,927 individuals after excluding 20,534 who lacked quality control genotyping or risk factor information (Fig. 2 and Supplementary Figs. 11 and 13). At the time of analysis, linkage to the United Kingdom General Practice (GP) Registry was available for a subset of 221,351 individuals. This assembly across data sources generated phenotypes for hypertension (Htn), diabetes mellitus (DM) (type 1 or 2), hyperlipidemia (Hld), or coronary artery disease (CAD) based on validated collections of hospitalization (HESIN), diagnostic, operation, general practice (GP) clinical and script as well as death information41. We found high overlap between these phenotypes and our own lab’s previously generated HESIN-restricted phenotypes36,44 (Supplementary Fig. 13). These phenotypes subsequently became the risk factor states in our model. Informed consent was obtained from all participants, and secondary data analyses were approved by the Mass General Brigham Institutional Review Board 2021P002228. Secondary data analysis of UKB was performed under application number 7089.
Because of the longitudinal nature of this cohort, every individual is observed at first encounter with the electronic health record (EHR) in early adulthood (median age 24.2 years). We selected UKB participants free of CAD at age 40 and followed until the occurrence of CAD, death, or loss to follow-up (median follow-up 29.9 years). We categorize individuals by their condition at entry into our cohort at age 40 years provided they have been observed in the EHR (Fig. 2). We then re-evaluate at each age the risk set as those individuals who have (1) been observed and (2) have not been censored for a given phenotype. We demonstrate the diversity of data sources and the corresponding availability of each data source over time for all considered phenotypes (Supplementary Fig. 12). In general, our model allows for the progression from CAD to death, but we report here the risk of progression to CAD on CAD-frr individuals at baseline.
Polygenic risk
An additional novelty of our model is the incorporation of the dynamic effects of genetics over time. We use CAD polygenic risk score (PRS) as released through the UKB resource45 and compute on individuals with adequate genotype information after quality control and after controlling for the principal component axes obtained from the common genotype data in the 1000 Genomes reference data set using standard methods45. Data supporting these scores were entirely from external GWAS data (the Standard PRS set) as conducted by Genomics PLC (Oxford, UK) under UKB project 965945. We demonstrate that the distribution of PRS is similar across entry age (Supplementary Fig. 14).
States and competing risk
The unique nature of our multistate model features eight mutually exclusive states and restricts one-year transitions as follows (Fig. 1), with death as the final absorbing state from which one cannot exit. At any age across the life course, cumulative one-step transitions can be assessed (Fig. 1). Possible transitions are as follows:
1.
Healthy to a single risk factor (Hypertension, Hyperlipidemia, Diabetes), CAD or death; (healthy to healthy also allowed but not displayed for clarity)
2.
Single risk factor to corresponding double risk factor, CAD or death;
3.
Double risk factor to triple risk factor, CAD or death;
4.
Triple risk factor to CAD or death;
5.
CAD to death.
Predictions with age as the time scale
Our model inferences are made per year using the individuals who are in a particular risk state at a given age (Fig. 2 and Supplementary. Fig. 9). Predictions can, therefore, be made over a requested time interval using the product of age-specific risks for which coefficients were estimated from individuals who were in the at-risk subset during a given period. While enrollment in the UK Biobank required that an individual be alive at age 40 to enroll for genotyping, it did not require that the individual be risk factor-free, and therefore we use this information to assign individuals into risk categories for inference from age 40 onward. We exclude individuals with CAD at baseline from our predictions. After deriving the model construction, we describe the computation and evaluation of state and age-specific risks below.
Statistical analysis
Let \({\pi }_{{jkia}}\) represent the annual transition probability from state j to state k for individual i during year a. We let the states j and k represent phenotypes ascertained from the electronic health record. ‘From’ states include Health; single risk factor states: Hypertension (Ht), Hyperlipidemia (Hld), Diabetes Mellitus Type 1 and Type 2 (DM), double risk factor states: Ht & Hld, Ht & Dm, Dm & Hld; Triple risk factor states: Dm & Hld & Ht; and Coronary Artery Disease (CAD). “To” states include all of the “From” states and Death. For our purposes, we report the progression to CAD or death from any of the “From” states.
For p covariates for a given individual transitioning from state j to k, we refer to the following.
$$log \frac{{\pi }_{{jkia}}}{1-{\pi }_{{jkia}}}={\hat{\beta }}_{{jka}0}+{\hat{\beta }}_{{jka}1}{{{{{{\rm{x}}}}}}}_{1}+\ldots {\hat{\beta }}_{{jkap}}{{{{{{\rm{x}}}}}}}_{p}$$
(2)
Where \({\hat{\beta }}_{{jkar}}\) represents the coefficient of variable r in the prediction of the transition probability from state j to state k at age a. Taking the inverse logit of the estimate returns the absolute risk for any individual i as a function of the age-specific coefficients and their p covariates, such that the annual risk estimate from state j to state k is given by:
$${\pi }_{{jkia}}=\frac{{exp }^{{{{{{{\boldsymbol{X}}}}}}}_{{{{{{\boldsymbol{ia}}}}}}}{{{{{{\boldsymbol{B}}}}}}}_{{{{{{\boldsymbol{jka}}}}}}}}}{1+{exp }^{{{{{{{\boldsymbol{X}}}}}}}_{{{{{{\boldsymbol{ia}}}}}}}{{{{{{\boldsymbol{B}}}}}}}_{{{{{{\boldsymbol{jka}}}}}}}}}$$
(3)
Here we let \({{{{{\boldsymbol{X}}}}}}^{{{\prime} }}\) represent the 1 × P vector of individuals and covariate profiles at a given age and \({{{{{\boldsymbol{\beta }}}}}}\) represents the P × 1 vector of age and state-state-specific smoothed coefficients. Smoothing is described in subsequent sections.
When estimating Eq. (3), state j represents the “from” state and state k represents the “to” state. To account for censoring, an individual exits the “at risk” group for transition inference when they are lost to follow-up. We use a 1-year interval over which to discretize age intervals and independently estimate the \({\pi }_{{jkia}}\) age-dependent-state to state transitions. We have both fixed and time-varying covariates. The effect of all covariates on transitions varies with age. Time-invariant covariates include sex and polygenic risk score (PRS). UKB assesses current smoker status at enrollment subsequent change in smoking status is not sufficiently reliable for our purposes. Therefore, we use it as a time-invariant covariate for model estimation. Time-dependent covariates include both antihypertensive and statin prescriptions. These are reevaluated each year using prescription data from the UKB46. Our final prediction model allows for continuous updates of smoking and medication usage in evaluating age-specific transition probabilities. We use 80% of our data as training and 20% as testing (Fig. 2) which divides our data into a training set for model fitting using 384,510 samples and a testing data set of 79,117 unique individuals. Before carrying out the analysis, we selected these covariates for compatibility with the existing Pooled Cohort and Framingham 30-year equations6,26. As a sensitivity analysis, we also report the results after removing certain covariates (Supplementary Table 1).
Predicted interval risk
Predicted risk of progressing to state k from state j for individual i over any period ranging from age A1 to A2 is (Eq. (4)):
$${Interval\; Risk}=1-{\prod }_{{A}_{1}}^{{A}_{2}}\left(1-{\pi }_{{jkia}}\right)$$
(4)
where a indexes the current age. Accordingly, we compute risk for an individual i of progressing to state k from state j where L is the maximum age of life and a is the currently observed age (Eq. (5)). For our purposes, we choose L = 80 in line with the available data by age in the UK Biobank.
$${Remaining\; Lifetime\; Risk}=1-{\prod }_{{A}_{1}}^{L}\,\left(1-{\pi }_{{jkia}}\right)$$
(5)
The remaining lifetime risk can be modified to account for treatments by applying a constant relative risk reduction to the age-specific transition probabilities in expression 4. Then the interval risk under treatment can be calculated using the per-year risk reduction RR of progressing to state k from state j over an interval from age A1 to A2 as (Eq. (6)):
$${Interval\; Risk\; under\; treatment}=1-{\prod }_{{A}_{1}}^{{A}_{2}}\left(1-{\left(1-{RR}\right)\times \pi }_{{jkia}}\right)$$
(6)
For the purposes of this manuscript, we are interested in state k = CAD. We impute the relative risk reduction of 0.20 from 24 trials of statin therapy29. Within our model, we constrain each individual’s predicted probabilities across states per year to sum to one such that for each age a, the probability of staying within the given state is the complement of the sum of transitions over K to the alternative states:
$${\pi }_{{jjia}}=1- {\sum }_{k\ne j}{\pi }_{{jkia}}$$
(7)
We choose j because it is mostly above 50% and the constraint in 6 will guarantee that for a given age, the probabilities for an individual of a particular covariate profile sum to 1. The alternative of fitting a polytomous regression is computationally much more demanding and gives approximately the same answer. Here we report the product of these conditional one-step transitions from the healthy state as the state of most interest for primary prevention.
Flexible smoothing of regression coefficients across ages
The unadjusted observed coefficients may be inherently noisy and certain transitions may have low sample sizes. Therefore, we extract the unsmoothed coefficients \({\hat{\beta }}_{{jka}}\) for each age and state transition from the logistic regressions in Eq. (2). To borrow information across ages, we fit a smoothed locally estimated polynomial regression in which for each state to state transition and each covariate27,28 (LOESS) (Supplemental Fig. 15). The loess weights are proportional to the product of the inverse variance of each estimated coefficient and the tricube distance function of nearby ages. This will smooth adjacent ages more closely together proportional to the cube of their distance d from the age in question, where:
$$D={abs}\left({age}-{ag}{e}_{i}\right)$$
(8)
We consider the neighboring unsmoothed coefficients as those within an adjusted window length, and if the age in question is within 5 years of the minimum or maximum age, we extend the adjusted window by 5 years.
$$\begin{array}{c}{neighbors}={which}\left({D}_{i}\le {adjuste}{d}_{{windo}{w}_{{width}}}\right)\\ {weight}{s}_{{tricube}}=1 – {\left(\frac{{D}_{{neighbors}}}{{windo}{w}_{{width}}}\right)}^{3}\\ {weights} \, < -{weight}{s}_{{tricube}} * \frac{1}{{\sigma }^{2}}\end{array}$$
(9)
We then use weighted least square regression to obtain the weighted sum of neighboring coefficients where the design matrix X is the “N” neighbor’ by degree +1 matrix X and y is the N × 1 vector of unsmoothed coefficients.
$$\begin{array}{c}{WX}=\sqrt{{weights}}\,X\\ {Wy}=\sqrt{\left({weights}\right)} * {coefficients}\left[{neighbors}\right]\\ \beta \, < -{\left(W{X}^{{\prime} }{WX}\right)}^{-1}W{X}^{{\prime} }{Wy}\\ {smoothe}{d}_{{coefficient}{s}_{i}}=\sum \beta+\beta * {Ag}{e}_{\left\{i\right\}}\ldots \beta * {Ag}{e}_{i}^{d}\end{array}$$
(10)
A vignette showing this process on a sample calculation is shown here https://surbut.github.io/MSGene/vignette.html. Furthermore, flexible window choices and polynomial degrees can be found here: https://surbut.shinyapps.io/testapp/. All analyses were performed with R (version 4.3.1). The underlying MSGene framework which could be adapted for other datasets is available at https://github.com/surbut/MSGene with accompanying vignettes. All plots are available at https://surbut.github.io/multistate2/index.html.
Distance weighting refers to the fact that, for each age point, neighboring data within a dynamic window is incorporated, and expanded at the age extremes to mitigate boundary bias. We then apply a tricube weight function that assigns higher weights to nearer neighbors, tapering to zero beyond the window, capturing the assumption of locality. The regression is stabilized by inverse variance weighting, emphasizing more reliable (less variable) observations, in line with the assumption that points with lower variance provide more accurate information. The design matrix, accounting for polynomial terms up to a specified degree, facilitates flexible modeling of the age-coefficient relationship, without imposing a global functional form. This approach assumes that polynomials can locally approximate the age-related trends in coefficients and that these local fits can be aggregated to represent the global trend. It also presumes the initial coefficient estimates are sensible and variances are correctly specified for accurate weighting.
Standard error of projection
We sample with replacement (“bootstrap”) our training data 1000 times and extract the corresponding means and standard errors of each projection across bootstrapping iterations. We compute the remaining lifetime risk by setting the maximum age to 80, according to the density of observations in our training data. We impute a relative risk (RR) of CAD from statins of 0.2034,47,48; notably, the RR may be larger for some groups, such as those with elevated CAD-PRS36,49, and for longer periods of time and thus this reflects a conservative estimate50. We apply this benefit only to individuals not previously on statins.
For the RMSE, we report the standard error of the mean across strata. For proportions, we report the standard error of the sample proportion as \(\surd (\hat{p}(1-\hat{p})/n)\) where \(\hat{p}\) represents the sample proportion.
Performance metrics
For each age, we compare the average predicted score by genomic (<20%, 20–80%, and >80%) and sex strata, and report the root mean squared error (RMSE) as the difference in the average empirical and predicted cumulative incidence rate for each PRS and sex group.
$${RMSE}={sqrt}\Big(\sqrt{{{Empirical}\; {Incidence}}_{{PRS}\times {sex}} – {mean}\left({{Predicted} \; {Rate}}_{{PRS}\times {sex}}\right)}$$
(11)
For the area under the receiver operator curve (AUC-ROC) and precision-recall analysis, we compute the area under each curve using each score as a predictor of cumulative case or control status computed using values for individuals at each year plotted.
Comparison to 10-year PCE and 30-year Framingham CAD risks
For comparison of 10-year risk, we use the 2018 PCE with baseline covariates (total cholesterol, HDL-cholesterol and systolic blood pressure, current smoking) obtained from UKB enrollment data and update each prediction26 with age, diabetes, and medication use according to available records. This technique was used in the Framingham 30-year risk development to validate new longer window estimates in which age was iteratively updated with all other risk factors at their baseline values26.
For comparison of calibration to 30-year risk, we used the 2009 complete (lipids, non-BMI) Framingham 30-year equation (FRS30) and update each prediction26 with age, diabetes, and antihypertensive use according to available records, consistent with detailed formulae within the FRS30. Given the differing populations, we recalibrated51 according to the mean levels of each covariate and baseline hazard in the UKB sample (FRS30RC). For fair comparison, we report our results against FRS30RC (Supplementary Fig. 16). Precision and discrimination analysis described as above. We compute and display the predicted 30-year risk for individuals from ages 40–70 according to this model.
Age-dependent model assessment
To evaluate model concordance, we first use the age and state-specific predicted risk scores for each individual which arise from our MSGene system of smoothed logistic regressions – as covariates in a time-dependent Cox model, in which an individual is featured in nonoverlapping intervals with their respective score and event status. In the evaluation stage, we conservatively left censor individuals until enrollment. The way in which the test set is defined reflects the clinical application of the model. The Cox model is never used for estimating the MSGene transition probabilities themselves: it is only used in the model assessment stage to evaluate concordance in a time-dependent matter33.
We also calculate the minimum age at which an individual would exceed pre-specified risk thresholds for MSGene, PCE, and FRS30. We divide every individual’s observed trajectory into nonoverlapping intervals, indicating when one or all thresholds are achieved and when an event occurs. For example, if an individual is observed from ages 40–70 and exceeds one risk score at age 45 and the other at age 52 and has an event at age 68, his period of study will be divided into 4 intervals: the period from age 40 to 44 in which he exceeds the threshold with neither score, the period from 45 to 51 in which he exceeds the threshold only with score 1, the period from 52 to 67 in which exceeds with both scores, and the period from 68 to 80 in which he has had an event and exceeded in both scores. We left censor in this analysis at age of enrollment. We fit independent time-dependent Cox models31 to this expanded data set, and again conservatively left censor until enrollment. For both analyses, we report the concordance index (Harrell’s C) with confidence intervals derived from 1000 bootstrapping iterations52. Concordance asks, at each time interval, whether the level of one time-varying score exceeds the level of an alternative score for individuals with an event versus those without33.
Internal and external model assessment
We internally assess the RMSE (Supplementary Table 1) of models using a finite number of covariates (here sex, polygenic risk, and time-dependent smoking and antihypertensive use) for eight state-specific transitions built on a training set and independently assessed on our testing set. In addition, external validation was performed on individuals in the Framingham Heart Study Offspring cohort (FOS)53 by comparing the model fits estimated in the UKB with 10-year and lifetime risk estimates with individuals in the FOS (Supplementary Fig. 7) for whom genetic data are available. This is a community-based Northeastern United States cohort that was recruited in 1971, median age [IQR] 33.0 years [27.0, 41.0] and followed through 2013. Clinical data and incident disease for 3836 participants, and genetic data for a subset (2611), were available through the database of Genotypes and Phenotypes (dbGaP; accession phs000007.v33.p14). We compare these with the PCE and FRS30 (original score, calibrated for this population) estimates calculated at Exam 1 and compute the RMSE and AUC over the 30-year follow-up period. Informed consent was obtained from all participants, and secondary data analyses of dbGAP-based FOS and UKB were approved by the Mass General Brigham Institutional Review Board applications 2016P002395 and 2021P002228.
Calculating net reclassification
For net reclassification indices, at each age of consideration, we defined NRIevent as the net proportion of cases correctly reclassified by MSGene Lifetime (MSGeneLT >10%) as compared to a ten-year PCE:
$${{NRI}}_{{event}}:\,\frac{{MSGen}{e}_{{LT}} \, > \, 10\%\,\cap {PCE} \, < \, 5\%\cap {CAD}-{MSGen}{e}_{{LT}} \, < \, 5\%\,\cap {PCE} \, > \, 5\%\cap {CAD}}{{Develops\; CAD}}$$
(12)
We defined NRInon-event as the net proportion of controls correctly reclassified by MSGene lifetime risk <10%:
$${{NRI}}_{{non}-{event}}$$
$$\frac{{MSGen}{e}_{{LT}} \, < \, 10\%\,\cap {PCE} \, > \, 5\%\cap {No\; CAD}-{MSGen}{e}_{{LT}} \, > \, 10\%\,\cap {PCE} \, < \, 5\%\cap {No\; CAD}}{{Does\; not\; develop\; CAD}}$$
(13)
Marginal calculation
We also allow, for the absorbing states of CAD and death, the possibility of computing the probability of progressing through any path to CAD (“marginal”). The calculation of progressing to state K from state J through any path over N years is the product of N transition matrices T in which the j,k element for matrix Tia is the probability of progressing from state j to k at age a for individual of covariate profile i:
$${Marginal\; Interval\; risk}=\,{\prod }_{A1}^{A2}{T}_{{iajk}}$$
(14)
For absorbing states, the k,k probability is 1. This vignette is available at https://surbut.github.io/MSGene/usingMarginal.html.