Phenotype Prediction

It can be useful to predict phenotypes in your data, either for comparison to assumed categories or to help impute missing values. Whole blood sample cell-type composition and sex predictors are already included in this package, and we are in the process of creating predictors for age and smoking status. Meanwhile, below are some current methods to predict these phenotypes in your data.

It is important to carry out predictions prior to probe masking, as many probes masked for analysis (e.g. those on sex chromosomes) are still useful for predicting phenotypes. Additionally, we advise imputing missing vaues before phenotype prediction, since many predictors use only a subset of available probes and missing data can seriously impact accuracy.


Predict Cell Counts

Using BIOS consortium data, which contains multiomics measures from 6 Dutch biobanks comprising ~4000 individuals, we trained a predictor for cell-type composition of whole blood samples. This was based on methylation values from 481,388 CpGs for 2,852 individuals and code for building the predictor can be found at the bottom of this page, should you wish to replicate the process in your own large dataset.

In order to use our predictor, you must first load in the coefficients. This predictor takes age, sex, and sentrix position as covariates, so it is important to create a wbcc_covar variable with this information to input into the predict_wbcc function.

##                              age            sex Sentrix_Position
## GSM3092700_9985178096_R01C01  50 gender: Female           R01C01
## GSM3092701_9985178127_R03C02  19 gender: Female           R03C02
## GSM3092702_9986360109_R02C02 110   gender: Male           R02C02
## GSM3092703_9985178087_R01C01  39   gender: Male           R01C01
## GSM3092704_9985178090_R01C01  97   gender: Male           R01C01
## GSM3092705_9985178090_R04C01 103 gender: Female           R04C01

As you can see, this information is available for us, but some changes are needed to ensure that each variable conforms to the predictor’s format. Covariates must be named with capital letters, so Age, Sex, and Sentrix_Position. Additionally, Sex needs to be coded numerically, with 0 signifying females and 1, males. Sentrix_Position also needs to be factorised and stored as a numeric variable.

##                              age sex Sentrix_Position
## GSM3092700_9985178096_R01C01  50   0                1
## GSM3092701_9985178127_R03C02  19   0                6
## GSM3092702_9986360109_R02C02 110   1                4
## GSM3092703_9985178087_R01C01  39   1                1
## GSM3092704_9985178090_R01C01  97   1                1
## GSM3092705_9985178090_R04C01 103   0                7

After this reformatting wbcc_covar is ready to be input into the predict_wbcc function. Next, we turn our attention to the methylation values in our dataset. Since our predictor was trained using 481,388 CpG sites, we want to extract the methylation values for those sites from our data. In the predictor, the CpG rows are prefixed with data, making them simple to extract for matching purposes.

## [1] 481388    138
## [1] 481392      5

Looking at the dimensions of our wbcc_betas object in comparison with the DNAmPredictorCoef one, there are 4 missing rows in the beta value matrix. These represent the intercept along with the 3 covariates (Age, Sex, and Sentrix_Position). The last reformatting needed is to ensure that the rownames of the predictor and our dataset values match.

Next, we can use the predict_wbcc function to predict cell counts for our whole blood samples. This carries out matrix multiplication between our data and the predictor coefficients, trained on the large BIOS consortium data. Since 51 cell-type percentages were predicted to be outside the 0-100% range, we also apply a post-prediction trunction to the values for these rare cell types.

## [1] "There are 89976 NA's in the data matrix. These will be median imputed."
##                              baso_perc eos_perc lymph_perc mono_perc neut_perc
## GSM3092700_9985178096_R01C01 0.8649793 2.239715   31.43147  8.553607  56.17437
## GSM3092701_9985178127_R03C02 0.3980295 2.526380   39.72257 10.733249  46.26124
## GSM3092702_9986360109_R02C02 1.2671736 2.997470   42.84254  8.761936  42.96368
## GSM3092703_9985178087_R01C01 0.5129077 2.352888   38.84856  8.719807  48.29784
## GSM3092704_9985178090_R01C01 0.7756450 4.049402   41.44949  8.370699  43.93083
## GSM3092705_9985178090_R04C01 0.6115671 1.321606   35.01724  6.532910  54.54430

Here, you see the predicted cell types for our whole blood samples. There is an estimated percentage for basophils, eosinophils, lymphocytes, monocytes, and neutrophils. These can then be added to the colData of our RGset for use in later analysis models.


Predict Sex

The DNAmArray package contains a getSex.DNAmArray() function, which predicts the sex of your observations by inspecting intensities of X-chromosome signals.

##             assumedSex
## predictedSex gender: Female gender: Male
##       Female             55            0
##       Male                0           83

As you can see, this is complete data, but the predicted and assumed sexes are identical. This means that we can feel increased confidence that no incorrect labelling or mix-ups are present.


Building a Predictor

In order to build a cell-type composition predictor, you will need data on methylation status, cell-type composition, and relevant covariates for a large number of samples. In our case, we used data from the BIOS consortium, which is stored as a RangedSummarizedExperiment class object. We first extract the methylation values, ensuring we keep only CpG sites with complete data. In our case, this was all 481,388 assayed DNA methylation sites.

Next, we extracted measured cell percentages, which will be used as a known outcome to train the predictor. The cell-types measured may vary between datasets, but we chose to predict basophils, eosinophils, neutrophils, lymphocytes, and monocytes. It could also be a viable strategy to combine basophils, eosinophils, and neutrophils into one granulocyte category if you are lacking power to detect changes in rarer cell-types.

We also store relevant covariates, such as age, sex, and sentrix position, since these can confound associations between methylation and cell-type composition.

After ensuring that IDs were consistent between all three types of data, we ensured that our training set was only composed of complete data. Missing values in either the outcome or the covariates would interfere with building the model.

Now that we have a complete data set, we want to split it into a training set and a test set. To do this we use the createDataPartition function from caret. This is superior to using sample, since it preserves the proportion of the categories in the outcome variable, which can be disturbed if you sample randomly.

It is important to set.seed when splitting your data, so that your models and results are replicable. In this instance, we chose to split the data 70:30, but assigning somewhere between 60% and 80% of your data to the training set is common and should work well.

Training the predicted can be done with the train_wbcc function including in this pipeline. This takes methylation values, covariates, and cell percentages as inputs and uses plsr from the pls package carry out multivariate regression using partial least squares. By default, the predictor is built using 50 components, but this can be modified using the ncomp option.

Using pls.options allows you to utilise multiple cores when building the predictor, and this is advised. You can determine the optimal number of components using the validationplot function and, when determined, save the coefficients of your model with coef. Sometimes, it is also useful to know which covariates have the highest prediction value and store them, as shown below.

The data provided in this package is the DNAmPredictorCoef object stored above. This can be used to predict cell-type composition of whole blood samples as shown in the above sections, or you can use a predictor you build in a similar fashion.

In order to assess the power of your predictor, you should then predict cell percentages in your test set, and use our plot_wbcc function to examine the correlation between measured and predicted cell counts.


The above plots illustrate the relationship between measured cell counts in BIOS consortium data and predicted values in our test data. The correlation for neutrophils and lymphocytes is very strong, likely due to these being highly prevalent cell-types in whole blood samples. This agreement remains strong for eosinophils and monocytes, but is considerably weaker for the rarest cell type, basophils.

We chose not to cluster granulocytes as the prediction for eosinophils and monocytes was good, despite these being less prevalent, and there was still positive correlation for the basophils. Overall, these plots show that our predictor has the power to estimate cell types and performs well in unseen samples.