We recommend performing probe-filtering by setting unreliably measured probes to NA. Unfortunately, Functional Normalization does not allow NA’s in the RGChannelSet
. Therefore, it is best to run normalization and probe-filtering separately and combine the results afterwards using our reduce()
function.
Some observations need to be removed from the dataset since they provide little to no information that could skew analyses. Probes which have zero intensity or are based on very few (default: <3) beads are removed, based on a specified cut off.
## Filtering on number of beads...
## On average 0.07 % of the probes ( 622399 ) have number of beads below 3
## Filtering on zero intensities...
## On average 0.004 % of the Type II probes ( 350036 ) have Red and/or Green intensity below 1
## On average 0.009 % of the Type I probes ( 92578 ), measured in Green channel, have intensity below 1
## On average 0.008 % of the Type I probes ( 178374 ), measured in Red channel, have intensity below 1
## Set filtered probes in Red and/or Green channels to NA...
## ... done 100 out of 138 ...
There are differences between probe types that are used in probeFiltering()
. Information on these can be extracted and stored using functions from minfi (Feinberg, 2014).
locusNames <- getManifestInfo(RGset, "locusNames")
typeII <- getProbeInfo(RGset, type="II")
typeI <- getProbeInfo(RGset, type="I")
The negative controls on the array can be used to detect outlying samples via the detection p-values, which measure how likely it is a sample’s signal differs from the background. Signals that do not differ substantially (default: p<0.01) are removed. We utilise detectionP()
from minfi inside our function to achieve this. Additionally, samples where a large proportion (default: >5%) of probes were set to NA based on the above probe-level filtering are removed.
Our function reduce()
takes the normalized GRset
from the previous section and the filtered RGset
created above and returns filtered beta or M-values, which can be passed to the probeMasking()
function. If M-values are specified as the output, these are scaled to have a range of 32 instead of to infinite values.
## Calculate and filter on detection P-value...
## On average 0.14 % of the CpGs ( 485512 ) have detection P-value above the threshold 0.01
## Transform to beta -values...
## On average 0.09 % of the probes ( 485512 ) were set to NA in the probe filtering step!
## Calculate success rates and reduce...
## Percentage of samples having success rate above 0.95 is 100 %
## Percentage of CpGs having success rate above 0.95 is 99.56 %
Methylation profiles often contain multiple missing values, which can be assumed to be missing at random (MAR), since they are likely due to random experimental and technology-related errors, rather than associated with methylation status.
Imputation of these values is beneficial for multiple algorithms, but its importance is highlighted in the use of predictors that rely on values from only a subset of CpGs. In these instances, the effect of a few missing sites could vastly alter estimated age, for example.
DNA methylation values represent a limited-range continuous variable, for which many standard imputation methods are theoretically suitable. However, they struggle with larger datasets due to a lack of computational efficiency. For this reason, specific strategies have been developed to handle missing methylation data (Mazumder, 2010; Josse, 2012; Di Lena, 2019).
The method that we advise using is MethyLImp (Di Lena, 2019), and we are in the process of developing a parallelized version of this for DNAmArray. This iterative algorithm exploits the high inter-sample correlation seen in methylation data (Zhang, 2017), capturing it using linear regression. This method is both computationally efficient and has been shown to perform better than all current alternatives, whilst not requiring post-imputation truncation (Di Lena, 2019).
Until the parallelized version is available, we advise using imputePCA()
from missMDA (Josse, 2012). This performs well for its small computation time, and does not suffer from the beta-value bias that impute.knn()
has been shown to (Di Lena, 2019).
## [1] 51758
After filtering and preprocessing, our beta-value matrix has 51,758 missing values. Using the code below, we can impute them by implementing a low-rank approximation version of the iterative EM-PCA algorithm (Dempster, 1977). Since this function can impute values outside the [0,1] range of beta-values, we apply a simple post-imputation truncation of the overflowed values. This has been shown not to significantly affect the accuracy (Di Lena, 2019).
library(missMDA)
betas <- imputePCA(betas)
betas <- betas$completeObs
betas[betas<0] <- 0
betas[betas>1] <- 1