Probe annotations can be downloaded from the Zhou lab GitHub page15 for all commonly used arrays and reference genomes. For the example data, we use the EPIC annotation (hg19).
## Rows: 865918 Columns: 57
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (21): CpG_chrm, probe_strand, probeID, channel, designType, nextBase, ne...
## dbl (24): CpG_beg, CpG_end, address_A, address_B, probeCpGcnt, context35, pr...
## lgl (12): posMatch, MASK_mapping, MASK_typeINextBaseSwitch, MASK_rmsk15, MAS...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
anno <- anno %>%
dplyr::select(
cpg = probeID,
chr = CpG_chrm,
start = CpG_beg,
end = CpG_end,
strand = probe_strand,
gene_HGNC,
MASK_general,
probeType
) %>%
mutate(
chr = substr(chr,4,5)
)
str(anno)
## tibble [865,918 × 8] (S3: tbl_df/tbl/data.frame)
## $ cpg : chr [1:865918] "cg14817997" "cg26928153" "cg16269199" "cg13869341" ...
## $ chr : chr [1:865918] "1" "1" "1" "1" ...
## $ start : num [1:865918] 10524 10847 10849 15864 18826 ...
## $ end : num [1:865918] 10526 10849 10851 15866 18828 ...
## $ strand : chr [1:865918] "-" "+" "+" "-" ...
## $ gene_HGNC : chr [1:865918] NA "DDX11L1" "DDX11L1" "WASH7P" ...
## $ MASK_general: logi [1:865918] TRUE TRUE TRUE TRUE TRUE TRUE ...
## $ probeType : chr [1:865918] "cg" "cg" "cg" "cg" ...
We recommend performing probe-filtering by setting unreliably measured probes to NA.
Some observations need to be removed from the dataset since they provide little to no information that could skew analyses. Probes which have zero intensity or are based on very few (default: <3) beads are removed, based on a specified cut off.
## Filtering on number of beads...
## On average 0.09 % of the probes ( 1052641 ) have number of beads below 3
## Filtering on zero intensities...
## On average 0.002 % of the Type II probes ( 724574 ) have Red and/or Green intensity below 1
## On average 0.003 % of the Type I probes ( 99964 ), measured in Green channel, have intensity below 1
## On average 0.003 % of the Type I probes ( 184560 ), measured in Red channel, have intensity below 1
## Set filtered probes in Red and/or Green channels to NA...
## ... done 100 out of 496 ...
## ... done 200 out of 496 ...
## ... done 300 out of 496 ...
## ... done 400 out of 496 ...
There are differences between probe types that are used in probeFiltering()
. Information on these can be extracted and stored using functions from minfi (Feinberg, 2014).
locusNames <- getManifestInfo(RGset, "locusNames")
typeII <- getProbeInfo(RGset, type="II")
typeI <- getProbeInfo(RGset, type="I")
RGset
## class: RGChannelSet
## dim: 1052641 496
## metadata(0):
## assays(2): Green Red
## rownames(1052641): 1600101 1600111 ... 99810990 99810992
## rowData names(0):
## colnames(496): GSM3228585_200550980034_R01C01
## GSM3228586_200550980034_R02C01 ... GSM3229237_200598350015_R06C01
## GSM3229239_200598350080_R03C01
## colData names(25): sample_ID geo_accession ... Neu Mono
## Annotation
## array: IlluminaHumanMethylationEPIC
## annotation: ilm10b4.hg19
Since functional noramlization doesn’t accept NA values, the normalized GenomicRatioSet and filtered RGChannelSet need to be merged. For this reason, we provide a reduce
function in DNAmArray.
## Calculate and filter on detection P-value...
## On average 0.14 % of the CpGs ( 866836 ) have detection P-value above the threshold 0.01
## Transform to beta -values...
## On average 0.09 % of the probes ( 866836 ) were set to NA in the probe filtering step!
## Calculate success rates and reduce...
## Percentage of samples having success rate above 0.95 is 100 %
## Percentage of CpGs having success rate above 0.95 is 99.72 %
It can also be advisable to round the beta values to 4 significant figures, which both reduces the data size and provides a realistic measure of the available precision.
If required, these can be transformed to M-values using minfi’s getM()
which applies a logit2-transformation to the beta-values.
CpGs in the ENCODE Blacklist22 are provided in the package and can be loaded in:
450k
- ./data/ENCODEBlacklist-CpGomit-450K.RData
EPIC
- ./data/ENCODEBlacklist-CpGomit-EPIC.RData
EPICv2
- ./data/ENCODEBlacklist-CpGomit-EPICv2.RData
These were generated using the Blacklist .bed
file found here, which was converted to a GRanges object and overlapped with the manifest for each DNAm array. In total, this removes 7313 probes from EPICv2, 9035 from EPIC, and 6508 from the 450k array.
load("./data/ENCODEBlacklist_CpGomit-EPIC.RData")
betas <- betas[!rownames(betas) %in% cpg_blacklist,]
dim(betas)
## [1] 855423 496
Several studies15 have characterized cross-reactive and polymorphic probes on methylation arrays and made suggestions for removal. The proposed sets are actively maintained and supports all current array sizes. Additionally, different sets are specified depending on the reference genome used.
Probes are removed for a variety of reasons, from ambiguous mapping and cross-reactivity to polymorphisms that interfere with extension. In general, around 60,000 probes are suggested for removal in 450k arrays, and 100,000 for EPIC. We developed the probeMasking()
function, which removes specified probes from either beta- or M- value matrices.
## [1] 99360
## [1] 761741 496
For most EWAS, CpGs on sex chromosomes are filtered out. To perform analyses on sex-chromosomal probes, a stratified approach is required.
xy_cpgs <- (anno %>%
dplyr::filter(chr %in% c("X", "Y")))$cpg
betas <- betas[!rownames(betas) %in% xy_cpgs,]
dim(betas)
## [1] 744846 496
In many analyses, it is advisable to remove outlying DNAm values. This can be based on:
Outliers can be based on the IQR or RIN values. Based on your specification, you can then filter outlying values. Here, we present removing beta values more than 3 IQR from the nearest quartile.
## is.na(betas)
## FALSE TRUE
## 369086545 357071
iqr_dnam <- apply(betas, 1, function(x){
iqr <- IQR(x, na.rm = TRUE)
q1 <- quantile(x, na.rm=TRUE)[[2]]
q3 <- quantile(x, na.rm=TRUE)[[4]]
x <- ifelse((x <= q1 - (3*iqr) | x >= q3 + (3*iqr)), NA, x)
})
xtabs(~is.na(iqr_dnam))
## is.na(iqr_dnam)
## FALSE TRUE
## 367258212 2185404
## [1] 496 744846
We advise keeping only the CpGs where less than 5% of sample values are missing.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1694 0.2270 0.2727 0.5915 0.4517 17.5205
## [1] 744846 496
On the sample level, we keep only the samples where less than 5% of CpGs are missing.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.2016 0.4032 0.5915 0.8065 28.6290
## [1] 744846 496
Now that we have our annotations, we can combine them with our data using the SummarizedExperiment package. After ensuring that any column subsetting is accounted for, we extract the colData()
from our RGset
. Finally, the beta values, probe annotations, and sample information are combined into a RangedSummarizedExperiment
object using the SummarizedExperiment()
function.
Annotation data
## [1] 744846 8
Keep annotated probes
Sample information
## [1] 496 25
## [1] 744846 496
Make summarized experiment object
methData <- SummarizedExperiment(
assays=SimpleList(beta=betas),
rowData=anno,
colData=targets)
methData
## class: SummarizedExperiment
## dim: 744846 496
## metadata(0):
## assays(1): beta
## rownames(744846): cg09499020 cg16535257 ... cg08423507 cg24129471
## rowData names(8): cpg chr ... MASK_general probeType
## colnames(496): GSM3228585_200550980034_R01C01
## GSM3228586_200550980034_R02C01 ... GSM3229237_200598350015_R06C01
## GSM3229239_200598350080_R03C01
## colData names(25): sample_ID geo_accession ... Neu Mono
You can read more about SummarizedExperiment from the following resources:
Save RGset
## class: RGChannelSet
## dim: 1052641 496
## metadata(0):
## assays(2): Green Red
## rownames(1052641): 1600101 1600111 ... 99810990 99810992
## rowData names(0):
## colnames(496): GSM3228585_200550980034_R01C01
## GSM3228586_200550980034_R02C01 ... GSM3229237_200598350015_R06C01
## GSM3229239_200598350080_R03C01
## colData names(25): sample_ID geo_accession ... Neu Mono
## Annotation
## array: IlluminaHumanMethylationEPIC
## annotation: ilm10b4.hg19