Annotation

Probe annotations can be downloaded from the Zhou lab GitHub page¹⁵ for all commonly used arrays and reference genomes. For the example data, we use the EPIC annotation (hg19).

anno <- read_tsv("./data/EPIC.hg19.manifest.tsv.gz")

## Rows: 865918 Columns: 57
## ── Column specification ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (21): CpG_chrm, probe_strand, probeID, channel, designType, nextBase, nextBaseRef, probeType, orientation, ProbeSeq_A, ProbeSeq_B, ...
## dbl (24): CpG_beg, CpG_end, address_A, address_B, probeCpGcnt, context35, probeBeg, probeEnd, beg_A, flag_A, mapQ_A, NM_A, beg_B, flag_...
## lgl (12): posMatch, MASK_mapping, MASK_typeINextBaseSwitch, MASK_rmsk15, MASK_sub40_copy, MASK_sub35_copy, MASK_sub30_copy, MASK_sub25_...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

anno <- anno %>% 
  dplyr::select(
    cpg = probeID,
    chr = CpG_chrm,
    start = CpG_beg,
    end = CpG_end,
    strand = probe_strand,
    gene_HGNC,
    MASK_general,
    probeType
  ) %>% 
  mutate(
    chr = substr(chr,4,5)
  )

str(anno)

## tibble [865,918 × 8] (S3: tbl_df/tbl/data.frame)
##  $ cpg         : chr [1:865918] "cg14817997" "cg26928153" "cg16269199" "cg13869341" ...
##  $ chr         : chr [1:865918] "1" "1" "1" "1" ...
##  $ start       : num [1:865918] 10524 10847 10849 15864 18826 ...
##  $ end         : num [1:865918] 10526 10849 10851 15866 18828 ...
##  $ strand      : chr [1:865918] "-" "+" "+" "-" ...
##  $ gene_HGNC   : chr [1:865918] NA "DDX11L1" "DDX11L1" "WASH7P" ...
##  $ MASK_general: logi [1:865918] TRUE TRUE TRUE TRUE TRUE TRUE ...
##  $ probeType   : chr [1:865918] "cg" "cg" "cg" "cg" ...

Filtering on Intensities

We recommend performing probe-filtering by setting unreliably measured probes to NA.

Some observations need to be removed from the dataset since they provide little to no information that could skew analyses. Probes which have zero intensity or are based on very few (default: <3) beads are removed, based on a specified cut off.

RGset <- probeFiltering(RGset)

## Filtering on number of beads... 
## On average 0.12 % of the probes ( 1052641 ) have number of beads below 3 
## Filtering on zero intensities... 
## On average 0.019 % of the Type II probes ( 724574 ) have Red and/or Green intensity below 1 
## On average 0.016 % of the Type I probes ( 99964 ), measured in Green channel, have intensity below 1 
## On average 0.016 % of the Type I probes ( 184560 ), measured in Red channel, have intensity below 1 
## Set filtered probes in Red and/or Green channels to NA...

There are differences between probe types that are used in probeFiltering(). Information on these can be extracted and stored using functions from minfi (Feinberg, 2014).

locusNames <- getManifestInfo(RGset, "locusNames")
typeII <- getProbeInfo(RGset, type="II")
typeI <- getProbeInfo(RGset, type="I")

RGset

## class: RGChannelSet 
## dim: 1052641 57 
## metadata(0):
## assays(2): Green Red
## rownames(1052641): 1600101 1600111 ... 99810990 99810992
## rowData names(0):
## colnames(57): GSM3228563_200550980002_R02C01 GSM3228564_200550980002_R03C01 ... GSM3228620_200590490021_R07C01
##   GSM3228621_200590490021_R08C01
## colData names(25): sample_ID geo_accession ... Neu Mono
## Annotation
##   array: IlluminaHumanMethylationEPIC
##   annotation: ilm10b4.hg19

betas <- getBeta(RGset, type="Illumina")

ENCODE Blacklist

CpGs in the ENCODE Blacklist²² are provided in the package and can be loaded in:

450k - “./data/ENCODEBlacklist-CpGomit-450K.RData”
EPIC - “./data/ENCODEBlacklist-CpGomit-EPIC.RData”
EPICv2 - “./data/ENCODEBlacklist-CpGomit-EPICv2.RData”

These were generated using the Blacklist .bed file found here, which was converted to a GRanges object and overlapped with the manifest for each DNAm array. In total, this removes 7313 probes from EPICv2, 9035 from EPIC, and 6508 from the 450k array.

load("./data/ENCODEBlacklist_CpGomit-EPIC.RData")
betas <- betas[!rownames(betas) %in% cpg_blacklist,]
dim(betas)

## [1] 857749     57

Zhou probes

Several studies¹⁵ have characterized cross-reactive and polymorphic probes on methylation arrays and made suggestions for removal. The proposed sets are actively maintained and supports all current array sizes. Additionally, different sets are specified depending on the reference genome used.

Probes are removed for a variety of reasons, from ambiguous mapping and cross-reactivity to polymorphisms that interfere with extension. In general, around 60,000 probes are suggested for removal in 450k arrays, and 100,000 for EPIC. We developed the probeMasking() function, which removes specified probes from either beta- or M- value matrices.

maskProbes <- anno[anno$MASK_general == TRUE,]$cpg
length(maskProbes)

## [1] 99360

betas <- betas[!rownames(betas) %in% maskProbes,]
dim(betas)

## [1] 763466     57

Sex chromosomes

For most EWAS, CpGs on sex chromosomes are filtered out. To perform analyses on sex-chromosomal probes, a stratified approach is required.

xy_cpgs <- (anno %>% 
              dplyr::filter(chr %in% c("X", "Y")))$cpg
betas <- betas[!rownames(betas) %in% xy_cpgs,]
dim(betas)

## [1] 746345     57

Outlier removal

In many analyses, it is advisable to remove outlying DNAm values. This can be based on:

beta values - more biologically interpretable, but suffer from heteroskedasticity, or
M values - less heteroskedasticity

Outliers can be based on the IQR or RIN values. Based on your specification, you can then filter outlying values. Here, we present removing beta values more than 3 IQR from the nearest quartile.

xtabs(~is.na(betas))

## is.na(betas)
##    FALSE     TRUE 
## 42487888    53777

iqr_dnam <- apply(betas, 1, function(x){
  iqr <- IQR(x, na.rm = TRUE)
  q1 <- quantile(x, na.rm=TRUE)[[2]]
  q3 <- quantile(x, na.rm=TRUE)[[4]]
  x <- ifelse((x <= q1 - (3*iqr) | x >= q3 + (3*iqr)), NA, x)
})

xtabs(~is.na(iqr_dnam))

## is.na(iqr_dnam)
##    FALSE     TRUE 
## 42421845   119820

dim(iqr_dnam)

## [1]     57 746345

betas <- t(iqr_dnam)

Missingness

We advise keeping only the CpGs where less than 5% of sample values are missing.

perc_na <- rowSums(is.na(iqr_dnam))*100/ncol(iqr_dnam)
summary(perc_na)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1144  0.1561  0.1905  0.2817  0.2640  1.4570

betas <- betas[,perc_na <= 95]
dim(betas)

## [1] 746345     57

On the sample level, we keep only the samples where less than 5% of CpGs are missing.

perc_na <- colSums(is.na(iqr_dnam))*100/nrow(iqr_dnam)
summary(perc_na)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.2817  0.0000 47.3684

betas <- betas[perc_na <= 95,]
dim(betas)

## [1] 746345     57

SummarizedExperiment

Now that we have our annotations, we can combine them with our data using the SummarizedExperiment package. After ensuring that any column subsetting is accounted for, we extract the colData() from our RGset. Finally, the beta values, probe annotations, and sample information are combined into a RangedSummarizedExperiment object using the SummarizedExperiment() function.

Annotation data

anno <- anno %>% 
  dplyr::filter(cpg %in% rownames(betas))

dim(anno)

## [1] 745368      8

Keep annotated probes

betas <- betas[anno$cpg, ]

Sample information

targets <- targets[colnames(betas), ]

dim(targets)

## [1] 57 25

dim(betas)

## [1] 745368     57

Make summarized experiment object

methData <- SummarizedExperiment(
  assays=SimpleList(beta=betas), 
  rowData=anno, 
  colData=targets)

methData

## class: SummarizedExperiment 
## dim: 745368 57 
## metadata(0):
## assays(1): beta
## rownames(745368): cg09499020 cg16535257 ... cg08423507 cg24129471
## rowData names(8): cpg chr ... MASK_general probeType
## colnames(57): GSM3228563_200550980002_R02C01 GSM3228564_200550980002_R03C01 ... GSM3228620_200590490021_R07C01
##   GSM3228621_200590490021_R08C01
## colData names(25): sample_ID geo_accession ... Neu Mono

You can read more about SummarizedExperiment from the following resources:

Save

Save RGset

RGset <- RGset[, colnames(RGset) %in% colnames(methData)]
RGset

## class: RGChannelSet 
## dim: 1052641 57 
## metadata(0):
## assays(2): Green Red
## rownames(1052641): 1600101 1600111 ... 99810990 99810992
## rowData names(0):
## colnames(57): GSM3228563_200550980002_R02C01 GSM3228564_200550980002_R03C01 ... GSM3228620_200590490021_R07C01
##   GSM3228621_200590490021_R08C01
## colData names(25): sample_ID geo_accession ... Neu Mono
## Annotation
##   array: IlluminaHumanMethylationEPIC
##   annotation: ilm10b4.hg19

Save clean data

save(targets, file = "./data/targets.Rdata")
save(methData, file="./data/methData.Rdata")
save(RGset, file="./data/RGset.Rdata")

Saving

We want to save our data as a SummarizedExperiment object for the rest of the pipeline.