Probe annotations can be downloaded from the Zhou lab GitHub page15 for all commonly used arrays and reference genomes. For the example data, we use the EPIC annotation (hg19).
## Rows: 865918 Columns: 57
## ── Column specification ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (21): CpG_chrm, probe_strand, probeID, channel, designType, nextBase, nextBaseRef, probeType, orientation, ProbeSeq_A, ProbeSeq_B, ...
## dbl (24): CpG_beg, CpG_end, address_A, address_B, probeCpGcnt, context35, probeBeg, probeEnd, beg_A, flag_A, mapQ_A, NM_A, beg_B, flag_...
## lgl (12): posMatch, MASK_mapping, MASK_typeINextBaseSwitch, MASK_rmsk15, MASK_sub40_copy, MASK_sub35_copy, MASK_sub30_copy, MASK_sub25_...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
anno <- anno %>%
dplyr::select(
cpg = probeID,
chr = CpG_chrm,
start = CpG_beg,
end = CpG_end,
strand = probe_strand,
gene_HGNC,
MASK_general,
probeType
) %>%
mutate(
chr = substr(chr,4,5)
)
str(anno)
## tibble [865,918 × 8] (S3: tbl_df/tbl/data.frame)
## $ cpg : chr [1:865918] "cg14817997" "cg26928153" "cg16269199" "cg13869341" ...
## $ chr : chr [1:865918] "1" "1" "1" "1" ...
## $ start : num [1:865918] 10524 10847 10849 15864 18826 ...
## $ end : num [1:865918] 10526 10849 10851 15866 18828 ...
## $ strand : chr [1:865918] "-" "+" "+" "-" ...
## $ gene_HGNC : chr [1:865918] NA "DDX11L1" "DDX11L1" "WASH7P" ...
## $ MASK_general: logi [1:865918] TRUE TRUE TRUE TRUE TRUE TRUE ...
## $ probeType : chr [1:865918] "cg" "cg" "cg" "cg" ...
We recommend performing probe-filtering by setting unreliably measured probes to NA.
Some observations need to be removed from the dataset since they provide little to no information that could skew analyses. Probes which have zero intensity or are based on very few (default: <3) beads are removed, based on a specified cut off.
## Filtering on number of beads...
## On average 0.12 % of the probes ( 1052641 ) have number of beads below 3
## Filtering on zero intensities...
## On average 0.019 % of the Type II probes ( 724574 ) have Red and/or Green intensity below 1
## On average 0.016 % of the Type I probes ( 99964 ), measured in Green channel, have intensity below 1
## On average 0.016 % of the Type I probes ( 184560 ), measured in Red channel, have intensity below 1
## Set filtered probes in Red and/or Green channels to NA...
There are differences between probe types that are used in
probeFiltering()
. Information on these can be extracted and
stored using functions from minfi
(Feinberg, 2014).
locusNames <- getManifestInfo(RGset, "locusNames")
typeII <- getProbeInfo(RGset, type="II")
typeI <- getProbeInfo(RGset, type="I")
RGset
## class: RGChannelSet
## dim: 1052641 57
## metadata(0):
## assays(2): Green Red
## rownames(1052641): 1600101 1600111 ... 99810990 99810992
## rowData names(0):
## colnames(57): GSM3228563_200550980002_R02C01 GSM3228564_200550980002_R03C01 ... GSM3228620_200590490021_R07C01
## GSM3228621_200590490021_R08C01
## colData names(25): sample_ID geo_accession ... Neu Mono
## Annotation
## array: IlluminaHumanMethylationEPIC
## annotation: ilm10b4.hg19
CpGs in the ENCODE Blacklist22 are provided in the package and can be loaded in:
450k
- “./data/ENCODEBlacklist-CpGomit-450K.RData”EPIC
- “./data/ENCODEBlacklist-CpGomit-EPIC.RData”EPICv2
-
“./data/ENCODEBlacklist-CpGomit-EPICv2.RData”These were generated using the Blacklist .bed
file found
here, which was
converted to a GRanges object and overlapped with the manifest for each
DNAm array. In total, this removes 7313 probes from EPICv2, 9035 from
EPIC, and 6508 from the 450k array.
load("./data/ENCODEBlacklist_CpGomit-EPIC.RData")
betas <- betas[!rownames(betas) %in% cpg_blacklist,]
dim(betas)
## [1] 857749 57
Several studies15 have characterized cross-reactive and polymorphic probes on methylation arrays and made suggestions for removal. The proposed sets are actively maintained and supports all current array sizes. Additionally, different sets are specified depending on the reference genome used.
Probes are removed for a variety of reasons, from ambiguous mapping
and cross-reactivity to polymorphisms that interfere with extension. In
general, around 60,000 probes are suggested for removal in 450k arrays,
and 100,000 for EPIC. We developed the probeMasking()
function, which removes specified probes from either beta- or M- value
matrices.
## [1] 99360
## [1] 763466 57
For most EWAS, CpGs on sex chromosomes are filtered out. To perform analyses on sex-chromosomal probes, a stratified approach is required.
xy_cpgs <- (anno %>%
dplyr::filter(chr %in% c("X", "Y")))$cpg
betas <- betas[!rownames(betas) %in% xy_cpgs,]
dim(betas)
## [1] 746345 57
In many analyses, it is advisable to remove outlying DNAm values. This can be based on:
Outliers can be based on the IQR or RIN values. Based on your specification, you can then filter outlying values. Here, we present removing beta values more than 3 IQR from the nearest quartile.
## is.na(betas)
## FALSE TRUE
## 42487888 53777
iqr_dnam <- apply(betas, 1, function(x){
iqr <- IQR(x, na.rm = TRUE)
q1 <- quantile(x, na.rm=TRUE)[[2]]
q3 <- quantile(x, na.rm=TRUE)[[4]]
x <- ifelse((x <= q1 - (3*iqr) | x >= q3 + (3*iqr)), NA, x)
})
xtabs(~is.na(iqr_dnam))
## is.na(iqr_dnam)
## FALSE TRUE
## 42421845 119820
## [1] 57 746345
We advise keeping only the CpGs where less than 5% of sample values are missing.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1144 0.1561 0.1905 0.2817 0.2640 1.4570
## [1] 746345 57
On the sample level, we keep only the samples where less than 5% of CpGs are missing.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.2817 0.0000 47.3684
## [1] 746345 57
Now that we have our annotations, we can combine them with our data
using the SummarizedExperiment
package. After ensuring that any column subsetting is accounted for, we
extract the colData()
from our RGset
. Finally,
the beta values, probe annotations, and sample information are combined
into a RangedSummarizedExperiment
object using the
SummarizedExperiment()
function.
Annotation data
## [1] 745368 8
Keep annotated probes
Sample information
## [1] 57 25
## [1] 745368 57
Make summarized experiment object
methData <- SummarizedExperiment(
assays=SimpleList(beta=betas),
rowData=anno,
colData=targets)
methData
## class: SummarizedExperiment
## dim: 745368 57
## metadata(0):
## assays(1): beta
## rownames(745368): cg09499020 cg16535257 ... cg08423507 cg24129471
## rowData names(8): cpg chr ... MASK_general probeType
## colnames(57): GSM3228563_200550980002_R02C01 GSM3228564_200550980002_R03C01 ... GSM3228620_200590490021_R07C01
## GSM3228621_200590490021_R08C01
## colData names(25): sample_ID geo_accession ... Neu Mono
You can read more about SummarizedExperiment
from the following resources:
Save RGset
## class: RGChannelSet
## dim: 1052641 57
## metadata(0):
## assays(2): Green Red
## rownames(1052641): 1600101 1600111 ... 99810990 99810992
## rowData names(0):
## colnames(57): GSM3228563_200550980002_R02C01 GSM3228564_200550980002_R03C01 ... GSM3228620_200590490021_R07C01
## GSM3228621_200590490021_R08C01
## colData names(25): sample_ID geo_accession ... Neu Mono
## Annotation
## array: IlluminaHumanMethylationEPIC
## annotation: ilm10b4.hg19
Save clean data
save(targets, file = "./data/targets.Rdata")
save(methData, file="./data/methData.Rdata")
save(RGset, file="./data/RGset.Rdata")
We want to save our data as a SummarizedExperiment
object for the rest of the pipeline.