Ambiguous probes

Several studies (Zhou, 2017) have characterized cross-reactive and polymorphic probes on both the 450k and EPIC methylation arrays and made suggestions for removal. We chose to remove probes identified by Zhou et al., since this set is actively maintained and supports both array sizes. Additionally, different sets are specified depending on the reference genome used.

In total, Zhou et al. suggest removal of probes for a variety of reasons, from ambiguous mapping and cross-reactivity to polymorphisms that interfere with extension. In general, around 60,000 probes are suggested for removal in 450k arrays, and 100,000 for EPIC.

We developed the probeMasking() function, which removes specified probes from either beta- or M- value matrices. In this instance, 59,780 CpG rows are removed.

## [probeFilterDNAmArray] Extracting probe filter... 
## [probeFilterDNAmArray] 60466 probes considered for filtering... 
## [probeFilterDNAmArray] 59780 / 483366 probes removed from the dataset

RangedSummarizedExperiment

Moving further into the pipeline, it will be necessary to have our methylation data stored in a RangedSummarizedExperiment class object. Designed to handle microarray data, this class stores observations from multiple samples alongside relevant meta-data, and ensures that both features and phenotypes are kept in-sync when subsetting.

This matrix-like container is organised with rows representing features, such as genes or exons, which can be accessed using rowRanges(). This function calls a GRanges object describing features and their attributes. The columns of the RangedSummarizedExperiment store information about each sample, and this can be accessed using colData().

Lastly, each assay is represented in the third dimension of this matrix-like object and can be shown using the assays() function. Meta-data is then linked to this combination of data frames, and can be accessed with the metadata() command.


Creating a RangedSummarizedExperiment

The FDb.InfiniumMethylation.hg19 package contains probe annotations, which can be extracted using the getPlatform() function. To ensure that we apply the previous probe filtering, we use match() to keep only the relevant information.

## Fetching coordinates for hg19...
## GRanges object with 423586 ranges and 2 metadata columns:
##                   seqnames            ranges strand | channel platform
##                      <Rle>         <IRanges>  <Rle> |   <Rle>    <Rle>
##        cg01707559     chrY   6778695-6778696      * |     Red    HM450
##        cg03244189     chrY 21238472-21238473      * |     Grn    HM450
##        cg04792227     chrY 17568097-17568098      * |     Red    HM450
##        cg14180491     chrY 15016705-15016706      * |     Grn    HM450
##        cg25032547     chrY 14773536-14773537      * |     Red    HM450
##               ...      ...               ...    ... .     ...      ...
##   ch.22.44116734F    chr22          45738070      * |    Both    HM450
##     ch.22.909671F    chr22          46114168      * |    Both    HM450
##   ch.22.46830341F    chr22          48451677      * |    Both    HM450
##    ch.22.1008279F    chr22          48731367      * |    Both    HM450
##   ch.22.47579720R    chr22          49193714      * |    Both    HM450
##   -------
##   seqinfo: 24 sequences from hg19 genome

Now that we have our annotations, we can combine them with our data using the SummarizedExperiment package. After ensuring that any column subsetting is accounted for, we extract the colData() from our RGset. Finally, the beta values, probe annotations, and sample information are combined into a RangedSummarizedExperiment object using the SummarizedExperiment() function.

## class: RangedSummarizedExperiment 
## dim: 423586 138 
## metadata(0):
## assays(1): betas
## rownames(423586): cg01707559 cg03244189 ... ch.22.1008279F
##   ch.22.47579720R
## rowData names(10): addressA addressB ... probeEnd probeTarget
## colnames(138): GSM3092700_9985178096_R01C01
##   GSM3092701_9985178127_R03C02 ... GSM3093566_9020331152_R05C01
##   GSM3093567_9020331152_R06C01
## colData names(16): geo_accession cohort ... mono_perc neut_perc
You can read more about SummarizedExperiment from the following resources:

Sex Chromosomal Probes

Having a RangedSummarizedExperiment class object expedites masking sex chromosomal probes.