Covariates

Each column in the RGset colData should be considered as a potential covariate in our models. Both technical and biological factors should be investigated as these may introduce batch effects or be clinically relevant. In order to assess this, we convert each column to a categorical variable whose correlations can be visualised.

Homogenous columns can be excluded, as any variables that do not change by definition cannot explain variation in the data. In this instance, the description for all samples is peripheral whole blood, so there is no need to investigate a tissue effect. However, in some instances samples may be taken from different tissues, and in this type of study analysis must be done separately for each to correctly separate clinically relevant methylation changes from tissue-specific methylation differences.

##    Mode    TRUE 
## logical      16
##  [1] "geo_accession"    "cohort"           "sex"              "cc"              
##  [5] "age"              "growth"           "facial"           "cns"             
##  [9] "Basename"         "filenames"        "Sentrix_Position" "baso_perc"       
## [13] "eos_perc"         "lymph_perc"       "mono_perc"        "neut_perc"

As you can see, 16 columns need to be examined. These include sample identifiers, such as title and the Basename variable we created, as well as which cohort the participant was from. Other potential causes of variation are biological, such as age, sex, and cc_status, and some, like Sentrix_Position and the blood cell counts, represent potential batch effects or confounding.

Correlations between complete factors and the principal components is calculated, and this correlation matrix is then melted using the melt() function from the reshape2 package. This prepares it for visualization using ggplot2.

##     geo_accession cohort   sex    cc   age growth facial   cns Basename
## PC1         -0.28  -0.16  0.99  0.06  0.03  -0.08  -0.07 -0.11    -0.28
## PC2         -0.03   0.13  0.10 -0.25  0.03   0.18   0.22  0.27    -0.03
## PC3          0.70   0.89  0.00 -0.02  0.28  -0.01  -0.03  0.04     0.70
## PC4         -0.15  -0.34 -0.01  0.20 -0.03  -0.14  -0.15 -0.09    -0.15
## PC5         -0.05   0.08 -0.02 -0.36 -0.14   0.27   0.36  0.22    -0.05
## PC6         -0.02   0.07  0.02 -0.01 -0.10  -0.03   0.00  0.01    -0.02
##   Var1          Var2 value
## 1  PC1 geo_accession -0.28
## 2  PC2 geo_accession -0.03
## 3  PC3 geo_accession  0.70
## 4  PC4 geo_accession -0.15
## 5  PC5 geo_accession -0.05
## 6  PC6 geo_accession -0.02

After some aesthetic preparation of the theme(), the correlations can be visualized in a heatmap using ggplot2’s geom_tile() function.

By examining the correlations in the data, we can build our model in a more informed manner. The first PC is highly correlated with sex and, as is usual in EWAS, we intend to include this as a confounder. Cell counts appear of substantial importance and will also be adjusted for. Model specification should be informed by a combination of prior knowledge and inspection of patterns in the data.

Below we will specify a model adjusting for age, sex, sentrix position, cell counts, and cohort. It is not appropriate to adjust for sample ID type variables, such as Basename, despite their correlation with our PCs.


EWAS using cate

The package cate uses high dimensional factor analysis to adjust for measured and unmeasured confounding in multiple hypothesis testing. Unmeasured confounders, which are correlated with both methylation and case-control status, can bias the results from association tests. This makes this method superior to using limma, since it both corrects for technical variation and performs robust linear regression.

We advise using cate over PCA, since you can specify that factors should be independent of the outcome of interest. This is not the case in PCA, which can result in biological variation being removed.

The first step in running an EWAS is to make sure that you have information on all your confounders, and that these are stored in a design matrix. This matrix should be created using the model.matrix() function, so that variables are treated correctly.

The formula that will be used for our EWAS is stored inside the RangedSummarizeExperiment and called when creating the design matrix, after ensuring that none of the covariates contain missing values. We adjust for measured confounders, including age, sex, sentrix position, and estimated cell percentages.

##                              (Intercept) ccdisease state: FASD age
## GSM3092700_9985178096_R01C01           1                     0  50
## GSM3092701_9985178127_R03C02           1                     0  19
## GSM3092702_9986360109_R02C02           1                     0 110
## GSM3092703_9985178087_R01C01           1                     0  39
## GSM3092704_9985178090_R01C01           1                     0  97

To correct for unknown confounders, we must estimate their number using the est.confounder.num() function from cate. In order to correctly input the arguments of this function you need to specify the variable of interest, which in this instance is case-control status, followed by a formula indicating that all other variables except the intercept are known confounders. This function also needs the design matrix as a data frame, the transposed beta values, and an indication of what methods and number of iterations you want to use.

## [1] 10

In this instance, it is suggested we correct for 10 unmeasured confounders. To implement this, we use the cate.fit() function, as shown below. The output object stores the beta values, estimated test statistics, and relevant p-values, and top hits can be viewed after using the Bonferonni correction to adjust for multiple testing.

##   cg18420599   cg20893180   cg00121876   cg10646368   cg04613258   cg26530706 
## 0.000000e+00 9.186873e-11 9.186873e-11 6.430811e-10 2.847931e-09 3.123537e-09
## [1] 297

This analysis has identified 297 differentially methylated probes (DMPs) associated with FASD after adjusting for age, sex, and sentrix position. These can be analysed further, but first we want to outline alternatives to the cate-based EWAS, and demonstrate that results are comparable.


Alternatives

Limma is a package with excellent documentation that can be useful for smaller samples, due to the optional empirical Bayes step. Running limma is similar to running cate, except that the beta matrix does not need transposing and test statistics and p-values need to be manually derived.

##   cg18420599   cg00121876   cg20893180   cg10646368   cg04613258   cg26530706 
## 2.425952e-11 5.279661e-11 8.177351e-11 6.221082e-10 2.815284e-09 3.132579e-09
## [1] 297

As shown, results from limma are comparable with those from cate. Here, the same number of DMPs are identified and top hits are identical, albeit with slightly differing p-values. Despite this, for the reasons specified above, we still advise using cate unless your sample size is small enough to require the eBayes option.

Linear mixed models and linear models can both work well for performing EWAS and, when looking solely at top hits, are unlikely to differ greatly from results obtained using cate or limma.


bacon

We developed a package called bacon [@VanIterson2017] to estimate and correct for bias and inflation of test statistics in EWAS. This maximizes power while properly controlling the false positive rate, by estimating the empirical null distribution using Bayesian statistics. The utility of the tool was illustrated through the application of meta-analysis by performing epigenome- and transcriptome-wide association studies (EWASs/TWASs) on age and smoking which highlighted an overlap in differential methylation and expression of associated genes. It can be used to remove inflation and bias often observed in EWAS.

## Detected 2 workers!
##  Running in parallel!

Plots can be used to compare adjustment using cate with that using bacon, and bacon-adjusted limma and cate values can be compared side by side.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

##                   limma         cate
## cg00121876 4.415519e-14 4.438737e-14
## cg09249404 1.025937e-11 1.030006e-11
## cg18420599 3.683593e-11 3.720331e-11
## cg20893180 8.394516e-11 8.475689e-11
## cg08532057 2.300984e-10 2.308458e-10
## cg23522872 2.550714e-10 2.558940e-10
##                   limma         cate
## cg18420599 5.863470e-17 5.863470e-17
## cg00121876 1.276082e-16 1.276082e-16
## cg20893180 1.976447e-16 1.976447e-16
## cg10646368 1.503621e-15 1.503621e-15
## cg04613258 6.804477e-15 6.804477e-15
## cg26530706 7.571372e-15 7.571372e-15

After looking at these plots, strong bias is observed, and running bacon should be seriously considered.


DMR Finder

This DNAmArray package also contains a DMRfinder (Slieker, 2013) function, which can be used to identify differentially methylated regions (DMRs). This function takes a simple data frame as input, which needs to contain a column called crit with a binary indicator of whether the adjusted p-value was significant or not. Other options include mismatches, which specifies how many DMPs are allowed to be not significant in a row before a region is not deemed to be a DMR, and the icd or Inter-CpG distance required for a region to be identified.

## 
## =============================================================================
## GRanges object with 16 ranges and 0 metadata columns:
##        seqnames              ranges strand
##           <Rle>           <IRanges>  <Rle>
##    [1]     chr1   27683345-27683386      *
##    [2]     chr2 121625577-121625735      *
##    [3]     chr2 206628553-206628692      *
##    [4]     chr4     5710761-5711301      *
##    [5]     chr5     8457548-8458392      *
##    ...      ...                 ...    ...
##   [12]    chr17   76886742-76886912      *
##   [13]    chr18     5891948-5892213      *
##   [14]    chr18   11850139-11850153      *
##   [15]    chr18   32173084-32173227      *
##   [16]    chr19   36484913-36485360      *
##   -------
##   seqinfo: 11 sequences from an unspecified genome; no seqlengths

In this instance, 16 DMRs were found to be associated with FASD. These regions can be visualised with a chromosome-based circo plot, such as using OmicCircos to get a better understanding of the regions identified.


Probe Annotation

Our package also contains a cpgInfo() function, which can be used to map CpG probes to their nearest genes. This utilises the TxDb.Hsapiens.UCSC.hg19.knownGene package, which accesses the USCS annotation databases.

##   403 genes were dropped because they have exons located on both strands
##   of the same reference sequence or on more than one reference sequence,
##   so cannot be represented by a single genomic range.
##   Use 'single.strand.genes.only=FALSE' to get all the genes in a
##   GRangesList object, or use suppressMessages() to suppress this message.
## Fetching coordinates for hg19...
## 'select()' returned many:1 mapping between keys and columns
##            seqnames     start       end  width strand gene_id  SYMBOL
## cg00000957     chr1   5922870   6052533 129664      -  261734   NPHP4
## cg00001349     chr1 166944819 166991447  46629      +   84944    MAEL
## cg00001583     chr1 199996730 200146550 149821      +    2494   NR5A2
## cg00002028     chr1  20959948  20978004  18057      +   65018   PINK1
## cg00002719     chr1 169364114 169429907  65794      -   57821 CCDC181
## cg00002837     chr1  44514030  44600809  86780      +  128209   KLF17
## cg00003202     chr1 151313116 151319769   6654      -    5993    RFX5
## cg00003287     chr1 201328136 201346828  18693      -    7139   TNNT2
## cg00007036     chr1  33722174  33766320  44147      +  149076  ZNF362
## cg00007898     chr1 205111631 205180727  69097      -   25778   DSTYK