Installation

The DNAmArray package can be installed in several ways, and has been tested for >= R-4.4.3 on various Linux-builds and Windows.

Install using devtools

First, install devtools, and then use the install_github() function to fetch the DNAmArray package.

library(devtools)
install_github("molepi/DNAmArray")

Install from source using `git/R`

Using git, you can git clone our repository and then install the package, changing _x.y.z. to the relevant version.

git clone git@github.com/molepi/DNAmArray.git
R CMD build DNAmArray
R CMD INSTALL DNAmArray_x.y.z.tar.gz

Loading packages

First, load the packages that are required for this workflow:

DNAmArray - the main package, containing in-house build functions for the preprocessing of DNAm data
MethylAid¹⁰ - for sample-level quality control
omicsPrint¹² - in-house tool use to detect sample linkage errors and resolve them
bacon¹¹ - in-house tool for reducing bias and inflation in EWAS test statistics
GEOquery¹⁴ - bridges the gap between BioConductor tools and GEO
tidyverse - for data wrangling
reshape2 - for data wrangling
ggrepel - for data visualization
ComplexHeatmap - for creating heatmaps
circlize - designed to allow circular plots, but also used to create custom colour palettes
BiocParallel - to parallelize processing of genomic data
IlluminaHumanMethylationEPICmanifest - containing EPIC probe annotation for the example data (other packages are available for 450k or EPICv2)
minfi⁹ - functions to preprocess DNAm data
wateRmelon³⁰ - functions to preprocess DNAm data
snow - required by MethyLImp2 for parallelization
limma²⁷ - for EWAS analysis
EpiDISH²¹ - for cell count predictions
sva^23-25 - for estimating latent factors
methyLImp2²⁰
SummarizedExperiment

library(DNAmArray)
library(MethylAid)
library(omicsPrint)
library(bacon)

library(GEOquery)
library(tidyverse)
library(ggrepel)
library(reshape2)
library(ComplexHeatmap)
library(circlize)
library(BiocParallel)
library(IlluminaHumanMethylationEPICmanifest)
library(minfi)
library(wateRmelon)
library(snow)

library(limma)
library(EpiDISH)
library(sva)
library(methyLImp2)
library(SummarizedExperiment)

Importing IDATs

The first step in analysing microarray data is importing raw intensity files into your software program. In this example, we show how to import raw IDAT files from GEO into R, but similar strategies can be employed for all Illumina DNAm array files when using R.

Using the getGEOSuppFiles() function, supplementary data is downloaded to the current working directory. These consist of the raw IDAT files alongside relevant documentation.

getGEOSuppFiles("GSE116339")
untar(tarfile="../GSE116339/GSE116339_RAW.tar", exdir="../GSE116339/IDATs")

The data can then be efficiently unpacked using the gunzip() function.

setwd("../GSE116339/IDATs")
sapply(list.files(), gunzip)

getGEO() can then be used to import SOFT format microarray data into R as a large GSE-class list, and extract the metadata of interest.

GSE116339 <- getGEO(filename='./data/GSE116339_series_matrix.txt.gz', getGPL = FALSE)
targets <- phenoData(GSE116339)@data
rm(GSE116339)

Preparing `targets`

Before progressing further, take some time to get familiar with the targets data frame, removing duplicate information and converting variables to relevant classes.

targets <- targets %>% dplyr::select(sample_ID = `sample_id:ch1`,
                              geo_accession,
                              sex = `gender:ch1`,
                              age = `age:ch1`,
                              log_total_pbb = `ln(totalpbb):ch1`,
                              pbb_153 = `pbb-153:ch1`,
                              pbb_77 = `pbb-77:ch1`,
                              pbb_101 = `pbb-101:ch1`,
                              pbb_180 = `pbb-180:ch1`,
                              supplementary_file) %>% 
  separate(supplementary_file, sep="_", remove=F, into=c(NA, "plate", "row", NA)) %>% 
  mutate(age = as.numeric(age),
         log_total_pbb = as.numeric(log_total_pbb),
         pbb_153 = as.numeric(pbb_153),
         pbb_77 = as.numeric(pbb_77),
         pbb_101 = as.numeric(pbb_101),
         pbb_180 = as.numeric(pbb_180),
         row = as.numeric(substr(row, 3, 3)))

str(targets)

## 'data.frame':    679 obs. of  12 variables:
##  $ sample_ID         : chr  "3141584" "10120245" "99990005" "12130802" ...
##  $ geo_accession     : chr  "GSM3228562" "GSM3228563" "GSM3228564" "GSM3228565" ...
##  $ sex               : chr  "Female" "Female" "Male" "Male" ...
##  $ age               : num  75 31.5 46.7 46.4 53.6 ...
##  $ log_total_pbb     : num  -2.542 -2.754 -0.371 0.115 -1.548 ...
##  $ pbb_153           : num  0.069 0.054 0.68 1.107 0.203 ...
##  $ pbb_77            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ pbb_101           : num  0 0 0 0.008 0 0 0 0 0 0 ...
##  $ pbb_180           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ supplementary_file: chr  "ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM3228nnn/GSM3228562/suppl/GSM3228562_200550980002_R01C01_Grn.idat.gz" "ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM3228nnn/GSM3228563/suppl/GSM3228563_200550980002_R02C01_Grn.idat.gz" "ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM3228nnn/GSM3228564/suppl/GSM3228564_200550980002_R03C01_Grn.idat.gz" "ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM3228nnn/GSM3228565/suppl/GSM3228565_200550980002_R04C01_Grn.idat.gz" ...
##  $ plate             : chr  "200550980002" "200550980002" "200550980002" "200550980002" ...
##  $ row               : num  1 2 3 4 5 6 7 8 1 3 ...

This data frame consists of 679 observations and phenotypic information for 51 variables is stored after cleaning. This includes:

sample_ID - the ID of the sample
geo_accession - the GEO accession number of the sample
sex - male or female
age - age of the individual
log_total_pbb, pbb_153, pbb_77, pb_101, pbb_180 - exposure levels
supplementary_file - location of IDAT file
plate - EPIC array number
row - row number on the array (continuous)