The DNAmArray package can be installed in several ways, and has been tested for >= R-4.4.3 on various Linux-builds and Windows.
First, install devtools, and then use the install_github()
function to fetch the DNAmArray package.
git/R
Using git, you can git clone
our repository and then install the package, changing _x.y.z.
to the relevant version.
git clone git@github.com/molepi/DNAmArray.git
R CMD build DNAmArray
R CMD INSTALL DNAmArray_x.y.z.tar.gz
First, load the packages that are required for this workflow:
library(DNAmArray)
library(MethylAid)
library(omicsPrint)
library(bacon)
library(GEOquery)
library(tidyverse)
library(ggrepel)
library(reshape2)
library(ComplexHeatmap)
library(circlize)
library(BiocParallel)
library(IlluminaHumanMethylationEPICmanifest)
library(minfi)
library(wateRmelon)
library(snow)
library(limma)
library(EpiDISH)
library(sva)
library(methyLImp2)
library(SummarizedExperiment)
The first step in analysing microarray data is importing raw intensity files into your software program. In this example, we show how to import raw IDAT files from GEO into R, but similar strategies can be employed for all Illumina DNAm array files when using R.
Using the getGEOSuppFiles()
function, supplementary data is downloaded to the current working directory. These consist of the raw IDAT files alongside relevant documentation.
getGEOSuppFiles("GSE116339")
untar(tarfile="../GSE116339/GSE116339_RAW.tar", exdir="../GSE116339/IDATs")
The data can then be efficiently unpacked using the gunzip()
function.
getGEO()
can then be used to import SOFT format microarray data into R as a large GSE-class list, and extract the metadata of interest.
GSE116339 <- getGEO(filename='./data/GSE116339_series_matrix.txt.gz', getGPL = FALSE)
targets <- phenoData(GSE116339)@data
rm(GSE116339)
targets
Before progressing further, take some time to get familiar with the targets
data frame, removing duplicate information and converting variables to relevant classes.
targets <- targets %>% dplyr::select(sample_ID = `sample_id:ch1`,
geo_accession,
sex = `gender:ch1`,
age = `age:ch1`,
log_total_pbb = `ln(totalpbb):ch1`,
pbb_153 = `pbb-153:ch1`,
pbb_77 = `pbb-77:ch1`,
pbb_101 = `pbb-101:ch1`,
pbb_180 = `pbb-180:ch1`,
supplementary_file) %>%
separate(supplementary_file, sep="_", remove=F, into=c(NA, "plate", "row", NA)) %>%
mutate(age = as.numeric(age),
log_total_pbb = as.numeric(log_total_pbb),
pbb_153 = as.numeric(pbb_153),
pbb_77 = as.numeric(pbb_77),
pbb_101 = as.numeric(pbb_101),
pbb_180 = as.numeric(pbb_180),
row = as.numeric(substr(row, 3, 3)))
str(targets)
## 'data.frame': 679 obs. of 12 variables:
## $ sample_ID : chr "3141584" "10120245" "99990005" "12130802" ...
## $ geo_accession : chr "GSM3228562" "GSM3228563" "GSM3228564" "GSM3228565" ...
## $ sex : chr "Female" "Female" "Male" "Male" ...
## $ age : num 75 31.5 46.7 46.4 53.6 ...
## $ log_total_pbb : num -2.542 -2.754 -0.371 0.115 -1.548 ...
## $ pbb_153 : num 0.069 0.054 0.68 1.107 0.203 ...
## $ pbb_77 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ pbb_101 : num 0 0 0 0.008 0 0 0 0 0 0 ...
## $ pbb_180 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ supplementary_file: chr "ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM3228nnn/GSM3228562/suppl/GSM3228562_200550980002_R01C01_Grn.idat.gz" "ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM3228nnn/GSM3228563/suppl/GSM3228563_200550980002_R02C01_Grn.idat.gz" "ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM3228nnn/GSM3228564/suppl/GSM3228564_200550980002_R03C01_Grn.idat.gz" "ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM3228nnn/GSM3228565/suppl/GSM3228565_200550980002_R04C01_Grn.idat.gz" ...
## $ plate : chr "200550980002" "200550980002" "200550980002" "200550980002" ...
## $ row : num 1 2 3 4 5 6 7 8 1 3 ...
This data frame consists of 679 observations and phenotypic information for 51 variables is stored after cleaning. This includes:
sample_ID
- the ID of the samplegeo_accession
- the GEO accession number of the samplesex
- male or femaleage
- age of the individuallog_total_pbb
, pbb_153
, pbb_77
, pb_101
, pbb_180
- exposure levelssupplementary_file
- location of IDAT fileplate
- EPIC array numberrow
- row number on the array (continuous)