Abstract

As expression microarrays are typically designed relative to a reference genome, any individual genetic variant that overlaps a probe’s genomic position can possibly cause a reduction in hybridization due to the probe no longer being a perfect match to a given sample’s mRNA at that locus. If the samples or groups used in a microarray study differ in terms of genetic variants, the results of the microarray experiment can be negatively impacted. The oligoMask package is an R/SQLite framework which can utilize publicly available genetic variants and works in conjunction with the oligo package to read in the expression data and remove microarray probes which are likely to impact a given microarray experiment prior to analysis. Tools are provided for creating an SQLite database containing the probe and variant annotations and for performing the commonly used RMA preprocessing procedure for Affymetrix microarrays. The oligoMask package is freely available at https://github.com/dbottomly/oligoMask.

Introduction

It has been observed that for mRNA microarrays from a given sample, genetic differences of that sample relative to the probe sequences can affect hybridization to short oligonucleotide probes resulting in false positives or negatives depending on the experimental and array design (Walter et al. 2007; Alberts et al. 2007). Several approaches currently exist to identify and flag/remove probes that have hybridization artifacts due to genetic variants. Removal of probes based on pre-defined genetic variant databases is one such approach (Benovoy, Kwan, and Majewski 2008). Software (Kumari, Verma, and Weller 2007) and databases (Duan et al. 2008) allowing the interrogation of the relationships between microarray probes and single nucleotide variants have been described. The R package CustomCDF (Dai et al. 2005) is an example of this approach that removes probes from an environment formed from the Affymetrix chip description file (CDF) prior to analysis. One potential limitation of the CDF filtering approach is that for some of the more recent arrays platforms such as the Affymetrix Gene or Exon arrays the use of such environments has been superseded (e.g. the SQLite databases in the oligo (Carvalho and Irizarry 2010) package or ROOT scheme files as in the xps (Stratowa, Vienna, and Austria 2013) package).

In addition, the actual expression data itself can be interrogated to identify and mask out variants. R packages exist to effectively deal with a two group comparison between several strains or species through procedures based mainly on the expression data such as maskBAD (Dannemann, Lachmann, and Lorenc 2012) and SNEP (Fujisawa et al. 2009). However, models based on two (genetic) groups have limited utility when analyzing more complicated experimental designs such as those found in expression-based analyses using more genetically diverse mouse lines such as Diversity Outbred (Svenson et al. 2012), Collaborative Cross (Collaborative Cross Consortium 2012) or other Heterogenous Stock (Chia et al. 2005) mice.

In order to facilitate eQTL mapping and other expression analyses in complex mouse crosses we devised an R package oligoMask based on the use of high quality publicly available genetic variant databases to screen microarray probes identifying probes impacted by variants. The key to this is the relatively recent availability of genome-wide variant databases in the variant call format (VCF) such as those from the Sanger Mouse Genomes Project (Keane et al. 2011) and the 1000 Genomes for humans (1000 Genomes Project Consortium 2012) as well as the ability to query and parse these files via the VariantAnnotation (Obenchain et al. 2014) package. The oligoMask package is designed to work in conjunction with the oligo Bioconductor package to facilitate removal of aberrant probe expression prior to the commonly used robust multi-array average (RMA) (Irizarry et al. 2003) pre-processing procedure for Affymetrix arrays. Our package works by removing potentially impacted probes from the overall expression matrix prior to the call to the RMA processing functions. This can be done before the background correction step or after the normalization step. The annotation for these impacted probes are most easily derived from VCF files with the parsed data stored in an SQLite database. This database can be optionally wrapped in an R package with appropriate metadata to facilitate sharing and reproducibility. High-level S4 classes and methods provide a convenient interface with oligoMask and oligo. In addition, users can define new database schemas, add custom data as well as create their own functionality. Below we give an overview as well as demonstrate using publicly available data the steps involved for the use of oligoMask.

Example data

The example data presented in this article and in the vignette was downloaded from the gene expression omnibus (GEO) with accession number GSE33822 (Sun et al. 2012). For demonstration purposes we use a subset (n=8) of the dataset including only those samples derived from whole brain which received the vehicle treatment and that were run on version 1 of the Mouse Gene ST array. In our example, we are looking for expression differences between the NOD/ShiLtJ (NOD) inbred strain and the C57BL/6J (B6) inbred strain, the genome of which serves as the mouse reference genome. As we can expect the microarray probe sequences to be heavily biased towards the reference genome, looking for differential expression between these two strains may be problematic as differences in expression may be due to hybridization artifacts or true gene expression differences. First we create a NOD-specific database and then filter out those probes that are impacted by at least one variant in the NOD strain but not in the B6 strain and then carry out the differential expression analysis as per standard statistical workflows.

Workflow

Creating a variant database

The first step in the use of oligoMask is the creation of an SQLite database containing the probe annotation (including alignments to a given genome), variant annotation and the overlap,if any, between probes and variants. In a general sense, the probes sequences are first realigned to the given genome using the BSgenome and Biostrings Bioconductor packages (Pages 2013; Pages, P. Aboyoun, R. Gentleman, and S. DebRoy 2013) with the probe location and mappability of the probes being recorded. The locations of the uniquely mapping probes are then used to compute overlap with the variants in the specified VCF file using import and overlap functionality in the VariantAnnotation package. The locations of the variants in the genome, type of variant and the individual/population it was observed in is also recorded along with the overlap between probe alignments and variants. A convenience function for database creation is provided (create.sanger.mouse.vcf.db) for use with the case of variants derived from VCF files from the Sanger Mouse Genomes Project and variants of the Affymetrix Mouse Gene ST arrays. The oligoMask Vignette demonstrates in the section ‘Data preparation’ how the NOD variant database package (om.NOD.mogene.1.0.st) can be created using this function.

Additional array platforms and variant genotype file types can be supported through a modification of create.sanger.mouse.vcf.db as well as specifying the database schema as a "TableSchemaList" object as returned in the pre-defined SangerTableSchemaList function. The "TableSchemaList" S4 class serves a similar role as an object-relational mapping approach (ORM) in other languages and allows the R code to interact with a given database in a general way.

Masking procedure

The masking procedure first requires the installation of the oligo package along with the appropriate platform design databases that can be downloaded from Bioconductor. In our use case of Affymetrix Gene ST arrays, the CEL files are first read in using the read.celfiles function of oligo resulting in a "GeneFeatureSet" object. Next, the oligoMask database package is loaded, the parameters for the masking procedure are defined and finally the RMA summarization is performed as is shown below starting from the "GeneFeatureSet" object distributed with oligoMaskData.

library(oligoMask)
library(oligoMaskData)
library(om.NOD.mogene.1.0.st)
library(pd.mogene.1.0.st.v1)
library(limma)
data(oligoMaskData)

var.parms <- VariantMaskParams(om.NOD.mogene.1.0.st, geno.filter = FALSE, 
  rm.unmap = FALSE, rm.mult = FALSE)

sun.gfs.mask <- maskRMA(oligoMaskData, target = "core", apply.mask = TRUE, 
  mask.params = var.parms)

The result of these commands is a summarized "GeneFeatureSet" object with all probes overlapping variants from the NOD inbred strain of mouse removed prior to the background correction step of RMA. Users can control several aspects of the masking procedure through creation of a parameter object. For instance users can additionally remove probes that map to multiple locations as well as those that do not map at all to the reference genome by supplying TRUE to rm.multi and/or rm.unmap. Similarly, masking can be performed using only those variants that passed quality filters encoded in the VCF file by setting geno.filter to TRUE.

The maskRMA method carries out the RMA procedure and provides a similar interface to the rma method from oligo. In addition it requires specification of a "VariantMaskParams" object and whether the masking procedure should be performed before the background correction function or after background correction and normalization but before summarization by setting the mask.type argument to before.rma or before.summary respectively.

Assessment of masking procedure

As a demonstration of oligoMask next we perform a basic linear-model based differential expression analysis with the Sun et al. 2012 data comparing results with and without the NOD mask applied. Below we illustrate the basic approach using the masked data.

sun.exprs.mask <- exprs(sun.gfs.mask)
phen.dta <- data.frame(t(sapply(strsplit(colnames(sun.exprs.mask), "_"), c))[, 1:3])
names(phen.dta) <- c("tissue", "strain", "exposure")
use.mod <- model.matrix(~strain, data = phen.dta)
fit <- lmFit(sun.exprs.mask, use.mod)
fit <- eBayes(fit)
sun.exprs.mask.res <- decideTests(fit)

We then repeat this procedure but this time setting apply.mask = FALSE in maskRMA to provide the baseline standard RMA values for the comparison.

sun.gfs.unmask <- maskRMA(oligoMaskData, target = "core", apply.mask = FALSE, 
    mask.params = var.parms)
    
sun.exprs.unmask <- exprs(sun.gfs.unmask)
um.phen.dta <- 
    data.frame(t(sapply(strsplit(colnames(sun.exprs.unmask), "_"), c))[, 1:3])
names(um.phen.dta) <- c("tissue" , "strain" , "exposure")
um.mod <- model.matrix(~strain , data = um.phen.dta)
um.fit <- lmFit(sun.exprs.unmask , um.mod)
um.fit <- eBayes(um.fit)
sun.exprs.unmask.res <- decideTests(um.fit)

Finally we produce a summary venn diagram of the results.

comb.mat <- cbind(sun.exprs.unmask.res[ , "strainNOD", drop = FALSE], Masked = 0)
comb.mat[rownames(sun.exprs.mask.res), "Masked"] <- 
  sun.exprs.mask.res[, "strainNOD"]
colnames(comb.mat)[1] <- "UnMasked"
vennDiagram(vennCounts(comb.mat))

The Venn diagram provides a visual comparison between the masked and unmasked results, allowing the user to assess impact of variants on expression (Figure 1). An executable version of this code is provided in the accompanying vignette accessed by:

Stangle(system.file("doc/oligoMask.Rnw" , package = "oligoMask"))

Figure 1: Concordance of differentially expressed genes between the masked and unmasked versions of the analysis

Conclusion

oligoMask is a flexible R/SQLite framework for pre-processing and QA/QC of hybridization based expression data. Not only can it remove the effect of spurious probe intensities due to genetic variants but it can additionally correct for design artifacts (probes mapping to multiple places or not mapping at all). It utilizes SQLite and works in conjunction with the oligo Bioconductor package. Currently it supports the Affymetrix Mouse Gene ST array though support could be easily added for other array types or species as described above. Approaches for removal of probes based off of the variant and mapping information in the SQLite database are already implemented. More sophisticated algorithms could be built on top of the database to provide masking additionally based off of the expression data itself or inferred haplotypes. We are currently working on allowing the masking to be further controlled by enforcing positional constraints on the variants relative to the probes as well as enabling masking to be based solely on the mapping information. Source code is freely available to all users from https://github.com/dbottomly/oligoMask. Note that om.NOD.mogene.1.0.st and oligoMaskData are available as part of release 0.99.08 on github (https://github.com/dbottomly/oligoMask/releases).

Acknowledgements

We thank the two anonymous reviewers for their comments. This work was supported in part by grants from the National Institute of Allergy and Infectious Disease (1U19AI100625), the National Center for Advancing Translational Sciences (5UL1RR024140) and the National Cancer Institute (5P30CA069533-13).

oligoMask: A Framework for Assessing and Removing the Effect of Genetic Variants on Microarray Probes

Daniel Bottomly

Beth Wilmot

Shannon K. McWeeney

2014-05-27