Introduction

MetaVolcanoR requires a named list of objects containing results from analyses of differential expression. Several approaches can be adopted in order to obtain these results. Here, limma bioconductor package will be used for differential expression analyses of microarray and RNA-seq datasets. It is also recommended to include a step in your analysis to compute variances or 95% confidence intervals for gene fold-changes between phenotypes to be contrasted.

Different open repositories/resources for gene expression data provide diverse interfaces for data acquisition. In this example, pre-processed datasets will be downloaded using GEOquery and ExpressionAtlas bioconductor packages.

Analysis

Load packages

library(MetaVolcanoR)
library(GEOquery)
library(ExpressionAtlas)

Analysis of microarray dataset

Dataset GSE13999 is composed of lesional and non-lesional skin samples taken from individuals with psoriasis. Gene expression measurements were made with GPL570 microarray platform. Processed data will be downloaded using GEOquery bioconductor package. Next, limma will be used to detect differentially expressed genes in affected samples compared with unaffected samples.

gse30999 <- GEOquery::getGEO('GSE30999', GSEMatrix = TRUE)
exp_set <- gse30999[['GSE30999_series_matrix.txt.gz']]
exp_matrix <- exprs(exp_set)
samp_annot <- pData(exp_set)

Differential expression results from datasets to be integrated must have the same gene identifiers. Therefore, probe identifiers from GPL570 platform will be converted to their corresponding Gene Symbol identifiers before analysis. Moreover, microarray platforms may contain more than one probe for the same gene. This ambiguity will be removed by selecting probes with the highest mean of expression.

# Some probes are assigned to more than one Gene Symbol identifier.
# To simplify matters, these probes will not be included.
# First: get probe annotation from eset object
annottab <- fData(gse30999[[1]])[,c('ID', 'Gene Symbol')]
colnames(annottab) <- c('Probe', 'Symbol')
# Filter out probes with more than one Gene Symbol 
annottab <- subset(annottab, !grepl('///', Symbol))
# Filter out unannotated probes
annottab <- subset(annottab, Symbol != '')
# Filter out NAs
annottab <- annottab[complete.cases(annottab),]
# Filter out probes absent in expression matrix
annotatb <- annottab[annottab$Probe %in% rownames(exp_matrix),]
genes2probes <- split(annottab$Probe, annottab$Symbol)

# Select probe with highest expression
# Compute mean of expression for each probe
mean_probes <- apply(exp_matrix, 1, mean)
selected_probes <- lapply(genes2probes, function(probes){
  expression_levels <- mean_probes[probes]                      
  sel_probe <- probes[which.max(expression_levels)]
  return(sel_probe)
})
selected_probes <- unlist(selected_probes)
# Subset expression matrix with selected probes and use Gene Symbols as identifiers
exp_mat <- exp_matrix[selected_probes,]
rownames(exp_mat) <- names(selected_probes)

Now, limma can be used to detect differentialy expressed genes. For further details on how to properly set up more complex experimental designs, users are referred to the limma user guide, available in limma Bioconductor page: http://bioconductor.org/packages/release/bioc/html/limma.html

experimentalFactor <- samp_annot$`biopsy type:ch1`
experimentalFactor <- factor(experimentalFactor, levels = c('non-lesion (NL)', 'psoriasis lesion (LS)'))
design <- model.matrix(~experimentalFactor)
colnames(design) <- c('normal', 'lesional_vs_normal')
fit <- lmFit(exp_mat, design)
fit <- eBayes(fit)
diffexp <- topTable(fit, coef = 'lesional_vs_normal', confint = T)
diffexp <- data.frame(Symbol = rownames(diffexp),
                 log2FC = diffexp$logFC,
                 pvalue = diffexp$P.Value,
                 CI.L = diffexp$CI.L,
                 CI.R = diffexp$CI.R)
# Show first few rows of results table
print(head(diffexp))
##         Symbol log2FC    pvalue  CI.L   CI.R
## 1      S100A12  9.799 5.745e-89 9.323 10.276
## 2    TMPRSS11D  7.645 1.156e-82 7.236  8.054
## 3 CTA-384D8.35  6.550 9.769e-80 6.183  6.916
## 4         OASL  7.946 3.710e-78 7.491  8.402
## 5      PLA2G4D  6.080 1.118e-77 5.729  6.432
## 6         VNN3  7.112 1.461e-77 6.700  7.524

Dataframe in “diffexp” variable can now be stored in a list. Results from other differential expression analyses from similarly designed experiments must be stored in the same list for meta-analysis.

list_of_results <- list(GSE30999 = diffexp)

Analysis of RNA-seq dataset

Dataset E-GEOD-54456 is composed of lesional skin samples taken from individuals with psoriasis and normal skin samples. Gene expression measurements were taken using Illumina Genome Analyzer sequencing platform. A table of counts and associated metadata will be downloaded using ExpressionAtlas bioconductor package. Next, limma will be used to detect differentially expressed genes in lesional samples compared with normal skin.

experiment <- ExpressionAtlas::getAtlasData('E-GEOD-54456')
count_table <- assays(experiment[['E-GEOD-54456']]$rnaseq)$counts
sannot <- colData(experiment[['E-GEOD-54456']]$rnaseq) 

A DGEList object will be constructed before differential expression analysis using edgeR package.

# Create design matrix
expfactor <- sannot$disease
expfactor <- factor(expfactor, levels = c('normal', 'psoriasis'))
des <- model.matrix(~expfactor)
colnames(des) <- c('normal', 'psoriasis_vs_normal')
# Create DGEList and filter out genes with zero or low counts
library(edgeR)
dge <- DGEList(counts = count_table)
keep <- filterByExpr(dge, des)
dge <- dge[keep,,keep.lib.sizes=FALSE]

Dataset will be normalized before differential expression analysis.

dge <- calcNormFactors(dge)
logCPM <- cpm(dge, log = TRUE, prior.count = 3)

Now, limma will be used for differential expression analysis.

fit2 <- lmFit(logCPM, des)
fit2 <- eBayes(fit2, trend = TRUE)
diffexp2 <- topTable(fit2, coef = 'psoriasis_vs_normal', confint = T)

Gene identificators will be changed from Ensembl IDs to Gene Symbol IDs to allow for comparisons between different datasets. Bioconductor biomaRt package provides functions to easily convert between identifiers.

library(biomaRt)
ensembl <- useMart("ensembl", dataset="hsapiens_gene_ensembl")
ensembl2gs <- getBM(attributes = c('ensembl_gene_id', 'hgnc_symbol'),
                    filters = 'ensembl_gene_id',
                    values = rownames(diffexp2),
                    mart = ensembl)
geneSymbols <- ensembl2gs$hgnc_symbol[match(rownames(diffexp2), ensembl2gs$ensembl_gene_id)]
diffexp2 <- data.frame(Symbol = geneSymbols,
                       log2FC = diffexp2$logFC,
                       pvalue = diffexp2$P.Value,
                       CI.L = diffexp2$CI.L,
                       CI.L = diffexp2$CI.R)
# Show first few rows of results table
print(head(diffexp2))
##    Symbol log2FC     pvalue   CI.L CI.L.1
## 1    VNN3  5.671 9.212e-116  5.469  5.873
## 2 S100A7A 11.624 2.784e-112 11.189 12.058
## 3     PI3  9.670 2.781e-109  9.294 10.046
## 4 AKR1B10  6.279 1.205e-108  6.032  6.525
## 5  DEFB4A  9.821 8.539e-108  9.432 10.211
## 6  SPRR2A  9.842 1.116e-105  9.440 10.245

Finally, a second table with results can be added to a list for meta-analysis

list_of_results[['E-GEOD-54456']] <- diffexp2

Final considerations

Differential expression results tables must be stored in a list object before integration. Gene identifiers from different studies must be comparable. Conversion of identifiers can be executed with functions from biomaRt Bioconductor package. Users are referred to biomaRt user guide for further details: https://bioconductor.org/packages/release/bioc/html/biomaRt.html