Analysis

Load packages

library(MetaVolcanoR)
library(GEOquery)
library(ExpressionAtlas)

Analysis of microarray dataset

Dataset GSE13999 is composed of lesional and non-lesional skin samples taken from individuals with psoriasis. Gene expression measurements were made with GPL570 microarray platform. Processed data will be downloaded using GEOquery bioconductor package. Next, limma will be used to detect differentially expressed genes in affected samples compared with unaffected samples.

gse30999 <- GEOquery::getGEO('GSE30999', GSEMatrix = TRUE)
exp_set <- gse30999[['GSE30999_series_matrix.txt.gz']]
exp_matrix <- exprs(exp_set)
samp_annot <- pData(exp_set)

Differential expression results from datasets to be integrated must have the same gene identifiers. Therefore, probe identifiers from GPL570 platform will be converted to their corresponding Gene Symbol identifiers before analysis. Moreover, microarray platforms may contain more than one probe for the same gene. This ambiguity will be removed by selecting probes with the highest mean of expression.

# Some probes are assigned to more than one Gene Symbol identifier.
# To simplify matters, these probes will not be included.
# First: get probe annotation from eset object
annottab <- fData(gse30999[[1]])[,c('ID', 'Gene Symbol')]
colnames(annottab) <- c('Probe', 'Symbol')
# Filter out probes with more than one Gene Symbol 
annottab <- subset(annottab, !grepl('///', Symbol))
# Filter out unannotated probes
annottab <- subset(annottab, Symbol != '')
# Filter out NAs
annottab <- annottab[complete.cases(annottab),]
# Filter out probes absent in expression matrix
annotatb <- annottab[annottab$Probe %in% rownames(exp_matrix),]
genes2probes <- split(annottab$Probe, annottab$Symbol)

# Select probe with highest expression
# Compute mean of expression for each probe
mean_probes <- apply(exp_matrix, 1, mean)
selected_probes <- lapply(genes2probes, function(probes){
  expression_levels <- mean_probes[probes]                      
  sel_probe <- probes[which.max(expression_levels)]
  return(sel_probe)
})
selected_probes <- unlist(selected_probes)
# Subset expression matrix with selected probes and use Gene Symbols as identifiers
exp_mat <- exp_matrix[selected_probes,]
rownames(exp_mat) <- names(selected_probes)

Now, limma can be used to detect differentialy expressed genes. For further details on how to properly set up more complex experimental designs, users are referred to the limma user guide, available in limma Bioconductor page: http://bioconductor.org/packages/release/bioc/html/limma.html

experimentalFactor <- samp_annot$`biopsy type:ch1`
experimentalFactor <- factor(experimentalFactor, levels = c('non-lesion (NL)', 'psoriasis lesion (LS)'))
design <- model.matrix(~experimentalFactor)
colnames(design) <- c('normal', 'lesional_vs_normal')
fit <- lmFit(exp_mat, design)
fit <- eBayes(fit)
diffexp <- topTable(fit, coef = 'lesional_vs_normal', confint = T)
diffexp <- data.frame(Symbol = rownames(diffexp),
                 log2FC = diffexp$logFC,
                 pvalue = diffexp$P.Value,
                 CI.L = diffexp$CI.L,
                 CI.R = diffexp$CI.R)
# Show first few rows of results table
print(head(diffexp))

##         Symbol log2FC    pvalue  CI.L   CI.R
## 1      S100A12  9.799 5.745e-89 9.323 10.276
## 2    TMPRSS11D  7.645 1.156e-82 7.236  8.054
## 3 CTA-384D8.35  6.550 9.769e-80 6.183  6.916
## 4         OASL  7.946 3.710e-78 7.491  8.402
## 5      PLA2G4D  6.080 1.118e-77 5.729  6.432
## 6         VNN3  7.112 1.461e-77 6.700  7.524

Dataframe in “diffexp” variable can now be stored in a list. Results from other differential expression analyses from similarly designed experiments must be stored in the same list for meta-analysis.

list_of_results <- list(GSE30999 = diffexp)

Analysis of RNA-seq dataset

Dataset E-GEOD-54456 is composed of lesional skin samples taken from individuals with psoriasis and normal skin samples. Gene expression measurements were taken using Illumina Genome Analyzer sequencing platform. A table of counts and associated metadata will be downloaded using ExpressionAtlas bioconductor package. Next, limma will be used to detect differentially expressed genes in lesional samples compared with normal skin.

experiment <- ExpressionAtlas::getAtlasData('E-GEOD-54456')
count_table <- assays(experiment[['E-GEOD-54456']]$rnaseq)$counts
sannot <- colData(experiment[['E-GEOD-54456']]$rnaseq)

A DGEList object will be constructed before differential expression analysis using edgeR package.

# Create design matrix
expfactor <- sannot$disease
expfactor <- factor(expfactor, levels = c('normal', 'psoriasis'))
des <- model.matrix(~expfactor)
colnames(des) <- c('normal', 'psoriasis_vs_normal')
# Create DGEList and filter out genes with zero or low counts
library(edgeR)
dge <- DGEList(counts = count_table)
keep <- filterByExpr(dge, des)
dge <- dge[keep,,keep.lib.sizes=FALSE]

Dataset will be normalized before differential expression analysis.

dge <- calcNormFactors(dge)
logCPM <- cpm(dge, log = TRUE, prior.count = 3)

Now, limma will be used for differential expression analysis.

fit2 <- lmFit(logCPM, des)
fit2 <- eBayes(fit2, trend = TRUE)
diffexp2 <- topTable(fit2, coef = 'psoriasis_vs_normal', confint = T)

Gene identificators will be changed from Ensembl IDs to Gene Symbol IDs to allow for comparisons between different datasets. Bioconductor biomaRt package provides functions to easily convert between identifiers.

library(biomaRt)
ensembl <- useMart("ensembl", dataset="hsapiens_gene_ensembl")
ensembl2gs <- getBM(attributes = c('ensembl_gene_id', 'hgnc_symbol'),
                    filters = 'ensembl_gene_id',
                    values = rownames(diffexp2),
                    mart = ensembl)
geneSymbols <- ensembl2gs$hgnc_symbol[match(rownames(diffexp2), ensembl2gs$ensembl_gene_id)]
diffexp2 <- data.frame(Symbol = geneSymbols,
                       log2FC = diffexp2$logFC,
                       pvalue = diffexp2$P.Value,
                       CI.L = diffexp2$CI.L,
                       CI.L = diffexp2$CI.R)
# Show first few rows of results table
print(head(diffexp2))

##    Symbol log2FC     pvalue   CI.L CI.L.1
## 1    VNN3  5.671 9.212e-116  5.469  5.873
## 2 S100A7A 11.624 2.784e-112 11.189 12.058
## 3     PI3  9.670 2.781e-109  9.294 10.046
## 4 AKR1B10  6.279 1.205e-108  6.032  6.525
## 5  DEFB4A  9.821 8.539e-108  9.432 10.211
## 6  SPRR2A  9.842 1.116e-105  9.440 10.245

Finally, a second table with results can be added to a list for meta-analysis

list_of_results[['E-GEOD-54456']] <- diffexp2

Final considerations

Differential expression results tables must be stored in a list object before integration. Gene identifiers from different studies must be comparable. Conversion of identifiers can be executed with functions from biomaRt Bioconductor package. Users are referred to biomaRt user guide for further details: https://bioconductor.org/packages/release/bioc/html/biomaRt.html

MetaVolcanoR: Prepare differential expression results for meta-analysis

June 27, 2019

Introduction

Analysis

Load packages

Analysis of microarray dataset

Analysis of RNA-seq dataset

Final considerations