La primera parte del presente tutorial construye gráficas para la interpretación de los resultados de la publicación del capítulo limma del curso Bioconductor for Genomic Data Science de Kasper Daniel Hansen.
En la segunda parte se presenta la (i) limpieza de datos reportados, (ii) creación de un ExpressionSet y (iii) reproducción de resultados del artículo publicado por Torres et al. 2015, en el cual aplican la tecnología del microarreglo de proteínas.
Objetivo
- Complementar el flujo de trabajo de
Bioconductor
para el análisis de microarreglos.
Nota: Todos los outputs deben verse como líneas corridas. Disminuir el zoom de ser necesario.
Dependencies
library(tidyverse)
library(ggrepel)
library(Biobase)
library(limma)
library(NMF)
DNA microarray: Leukemia
Contexto:
- This is data on different types of leukemia. The code
NoL
means not leukemia, ie. normal controls.
Problema
- Which genes are differentially expressed between the
ALL
type and normal controls?
ExpressionSet
El presente ejemplo nos brinda un ExpressionSet()
, el cual es un contenedor de datos que integra:
- Data de expresión, accesible con la función
exprs()
, y
- Data de las muestras o covariables, accesible con la función
pData()
.
leukemiasEset <- leukeset
leukemiasEset
ExpressionSet (storageMode: lockedEnvironment)
assayData: 20172 features, 60 samples
element names: exprs, se.exprs
protocolData
sampleNames: GSM330151.CEL GSM330153.CEL ... GSM331677.CEL (60 total)
varLabels: ScanDate
varMetadata: labelDescription
phenoData
sampleNames: GSM330151.CEL GSM330153.CEL ... GSM331677.CEL (60 total)
varLabels: Project Tissue ... Subtype (5 total)
varMetadata: labelDescription
featureData: none
experimentData: use 'experimentData(object)'
Annotation: genemapperhgu133plus2
Differential expression
Define the control group
- First we subset the data and clean it up.
- Then, define the control group
eset <- leukemiasEset[, leukemiasEset$LeukemiaType %in% c("ALL", "NoL")]
eset$LeukemiaType <- factor(eset$LeukemiaType)
table(eset$LeukemiaType) # debe ser: control (fct referencia) vs caso
ALL NoL
12 12
eset$LeukemiaType <- relevel(eset$LeukemiaType, ref = "NoL")
table(eset$LeukemiaType) # debe ser: control (fct referencia) vs caso
NoL ALL
12 12
Matrix + Linear model + eBayes
limma
ayuda a resolver dos principales problemas:
- presencia de varianzas artificialmente altas o bajas producto del bajo número de réplicas entre arreglos o muestras (Baldi and Long 2001); y un
- presencia de falsos positivos producto al amplio número de hipótesis puestas a prueba en forma simultánea (Kayala and Baldi 2012).
- Un test-t empírico de Bayes permite reducir (o moderar) las varianzas de todas las lecturas.
cyber-t
actualiza la varianza observada de cada gen asumiendo como la probabilidad previa (prior) que la varianza de genes vecinos con medias de expresión similar (Baldi and Long 2001).
limma
optimizó y generalizó esta estrategia a distintos diseños experimentales. Asume una varianza general como prior para la actualización de todas las varianzas observadas (Smyth and others 2004).
Nota: A diferencia de un enfoque bayesiano puro, donde se asume una probabilidad previa para la actualización de un evento (condicionado) dado un evento condicionante, en un enfoque empírico esta información es obtenida de un pool de elementos observados en el mismo experimento.
- Los métodos de corrección o ajuste de valores p permiten que el controlar la razón de falsos positivos con respecto a todos los descubrimientos o razón de falsos descubrimientos (FDR) (Brazma et al. 2001).
- El método Benjamini-Hochberg determina un valor crítico de valores p dependiente del total de hipótesis puestas a prueba y el valor de FDR deseado (Benjamini and Hochberg 1995).
Ambas propiedades son revisadas aquí.
design <- model.matrix(~ eset$LeukemiaType)
fit <- lmFit(eset, design)
eb <- eBayes(fit)
topTable(eb) %>% rownames_to_column()
Removing intercept from test coefficients
The output from topTable()
includes
logFC
: the log fold-change between cases and controls.
t
: the t-statistic used to assess differential expression.
P.Value
: the p-value for differential expression; this value is not adjusted for multiple testing.
adj.P.Val
: the p-value adjusted for multiple testing. Different adjustment methods are available, the default is Benjamini-Horchberg.
# tidy toptable
td <- topTable(eb, coef = 2, # default: slope
sort.by =NULL, resort.by =NULL,
#genelist = NULL, # no fit$gene
number = Inf, # show all the hipothesis
confint=0.95) %>% rownames_to_column() %>% as.tbl() %>%
dplyr::rename(Gene.ID=rowname)
Plots
p-value histogram
test <- topTable(eb, coef = 2,
sort.by ="P",
#sort.by ="AveExpr",
genelist = fit$genes$Gene.ID, number = Inf)
library(ggplot2)
ax <- qplot(test$P.Value, binwidth=.05)
bx <- qplot(test$adj.P.Val, binwidth=.05)
Rmisc::multiplot(ax,bx,cols = 2)
volcano plot
# (2) Volcano plot: Log Fold Change x -log10(p.value)
td %>%
mutate(significance=if_else(P.Value<1e-13,
"p.adj < 1e-13","diff rve")) %>%
ggplot(aes(logFC,-log10(P.Value))) +
geom_point(aes(colour=significance)) +
scale_color_manual(values=c("black","red")) +
ggrepel::geom_text_repel(data = filter(td,
P.Value<1e-14 | P.Value<1e-13 & logFC>0),
aes(label=Gene.ID)) +
labs(title="Leukemia DNA microarray",
subtitle="Caso vs Control")
heatmap + hierachical clustering
La inferencia estadística es comúnmente complementada con técnicas de clasificación.
La clasificación supervisada o predicción de clase, define a priori el número de categorías, el conjunto de entrenamiento y el de prueba. Ejemplos: support vector machines (SVM), k-nearest neighbors (k-NN) y validación por leave-one-out cross-validation (LOOCV).
La clasificación no-supervisada o descubrimiento de clase, agrupa objetos en base a métricas de similaridad de los objetos a lo largo de las muestras. Requiere definir 2 tipos de algoritmos:
de distancia: e.g., Euclidiana, en la que se prioriza la magnitud sobre la dispersión de las señales, y la correlación de Pearson, en la que se prioriza la dispersión sobre la magnitud.
de agrupamiento: e.g., agrupamiento jerárquico o hierarchical clustering, en la que los racimos o clusters se forman iterativamente al calcular la distancia entre elementos y racimos en formación. Más detalles, aquí. Otros son el k-mean clustering, self-organizing maps (SOM) y principal component analysis (PCA).
Las principales desventajas de la técnica están en el requerimiento de decisiones entre varios algoritmos sin consenso y su escasa reproducibilidad entre experimentos. (Allison et al. 2006)
#eset <- eset#leukemiasEset
#exprs(eset) %>% dim()
x <- td %>%
arrange(desc(t)) %>%
filter(B>0 & adj.P.Val<1e-10) %>%
select(Gene.ID) %>% as.matrix()
#x %>% dim()
#%>% dim()
aheatmap(exprs(eset)[x,],
Rowv = TRUE, Colv = TRUE,
annCol = pData(eset),
layout = "_")
Protein microarray: Malaria
Contexto:
- Pacientes con parasitemia circulante de P. falciparum y auscencia de síntomas son considerados como individuos con inmunidad clínica ante la malaria.
- Hipótesis: La respuesta de anticuerpos ante la infección en el grupo de pacientes asintomáticos reconoce un subgrupo de proteínas de P. falciparum, posiblemente asociadas al desarrollo de la inmunidad clínica.
Revisar más información sobre el experimento aquí.
Problema
- ¿Cómo limpiar la data dsponible y generar un
ExpressionSet()
desde cero?
- ¿Qué proteínas están diferencialmente reconocidas en el grupo de asintomáticos con respecto a los sintomáticos?
Tidy up data
raw<-protmicro
raw %>% View()
phenotype data
- Objetivo:
- Extraer la data fenotípica,
- Transponer su estructura,
- Asignar el nombre de muestra por columna,
- Crear un nuevo objeto capaz de ser leído por el
ExpressionSet
.
#samples <- raw[1:2,] %>% colnames()
#samples %>% as.matrix() %>% t() %>% class()
#https://stackoverflow.com/questions/34004008/transposing-in-dplyr
pheno_data <- raw[1:2,] %>%
slice(1) %>%
select(-1,-2) %>%
mutate(group=1) %>%
gather(sample_name,phenotype) %>%
filter(phenotype!="1")
pheno_data %>% arrange(sample_name)
pheno_data %>% count(phenotype)
#https://www.rdocumentation.org/packages/tibble/versions/1.3.3/topics/rownames
pheno_data_row <- pheno_data %>%
arrange(sample_name) %>%
as.data.frame() %>%
column_to_rownames(var="sample_name")
pheno_d <- new("AnnotatedDataFrame", data=pheno_data_row)
expression data
- Objetivo:
- Transformar el
data.frame
a una matriz,
- Separar la data de expresión de la data de normalización.
# limpieza inicial
exprs_data <- raw[77:nrow(raw),] %>%
rename(gene_name="X__1",
gene_ID="X__2") %>%
select(-gene_name) %>%
mutate(group=seq(n())) %>%
unite(gene_ID.n,gene_ID,group,sep = ".") %>%
mutate(gene_ID.n=stringr::str_replace(gene_ID.n,"\\s",""))
#exprs_data %>% dim() #33668
exprs_ctrl <- exprs_data %>% slice(1:31) #%>% dim() # 31 "no DNA" per sample
exprs_prot <- exprs_data %>% slice(32:n()) #%>% dim() # 33637 "no DNA" per sample
# PROTEINS
exprs_prot_m <- exprs_prot %>%
gather(sample_name, expression, -gene_ID.n) %>%
mutate(expression=as.numeric(expression)) %>%
reshape2::acast(gene_ID.n ~ sample_name,
value.var = "expression") #%>% class()
# dimensión
dim(exprs_prot_m)
[1] 855 38
# seis lecturas por proteína de las 8 primeras muestras
head(exprs_prot_m[,1:8])
PA06 PA07 PA11 PA12 PA13 PA14 PA15 PA16
MAL13P1.103.289 17058 9181 9841 20898 19586 13419 10469 29562
MAL13P1.107-s1.554 7393 4539 3886 18981 5236 9570 10345 7392
MAL13P1.107-s2.464 6822 3942 3336 32468 5201 10846 13803 9165
MAL13P1.114e3s1.339 14834 7712 9367 18653 18574 27646 26424 30289
MAL13P1.121_1o4.423 11942 4889 7618 16364 12059 9713 29187 8391
MAL13P1.124.557 8064 4424 10852 13348 4781 12231 11681 10184
# CONTROL
exprs_ctrl_m <- exprs_ctrl %>%
summarise_all(median) %>% # MEDIAN to NORMALIZE against noDNA controls
gather(sample_name, expression, -gene_ID.n) %>%
mutate(expression=as.numeric(expression)) %>%
reshape2::acast(gene_ID.n ~ sample_name,
value.var = "expression") #%>% dim() #%>% class()
# dimensión
dim(exprs_ctrl_m)
[1] 1 38
# controles noDNA de las 8 primeras muestras
exprs_ctrl_m[,1:8]
PA06 PA07 PA11 PA12 PA13 PA14 PA15 PA16
8888 5030 4080 7784 4581 7538 7850 5827
mat <- NULL # crear objeto vacío
# loop para generar una matriz de las dimensiones de `exprs_prot_m`
for(i in 1:nrow(exprs_prot_m)){
mat <- rbind(mat,exprs_ctrl_m)
}
# dimensiones de la matriz con la mediana de ctrls `noDNA` creada
dim(mat)
[1] 855 38
# matriz con la mediana de ctrls `noDNA` para las 8 primeras muestras
head(mat[,1:8]) #%>% class()
PA06 PA07 PA11 PA12 PA13 PA14 PA15 PA16
noDNA.23 8888 5030 4080 7784 4581 7538 7850 5827
noDNA.23 8888 5030 4080 7784 4581 7538 7850 5827
noDNA.23 8888 5030 4080 7784 4581 7538 7850 5827
noDNA.23 8888 5030 4080 7784 4581 7538 7850 5827
noDNA.23 8888 5030 4080 7784 4581 7538 7850 5827
noDNA.23 8888 5030 4080 7784 4581 7538 7850 5827
gene names
# generate a list (df) of gene names and gene ID's
ln <- raw[108:nrow(raw),1:2] %>%
rename(gene_name="X__1",
gene_ID="X__2") %>%
mutate(num=seq(from=32, to=31+n())) %>%
unite(gene_ID.n,gene_ID,num,sep=".") %>%
full_join(exprs_prot,by="gene_ID.n") %>%
select(1:2) %>%
rename(Gene.ID=gene_ID.n)
Create an ExpressionSet
eset <- ExpressionSet(assayData = exprs_norm,
phenoData = pheno_d)
eset
ExpressionSet (storageMode: lockedEnvironment)
assayData: 855 features, 38 samples
element names: exprs
protocolData: none
phenoData
sampleNames: PA06 PA07 ... TFS03 (38 total)
varLabels: phenotype
varMetadata: labelDescription
featureData: none
experimentData: use 'experimentData(object)'
Annotation:
Differential expression
Define the control group
- El problem defin al grupo de sintomáticos como el grupo control.
eset$phenotype <- factor(eset$phenotype)
table(eset$phenotype) # debe ser: control (fct referencia) vs caso
Asymptomatic Symptomatic
14 24
eset$phenotype <- relevel(eset$phenotype, ref = "Symptomatic")
table(eset$phenotype) # debe ser: control (fct referencia) vs caso
Symptomatic Asymptomatic
24 14
Matrix + Linear model + eBayes
design <- model.matrix(~ eset$phenotype)
fit <- lmFit(eset, design)
eb <- eBayes(fit)
topTable(eb) %>% rownames_to_column()
Removing intercept from test coefficients
# tidy toptable
td <- topTable(eb, coef = 2, # default: slope
sort.by =NULL, resort.by =NULL,
#genelist = NULL, # no fit$gene
number = Inf, # show all the hipothesis
confint=0.95) %>% rownames_to_column() %>% as.tbl() %>%
dplyr::rename(Gene.ID=rowname)
Problemas específicos
- Decisión estadística cambia:
- microarreglos de DNA -> variación w.r.t. housekeeping gene
- microarreglo de proteínas -> veces sobre control (blanco) -> intensidad final es relevante
Plots
p-value histogram
volcano plot
- OBJETIVO:
- Priorizar el Tamaño del Efecto por la significancia estadístistica.
# (2) Volcano plot: Log Fold Change x -log10(p.value)
tv <- td %>%
inner_join(biobroom::tidy.ExpressionSet(eset,addPheno = TRUE) %>% #count(gene)
group_by(gene,phenotype) %>% summarise(mean_group=mean(value)) %>%
ungroup() %>%
spread(phenotype,mean_group) %>%
#arrange(desc(Asymptomatic)) %>%
rename(Gene.ID=gene),
by="Gene.ID")
tv %>%
mutate(significance=if_else(adj.P.Val<0.05,
"adj.P.Val<0.05","diff rve")) %>%
ggplot(aes(logFC,-log10(P.Value))) +
#geom_point(aes(colour=significance)) + scale_color_manual(values=c("red","black")) +
geom_point(aes(colour=Asymptomatic)) + viridis::scale_color_viridis() +
ggrepel::geom_text_repel(data = tv %>%
filter(adj.P.Val<0.05 & logFC>1.5 & Asymptomatic>1)# %>%
#arrange(desc(Asymptomatic)) %>%
#top_n(10,Asymptomatic)
,
aes(label=Gene.ID)) +
labs(title="Malaria Protein microarray",
subtitle="Caso vs Control")+
geom_hline(aes(yintercept=-log10(0.01)), lty=3)+
geom_vline(aes(xintercept=1), lty=3)
diff reactive
boxplot
list
tv %>%
filter(adj.P.Val<0.05 & logFC>1 & Asymptomatic>1) %>%
arrange(desc(Asymptomatic)) %>%
select(Gene.ID) %>%
inner_join(ln, by="Gene.ID") %>%
inner_join(biobroom::tidy.ExpressionSet(eset,addPheno = TRUE) %>%
group_by(gene,phenotype) %>% summarise(mean_group=mean(value)) %>%
ungroup() %>%
spread(phenotype,mean_group) %>%
rename(Gene.ID=gene),
by="Gene.ID") %>%
inner_join(td %>%
arrange(desc(t)) %>%
mutate(t.order=seq(n())) %>%
select(Gene.ID,AveExpr,P.Value,adj.P.Val,t.order),
by="Gene.ID") #%>% slice(1:10)
top reactive
- El grupo de antígenos con mayor reactividad permitirá darle una contexto biológico a los valores con significancia estadística.
boxplots
- show the intensity distributions of selected features in both groups.
both
heatmap
eset <- ExpressionSet(assayData = exprs_norm,
phenoData = pheno_d)
s <- tv %>%
filter(adj.P.Val<0.05 & logFC>1 & Asymptomatic>1) %>%
arrange(desc(Asymptomatic)) %>%
select(Gene.ID) %>% as.matrix()
x <- tv %>%
#arrange(desc(t)) %>%
arrange(desc(AveExpr)) %>%
#filter(B>0 #& adj.P.Val<1e-4
# ) %>%
top_n(10,AveExpr) %>%
arrange(desc(Asymptomatic)) %>% # If diff order at boxplot is achieved, the deactivate this line.
select(Gene.ID) %>% as.matrix()
pData(eset) <- biobroom::tidy.ExpressionSet(eset,addPheno = TRUE) %>%
group_by(sample) %>% summarise(sample_median=median(value)) %>%
full_join(pData(eset) %>%
rownames_to_column() %>%
rename(sample=rowname),
by="sample") %>%
arrange(phenotype,sample_median) %>%
as.data.frame() %>%
column_to_rownames(var="sample")
y <- pData(eset) %>% rownames_to_column() %>%
select(rowname) %>% as.matrix() %>% as.character()
eset <- eset[c(x,s),y]
aheatmap(exprs(eset),
Rowv = NA, Colv = NA,
annCol = pData(eset),
layout = "_")
boxplot
#repro
biobroom::tidy.ExpressionSet(eset,addPheno = TRUE) %>% #count(gene)
full_join(biobroom::tidy.ExpressionSet(eset,addPheno = TRUE) %>% #count(gene)
group_by(gene,phenotype) %>% summarise(mean_group=mean(value)) %>%
ungroup() %>%
spread(phenotype,mean_group) %>%
arrange(desc(Asymptomatic)) %>%
select(-Symptomatic),
by="gene") %>%
full_join(x %>% as.data.frame() %>%
mutate(group="Top 10") %>%
rename(gene="Gene.ID") %>%
mutate(gene=as.character(gene)),
by="gene") %>%
full_join(s %>% as.data.frame() %>%
mutate(group="Differentially*") %>%
rename(gene="Gene.ID") %>%
mutate(gene=as.character(gene)),
by="gene") %>%
unite(group, group.x, group.y) %>%
mutate(group=stringr::str_replace(group,"NA_(.)", "\\1")) %>%
mutate(group=stringr::str_replace(group,"(.)_NA", "\\1")) %>%
mutate(group=forcats::fct_relevel(group,"Top 10")) %>%
ggplot(aes(reorder(gene,Asymptomatic),value,fill=phenotype))+
geom_boxplot()+
#geom_point() +
facet_grid(group~.,scales = "free", space = "free") +#
ylab("log2-transformed normalized MFI")+
xlab("antigens")+
labs(title="Highly and differentially reactive antigens",
subtitle="All sorted w.r.t. Asymptomatic mean Ab reactivity",
caption="*filtered by adj.P.Val<0.05 and logFC>1")+
coord_flip() # ORDENAR eje de proteínas!!!!
diff reactive (paper)
- Los autores reportan todos los antígenos con valor p ajustado menor a 0.05
- Dada la variabilidad entre las muestras por grupo, un algoritmo de clasificación no es la estrategia más adecuada.
heatmap
aheatmap(exprs(eset)[x,],
Rowv = NA, Colv = NA,
annCol = pData(eset),
layout = "_")
boxplots
- Una mejor estrategia de visualización son los diagramas de cajas, los cuales grafican la distribución total de las observaciones.
#repro
biobroom::tidy.ExpressionSet(eset,addPheno = TRUE) %>% #str()
mutate(order=seq(from=n(),to=1)) %>%
ggplot(aes(reorder(gene,order,order=TRUE),value,fill=phenotype))+
geom_boxplot()+
#geom_point() +
ylab("log2-transformed normalized MFI")+
xlab("antigens")+
labs(title="Differentially reactive antigens",
subtitle="filtered by adj.P.Value<0.05 and sorted w.r.t. Asymptomatic median Ab reactivity")+
coord_flip() # ORDENAR eje de proteínas!!!!
list
x %>% as.data.frame() %>%
mutate(Gene.ID=as.character(Gene.ID)) %>%
inner_join(ln, by="Gene.ID") %>%
inner_join(biobroom::tidy.ExpressionSet(eset,addPheno = TRUE) %>%
group_by(gene,phenotype) %>% summarise(mean_group=mean(value)) %>%
ungroup() %>%
spread(phenotype,mean_group) %>%
rename(Gene.ID=gene),
by="Gene.ID") %>%
inner_join(td %>%
arrange(desc(t)) %>%
mutate(t.order=seq(n())) %>%
select(Gene.ID,P.Value,adj.P.Val,t.order),
by="Gene.ID") #%>% slice(1:8)
B-stat >0
- Here, all features are filtered by positive log-odds that the gene is differentially expressed (B=0 ~ 50% probability a gene is differentially expressed).
- B-statistic is adjusted for multiple testing by assuming that 1% of the genes (modifiable parameter) are expected to be differentially expressed.
- Depends on normality (same as p-values), and independence between genes (same as BH control of FDR), but require in addition a prior guess for the proportion of differentially expressed probes.
- p-values may be preferred to the B-statistics because they do not require this prior knowledge.
sorted by significance
boxplots
- show the intensity distributions of selected features in both groups.
list
sorted by average
boxplots
- show the intensity distributions of selected features in both groups.
list
Conclusión: aplica el tidyverse
antes y después de Bioconductor
Al igual que en los casos presentados en los tutoriales anteriores, el dialecto del tidyverse
facilitó tanto el ingreso de data para la ejecución de modelos en Bioconductor
, como para “limpiar” sus resultados y generar en forma intuitiva visualizaciones que ayuden a su interpretación.
El análisis presentado sigue el flujo de trabajo propuesto por David Robinson:
Computer environment
devtools::session_info()
Session info ----------------------------------------------------------------------------
setting value
version R version 3.4.3 (2017-11-30)
system x86_64, linux-gnu
ui X11
language en_US
collate en_US.UTF-8
tz America/Lima
date 2018-01-07
Packages --------------------------------------------------------------------------------
package * version date source
assertthat 0.2.0 2017-04-11 CRAN (R 3.4.0)
backports 1.1.0 2017-05-22 CRAN (R 3.4.1)
base * 3.4.3 2017-12-01 local
base64enc 0.1-3 2015-07-28 CRAN (R 3.4.0)
bindr 0.1 2016-11-13 cran (@0.1)
bindrcpp * 0.2 2017-06-17 CRAN (R 3.4.1)
Biobase * 2.32.0 2017-08-02 Bioconductor
biobroom 1.4.2 2017-08-02 Bioconductor
BiocGenerics * 0.18.0 2016-06-23 Bioconductor
broom 0.4.2 2017-02-13 CRAN (R 3.4.0)
cellranger 1.1.0 2016-07-27 CRAN (R 3.4.0)
class 7.3-14 2015-08-30 CRAN (R 3.4.0)
cluster * 2.0.6 2017-03-16 CRAN (R 3.4.0)
codetools 0.2-15 2016-10-05 CRAN (R 3.3.1)
colorspace 1.3-2 2016-12-14 CRAN (R 3.4.0)
compiler 3.4.3 2017-12-01 local
datasets * 3.4.3 2017-12-01 local
dendextend 1.5.2 2017-03-28 CRAN (R 3.4.0)
DEoptimR 1.0-8 2016-11-19 CRAN (R 3.4.0)
devtools 1.13.2 2017-06-02 CRAN (R 3.4.1)
digest 0.6.12 2017-01-27 CRAN (R 3.4.0)
diptest 0.75-7 2016-12-05 CRAN (R 3.4.0)
doParallel 1.0.10 2015-10-14 CRAN (R 3.4.0)
dplyr * 0.7.2 2017-07-20 CRAN (R 3.4.1)
evaluate 0.10.1 2017-06-24 CRAN (R 3.4.1)
flexmix 2.3-14 2017-04-28 CRAN (R 3.4.0)
forcats 0.2.0 2017-01-23 CRAN (R 3.4.0)
foreach 1.4.3 2015-10-13 CRAN (R 3.4.0)
foreign 0.8-69 2017-06-21 CRAN (R 3.4.1)
fpc 2.1-10 2015-08-14 CRAN (R 3.4.0)
ggplot2 * 2.2.1.9000 2017-10-24 Github (tidyverse/ggplot2@ffb40f3)
ggrepel * 0.6.9 2017-05-05 Github (slowkow/ggrepel@d21b468)
glue 1.1.1 2017-06-21 CRAN (R 3.4.1)
graphics * 3.4.3 2017-12-01 local
grDevices * 3.4.3 2017-12-01 local
grid 3.4.3 2017-12-01 local
gridBase 0.4-7 2014-02-24 CRAN (R 3.4.0)
gridExtra 2.2.1 2016-02-29 CRAN (R 3.4.0)
gtable 0.2.0 2016-02-26 CRAN (R 3.4.0)
haven 1.1.0 2017-07-09 CRAN (R 3.4.1)
hms 0.3 2016-11-22 CRAN (R 3.4.0)
htmltools 0.3.6 2017-04-28 CRAN (R 3.4.0)
httr 1.2.1 2016-07-03 CRAN (R 3.4.0)
iterators 1.0.8 2015-10-13 CRAN (R 3.4.0)
jsonlite 1.5 2017-06-01 cran (@1.5)
kernlab 0.9-25 2016-10-03 CRAN (R 3.4.0)
knitr 1.16 2017-05-18 cran (@1.16)
labeling 0.3 2014-08-23 CRAN (R 3.4.0)
lattice 0.20-35 2017-03-25 CRAN (R 3.3.3)
lazyeval 0.2.0 2016-06-12 CRAN (R 3.4.0)
limma * 3.28.21 2017-08-02 Bioconductor
lubridate 1.6.0 2016-09-13 CRAN (R 3.4.0)
magrittr 1.5 2014-11-22 CRAN (R 3.4.0)
MASS 7.3-47 2017-04-21 CRAN (R 3.4.0)
mclust 5.3 2017-05-21 CRAN (R 3.4.1)
memoise 1.1.0 2017-04-21 CRAN (R 3.4.0)
methods * 3.4.3 2017-12-01 local
mnormt 1.5-5 2016-10-15 CRAN (R 3.4.0)
modelr 0.1.1 2017-07-24 CRAN (R 3.4.1)
modeltools 0.2-21 2013-09-02 CRAN (R 3.4.0)
munsell 0.4.3 2016-02-13 CRAN (R 3.4.0)
mvtnorm 1.0-6 2017-03-02 CRAN (R 3.4.0)
nlme 3.1-131 2017-02-06 CRAN (R 3.4.0)
NMF * 0.23.4 2017-05-05 Github (renozao/NMF@ff913c0)
nnet 7.3-12 2016-02-02 CRAN (R 3.4.0)
parallel * 3.4.3 2017-12-01 local
pkgconfig 2.0.1 2017-03-21 cran (@2.0.1)
pkgmaker * 0.26.7 2017-05-05 Github (renozao/pkgmaker@845deb6)
plyr 1.8.4 2016-06-08 CRAN (R 3.4.0)
prabclus 2.2-6 2015-01-14 CRAN (R 3.4.0)
psych 1.7.5 2017-05-03 CRAN (R 3.4.0)
purrr * 0.2.3 2017-08-02 cran (@0.2.3)
R6 2.2.2 2017-06-17 CRAN (R 3.4.1)
RColorBrewer * 1.1-2 2014-12-07 CRAN (R 3.4.0)
Rcpp 0.12.13 2017-09-28 cran (@0.12.13)
readr * 1.1.1 2017-05-16 CRAN (R 3.4.1)
readxl 1.0.0 2017-04-18 CRAN (R 3.4.0)
registry * 0.3 2015-07-08 CRAN (R 3.4.0)
reshape2 1.4.2 2016-10-22 CRAN (R 3.4.0)
rlang 0.1.2 2017-08-09 cran (@0.1.2)
rmarkdown 1.6 2017-06-15 CRAN (R 3.4.1)
Rmisc 1.5 2013-10-22 CRAN (R 3.4.0)
rngtools * 1.2.4 2014-03-06 CRAN (R 3.4.0)
robustbase 0.92-7 2016-12-09 CRAN (R 3.4.0)
rprojroot 1.2 2017-01-16 CRAN (R 3.4.0)
rvest 0.3.2 2016-06-17 CRAN (R 3.4.0)
scales 0.5.0.9000 2017-10-06 Github (hadley/scales@d767915)
stats * 3.4.3 2017-12-01 local
stats4 3.4.3 2017-12-01 local
stringi 1.1.5 2017-04-07 CRAN (R 3.4.0)
stringr 1.2.0 2017-02-18 CRAN (R 3.4.0)
tibble * 1.3.4 2017-08-22 cran (@1.3.4)
tidyr * 0.6.3 2017-05-15 CRAN (R 3.4.1)
tidyverse * 1.1.1 2017-01-27 CRAN (R 3.4.0)
tools 3.4.3 2017-12-01 local
trimcluster 0.1-2 2012-10-29 CRAN (R 3.4.0)
utils * 3.4.3 2017-12-01 local
viridis 0.4.0 2017-03-27 CRAN (R 3.4.0)
viridisLite 0.2.0 2017-03-24 CRAN (R 3.4.0)
whisker 0.3-2 2013-04-28 CRAN (R 3.4.0)
withr 2.0.0 2017-10-24 Github (jimhester/withr@a43df66)
xml2 1.1.1 2017-01-24 CRAN (R 3.4.0)
xtable 1.8-2 2016-02-05 CRAN (R 3.4.0)
yaml 2.1.14 2016-11-12 CRAN (R 3.4.0)
References
Allison, David B, Xiangqin Cui, Grier P Page, and Mahyar Sabripour. 2006. “Microarray Data Analysis: From Disarray to Consolidation and Consensus.” Nature Reviews Genetics 7 (1). Nature Publishing Group: 55–65. doi:10.1038/nrg1749.
Baldi, Pierre, and Anthony D Long. 2001. “A Bayesian Framework for the Analysis of Microarray Expression Data: Regularized T-Test and Statistical Inferences of Gene Changes.” Bioinformatics 17 (6). Oxford Univ Press: 509–19. doi:10.1093/bioinformatics/17.6.509.
Benjamini, Yoav, and Yosef Hochberg. 1995. “Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing.” Journal of the Royal Statistical Society. Series B (Methodological). JSTOR, 289–300. doi:10.2307/2346101.
Brazma, Alvis, Pascal Hingamp, John Quackenbush, Gavin Sherlock, Paul Spellman, Chris Stoeckert, John Aach, et al. 2001. “Minimum Information About a Microarray Experiment (MIAME)—toward Standards for Microarray Data.” Nature Genetics 29 (4). Nature Publishing Group: 365–71. doi:10.1038/ng1201-365.
Huber, Wolfgang, Anja Von Heydebreck, Holger Sültmann, Annemarie Poustka, and Martin Vingron. 2002. “Variance Stabilization Applied to Microarray Data Calibration and to the Quantification of Differential Expression.” Bioinformatics 18 (suppl 1). Oxford Univ Press: S96–S104. doi:10.1093/bioinformatics/18.suppl_1.S96.
Kayala, Matthew A, and Pierre Baldi. 2012. “Cyber-T Web Server: Differential Analysis of High-Throughput Data.” Nucleic Acids Research 40 (W1). Oxford Univ Press: W553–W559. doi:10.1093/nar/gks420.
King, C. L., D. H. Davies, P. Felgner, E. Baum, A. Jain, A. Randall, K. Tetteh, C. J. Drakeley, and B. Greenhouse. 2015. “Biosignatures of Exposure/Transmission and Immunity.” American Journal of Tropical Medicine and Hygiene 93 (3 Suppl). American Society of Tropical Medicine; Hygiene: 16–27. doi:10.4269/ajtmh.15-0037.
Kreil, David P, and Roslin R Russell. 2005. “Tutorial Section: There Is No Silver Bullet—a Guide to Low-Level Data Transforms and Normalisation Methods for Microarray Data.” Briefings in Bioinformatics 6 (1). Oxford Univ Press: 86–97. doi:10.1093/bib/6.1.86.
Smyth, Gordon K, and others. 2004. “Linear Models and Empirical Bayes Methods for Assessing Differential Expression in Microarray Experiments.” Statistical Applications in Genetics and Molecular Biology 3 (1): 3. doi:10.2202/1544-6115.1027.
---
title: "Microarreglos: Complementos al Flujo de trabajo de `Bioconductor`"
author: "avallecam@gmail.com"
date: "2017-08-04"
output:
  #html_document:
  #pdf_document:
  html_notebook:
    toc: yes
    toc_depth: 5
    toc_float:
      collapsed: yes
    #theme: united
    #code_folding: "hide"
    #fig_caption: TRUE
    #number_sections: TRUE
bibliography: malaria.bib
link-citations: yes
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
options(width = 90) # expand limits of CONSOLE output
```

La primera parte del presente tutorial construye gráficas para la interpretación de los resultados de la publicación del 
capítulo [limma](https://kasperdanielhansen.github.io/genbioconductor/html/limma.html) 
del curso
[Bioconductor for Genomic Data Science](https://kasperdanielhansen.github.io/genbioconductor/) 
de [Kasper Daniel Hansen](http://www.hansenlab.org/).

En la segunda parte se presenta la (i) limpieza de datos reportados, (ii) creación de un ExpressionSet y (iii) reproducción de resultados del 
artículo publicado por [Torres et al. 2015](https://academic.oup.com/jid/article-lookup/doi/10.1093/infdis/jiu614), en el cual aplican la tecnología del microarreglo de proteínas.

## Objetivo

- Complementar el flujo de trabajo de `Bioconductor` para el análisis de microarreglos.

__Nota:__ Todos los outputs deben verse como líneas corridas. _Disminuir el zoom de ser necesario._

## Dependencies
```{r, message=FALSE}
library(tidyverse)
library(ggrepel)
library(Biobase)
library(limma)
library(NMF)
```

## Data input

La data ha sido adquirida por dos vias distintas:

- Data disponible del paquete `leukemiaEset` e
- Importación del archivo __excel__ disponible en el material suplementario del artículo referido.

```{r, eval=FALSE, echo=TRUE}
library(leukemiasEset)
data(leukemiasEset)
leukeset <- leukemiasEset

#https://academic.oup.com/jid/article-lookup/doi/10.1093/infdis/jiu614#supplementary-data
protmicro <- readxl::read_excel("data-raw/jiu614_Supplementary_Data/jiu614supp_table1.xlsx")
```

Ambas bases han sido grabadas en un formato `.rds` con la función `saveRDS()` para facilitar su distribución y acceso.
```{r,eval=FALSE, echo=TRUE}
saveRDS(leukeset, "data-raw/leukeset.rds")
saveRDS(protmicro, "data-raw/protmicro.rds")
```

Por ello, la importación de ambas bases será ejecutada por la función `readRDS()`.
```{r}
leukeset <- readRDS("data-raw/leukeset.rds")
protmicro <- readRDS("data-raw/protmicro.rds")
```

## DNA microarray: Leukemia

__Contexto:__

- This is data on different types of leukemia.  The code `NoL` means not leukemia, ie. normal controls.

### Problema

- Which genes are differentially expressed between the `ALL` type and normal controls?

### ExpressionSet

El presente ejemplo nos brinda un `ExpressionSet()`, el cual es un contenedor de datos que integra:

- Data de __expresión__, accesible con la función `exprs()`, y
- Data de las __muestras__ o covariables, accesible con la función `pData()`.

```{r}
leukemiasEset <- leukeset
leukemiasEset
```

### Differential expression

#### Define the control group

- First we subset the data and clean it up.
- Then, define the control group

```{r}
eset <- leukemiasEset[, leukemiasEset$LeukemiaType %in% c("ALL", "NoL")]
eset$LeukemiaType <- factor(eset$LeukemiaType)

table(eset$LeukemiaType) # debe ser: control (fct referencia) vs caso
eset$LeukemiaType <- relevel(eset$LeukemiaType, ref = "NoL")
table(eset$LeukemiaType) # debe ser: control (fct referencia) vs caso
```

#### Matrix + Linear model + eBayes

`limma` ayuda a resolver dos principales problemas:

- presencia de varianzas artificialmente altas o bajas producto del 
bajo número de réplicas entre arreglos o muestras [@baldi2001cybert]; y un
- presencia de falsos positivos producto al amplio número de hipótesis 
puestas a prueba en forma simultánea [@kayala2012cyber].

1. Un __test-t empírico de Bayes__ permite reducir (o moderar) las varianzas de 
todas las lecturas.
    - `cyber-t` actualiza la varianza observada de cada gen asumiendo como la probabilidad previa (_prior_) 
    que la varianza de genes vecinos con medias de expresión similar [@baldi2001cybert].
    - `limma` optimizó y generalizó esta estrategia a distintos diseños experimentales. 
    Asume una varianza general como _prior_ para la actualización de todas las varianzas 
    observadas [@smyth2004ebayes].
    
    __Nota:__ A diferencia de un enfoque bayesiano puro, donde se asume una 
    probabilidad previa para la actualización de un evento (condicionado) dado un evento condicionante,
    en un enfoque empírico esta información es obtenida de un _pool_ de
    elementos observados en el mismo experimento.
    
2. Los métodos de __corrección o ajuste de valores p__ permiten que el 
controlar la razón de falsos positivos con 
respecto a todos los descubrimientos o razón de falsos descubrimientos 
(FDR) [@brazma2001].
    - El método Benjamini-Hochberg determina un valor crítico de valores p dependiente del total 
de hipótesis puestas a prueba y el valor de FDR deseado [@benjamini1995fdr].

Ambas propiedades son revisadas [aquí](http://genomicsclass.github.io/book/pages/using_limma.html).

```{r, message=FALSE, eval=FALSE, echo=FALSE}
design <- model.matrix(~ eset$LeukemiaType)

eb <- eset %>% 
  lmFit(design) %>% 
  eBayes()

eb %>% topTable()
```

```{r}
design <- model.matrix(~ eset$LeukemiaType)
fit <- lmFit(eset, design)
eb <- eBayes(fit)
topTable(eb) %>% rownames_to_column()
```

The output from `topTable()` includes

- `logFC`: the log fold-change between cases and controls.
- `t`: the t-statistic used to assess differential expression.
- `P.Value`: the p-value for differential expression; this value is not adjusted for multiple testing.
- `adj.P.Val`: the p-value adjusted for multiple testing.   Different adjustment methods are available, the default is Benjamini-Horchberg.

```{r}
# tidy toptable
td <- topTable(eb, coef = 2,               # default: slope
         sort.by =NULL, resort.by =NULL, 
         #genelist = NULL,                 # no fit$gene
         number = Inf,                     # show all the hipothesis
         confint=0.95) %>% rownames_to_column() %>% as.tbl() %>% 
  dplyr::rename(Gene.ID=rowname)
```

```{r, eval=FALSE, echo=FALSE}
# explore the toptable

topTable(eb, coef = 2, 
         # interesting sorting: 1st: significance, then: effect
         sort.by ="P", resort.by ="logFC",
         number = Inf, 
         confint=0.95)

#topTable(eb, coef = c(1,2), number = Inf) %>% 
#  as.data.frame() %>% rownames_to_column() %>% as.tibble() %>% 
#  arrange(rowname)
#coef(eb) %>% 
#  as.data.frame() %>% rownames_to_column() %>% as.tibble() %>% 
#  arrange(rowname)

```

### Plots

```{r, eval=FALSE, echo=FALSE}

#### variance shrinkage

par(mfrow=c(1,2))
plot(eb$Amean, eb$sigma)
plot(eb$Amean, eb$s2.post)
```

```{r, fig.height=8, fig.width=8, message=FALSE, echo=FALSE, eval=FALSE}
x <- eb %>% biobroom::augment.MArrayLM() %>% ## ERROR in function!! s2.prior!!
  inner_join(
    inner_join(eb$Amean %>% 
               as.tibble() %>% 
               tibble::rownames_to_column() %>% 
               dplyr::rename(.AMean=value,
                             .gene=rowname),
               eb$s2.post %>% 
               as.tibble() %>% 
               tibble::rownames_to_column() %>% 
               dplyr::rename(s2.post=value,
                             .gene=rowname),
               by=".gene"),
             by=".gene") 

m <- x %>% 
  ggplot(aes(.AMean,.sigma))+
  geom_hex()+scale_y_continuous(limits = c(-0.2,9))
  #geom_density2d()

n <- x %>% 
  ggplot(aes(.AMean,s2.post))+
  geom_hex()+scale_y_continuous(limits = c(-0.2,9))
  #geom_density2d()#

o <- x %>% 
  ggplot(aes(.AMean,.sigma))+
  #geom_hex()+scale_y_continuous(limits = c(-0.2,9))
  geom_density2d()#+
  #scale_y_continuous(limits = c(-0.05,0.7))+
  #scale_x_continuous(limits = c(-0.05,15))

p <- x %>% 
  ggplot(aes(.AMean,s2.post))+
  #geom_hex()
  geom_density2d()#+
  #scale_y_continuous(limits = c(-0.05,0.7))+
  #scale_x_continuous(limits = c(-0.05,15))

Rmisc::multiplot(m,n,o,p,cols = 2)
```

#### p-value histogram

- useful to check prior to execution of a multiple comparison correction.

- El comportamiento de esta corrección puede inferirse preliminarmente al observar la ditribución de valores p en un histograma. 
- Corroborar que la proporción de valores p ajustados menores al nivel de FDR deseado será igual a la proporción de valores p menores al nivel de significancia que estén sobre la distribución de las hipótesis nulas no rechazadas.
- __Comparar con el siguiente caso__

```{r}
test <- topTable(eb, coef = 2,
                 sort.by ="P", 
                 #sort.by ="AveExpr", 
                 genelist = fit$genes$Gene.ID, number = Inf)
library(ggplot2)
ax <- qplot(test$P.Value, binwidth=.05)
bx <- qplot(test$adj.P.Val, binwidth=.05)
Rmisc::multiplot(ax,bx,cols = 2)
```


```{r, eval=FALSE, fig.align="center", echo=FALSE}

#### base R volcano plot

# (2) Volcano plot: Log Fold Change x -log10(p.value)
#source: http://genomicsclass.github.io/book/pages/using_limma.html

#
par(mfrow=c(1,2))
with(td, plot(logFC, -log10(P.Value),cex=.7, pch=20,
                    main = "caso vs control",
                    #col=cols,
                    #xlim=c(-2,2), ylim=c(0,15),
                    xlab="log Fold Change"))
abline(h=c(-log10(0.05),-log10(1e-13)),
       v=c(-1,1),
       lty=2, col= "grey55")
```

<!--
h<-topTable(eb, coef = 2, 
            sort.by =NULL, resort.by =NULL, 
            genelist = NULL, 
            number = Inf, 
            confint=0.95)
h<-h[rownames(coef(eb)),]
td <- data.frame(Gene.ID=dimnames(eb$coefficients)[[1]],logFC=coef(eb)[,2],AveExpr=h$AveExpr, p.value=eb$p.value[,2], padj=h$adj.P.Val, B=h$B)
-->

#### volcano plot

- Visualizacióń del efecto del caso sobre el control (logFC) por su significancia estadística.

- tutorial [aquí](http://www.gettinggeneticsdone.com/2016/01/repel-overlapping-text-labels-in-ggplot2.html)

```{r, echo=TRUE}
# (2) Volcano plot: Log Fold Change x -log10(p.value)

td %>% 
  mutate(significance=if_else(P.Value<1e-13,
                              "p.adj < 1e-13","diff rve")) %>% 
  ggplot(aes(logFC,-log10(P.Value))) +
  geom_point(aes(colour=significance)) +
  scale_color_manual(values=c("black","red")) +
  ggrepel::geom_text_repel(data = filter(td, 
                                         P.Value<1e-14 | P.Value<1e-13 & logFC>0),
                           aes(label=Gene.ID)) +
  labs(title="Leukemia DNA microarray",
       subtitle="Caso vs Control")
```

#### heatmap + hierachical clustering

La inferencia estadística es comúnmente complementada con técnicas de clasificación.

1. La clasificación __supervisada__ o predicción de clase, define _a priori_ 
el número de categorías, el conjunto de entrenamiento y el de prueba. Ejemplos:
_support vector machines_ (SVM), _k-nearest neighbors_ (k-NN) y validación
por _leave-one-out cross-validation_ (LOOCV).

2. La clasificación __no-supervisada__ o descubrimiento de clase, agrupa objetos 
en base a métricas de similaridad de los objetos a lo largo de las muestras.
Requiere definir 2 tipos de algoritmos: 
    
    - de distancia: e.g., Euclidiana, en la que se prioriza la 
    magnitud sobre la dispersión de las señales, y la correlación de 
    Pearson, en la que se prioriza la dispersión sobre la magnitud.
    
    - de agrupamiento: e.g., agrupamiento jerárquico o _hierarchical clustering_,
    en la que los racimos o _clusters_ se forman iterativamente al calcular 
    la distancia entre elementos y racimos en formación. 
    Más detalles, [aquí](bonsai.hgc.jp/~mdehoon/software/cluster/cluster3.pdf). 
    Otros son el 
    _k-mean clustering_, _self-organizing maps_ (SOM) y 
    _principal component analysis_ (PCA).
    
    > Las principales desventajas de la técnica están en el requerimiento 
    de decisiones entre varios algoritmos sin consenso y su escasa reproducibilidad
    entre experimentos. [@allison2006]

```{r, fig.height=7, fig.width=10, fig.align="center"}
#eset <- eset#leukemiasEset
#exprs(eset) %>% dim()
x <- td %>% 
  arrange(desc(t)) %>% 
  filter(B>0 & adj.P.Val<1e-10) %>% 
  select(Gene.ID) %>% as.matrix() 
#x %>% dim()
#%>% dim()
aheatmap(exprs(eset)[x,], 
         Rowv = TRUE, Colv = TRUE, 
         annCol = pData(eset), 
         layout = "_")
```

```{r, eval=FALSE, echo=FALSE}
#### boxplots
biobroom::tidy.ExpressionSet(eset,addPheno = TRUE) %>% 
  ggplot(aes(reorder(gene,value,order = TRUE),value,fill=LeukemiaType))+
  geom_boxplot()+
  #geom_point() +
  ylab("log2-transformed normalized MFI")+
  xlab("antigens")+
  labs(title="Differentian reactivity of antibodies",
       subtitle="filtered by B>0 (~adj.P.Val<0.01) and sorted w.r.t. median Ab reactivity per gene")+
  coord_flip() # ORDENAR eje de proteínas!!!!
```


## Protein microarray: Malaria

__Contexto:__

- Pacientes __con__ parasitemia circulante de _P. falciparum_ y __auscencia__ de síntomas son considerados como individuos con inmunidad clínica ante la malaria.
- __Hipótesis:__ La respuesta de anticuerpos ante la infección en el grupo de __pacientes asintomáticos__ reconoce un subgrupo de proteínas de _P. falciparum_, posiblemente asociadas al desarrollo de la inmunidad clínica.

Revisar más información sobre el experimento [aquí](https://academic.oup.com/jid/article-lookup/doi/10.1093/infdis/jiu614).

### Problema

- ¿Cómo limpiar la data dsponible y generar un `ExpressionSet()` desde cero?
- ¿Qué proteínas están diferencialmente reconocidas en el grupo de asintomáticos con respecto a los sintomáticos?

### Tidy up data

```{r}
raw<-protmicro
```

```{r, eval=FALSE}
raw %>% View()
```

#### phenotype data

- Objetivo:
    + Extraer la data fenotípica,
    + Transponer su estructura,
    + Asignar el nombre de muestra por columna,
    + Crear un nuevo objeto capaz de ser leído por el `ExpressionSet`.

```{r}
#samples <- raw[1:2,] %>% colnames()
#samples %>% as.matrix() %>% t() %>% class()
#https://stackoverflow.com/questions/34004008/transposing-in-dplyr

pheno_data <- raw[1:2,] %>% 
  slice(1) %>% 
  select(-1,-2) %>% 
  mutate(group=1) %>% 
  gather(sample_name,phenotype) %>% 
  filter(phenotype!="1") 

pheno_data %>% arrange(sample_name)
pheno_data %>% count(phenotype)
```

```{r}
#https://www.rdocumentation.org/packages/tibble/versions/1.3.3/topics/rownames
pheno_data_row <- pheno_data %>% 
  arrange(sample_name) %>% 
  as.data.frame() %>% 
  column_to_rownames(var="sample_name")

pheno_d <- new("AnnotatedDataFrame", data=pheno_data_row)
```

#### expression data

- Objetivo:
    + Transformar el `data.frame` a una matriz,
    + Separar la data de expresión de la data de normalización.

```{r, echo=FALSE, eval=FALSE}
raw[77:nrow(raw),] %>% 
  rename(gene_name="X__1",
         gene_ID="X__2") %>% 
  select(1:10,-gene_name) %>% #solo primeras 8 muestras con fin demostrativo
  mutate(group=seq(n())) %>% 
  unite(gene_ID.n,gene_ID,group,sep = ".") %>% 
  gather(sample_name, expression, -gene_ID.n) %>% 
  mutate(expression=as.numeric(expression)) %>% 
  reshape2::acast(gene_ID.n ~ sample_name,
                  value.var = "expression") %>% head() #%>% class()
```

```{r}
# limpieza inicial
exprs_data <- raw[77:nrow(raw),] %>% 
  rename(gene_name="X__1",
         gene_ID="X__2") %>% 
  select(-gene_name) %>% 
  mutate(group=seq(n())) %>% 
  unite(gene_ID.n,gene_ID,group,sep = ".") %>% 
  mutate(gene_ID.n=stringr::str_replace(gene_ID.n,"\\s",""))

#exprs_data %>% dim() #33668
exprs_ctrl <- exprs_data %>% slice(1:31) #%>% dim() # 31 "no DNA" per sample
exprs_prot <- exprs_data %>% slice(32:n()) #%>% dim() # 33637 "no DNA" per sample

# PROTEINS
exprs_prot_m <- exprs_prot %>% 
  gather(sample_name, expression, -gene_ID.n) %>% 
  mutate(expression=as.numeric(expression)) %>% 
  reshape2::acast(gene_ID.n ~ sample_name,
                  value.var = "expression") #%>% class()
# dimensión
dim(exprs_prot_m)
# seis lecturas por proteína de las 8 primeras muestras
head(exprs_prot_m[,1:8])

# CONTROL
exprs_ctrl_m <- exprs_ctrl %>%
  summarise_all(median) %>% # MEDIAN to NORMALIZE against noDNA controls
  gather(sample_name, expression, -gene_ID.n) %>% 
  mutate(expression=as.numeric(expression)) %>% 
  reshape2::acast(gene_ID.n ~ sample_name,
                  value.var = "expression") #%>% dim() #%>% class()
# dimensión
dim(exprs_ctrl_m)
# controles noDNA de las 8 primeras muestras
exprs_ctrl_m[,1:8]
```

```{r}
mat <- NULL # crear objeto vacío
# loop para generar una matriz de las dimensiones de `exprs_prot_m`
for(i in 1:nrow(exprs_prot_m)){
  mat <- rbind(mat,exprs_ctrl_m)
}
# dimensiones de la matriz con la mediana de ctrls `noDNA` creada
dim(mat)
# matriz con la mediana de ctrls `noDNA` para las 8 primeras muestras
head(mat[,1:8]) #%>% class()
```

#### gene names

```{r}
# generate a list (df) of gene names and gene ID's
ln <- raw[108:nrow(raw),1:2] %>% 
  rename(gene_name="X__1",
         gene_ID="X__2") %>% 
  mutate(num=seq(from=32, to=31+n())) %>% 
  unite(gene_ID.n,gene_ID,num,sep=".") %>%
  full_join(exprs_prot,by="gene_ID.n") %>% 
  select(1:2) %>% 
  rename(Gene.ID=gene_ID.n)
```

### Normalization + Transformation

Objetivo de la __normalización__: homogeneizar la variabilidad entre arreglos
en base a controles, los cuales se asume que son invariantes entre muestras. 
El método depende del diseño de arreglos:

- sustracción de controles o _fold over control_ (__FOC__),
aplicado en microarreglos de proteínas con respecto a la mediana del control blanco
por muestra [@King2015FOC];
- estabilización de varianzas o __VSN__, 
aplicado en microarreglos de DNA con respecto a controles de 
intensidad invariante entre muestras, e.g. genes _housekeeping_ [@huber2002vsn]

Objetivo de la __transformación__: reducir la
heterocedasticidad o heterogeneidad de varianzas entre las observaciones 
por errores aleatorios en el proceso de hibridación [@kreil2005bullet].
Dos principales métodos:

- __logarítmica__,
la cual asume un error multiplicativo que corrige las varianzas de 
alta intensidad pero capaz dispersar a las de baja intensidad; y
- __asinh__,
la cual asume un error multiplicativo y aditivo que corrige esta dispersión, 
pero solo relevante en contextos donde los valores normalizados menores a la unidad 
tienen significado biológico.

En el contexto de microarreglos de DNA, ambas transformaciones permiten homogenizar la sobreexpresión o represión génica con respecto a el control elegido. Por ejemplo, una sobreexpresión de 4 sobre el control se homogeniza con una represión de la misma magnitud:

$$log_2\left(\frac{gen_x}{ctrl}\right) \qquad \Rightarrow \qquad log_2\left(\frac{4a}{a}\right)=+2 \qquad or \qquad log_2\left(\frac{b}{4b}\right)=-2$$

> **NOTA**: Los autores del paper reportan el método `vsn`. A diferencia de la transformación logarítimica, este logra corregir errores en lecturas de intensidad baja (e.g., cercanas y menores a la unidad posterior a su normalización). Es idóneo en experimentos donde la intensidades positivas y negativas con respecto al control (generalmente, genes _housekeeping_) son biológicamente relevantes. Sin embargo, el contexto de microarreglo de proteínas es particularmente asimétrico, es decir, las lecturas con relevancia biológica serán todas las que poseean intensidades mayores al control negativo (noDNA) usado en la normalización. 
Más detalles, [aquí](https://academic.oup.com/bib/article/6/1/86/288923/Tutorial-section-There-is-no-silver-bullet-a-guide)

```{r}
# custom function
# definir resultado para los valores menores a cero, 
# donde el log no está definido
log2.NA = function(x) {log2(ifelse(x>0, x, NA))}

# normalización con respecto a los controles `noDNA` c/ transformación log2
exprs_norm <- log2.NA(exprs_prot_m/mat)
head(exprs_norm[,1:6])
```

```{r, fig.height=4, fig.width=10, eval=FALSE, echo=FALSE}
# compare between normalization methods
z <- exprs_prot_m %>% 
  as.data.frame() %>% as.tbl() %>% 
  gather(sample,expression) %>% 
  arrange(expression) #%>% summary()
a <- exprs_prot_m/mat
#a %>% dim()
b <- a %>% 
  as.data.frame() %>% as.tbl() %>% 
  gather(sample,expression) %>% 
  arrange(expression) #%>% summary()
c <- b %>% 
  ggplot(aes(expression,log2(expression))) +
  geom_hex() +
  #geom_density2d()+
  geom_vline(aes(xintercept=1), lty=3)
d <- b %>% 
  ggplot(aes(expression,log2(expression))) +
  #geom_hex() +
  geom_density2d()+
  geom_vline(aes(xintercept=1), lty=3)
Rmisc::multiplot(c,d,cols = 2)

e <- b %>% group_by(sample) %>% 
  summarise(mean=mean(expression),sd=sd(expression)) %>% 
  ggplot(aes(mean,sd))+
  geom_point()+geom_smooth()
f <- b %>% 
  mutate_at(vars(expression), funs(log2)) %>% #arrange(sample)
  group_by(sample) %>%
  summarise(mean=mean(expression),sd=sd(expression)) %>% 
  ggplot(aes(mean,sd))+
  geom_point()+geom_smooth()
Rmisc::multiplot(e,f,cols = 2)
```


<!--

__En el pasado...__

```r
all.raw_rnames <- raw[,2]
all.raw_data <- data.matrix(raw[,3:ncol(raw)])
rownames(all.raw_data) <- all.raw_rnames
ctrl.raw_rnames <- raw[1:31,2]
ctrl.raw_data <- data.matrix(raw[1:31,3:ncol(raw)])
rownames(ctrl.raw_data) <- ctrl.raw_rnames                
median.ctrl.raw_data<-t(as.matrix(apply(ctrl.raw_data, 2, median)))
rownames(median.ctrl.raw_data) <- c("median.ctrl")                
anti.raw_rnames <- raw[32:nrow(raw),2]                           
anti.raw_data <- data.matrix(raw[32:nrow(raw),3:ncol(raw)])     
rownames(anti.raw_data) <- anti.raw_rnames                
median.ctrl.raw_data.1<-rep(median.ctrl.raw_data[1],nrow(anti.raw_data))
median.ctrl.raw_data.mat<-matrix(median.ctrl.raw_data.1,length(median.ctrl.raw_data.1),1)
for (i in 2:ncol(median.ctrl.raw_data)) {
  median.ctrl.raw_data.mat<-cbind(median.ctrl.raw_data.mat,rep(median.ctrl.raw_data[i],nrow(anti.raw_data)))
}
log2.zero = function(x) {log2(ifelse(x>0, x, 1))}
log2.NA = function(x) {log2(ifelse(x>0, x, NA))}   dsitribution
log2.NA.norm.ANTIGENS<-log2.NA(anti.raw_data/median.ctrl.raw_data.mat)
```
-->


### Create an ExpressionSet

```{r}
eset <- ExpressionSet(assayData = exprs_norm, 
                      phenoData = pheno_d)
eset
```

```{r qualityX, echo=FALSE, eval=FALSE}
### Pre-plots

#### Heatmap

########################### SAMPLE ORDER per group

# [START FROM THE BEGINNNING]
#eset
#eset$Group <- factor(eset$Group)
#eset$Group

#pData(eset) <- cbind(pData(eset),cat_2,t(col.mean.m_dataX))#ncol(pData(eset))
#rownames(sampleSEVcovariates)
#cat_2 <- data.frame(as.factor(matrix(sampleSEVcovariates$edad_CAT)))
#rownames(cat_2) <- rownames(sampleSEVcovariates)
#colnames(cat_2) <- "edad_CAT_2"
#pData(eset)[23] <- sampleSEVcovariates$edad_CAT
#pData(eset) <- cbind(pData(eset),cat_2)#ncol(pData(eset))

# starts
#class(exprs(eset)) #matrix
m_data <- exprs(eset)
m_data<-as.data.frame(m_data)

# [MODIFICATION 28jul2016]
col.mean.m_dataX<-t(as.matrix(apply(m_data, 2, mean, na.rm=TRUE)))
rownames(col.mean.m_dataX) <- c("col.mean.m")                 # assign row names 
#col.mean.m_dataX

row.mean.m_data<-as.matrix(apply(m_data, 1, mean, na.rm=TRUE))
colnames(row.mean.m_data) <- c("row.mean.m")                 # assign row names 
#row.mean.m_data


m_data_row.mean<-cbind(m_data,row.mean.m_data)
m_data_row.mean<-as.matrix(m_data_row.mean)

m_data_row.mean<-as.data.frame(m_data_row.mean)
m_data_row.mean<-m_data_row.mean[order(-m_data_row.mean$row.mean.m),]
m_data_row.mean<-as.matrix(m_data_row.mean)

m_data_row.mean<-as.data.frame(m_data_row.mean)
m_data_row.mean$row.mean.m<-NULL
#dim(m_data_row.mean)
#m_data_row.mean
m_data_row.mean<-as.matrix(m_data_row.mean)
#

m_data_col.row.mean<-rbind(m_data_row.mean,col.mean.m_dataX)

m_data_col.row.mean<-t(m_data_col.row.mean)
m_data_col.row.mean<-as.data.frame(m_data_col.row.mean)

m_data_col.row.mean<-m_data_col.row.mean[order(m_data_col.row.mean$col.mean.m),]
m_data_col.row.mean$col.mean.m<-NULL
m_data_col.row.mean<-t(m_data_col.row.mean)

m_data_col.row.mean<-as.data.frame(m_data_col.row.mean)
#dim(m_data_col.row.mean)
#m_data_col.row.mean
m_data_col.row.mean<-as.matrix(m_data_col.row.mean)
# ends


##### TOP10 OF HIGHER REACTIVITY in BOTH GROUPS -> the same as sort.by AveExp
#head(rownames(m_data_col.row.mean), 10)

# [ADDED 28jul2016]
feature_sort_mean<-rownames(m_data_col.row.mean)
sample_sort_mean_FIRST<-colnames(m_data_col.row.mean)
#
pData(eset) <- cbind(pData(eset),t(col.mean.m_dataX))#ncol(pData(eset))
#
eset <- eset[feature_sort_mean,sample_sort_mean_FIRST] ## REORDER samples and features inside the ExpressionSet

#eset$Group
```

```{r heatmap11, eval=FALSE, echo=FALSE}
#head(anti.raw_data)[,1:5]
#head(log2.NA.norm.ANTIGENS)[,1:5]
#head(exprs(normal.log2.raw.eSet.VIVAX.SEV.FILTER))[,1:5]
#anti.raw_data
rawhm <- exprs_prot_m[feature_sort_mean,sample_sort_mean_FIRST] ## REORDER samples and features inside the ExpressionSet
#head(rawhm)[,1:5]

#log2hp <- exprs(normal.log2.raw.eSet.VIVAX.SEV.FILTER)[feature_sort_mean,sample_sort_mean_FIRST] ## REORDER samples and features inside the ExpressionSet
#sum(log2hp==exprs(eset))
#head(log2hp)[,1:5]
#head(exprs(eset))[,1:5]
#ctrl.raw_data
ctrlhm <- exprs_ctrl_m[,sample_sort_mean_FIRST] ## REORDER samples and features inside the ExpressionSet
#head(ctrlhm)[,1:5]

#head(median.ctrl.raw_data)[,1:5]
#median.ctrl.raw_data
```

```{r heatmap0, eval=FALSE, fig.height=8, fig.width=8, fig.align="center", echo=FALSE}

#__Sort all by Mean Sample Intensity__

#eset<-normal.log2.raw.eSet.VIVAX.SEV.FILTER
#pData(eset) <- pData(eset)[,9:ncol(pData(eset))]
#library(NMF)
aheatmap(exprs(eset), Rowv = NA, Colv = NA, annCol = pData(eset))
```

```{r bckgrd1, fig.height=8, fig.width=8, fig.align='center', warning=FALSE, message=FALSE, echo=FALSE, eval=FALSE}
#### Backgroud effect
#rownames(row.mean.m_data)
RAW_data <- exprs_prot_m[rownames(row.mean.m_data),]
row.mean.RAW_data<-as.matrix(apply(RAW_data, 1, median, na.rm=TRUE))
colnames(row.mean.RAW_data) <- c("row.mean.RAW")                 # assign row names 
#
col.mean.RAW_data<-as.matrix(apply(RAW_data[,colnames(col.mean.m_dataX)], 2, median, na.rm=TRUE)) ### ONLY FOR 60 samples
colnames(col.mean.RAW_data) <- c("col.mean.RAW")                 # assign row names 
#
#head(row.mean.RAW_data)
#head(row.mean.m_data)
totint <- cbind(row.mean.m_data,row.mean.RAW_data)
totint <- as.data.frame(totint)
colnames(totint) <- c("LOG2.feature.mean.int", "RAW.feature.median.int")
t1 <- ggplot(totint, aes(x=LOG2.feature.mean.int,y=RAW.feature.median.int)) + geom_point() + geom_smooth()
#
sampleint <- cbind(t(col.mean.m_dataX),col.mean.RAW_data)
sampleint <- as.data.frame(sampleint)
colnames(sampleint) <- c("LOG2.sample.mean.int", "RAW.sample.median.int")
t2 <- ggplot(sampleint, aes(x=LOG2.sample.mean.int,y=RAW.sample.median.int)) + geom_point() + geom_smooth()
#
#class(col.mean.m_dataX)
#class(median.ctrl.raw_data) #exprs_ctrl_m
No.DNA.int <- subset(exprs_ctrl_m, select= colnames(col.mean.m_dataX))
#No.DNA.int
#col.mean.m_dataX
back <- rbind(col.mean.m_dataX,No.DNA.int)
back <- as.data.frame(t(back))
colnames(back) <- c("LOG2.sample.mean.int", "RAW.No.DNA.median.int")
b1 <- ggplot(back, aes(x=LOG2.sample.mean.int,y=RAW.No.DNA.median.int)) + geom_point() + geom_smooth()
#
rawk <- cbind(col.mean.RAW_data,t(No.DNA.int))
rawk <- as.data.frame(rawk)
colnames(rawk) <- c("RAW.sample.median.int", "RAW.No.DNA.median.int")
b2 <- ggplot(rawk, aes(x=RAW.sample.median.int,y=RAW.No.DNA.median.int)) + geom_point() + geom_smooth()
#
#Rmisc::multiplot(t1,b2,t2,b1, cols=2)
#
```

```{r bckgrnd2, fig.width=14, fig.height=7, fig.align='center', echo=FALSE, eval=FALSE}

# RAW vs LOG2 HEATMAP

par(mfrow=c(1,2))

# [START FROM THE BEGINNNING]
#eset#<-normal.log2.raw.eSet
#
eset <- eset[feature_sort_mean,sample_sort_mean_FIRST] ## REORDER samples and features inside the ExpressionSet

aheatmap(exprs(eset), Rowv = NA, Colv = NA, main = "LOG2")
aheatmap(rbind(rawhm,ctrlhm), Rowv = NA, Colv = NA, main = "RAW")
```

```{r homosk1, fig.height=4, fig.width=8, fig.align='center', warning=FALSE, message=FALSE, echo=FALSE, eval=FALSE}

#### Heteroskedasticity 
#*pre- & post- transformation/normalization*

#
RAW_data <- exprs_prot_m[rownames(row.mean.m_data),]
row.sd.RAW_data<-as.matrix(apply(RAW_data, 1, sd, na.rm=TRUE))
colnames(row.sd.RAW_data) <- c("row.mean.RAW")                 # assign row names 
#
homoskRAW <- cbind(row.mean.RAW_data,row.sd.RAW_data)
homoskRAW <- as.data.frame(homoskRAW)
colnames(homoskRAW) <- c("RAW.feature.median.int","RAW.feature.sd.int")
h2 <- ggplot(homoskRAW, aes(x=RAW.feature.median.int,y=RAW.feature.sd.int)) + geom_point() + geom_smooth()
#
row.sd.m_data<-as.matrix(apply(m_data, 1, sd, na.rm=TRUE))
colnames(row.sd.m_data) <- c("row.sd.m")                 # assign row names 
#
homosk <- cbind(row.mean.m_data,row.sd.m_data)
homosk <- as.data.frame(homosk)
colnames(homosk) <- c("LOG2.feature.mean.int", "LOG2.feature.sd.int")
h1 <- ggplot(homosk, aes(x=LOG2.feature.mean.int,y=LOG2.feature.sd.int)) + geom_point() + geom_smooth()#method="lm",se=FALSE
#
Rmisc::multiplot(h2,h1,cols=2)
```

### Differential expression

#### Define the control group

- El problem defin al grupo de sintomáticos como el grupo control.

```{r}
eset$phenotype <- factor(eset$phenotype)

table(eset$phenotype) # debe ser: control (fct referencia) vs caso
eset$phenotype <- relevel(eset$phenotype, ref = "Symptomatic")
table(eset$phenotype) # debe ser: control (fct referencia) vs caso
```

#### Matrix + Linear model + eBayes

```{r}
design <- model.matrix(~ eset$phenotype)
fit <- lmFit(eset, design)
eb <- eBayes(fit)
topTable(eb) %>% rownames_to_column()
```

```{r}
# tidy toptable
td <- topTable(eb, coef = 2,               # default: slope
         sort.by =NULL, resort.by =NULL, 
         #genelist = NULL,                 # no fit$gene
         number = Inf,                     # show all the hipothesis
         confint=0.95) %>% rownames_to_column() %>% as.tbl() %>% 
  dplyr::rename(Gene.ID=rowname)
```

### Problemas específicos

- Decisión estadística cambia:
    + microarreglos de DNA -> variación w.r.t. _housekeeping_ gene
    + microarreglo de proteínas -> veces sobre control (blanco) -> intensidad final es relevante

### Plots

```{r, echo=FALSE, eval=FALSE, echo=FALSE}

#### variance shrinkage

par(mfrow=c(1,2))
plot(eb$Amean, eb$sigma)
plot(eb$Amean, eb$s2.post)
```

```{r, fig.height=8, fig.width=8, message=FALSE, echo=FALSE, eval=FALSE}
x <- eb %>% biobroom::augment.MArrayLM() %>% ## ERROR in function!! s2.prior!!
  inner_join(
    inner_join(eb$Amean %>% 
               as.tibble() %>% 
               tibble::rownames_to_column() %>% 
               dplyr::rename(.AMean=value,
                             .gene=rowname),
               eb$s2.post %>% 
               as.tibble() %>% 
               tibble::rownames_to_column() %>% 
               dplyr::rename(s2.post=value,
                             .gene=rowname),
               by=".gene"),
             by=".gene") 

m <- x %>% 
  ggplot(aes(.AMean,.sigma))+
  geom_hex()+scale_y_continuous(limits = c(-0.2,3))
  #geom_density2d()

n <- x %>% 
  ggplot(aes(.AMean,s2.post))+
  geom_hex()+scale_y_continuous(limits = c(-0.2,3))
  #geom_density2d()#

o <- x %>% 
  ggplot(aes(.AMean,.sigma))+
  #geom_hex()+scale_y_continuous(limits = c(-0.2,9))
  geom_density2d()#+
  #scale_y_continuous(limits = c(-0.05,0.7))+
  #scale_x_continuous(limits = c(-0.05,15))

p <- x %>% 
  ggplot(aes(.AMean,s2.post))+
  #geom_hex()
  geom_density2d()#+
  #scale_y_continuous(limits = c(-0.05,0.7))+
  #scale_x_continuous(limits = c(-0.05,15))

Rmisc::multiplot(m,n,o,p,cols = 2)
```



#### p-value histogram

```{r, echo=FALSE}
test <- topTable(eb, coef = 2,
                 sort.by ="P", 
                 #sort.by ="AveExpr", 
                 genelist = fit$genes$Gene.ID, number = Inf)
library(ggplot2)
ax <- qplot(test$P.Value, binwidth=.05)
bx <- qplot(test$adj.P.Val, binwidth=.05)
Rmisc::multiplot(ax,bx,cols = 2)
```

```{r, eval=FALSE, fig.align="center", echo=FALSE}

#### base R volcano plot

# (2) Volcano plot: Log Fold Change x -log10(p.value)
#source: http://genomicsclass.github.io/book/pages/using_limma.html

#
par(mfrow=c(1,2))
with(td, plot(logFC, -log10(P.Value),cex=.7, pch=20,
                    main = "caso vs control",
                    #col=cols,
                    #xlim=c(-2,2), ylim=c(0,15),
                    xlab="log Fold Change"))
abline(h=c(-log10(0.05),-log10(1e-13)),
       v=c(-1,1),
       lty=2, col= "grey55")
```


#### volcano plot

- OBJETIVO:
    + Priorizar el __Tamaño del Efecto__ por la __significancia estadístistica__.

```{r, eval=TRUE, echo=FALSE}
#### heatmap + hierachical clustering
eset <- ExpressionSet(assayData = exprs_norm, 
                      phenoData = pheno_d)
```

```{r, fig.height=5, fig.width=6.5, echo=TRUE}
# (2) Volcano plot: Log Fold Change x -log10(p.value)

tv <- td %>% 
  inner_join(biobroom::tidy.ExpressionSet(eset,addPheno = TRUE) %>% #count(gene)
               group_by(gene,phenotype) %>% summarise(mean_group=mean(value)) %>% 
               ungroup() %>% 
               spread(phenotype,mean_group) %>% 
               #arrange(desc(Asymptomatic)) %>% 
               rename(Gene.ID=gene),
             by="Gene.ID") 

tv %>% 
  mutate(significance=if_else(adj.P.Val<0.05,
                              "adj.P.Val<0.05","diff rve")) %>% 
  ggplot(aes(logFC,-log10(P.Value))) +
  #geom_point(aes(colour=significance)) + scale_color_manual(values=c("red","black")) +
  geom_point(aes(colour=Asymptomatic)) + viridis::scale_color_viridis() +
  ggrepel::geom_text_repel(data = tv %>% 
                             filter(adj.P.Val<0.05 & logFC>1.5 & Asymptomatic>1)# %>% 
                             #arrange(desc(Asymptomatic)) %>% 
                             #top_n(10,Asymptomatic)
                           ,
                           aes(label=Gene.ID)) +
  labs(title="Malaria Protein microarray",
       subtitle="Caso vs Control")+
  geom_hline(aes(yintercept=-log10(0.01)), lty=3)+
  geom_vline(aes(xintercept=1), lty=3)
```

### diff reactive

#### boxplot

```{r, fig.width=8,fig.height=6, eval=FALSE, echo=FALSE}
#repro
# #str()
#  mutate(order=seq(from=n(),to=1)) 

biobroom::tidy.ExpressionSet(eset,addPheno = TRUE) %>%
  inner_join(tv %>% 
               filter(adj.P.Val<0.05 & logFC>1 & Asymptomatic>1) %>% 
               rename(gene="Gene.ID")
    ,by="gene") %>% 
  ggplot(aes(reorder(gene,logFC),value,fill=phenotype))+
  geom_boxplot()+
  #geom_point() +
  ylab("log2-transformed normalized MFI")+
  xlab("antigens")+
  labs(title="Differentially reactive antigens",
       subtitle="filtered by adj.P.Value<0.05 and logFC>1 sorted w.r.t. Fold Change")+
  coord_flip() # ORDENAR eje de proteínas!!!!
```

```{r, fig.width=8,fig.height=6, eval=FALSE, echo=FALSE}
#repro
# #str()
#  mutate(order=seq(from=n(),to=1)) 

biobroom::tidy.ExpressionSet(eset,addPheno = TRUE) %>%
  inner_join(tv %>% 
               filter(adj.P.Val<0.05 & logFC>1 & Asymptomatic>1) %>% 
               rename(gene="Gene.ID")
    ,by="gene") %>% 
  ggplot(aes(reorder(gene,Asymptomatic),value,fill=phenotype))+
  geom_boxplot()+
  #geom_point() +
  ylab("log2-transformed normalized MFI")+
  xlab("antigens")+
  labs(title="Differentially reactive antigens",
       subtitle="filtered by adj.P.Value<0.05 and logFC>1 sorted w.r.t. Asymptomatic median Ab reactivity")+
  coord_flip() # ORDENAR eje de proteínas!!!!
```

#### list

```{r}
tv %>% 
  filter(adj.P.Val<0.05 & logFC>1 & Asymptomatic>1) %>% 
  arrange(desc(Asymptomatic)) %>% 
  select(Gene.ID) %>% 
  inner_join(ln, by="Gene.ID") %>% 
  inner_join(biobroom::tidy.ExpressionSet(eset,addPheno = TRUE) %>% 
               group_by(gene,phenotype) %>% summarise(mean_group=mean(value)) %>% 
               ungroup() %>% 
               spread(phenotype,mean_group) %>% 
               rename(Gene.ID=gene),
             by="Gene.ID") %>% 
  inner_join(td %>% 
               arrange(desc(t)) %>% 
               mutate(t.order=seq(n())) %>% 
               select(Gene.ID,AveExpr,P.Value,adj.P.Val,t.order),
             by="Gene.ID") #%>% slice(1:10)
```



### top reactive

- El grupo de antígenos con mayor reactividad permitirá darle una contexto biológico a los valores con significancia estadística.

```{r, eval=TRUE, echo=FALSE}
#### heatmap
eset <- ExpressionSet(assayData = exprs_norm, 
                      phenoData = pheno_d)
```

```{r, fig.height=7, fig.width=10, fig.align="center", echo=FALSE}
#eset <- eset#leukemiasEset
#exprs(eset) %>% dim()
x <- td %>% 
  #arrange(desc(t)) %>% 
  arrange(desc(AveExpr)) %>% 
  #filter(B>0 #& adj.P.Val<1e-4
  #       ) %>% 
  top_n(10,AveExpr) %>% 
  select(Gene.ID) %>% as.matrix() 
#x %>% dim()

pData(eset) <- biobroom::tidy.ExpressionSet(eset,addPheno = TRUE) %>% 
  group_by(sample) %>% summarise(sample_median=median(value)) %>% 
  full_join(pData(eset) %>% 
              rownames_to_column() %>% 
              rename(sample=rowname),
            by="sample") %>% 
  arrange(phenotype,sample_median) %>%
  as.data.frame() %>% 
  column_to_rownames(var="sample")

y <- pData(eset) %>% rownames_to_column() %>% 
  select(rowname) %>% as.matrix() %>% as.character()

eset <- eset[x,y]

#aheatmap(exprs(eset)[x,], 
#         Rowv = NA, Colv = NA, 
#         annCol = pData(eset), 
#         layout = "_")
```

#### boxplots

- show the intensity distributions of selected features in both groups.

```{r, fig.width=8,fig.height=2.5, echo=FALSE, eval=FALSE}
#repro
biobroom::tidy.ExpressionSet(eset,addPheno = TRUE) %>% #str()
  ggplot(aes(reorder(gene,value,order = TRUE),value,fill=phenotype))+
  geom_boxplot()+
  #geom_point() +
  ylab("log2-transformed normalized MFI")+
  xlab("antigens")+
  labs(title="Highly reactive antigens",
       subtitle="top 10 sorted w.r.t. median Ab reactivity")+
  coord_flip() # ORDENAR eje de proteínas!!!!
```

#### list

```{r, echo=FALSE}
x %>% as.data.frame() %>% 
  mutate(Gene.ID=as.character(Gene.ID)) %>% 
  inner_join(ln, by="Gene.ID") %>% 
  inner_join(biobroom::tidy.ExpressionSet(eset,addPheno = TRUE) %>% 
               group_by(gene,phenotype) %>% summarise(mean_group=mean(value)) %>% 
               ungroup() %>% 
               spread(phenotype,mean_group) %>% 
               rename(Gene.ID=gene),
             by="Gene.ID") %>% 
  inner_join(td %>% 
               arrange(desc(t)) %>% 
               mutate(t.order=seq(n())) %>% 
               select(Gene.ID,AveExpr,P.Value,adj.P.Val,t.order),
             by="Gene.ID") #%>% slice(1:10)
```

### both

#### heatmap

```{r, fig.height=7, fig.width=6, fig.align="center"}
eset <- ExpressionSet(assayData = exprs_norm, 
                      phenoData = pheno_d)

s <- tv %>% 
  filter(adj.P.Val<0.05 & logFC>1 & Asymptomatic>1) %>% 
  arrange(desc(Asymptomatic)) %>% 
  select(Gene.ID) %>% as.matrix() 

x <- tv %>% 
  #arrange(desc(t)) %>% 
  arrange(desc(AveExpr)) %>% 
  #filter(B>0 #& adj.P.Val<1e-4
  #       ) %>% 
  top_n(10,AveExpr) %>% 
  arrange(desc(Asymptomatic)) %>% # If diff order at boxplot is achieved, the deactivate this line.
  select(Gene.ID) %>% as.matrix() 

pData(eset) <- biobroom::tidy.ExpressionSet(eset,addPheno = TRUE) %>% 
  group_by(sample) %>% summarise(sample_median=median(value)) %>% 
  full_join(pData(eset) %>% 
              rownames_to_column() %>% 
              rename(sample=rowname),
            by="sample") %>% 
  arrange(phenotype,sample_median) %>%
  as.data.frame() %>% 
  column_to_rownames(var="sample")

y <- pData(eset) %>% rownames_to_column() %>% 
  select(rowname) %>% as.matrix() %>% as.character()

eset <- eset[c(x,s),y]

aheatmap(exprs(eset), 
         Rowv = NA, Colv = NA, 
         annCol = pData(eset), 
         layout = "_")
```

#### boxplot

```{r, fig.width=6,fig.height=7.5, echo=TRUE}
#repro
biobroom::tidy.ExpressionSet(eset,addPheno = TRUE) %>% #count(gene)
  full_join(biobroom::tidy.ExpressionSet(eset,addPheno = TRUE) %>% #count(gene)
              group_by(gene,phenotype) %>% summarise(mean_group=mean(value)) %>% 
              ungroup() %>% 
              spread(phenotype,mean_group) %>% 
              arrange(desc(Asymptomatic)) %>% 
              select(-Symptomatic),
            by="gene") %>% 
  full_join(x %>% as.data.frame() %>% 
              mutate(group="Top 10") %>% 
              rename(gene="Gene.ID") %>% 
              mutate(gene=as.character(gene)),
            by="gene") %>% 
  full_join(s %>% as.data.frame() %>% 
              mutate(group="Differentially*") %>% 
              rename(gene="Gene.ID") %>% 
              mutate(gene=as.character(gene)),
            by="gene") %>% 
  unite(group, group.x, group.y) %>% 
  mutate(group=stringr::str_replace(group,"NA_(.)", "\\1")) %>% 
  mutate(group=stringr::str_replace(group,"(.)_NA", "\\1")) %>% 
  mutate(group=forcats::fct_relevel(group,"Top 10")) %>% 
  ggplot(aes(reorder(gene,Asymptomatic),value,fill=phenotype))+
  geom_boxplot()+
  #geom_point() +
  facet_grid(group~.,scales = "free", space = "free") +#
  ylab("log2-transformed normalized MFI")+
  xlab("antigens")+
  labs(title="Highly and differentially reactive antigens",
       subtitle="All sorted w.r.t. Asymptomatic mean Ab reactivity",
       caption="*filtered by adj.P.Val<0.05 and logFC>1")+
  coord_flip() # ORDENAR eje de proteínas!!!!
```

### diff reactive (paper)

- Los autores reportan todos los antígenos con __valor p ajustado menor a 0.05__
- Dada la variabilidad entre las muestras por grupo, un algoritmo de clasificación no es la estrategia más adecuada.

#### heatmap

```{r, eval=TRUE, echo=FALSE}
#### heatmap + hierachical clustering
eset <- ExpressionSet(assayData = exprs_norm, 
                      phenoData = pheno_d)
```

```{r, fig.height=7, fig.width=10, fig.align="center", eval=TRUE, echo=FALSE}
x <- td %>% 
  #arrange(desc(t)) %>% 
  #arrange(desc(AveExpr)) %>% 
  filter(adj.P.Val<0.05#B>0 #& adj.P.Val<1e-4
         ) %>% 
  inner_join(biobroom::tidy.ExpressionSet(eset,addPheno = TRUE) %>% 
               group_by(gene,phenotype) %>% summarise(mean_group=mean(value)) %>% 
               ungroup() %>% 
               spread(phenotype,mean_group) %>% 
               arrange(desc(Asymptomatic)) %>% 
               rename(Gene.ID=gene),
             by="Gene.ID") %>% 
  arrange(desc(Asymptomatic)) %>% 
  filter(Symptomatic>0) %>% #top_n(30,Asymptomatic)
  select(Gene.ID) %>% as.matrix() 
#x %>% dim()

pData(eset) <- biobroom::tidy.ExpressionSet(eset,addPheno = TRUE) %>% 
  group_by(sample) %>% summarise(sample_median=median(value)) %>% 
  full_join(pData(eset) %>% 
              rownames_to_column() %>% 
              rename(sample=rowname),
            by="sample") %>% 
  arrange(phenotype,sample_median) %>%
  as.data.frame() %>% 
  column_to_rownames(var="sample")

y <- pData(eset) %>% rownames_to_column() %>% 
  select(rowname) %>% as.matrix() %>% as.character()

eset <- eset[x,y]

#aheatmap(exprs(eset)[x,], 
#         Rowv = TRUE, Colv = NA, 
#         annCol = pData(eset), 
#         layout = "_")
```

```{r, fig.width=8,fig.height=5, eval=FALSE, echo=FALSE}
#### boxplots
#- show the intensity distributions of selected features in both groups.

# ¿HOW TO EXTRACT HE CLUSTERING ORDER?

#repro
biobroom::tidy.ExpressionSet(eset,addPheno = TRUE) %>% #str()
  ggplot(aes(reorder(gene,value,order = TRUE),value,fill=phenotype))+
  geom_boxplot()+
  #geom_point() +
  ylab("log2-transformed normalized MFI")+
  xlab("antigens")+
  labs(title="Differentially reactive antigens",
       subtitle="filtered by B>0 (~adj.P.Val<0.01), and sorted w.r.t. median Ab reactivity")+
  coord_flip() # ORDENAR eje de proteínas!!!!
```


```{r, fig.height=12, fig.width=8, fig.align="center"}
aheatmap(exprs(eset)[x,], 
         Rowv = NA, Colv = NA, 
         annCol = pData(eset), 
         layout = "_")
```

#### boxplots

- Una mejor estrategia de visualización son los diagramas de cajas, los cuales grafican la distribución total de las observaciones.

```{r, fig.width=8,fig.height=11}
#repro
biobroom::tidy.ExpressionSet(eset,addPheno = TRUE) %>% #str()
  mutate(order=seq(from=n(),to=1)) %>% 
  ggplot(aes(reorder(gene,order,order=TRUE),value,fill=phenotype))+
  geom_boxplot()+
  #geom_point() +
  ylab("log2-transformed normalized MFI")+
  xlab("antigens")+
  labs(title="Differentially reactive antigens",
       subtitle="filtered by adj.P.Value<0.05 and sorted w.r.t. Asymptomatic median Ab reactivity")+
  coord_flip() # ORDENAR eje de proteínas!!!!
```

#### list

```{r}
x %>% as.data.frame() %>% 
  mutate(Gene.ID=as.character(Gene.ID)) %>% 
  inner_join(ln, by="Gene.ID") %>% 
  inner_join(biobroom::tidy.ExpressionSet(eset,addPheno = TRUE) %>% 
               group_by(gene,phenotype) %>% summarise(mean_group=mean(value)) %>% 
               ungroup() %>% 
               spread(phenotype,mean_group) %>% 
               rename(Gene.ID=gene),
             by="Gene.ID") %>% 
    inner_join(td %>% 
               arrange(desc(t)) %>% 
               mutate(t.order=seq(n())) %>% 
               select(Gene.ID,P.Value,adj.P.Val,t.order),
             by="Gene.ID") #%>% slice(1:8)
```

### B-stat >0

- Here, all features are filtered by __positive log-odds__ that the gene is differentially expressed (B=0 ~ 50% probability a gene is differentially expressed).
- __B-statistic__ is adjusted for multiple testing by assuming that 1% of the genes (modifiable parameter) are expected to be differentially expressed.
- Depends on normality (same as p-values), and independence between genes (same as BH control of FDR), but require in addition a prior guess for the proportion of differentially expressed probes.
- __p-values__ may be preferred to the B-statistics because they do not require this prior knowledge.

#### sorted by significance

```{r, eval=TRUE, echo=FALSE}
##### heatmap
eset <- ExpressionSet(assayData = exprs_norm, 
                      phenoData = pheno_d)
```

```{r, fig.height=7, fig.width=10, fig.align="center", echo=FALSE}
#eset <- eset#leukemiasEset
#exprs(eset) %>% dim()
x <- td %>% 
  arrange(desc(t)) %>% 
  #arrange(desc(AveExpr)) %>% 
  filter(B>0 #& adj.P.Val<1e-4
         ) %>% 
  select(Gene.ID) %>% as.matrix() 
#x %>% dim()

pData(eset) <- biobroom::tidy.ExpressionSet(eset,addPheno = TRUE) %>% 
  group_by(sample) %>% summarise(sample_median=median(value)) %>% 
  full_join(pData(eset) %>% 
              rownames_to_column() %>% 
              rename(sample=rowname),
            by="sample") %>% 
  arrange(phenotype,sample_median) %>%
  as.data.frame() %>% 
  column_to_rownames(var="sample")

y <- pData(eset) %>% rownames_to_column() %>% 
  select(rowname) %>% as.matrix() %>% as.character()

eset <- eset[x,y]

#aheatmap(exprs(eset)[x,], 
#         Rowv = NA, Colv = NA, 
#         annCol = pData(eset), 
#         layout = "_")
```

##### boxplots

- show the intensity distributions of selected features in both groups.

```{r, fig.width=8,fig.height=5, echo=FALSE, eval=FALSE}
#repro
biobroom::tidy.ExpressionSet(eset,addPheno = TRUE) %>% #str()
  mutate(order=seq(from=n(),to=1)) %>% 
  ggplot(aes(reorder(gene,order,order=TRUE),value,fill=phenotype))+
  geom_boxplot()+
  #geom_point() +
  ylab("log2-transformed normalized MFI")+
  xlab("antigens")+
  labs(title="Differentially reactive antigens",
       subtitle="filtered by B>0 (~adj.P.Val<0.01), and sorted w.r.t. P.Value")+
  coord_flip() # ORDENAR eje de proteínas!!!!
```

##### list

```{r, echo=FALSE, eval=FALSE}
x %>% as.data.frame() %>% 
  mutate(Gene.ID=as.character(Gene.ID)) %>% 
  inner_join(ln, by="Gene.ID") %>% 
  inner_join(biobroom::tidy.ExpressionSet(eset,addPheno = TRUE) %>% 
               group_by(gene,phenotype) %>% summarise(mean_group=mean(value)) %>% 
               ungroup() %>% 
               spread(phenotype,mean_group) %>% 
               rename(Gene.ID=gene),
             by="Gene.ID") %>% 
    inner_join(td %>% 
               arrange(desc(t)) %>% 
               mutate(t.order=seq(n())) %>% 
               select(Gene.ID,P.Value,adj.P.Val,t.order),
             by="Gene.ID") #%>% slice(1:10)
```


#### sorted by average

```{r, eval=TRUE, echo=FALSE}
##### heatmap
eset <- ExpressionSet(assayData = exprs_norm, 
                      phenoData = pheno_d)
```

```{r, fig.height=7, fig.width=10, fig.align="center", echo=FALSE}
#eset <- eset#leukemiasEset
#exprs(eset) %>% dim()
x <- td %>% 
  #arrange(desc(t)) %>% 
  arrange(desc(AveExpr)) %>% 
  filter(B>0 #& adj.P.Val<1e-4
         ) %>% 
  select(Gene.ID) %>% as.matrix() 
#x %>% dim()

pData(eset) <- biobroom::tidy.ExpressionSet(eset,addPheno = TRUE) %>% 
  group_by(sample) %>% summarise(sample_median=median(value)) %>% 
  full_join(pData(eset) %>% 
              rownames_to_column() %>% 
              rename(sample=rowname),
            by="sample") %>% 
  arrange(phenotype,sample_median) %>%
  as.data.frame() %>% 
  column_to_rownames(var="sample")

y <- pData(eset) %>% rownames_to_column() %>% 
  select(rowname) %>% as.matrix() %>% as.character()

eset <- eset[x,y]

#aheatmap(exprs(eset)[x,], 
#         Rowv = NA, Colv = NA, 
#         annCol = pData(eset), 
#         layout = "_")
```

##### boxplots

- show the intensity distributions of selected features in both groups.

```{r, fig.width=8,fig.height=5, echo=FALSE, eval=FALSE}
#repro
biobroom::tidy.ExpressionSet(eset,addPheno = TRUE) %>% #str()
  ggplot(aes(reorder(gene,value,order = TRUE),value,fill=phenotype))+
  geom_boxplot()+
  #geom_point() +
  ylab("log2-transformed normalized MFI")+
  xlab("antigens")+
  labs(title="Differentially reactive antigens",
       subtitle="filtered by B>0 (~adj.P.Val<0.01), and sorted w.r.t. median Ab reactivity")+
  coord_flip() # ORDENAR eje de proteínas!!!!
```

##### list

```{r, echo=FALSE, eval=FALSE}
x %>% as.data.frame() %>% 
  mutate(Gene.ID=as.character(Gene.ID)) %>% 
  inner_join(ln, by="Gene.ID") %>% 
  inner_join(biobroom::tidy.ExpressionSet(eset,addPheno = TRUE) %>% 
               group_by(gene,phenotype) %>% summarise(mean_group=mean(value)) %>% 
               ungroup() %>% 
               spread(phenotype,mean_group) %>% 
               rename(Gene.ID=gene),
             by="Gene.ID") %>% 
    inner_join(td %>% 
               arrange(desc(t)) %>% 
               mutate(t.order=seq(n())) %>% 
               select(Gene.ID,P.Value,adj.P.Val,t.order),
             by="Gene.ID") #%>% slice(1:10)
```

## Conclusión: aplica el `tidyverse` antes y después de `Bioconductor`

- Al igual que en los casos presentados en los tutoriales anteriores, el dialecto del `tidyverse` facilitó tanto el ingreso de data para la ejecución de modelos en `Bioconductor`, como para "limpiar" sus resultados y generar en forma intuitiva visualizaciones que ayuden a su interpretación.

- El análisis presentado sigue el flujo de trabajo propuesto por David Robinson:

![workflow](figure/tidy_01.jpg)

## EXTRA TOOL: `biobroom` y `ExpressionSet_tidiers`

- `biobroom` incluye "tidiers" para varias estructuras de datos de Bioconductor. Revísalo [aquí](http://bioconductor.org/packages/release/bioc/vignettes/biobroom/inst/doc/biobroom_vignette.html).

- ¿Tarea? Ponerlo en práctica con sus ejemplos de RNA-seq :)

```{r}
#exprs(eset)
#pData(eset) %>% rownames_to_column() %>% arrange(phenotype)
biobroom::tidy.ExpressionSet(eset,addPheno = TRUE)
```

## Computer environment
```{r}
devtools::session_info()
```

## References