El presente tutorial está basado en la publicación del blog variance explained de David Robinson. El material ha sido actualizado y adaptado al objetivo del curso.
tidyverse
en la limpieza, exploración y visualización de datos genómicos.Nota: Todos los outputs deben verse como líneas corridas. Disminuir el zoom de ser necesario.
library(tidyverse)
Loading tidyverse: ggplot2
Loading tidyverse: tibble
Loading tidyverse: tidyr
Loading tidyverse: readr
Loading tidyverse: purrr
Loading tidyverse: dplyr
Conflicts with tidy packages -------------------------------------------------------------
filter(): dplyr, stats
lag(): dplyr, stats
Through the process of gene regulation, a cell can control which genes are transcribed from DNA to RNA- what we call being “expressed”. (If a gene is never turned into RNA, it may as well not be there at all). This provides a sort of “cellular switchboard” that can activate some systems and deactivate others, which can speed up or slow down growth, switch what nutrients are transported into or out of the cell, and respond to other stimuli. A gene expression microarray lets us measure how much of each gene is expressed in a particular condition. We can use this to figure out the function of a specific gene (based on when it turns on and off), or to get an overall picture of the cell’s activity.
Brauer 2008 used microarrays to test the effect of starvation and growth rate on baker’s yeast (S. cerevisiae, a popular model organism for studying molecular genomics because of its simplicity)1. Basically, if you give yeast plenty of nutrients (a rich media), except that you sharply restrict its supply of one nutrient, you can control the growth rate to whatever level you desire (we do this with a tool called a chemostat). For example, you could limit the yeast’s supply of glucose (sugar, which the cell metabolizes to get energy and carbon), of leucine (an essential amino acid), or of ammonium (a source of nitrogen).
“Starving” the yeast of these nutrients lets us find genes that:
Sounds pretty cool, right? So let’s get started!
original_data <- readRDS("data-raw/tidymicro.rds")
Each of those columns like G0.05
, N0.3
and so on represents gene expression values for that sample, as measured by the microarray. The column titles show the condition: G0.05
, for instance, means the limiting nutrient was glucose and the growth rate was .05. A higher value means the gene was more expressed in that sample, lower means the gene was less expressed. In total the yeast was grown with six limiting nutrients and six growth rates, which makes 36 samples, and therefore 36 columns, of gene expression data.
Diseño experimental:
Columnas:
dim(original_data)
[1] 5537 40
head(original_data)
glimpse(original_data)
Observations: 5,537
Variables: 40
$ GID <chr> "GENE1331X", "GENE4924X", "GENE4690X", "GENE1177X", "GENE511X", "GENE...
$ YORF <chr> "A_06_P5820", "A_06_P5866", "A_06_P1834", "A_06_P4928", "A_06_P5620",...
$ NAME <chr> "SFB2 || ER to Golgi transport || molecular function unknown ||...
$ GWEIGHT <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
$ G0.05 <dbl> -0.24, 0.28, -0.02, -0.33, 0.05, -0.69, -0.55, -0.75, -0.24, -0.16, -...
$ G0.1 <dbl> -0.13, 0.13, -0.27, -0.41, 0.02, -0.03, -0.30, -0.12, -0.22, -0.38, -...
$ G0.15 <dbl> -0.21, -0.40, -0.27, -0.24, 0.40, 0.23, -0.12, -0.07, 0.14, 0.05, 0.2...
$ G0.2 <dbl> -0.15, -0.48, -0.02, -0.03, 0.34, 0.20, -0.03, 0.02, 0.06, 0.14, 0.18...
$ G0.25 <dbl> -0.05, -0.11, 0.24, -0.03, -0.13, 0.00, -0.16, -0.32, 0.00, -0.04, 0....
$ G0.3 <dbl> -0.05, 0.17, 0.25, 0.00, -0.14, -0.27, -0.11, -0.41, -0.13, -0.01, -0...
$ N0.05 <dbl> 0.20, 0.31, 0.23, 0.20, -0.35, 0.17, 0.04, 0.11, 0.30, 0.39, 0.26, -0...
$ N0.1 <dbl> 0.24, 0.00, 0.06, -0.25, -0.09, -0.40, 0.00, -0.16, 0.07, 0.20, 0.15,...
$ N0.15 <dbl> -0.20, -0.63, -0.66, -0.49, -0.08, -0.54, -0.63, -0.26, -0.30, 0.27, ...
$ N0.2 <dbl> -0.42, -0.44, -0.40, -0.49, -0.58, -1.19, -0.51, -0.42, -0.01, 0.19, ...
$ N0.25 <dbl> -0.14, -0.26, -0.46, -0.43, -0.14, -0.42, -0.37, 0.18, 0.15, 0.20, 0....
$ N0.3 <dbl> 0.09, 0.21, -0.43, -0.26, -0.12, 1.89, -0.24, 0.13, 0.13, 0.06, -0.08...
$ P0.05 <dbl> -0.26, -0.09, 0.18, 0.05, -0.16, -0.32, -0.35, -0.19, -0.26, -0.23, -...
$ P0.1 <dbl> -0.20, -0.04, 0.22, 0.04, 0.18, -0.06, -0.32, -0.25, -0.20, -0.20, -0...
$ P0.15 <dbl> -0.22, -0.10, 0.33, 0.03, 0.21, -0.62, -0.39, -0.25, -0.22, -0.07, -0...
$ P0.2 <dbl> -0.31, 0.15, 0.34, -0.04, 0.08, -0.50, -0.60, -0.47, -0.17, -0.13, -0...
$ P0.25 <dbl> 0.04, 0.20, 0.13, 0.08, 0.23, -0.37, -0.29, -0.24, -0.23, -0.14, -0.4...
$ P0.3 <dbl> 0.34, 0.63, 0.44, 0.21, -0.29, NA, -0.25, -0.49, -0.38, -0.42, -0.63,...
$ S0.05 <dbl> -0.51, 0.53, 1.29, 0.41, -0.70, NA, -0.14, 0.09, -0.35, -0.38, 0.25, ...
$ S0.1 <dbl> -0.12, 0.15, -0.32, -0.43, 0.05, -0.20, -0.50, 0.13, -0.14, -0.14, 0....
$ S0.15 <dbl> 0.09, -0.01, -0.47, -0.21, 0.10, -0.09, -0.19, 0.15, 0.10, 0.00, 0.25...
$ S0.2 <dbl> 0.09, 0.12, -0.50, -0.33, -0.07, 0.06, -0.13, -0.02, -0.04, -0.06, 0....
$ S0.25 <dbl> 0.20, -0.15, -0.42, -0.05, -0.10, -0.19, -0.01, 0.24, 0.22, 0.16, 0.2...
$ S0.3 <dbl> 0.08, 0.32, -0.33, -0.24, -0.32, -0.14, -0.04, -0.08, 0.02, -0.15, -0...
$ L0.05 <dbl> 0.18, 0.16, -0.30, -0.27, -0.59, -0.17, -0.02, -0.11, 0.12, -0.20, -0...
$ L0.1 <dbl> 0.18, 0.09, 0.02, -0.28, -0.13, -0.07, -0.05, -0.01, -0.01, -0.18, 0....
$ L0.15 <dbl> 0.13, 0.02, -0.07, -0.05, 0.00, 0.25, 0.27, 0.15, 0.17, 0.11, -0.04, ...
$ L0.2 <dbl> 0.20, 0.04, -0.05, 0.02, -0.11, -0.21, 0.24, 0.15, 0.07, 0.00, -0.13,...
$ L0.25 <dbl> 0.17, 0.03, -0.13, 0.00, 0.04, 0.12, 0.05, 0.00, 0.10, 0.02, -0.08, -...
$ L0.3 <dbl> 0.11, 0.01, -0.04, 0.08, 0.01, -0.11, 0.19, 0.03, 0.11, 0.09, 0.10, -...
$ U0.05 <dbl> -0.06, -1.02, -0.91, -0.53, -0.45, NA, 0.07, -0.40, 0.01, -0.26, -0.0...
$ U0.1 <dbl> -0.26, -0.91, -0.94, -0.51, -0.09, -0.65, -0.31, -0.02, -0.16, -0.13,...
$ U0.15 <dbl> -0.05, -0.59, -0.42, -0.26, -0.13, 0.09, -0.08, 0.26, 0.07, -0.10, 0....
$ U0.2 <dbl> -0.28, -0.61, -0.36, 0.05, 0.02, 0.06, 0.12, 0.31, 0.20, 0.07, 0.02, ...
$ U0.25 <dbl> -0.19, -0.17, -0.49, -0.14, -0.09, -0.07, 0.05, 0.14, 0.02, -0.04, -0...
$ U0.3 <dbl> 0.09, 0.18, -0.47, -0.01, -0.03, -0.10, 0.06, 0.11, 0.10, -0.12, -0.2...
Usando dplyr
y tidyr
Una data “limpia” o “Tidy data” o sigue las siguientes reglas:
¿Qué no está limpio (“untidy”) en la data?
Column headers are values, not variable names. Our column names contain the values of two variables: nutrient (G, N, P, etc) and growth rate (0.05-0.3). For this reason, we end up with not one observation per row, but 36! This is a very common issue in biological datasets: you often see one-row-per-gene and one-column-per-sample, rather than one-row-per-gene-per-sample.
Multiple variables are stored in one column. The NAME column contains lots of information, split up by ||
’s. If we examine one of the names, it looks like
SFB2 || ER to Golgi transport || molecular function unknown || YNL049C || 1082129
which seems to have both some systematic IDs and some biological information about the gene. If we’re going to use this programmatically, we need to split up the information into multiple columns.
original_data$NAME[1:2]
[1] "SFB2 || ER to Golgi transport || molecular function unknown || YNL049C || 1082129"
[2] " || biological process unknown || molecular function unknown || YNL095C || 1086222"
SFB2 || ER to Golgi transport || molecular function unknown || YNL049C || 1082129
|| biological process unknown || molecular function unknown || YNL095C || 1086222
The details of each of these fields isn’t annotated in the paper, but we can figure out most of it. It contains:
Having all give of these in the same column is very inconvenient. For example, if I have another dataset with information about each gene, I can’t merge the two. Luckily, the tidyr
package provides the separate
function for exactly this case.
library(dplyr)
library(tidyr)
cleaned_data <- original_data %>%
separate(NAME, c("name", "BP", "MF", "systematic_name", "number"), sep = "\\|\\|")
head(cleaned_data)
Two more things. First, when we split by ||, we ended up with whitespace at the start and end of some of the columns, which is inconvenient:
head(cleaned_data$BP)
[1] " ER to Golgi transport " " biological process unknown "
[3] " proteolysis and peptidolysis " " mRNA polyadenylylation* "
[5] " vesicle fusion* " " biological process unknown "
ER to Golgi transport
biological process unknown
proteolysis and peptidolysis
mRNA polyadenylylation*
vesicle fusion*
biological process unknown
We’ll solve that with dplyr’s mutate_each
, along with the built-in trimws
(“trim whitespace”) function.
cleaned_data <- original_data %>%
separate(NAME, c("name", "BP", "MF", "systematic_name", "number"), sep = "\\|\\|") %>%
mutate_at(vars(name:systematic_name), funs(trimws))
head(cleaned_data$BP)
[1] "ER to Golgi transport" "biological process unknown"
[3] "proteolysis and peptidolysis" "mRNA polyadenylylation*"
[5] "vesicle fusion*" "biological process unknown"
ER to Golgi transport
biological process unknown
proteolysis and peptidolysis
mRNA polyadenylylation*
vesicle fusion*
biological process unknown
Finally, we don’t even know what the number column represents (if you can figure it out, let me know!) And while we’re at it, we’re not going to use the GID, YORF or GWEIGHT columns in this analysis either. We may as well drop them, which we can do with dplyr’s select
.
cleaned_data <- original_data %>%
separate(NAME, c("name", "BP", "MF", "systematic_name", "number"), sep = "\\|\\|") %>%
mutate_at(vars(name:systematic_name), funs(trimws)) %>%
select(-number, -GID, -YORF, -GWEIGHT)
head(cleaned_data)
Let’s take a closer look at all those column headers like G0.05
, N0.2
and P0.15
.
The rules of tidy data specify that each variable forms one column, and this is not even remotely the case- we have 36 columns when we should have 3. That means our data is trapped in our column names. If you don’t see why this is a problem, consider: how would you put growth rate on the x-axis of a graph? How would you filter to look only the glucose condition?
Luckily, the tidyr
package has a solution ready. The documentation for gather
notes (emphasis mine):
Gather takes multiple columns and collapses into key-value pairs, duplicating all other columns as needed. You use gather() when you notice that you have columns that are not variables.
Hey, that’s us! So let’s apply gather as our next step:
OJO: Experimentar con el comando dplyr::count(sample)
luego de aplicar gather()
para confirmar la frecuencia de datos por observación.
cleaned_data <- original_data %>%
separate(NAME, c("name", "BP", "MF", "systematic_name", "number"), sep = "\\|\\|") %>%
mutate_at(vars(name:systematic_name), funs(trimws)) %>%
select(-number, -GID, -YORF, -GWEIGHT) %>%
gather(sample, expression, G0.05:U0.3)
#cleaned_data
cleaned_data %>% dplyr::count(sample)
Notice that the dataset no longer consists of one-row-per-gene: it’s one-row-per-gene-per-sample. This has previously been called “melting” a dataset, or turning it into “long” format. But I like the term “gather”: it shows that we’re taking these 36 columns and pulling them together.
One last problem. That sample column really contains two variables, nutrient and rate. We already learned what to do when we have two variables in one column: use separate
:
library(dplyr)
library(tidyr)
cleaned_data <- original_data %>%
separate(NAME, c("name", "BP", "MF", "systematic_name", "number"), sep = "\\|\\|") %>%
mutate_at(vars(name:systematic_name), funs(trimws)) %>%
select(-number, -GID, -YORF, -GWEIGHT) %>%
gather(sample, expression, G0.05:U0.3) %>% #dplyr::count(sample)
separate(sample, c("nutrient", "rate"), sep = 1, convert = TRUE)
head(cleaned_data)
This time, instead of telling separate to split the strings based on a particular delimiter, we told it to separate it after the first character (that is, after G/P/S/N/L/U). We also told it convert = TRUE
to tell it that it should notice the 0.05/0.1/etc value is a number and convert it.
Take a look at those six lines of code, a mini-sonnet of data cleaning. Doesn’t it read less like code and more like instructions? (“First we separated the NAME column into its five parts, and trimmed each. We selected out columns we didn’t need…”) That’s the beauty of the %>%
operator and the dplyr/tidyr verbs.
¿Por qué limpiar la data?
So we went through this effort to get this dataset into this structure, and you’re probably wondering why. In particular, why did we have to bother gathering those expression columns into one-row-per-gene-per-sample?
Well, suppose we have a single yeast gene we’re interested in. Let’s say LEU1, a gene involved in the leucine synthesis pathway.
cleaned_data %>%
filter(name == "LEU1") %>%
glimpse()
Observations: 36
Variables: 7
$ name <chr> "LEU1", "LEU1", "LEU1", "LEU1", "LEU1", "LEU1", "LEU1", "LEU1...
$ BP <chr> "leucine biosynthesis", "leucine biosynthesis", "leucine bios...
$ MF <chr> "3-isopropylmalate dehydratase activity", "3-isopropylmalate ...
$ systematic_name <chr> "YGL009C", "YGL009C", "YGL009C", "YGL009C", "YGL009C", "YGL00...
$ nutrient <chr> "G", "G", "G", "G", "G", "G", "N", "N", "N", "N", "N", "N", "...
$ rate <dbl> 0.05, 0.10, 0.15, 0.20, 0.25, 0.30, 0.05, 0.10, 0.15, 0.20, 0...
$ expression <dbl> -1.12, -0.77, -0.67, -0.59, -0.20, 0.03, -0.76, -1.17, -1.20,...
We now have 36 data points (six conditions, six growth rates), and for each we have a limiting nutrient, a growth rate, and the resulting expresion. We’re probably interested in how both the growth rate and the limiting nutrient affect the gene’s expression.
36 points is too many to look at manually. So it’s time to bring in some visualization. To do that, we simply pipe the results of our filtering right into ggplot2
:
cleaned_data %>%
filter(name=="LEU1") %>%
ggplot(aes(rate,expression,colour=nutrient))+
geom_line()
What a story this single gene tells! The gene’s expression is far higher (more “turned on”) when the cell is being starved of leucine than in any other condition, because in that case the cell has to synthesize its own leucine. And as the amount of leucine in the environment (the growth rate) increases, the cell can focus less on leucine production, and the expression of those genes go down. We’ve just gotten one snapshot of our gene’s regulatory network, and how it responds to external stimuli.
We don’t have to choose one gene to visualize- LEU1 is just one gene in the leucine biosynthesis process. Recall that we have that information in the BP column, so we can filter for all genes in that process, and then facet to create sub-plots for each.
cleaned_data %>%
filter(BP == "leucine biosynthesis") %>%
ggplot(aes(rate, expression, color = nutrient)) +
geom_line() +
facet_wrap(~name)
LEU1, LEU2, and LEU4 all show a similar pattern, where starvation of leucine causes higher gene expression. (Interestingly, LEU4 responds to glucose starvation as well. Any geneticists have some idea why?). LEU9 is a little more ambiguous but is still highest expressed under leucine starvation. We already know what these genes do, but this hints at how we might be able to find other genes that are involved in leucine synthesis, including ones we don’t yet know.
Let’s play with graph a little more. These trends look vaguely linear. Maybe we should show best points with best fit lines instead:
cleaned_data %>%
filter(BP == "leucine biosynthesis") %>%
ggplot(aes(rate, expression, color = nutrient)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
facet_wrap(~name)
cleaned_data %>%
filter(BP == "sulfur metabolism") %>%
ggplot(aes(rate, expression, color = nutrient)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
facet_wrap(~name + systematic_name, scales = "free_y")
Warning: Removed 3 rows containing non-finite values (stat_smooth).
Warning: Removed 3 rows containing missing values (geom_point).
(Notice that we have to facet by the systematic_name here, since not all genes in this process have traditional names).
I earlier pointed to a list of available workflows from Bioconductor
, which teach ways to analyze many kinds of genomic data using packages specialized for those purposes. These kinds of guides are incredibly valuable, and Bioconductor
has built up an excellent set of tools for analyzing genomic data, many of which come with their own data processing and visualization methods.
So why bother teaching a dplyr/ggplot2 approach? Because these tools are useful everywhere. Consider the dozen lines of code we used to clean and visualize our data:
library(dplyr)
library(tidyr)
cleaned_data <- original_data %>%
separate(NAME, c("name", "BP", "MF", "systematic_name", "number"), sep = "\\|\\|") %>%
mutate_at(vars(name:systematic_name), funs(trimws)) %>%
select(-number, -GID, -YORF, -GWEIGHT) %>%
gather(sample, expression, G0.05:U0.3) %>% #dplyr::count(sample)
separate(sample, c("nutrient", "rate"), sep = 1, convert = TRUE)
cleaned_data %>%
filter(BP == "leucine biosynthesis") %>%
ggplot(aes(rate, expression, color = nutrient)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
facet_wrap(~name)
With this code we’re able to go from published data to a new conclusion about leucine biosynthesis (or sulfate metabolism, or whatever biological process or genes we’re interested in). But the functions we use aren’t specific to our dataset or data format: in fact, they’re not related to biology at all. Instead, these tools are building blocks, or “atoms,” in a grammar of data manipulation and visualization that applies to nearly every kind of data.
This isn’t meant to disparage Bioconductor in any way: scientists should use whatever tool gets their job done. But educators can and should focus on teaching tools that can be universally applied. In turn, students can take these tools and build new packages, for Bioconductor and elsewhere, that analyze novel forms of data.
Tres pasos más de limpieza:
¿Cómo revisar la presencia de NA?
# tipo de variable: character <chr>
m <- cleaned_data %>%
filter(systematic_name=="") %>%
dplyr::count()
# tipo de variable: double <dbl>
n <- cleaned_data %>%
filter(is.na(expression)) %>%
dplyr::count(expression)
systematic_name
posee 36 NA’sexpression
posee 866 NA’sEjecutar la limpieza
nutrient
systematic_names
con NA’s (porque no habría cómo identificar al gen)expression
con NA’slibrary(forcats)
cleaned_data <- original_data %>%
separate(NAME, c("name", "BP", "MF", "systematic_name", "number"), sep = "\\|\\|") %>%
mutate_at(vars(name:systematic_name), funs(trimws)) %>%
select(-number, -GID, -YORF, -GWEIGHT) %>%
gather(sample, expression, G0.05:U0.3) %>% #dplyr::count(sample)
separate(sample, c("nutrient", "rate"), sep = 1, convert = TRUE) %>%
mutate(nutrient=forcats::fct_recode(nutrient,
"Glucose" = "G", "Leucine" = "L",
"Phosphate" = "P", "Sulfate" = "S",
"Ammonia" = "N", "Uracil" = "U")) %>%
filter(!is.na(expression),systematic_name!="") %>%
glimpse()
Observations: 198,430
Variables: 7
$ name <chr> "SFB2", "", "QRI7", "CFT2", "SSO2", "PSP2", "RIB2", "VMA13", ...
$ BP <chr> "ER to Golgi transport", "biological process unknown", "prote...
$ MF <chr> "molecular function unknown", "molecular function unknown", "...
$ systematic_name <chr> "YNL049C", "YNL095C", "YDL104C", "YLR115W", "YMR183C", "YML01...
$ nutrient <fctr> Glucose, Glucose, Glucose, Glucose, Glucose, Glucose, Glucos...
$ rate <dbl> 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0...
$ expression <dbl> -0.24, 0.28, -0.02, -0.33, 0.05, -0.69, -0.55, -0.75, -0.24, ...
¿Todos los genes tienen valores de expresión?
y <- original_data %>%
separate(NAME, c("name", "BP", "MF", "systematic_name", "number"), sep = "\\|\\|") %>%
mutate_at(vars(name:systematic_name), funs(trimws)) %>%
select(-number, -GID, -YORF, -GWEIGHT) %>%
gather(sample, expression, G0.05:U0.3) %>% #dplyr::count(sample)
separate(sample, c("nutrient", "rate"), sep = 1, convert = TRUE) %>%
mutate(nutrient=forcats::fct_recode(nutrient,
"Glucose" = "G", "Leucine" = "L",
"Phosphate" = "P", "Sulfate" = "S",
"Ammonia" = "N", "Uracil" = "U")) %>%
filter(!is.na(expression),systematic_name!="") %>%
#filter(systematic_name=="Q0140", nutrient=="Phosphate") # ejemplo con solo dos lecturas de gen x nutriente
group_by(systematic_name, nutrient) %>% dplyr::count() %>%
ungroup() %>% count(n) %>% filter(n<6) #%>% summarise(sum(nn))
rate
para las combinaciones entre systematic_name
(gen) y nutrient
(nutriente limitante).Limpieza final
cleaned_data <- original_data %>%
separate(NAME, c("name", "BP", "MF", "systematic_name", "number"), sep = "\\|\\|") %>%
mutate_at(vars(name:systematic_name), funs(trimws)) %>%
select(-number, -GID, -YORF, -GWEIGHT) %>%
gather(sample, expression, G0.05:U0.3) %>% #dplyr::count(sample)
separate(sample, c("nutrient", "rate"), sep = 1, convert = TRUE) %>%
mutate(nutrient=forcats::fct_recode(nutrient,
"Glucose" = "G", "Leucine" = "L",
"Phosphate" = "P", "Sulfate" = "S",
"Ammonia" = "N", "Uracil" = "U")) %>%
filter(!is.na(expression),systematic_name!="") %>%
group_by(systematic_name, nutrient) %>%
filter(n()==6) %>%
ungroup() %>%
glimpse()
Observations: 195,222
Variables: 7
$ name <chr> "SFB2", "", "QRI7", "CFT2", "SSO2", "PSP2", "RIB2", "VMA13", ...
$ BP <chr> "ER to Golgi transport", "biological process unknown", "prote...
$ MF <chr> "molecular function unknown", "molecular function unknown", "...
$ systematic_name <chr> "YNL049C", "YNL095C", "YDL104C", "YLR115W", "YMR183C", "YML01...
$ nutrient <fctr> Glucose, Glucose, Glucose, Glucose, Glucose, Glucose, Glucos...
$ rate <dbl> 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0...
$ expression <dbl> -0.24, 0.28, -0.02, -0.33, 0.05, -0.69, -0.55, -0.75, -0.24, ...
Tidying the data in this way lets us make graphs like this
cleaned_data %>%
filter(BP == "leucine biosynthesis") %>%
ggplot(aes(rate, expression, color = nutrient)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
facet_wrap(~name + systematic_name)
For starters, let’s wrap this useful graph into a function, so that we can make it easily in the rest of the post.
plot_expression_data <- function(expression_data) {
ggplot(expression_data, aes(rate, expression, color = nutrient)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
facet_wrap(~name + systematic_name, scales = "free_y")
}
At which point we would rewrite the above graph like:
cleaned_data %>%
filter(BP == "leucine biosynthesis") %>%
plot_expression_data()
This is a great way to visualize a few genes at a time. But there are so many genes in the dataset. For example, let’s instead filter by the biological process cell wall organization and biogenesis.
cleaned_data %>%
filter(BP == "cell wall organization and biogenesis") %>%
plot_expression_data()
OK, that’s 36 genes and it’s already getting a lot harder to understand these plots. And we have 5500 genes in this dataset: no way can we visually interpret all those genes at once. This is where we introduce modeling.
devtools::session_info()
Session info ----------------------------------------------------------------------------
setting value
version R version 3.4.1 (2017-06-30)
system x86_64, linux-gnu
ui X11
language en_US
collate en_US.UTF-8
tz America/Lima
date 2017-08-03
Packages --------------------------------------------------------------------------------
package * version date source
assertthat 0.2.0 2017-04-11 CRAN (R 3.4.0)
backports 1.1.0 2017-05-22 CRAN (R 3.4.1)
base * 3.4.1 2017-07-08 local
base64enc 0.1-3 2015-07-28 CRAN (R 3.4.0)
bindr 0.1 2016-11-13 cran (@0.1)
bindrcpp * 0.2 2017-06-17 CRAN (R 3.4.1)
broom 0.4.2 2017-02-13 CRAN (R 3.4.0)
cellranger 1.1.0 2016-07-27 CRAN (R 3.4.0)
colorspace 1.3-2 2016-12-14 CRAN (R 3.4.0)
compiler 3.4.1 2017-07-08 local
datasets * 3.4.1 2017-07-08 local
devtools 1.13.2 2017-06-02 CRAN (R 3.4.1)
digest 0.6.12 2017-01-27 CRAN (R 3.4.0)
dplyr * 0.7.2 2017-07-20 CRAN (R 3.4.1)
evaluate 0.10.1 2017-06-24 CRAN (R 3.4.1)
forcats * 0.2.0 2017-01-23 CRAN (R 3.4.0)
foreign 0.8-69 2017-06-21 CRAN (R 3.4.1)
ggplot2 * 2.2.1 2016-12-30 CRAN (R 3.4.0)
glue 1.1.1 2017-06-21 CRAN (R 3.4.1)
graphics * 3.4.1 2017-07-08 local
grDevices * 3.4.1 2017-07-08 local
grid 3.4.1 2017-07-08 local
gtable 0.2.0 2016-02-26 CRAN (R 3.4.0)
haven 1.1.0 2017-07-09 CRAN (R 3.4.1)
hms 0.3 2016-11-22 CRAN (R 3.4.0)
htmltools 0.3.6 2017-04-28 CRAN (R 3.4.0)
httr 1.2.1 2016-07-03 CRAN (R 3.4.0)
jsonlite 1.5 2017-06-01 cran (@1.5)
knitr 1.16 2017-05-18 cran (@1.16)
labeling 0.3 2014-08-23 CRAN (R 3.4.0)
lattice 0.20-35 2017-03-25 CRAN (R 3.3.3)
lazyeval 0.2.0 2016-06-12 CRAN (R 3.4.0)
lubridate 1.6.0 2016-09-13 CRAN (R 3.4.0)
magrittr 1.5 2014-11-22 CRAN (R 3.4.0)
memoise 1.1.0 2017-04-21 CRAN (R 3.4.0)
methods * 3.4.1 2017-07-08 local
mnormt 1.5-5 2016-10-15 CRAN (R 3.4.0)
modelr 0.1.1 2017-07-24 CRAN (R 3.4.1)
munsell 0.4.3 2016-02-13 CRAN (R 3.4.0)
nlme 3.1-131 2017-02-06 CRAN (R 3.4.0)
parallel 3.4.1 2017-07-08 local
pkgconfig 2.0.1 2017-03-21 cran (@2.0.1)
plyr 1.8.4 2016-06-08 CRAN (R 3.4.0)
psych 1.7.5 2017-05-03 CRAN (R 3.4.0)
purrr * 0.2.2.2 2017-05-11 cran (@0.2.2.2)
R6 2.2.2 2017-06-17 CRAN (R 3.4.1)
Rcpp 0.12.12 2017-07-15 CRAN (R 3.4.1)
readr * 1.1.1 2017-05-16 CRAN (R 3.4.1)
readxl 1.0.0 2017-04-18 CRAN (R 3.4.0)
reshape2 1.4.2 2016-10-22 CRAN (R 3.4.0)
rlang 0.1.1 2017-05-18 cran (@0.1.1)
rmarkdown 1.6 2017-06-15 CRAN (R 3.4.1)
rprojroot 1.2 2017-01-16 CRAN (R 3.4.0)
rvest 0.3.2 2016-06-17 CRAN (R 3.4.0)
scales 0.4.1 2016-11-09 CRAN (R 3.4.0)
stats * 3.4.1 2017-07-08 local
stringi 1.1.5 2017-04-07 CRAN (R 3.4.0)
stringr 1.2.0 2017-02-18 CRAN (R 3.4.0)
tibble * 1.3.3 2017-05-28 cran (@1.3.3)
tidyr * 0.6.3 2017-05-15 CRAN (R 3.4.1)
tidyverse * 1.1.1 2017-01-27 CRAN (R 3.4.0)
tools 3.4.1 2017-07-08 local
utils * 3.4.1 2017-07-08 local
withr 2.0.0 2017-07-28 CRAN (R 3.4.1)
xml2 1.1.1 2017-01-24 CRAN (R 3.4.0)
yaml 2.1.14 2016-11-12 CRAN (R 3.4.0)