Limpieza y visualización de datos: tidyverse

El presente tutorial está basado en la publicación de Michael Levy y la publicación de Bradley Boehmke. El material ha sido readaptado para cumplir el objetivo del curso.

Mayor detalle en el libro de Hadley Wickham.

Objetivo

Introducir las herramientas y estilo del tidyverse para la limpieza de datos.

Análisis de datos

Analysts tend to follow 4 fundamental processes to turn data into understanding, knowledge & insight:

Data manipulation
Data visualization
Statistical analysis/modeling
Deployment of results

This tutorial will focus on data manipulation

Data manipulation

It is often said that 80% of data analysis is spent on the process of cleaning and preparing the data. (Dasu and Johnson, 2003)

Well structured data serves two purposes:

Makes data suitable for software processing whether that be mathematical functions, visualization, etc.
Reveals information and insights

data wrangle

Tidy data

Put data in data frames

Each variable gets a column
Each observation gets a row
Each type of observation gets a data frame

tidy data

What is the tidyverse?

The tidyverse is a suite of R tools that follow a tidy philosophy.

Tidy APIs

Functions should be consistent and easily (human) readable

Take one step at a time
Connect simple steps with the pipe
Referential transparency

Okay but really, what is it?

Suite of ~20 packages that provide consistent, user-friendly, smart-default tools to do most of what most people do in R.

Core packages: ggplot2, dplyr, tidyr, readr, purrr, tibble
Specialized data manipulation: hms, stringr, lubridate, forcats
Data import: DBI, haven, httr, jsonlite, readxl, rvest, xml2
Modeling: modelr, broom

install.packages(tidyverse) installs all of the above packages.

library(tidyverse) attaches only the core packages.

tidyverse

tidyverse functions

Why tidyverse?

Consistency
- e.g. Many functions take data.frame first -> piping
  - Faster to write
  - Easier to read
  - Easier to remember
- Tidy data: Imposes good practices
Simple solutions to common problems (e.g. tidyr::separate)
Runs fast (thanks to Rcpp).
It is modular! (with the UNIX pipe | “spirit”)

`tibble`

A modern reimagining of data frames.

tdf = tibble(x = 1:1e4, y = rnorm(1e4))  # == data_frame(x = 1:1e4, y = rnorm(1e4))
class(tdf)

## [1] "tbl_df"     "tbl"        "data.frame"

Tibbles print politely.

tdf

## # A tibble: 10,000 x 2
##        x          y
##    <int>      <dbl>
##  1     1  1.5769043
##  2     2 -0.7643222
##  3     3 -1.5141980
##  4     4 -2.2969305
##  5     5  1.1883936
##  6     6 -0.8345075
##  7     7  0.2408071
##  8     8 -0.4245211
##  9     9 -1.0505558
## 10    10  0.6696043
## # ... with 9,990 more rows

Tibbles have some convenient and consistent defaults that are different from base R data.frames.

The pipe `%>%`

Sends the output of the LHS function to the first argument of the RHS function.

sum(1:8) %>%
  sqrt()

## [1] 6

%>% se obtiene de forma automática con el atajo Ctrl+M

When you desire to perform multiple functions its advantage becomes obvious.
For instance, if we want to
- filter some data,
- summarize it, and then
- order the summarized results we would write it out as:

Nested Option:

arrange(
        summarize(
            filter(data, variable == numeric_value),
            Total = sum(variable)
        ),
    desc(Total)
)

Multiple Object Option:

 a <- filter(data, variable == numeric_value)
 b <- summarise(a, Total = sum(variable))
 c <- arrange(b, desc(Total))

%>% Option:

 data %>%
        filter(variable == “value”) %>%
        summarise(Total = sum(variable)) %>%
        arrange(desc(Total))

As your function tasks get longer the %>% operator becomes more efficient and makes your code more legible.
In addition, the %>% operator allows you to flow from data manipulation tasks straight into vizualization functions (via ggplot and ggvis) and also into many analytic functions.

`tidyr`

There are four fundamental functions of data tidying:

gather() takes multiple columns, and gathers them into key-value pairs: it makes “wide” data longer.
spread() takes two columns (key & value) and spreads in to multiple columns, it makes “long” data wider.
separate() splits a single column into multiple columns
unite() combines multiple columns into a single column

`gather` and `spread`

gather to make wide table long, spread to make long tables wide.

tidyr::gather

gather

spread

mini

library(EDAWR)
cases %>%
  tbl_df() %>%
  gather(key= year, value=n, -country) %>%
  spread(year, n)

## # A tibble: 3 x 4
##   country `2011` `2012` `2013`
## *   <chr>  <dbl>  <dbl>  <dbl>
## 1      DE   5800   6000   6200
## 2      FR   7000   6900   7000
## 3      US  15000  14000  13000

stocks <- data.frame(
  time = as.Date('2009-01-01') + 0:9,
  X = rnorm(10, 0, 1),
  Y = rnorm(10, 0, 2),
  Z = rnorm(10, 0, 4)
)
stocksm <- stocks %>% gather(stock, price, -time) #%>% count(stock) #use gather()+count()
stocksm %>% spread(stock, price)
stocksm %>% spread(time, price)

large

who  # Tuberculosis data from the WHO

## # A tibble: 7,240 x 60
##        country  iso2  iso3  year new_sp_m014 new_sp_m1524 new_sp_m2534 new_sp_m3544
##          <chr> <chr> <chr> <int>       <int>        <int>        <int>        <int>
##  1 Afghanistan    AF   AFG  1980          NA           NA           NA           NA
##  2 Afghanistan    AF   AFG  1981          NA           NA           NA           NA
##  3 Afghanistan    AF   AFG  1982          NA           NA           NA           NA
##  4 Afghanistan    AF   AFG  1983          NA           NA           NA           NA
##  5 Afghanistan    AF   AFG  1984          NA           NA           NA           NA
##  6 Afghanistan    AF   AFG  1985          NA           NA           NA           NA
##  7 Afghanistan    AF   AFG  1986          NA           NA           NA           NA
##  8 Afghanistan    AF   AFG  1987          NA           NA           NA           NA
##  9 Afghanistan    AF   AFG  1988          NA           NA           NA           NA
## 10 Afghanistan    AF   AFG  1989          NA           NA           NA           NA
## # ... with 7,230 more rows, and 52 more variables: new_sp_m4554 <int>,
## #   new_sp_m5564 <int>, new_sp_m65 <int>, new_sp_f014 <int>, new_sp_f1524 <int>,
## #   new_sp_f2534 <int>, new_sp_f3544 <int>, new_sp_f4554 <int>, new_sp_f5564 <int>,
## #   new_sp_f65 <int>, new_sn_m014 <int>, new_sn_m1524 <int>, new_sn_m2534 <int>,
## #   new_sn_m3544 <int>, new_sn_m4554 <int>, new_sn_m5564 <int>, new_sn_m65 <int>,
## #   new_sn_f014 <int>, new_sn_f1524 <int>, new_sn_f2534 <int>, new_sn_f3544 <int>,
## #   new_sn_f4554 <int>, new_sn_f5564 <int>, new_sn_f65 <int>, new_ep_m014 <int>,
## #   new_ep_m1524 <int>, new_ep_m2534 <int>, new_ep_m3544 <int>, new_ep_m4554 <int>,
## #   new_ep_m5564 <int>, new_ep_m65 <int>, new_ep_f014 <int>, new_ep_f1524 <int>,
## #   new_ep_f2534 <int>, new_ep_f3544 <int>, new_ep_f4554 <int>, new_ep_f5564 <int>,
## #   new_ep_f65 <int>, new_rel_m014 <int>, new_rel_m1524 <int>, new_rel_m2534 <int>,
## #   new_rel_m3544 <int>, new_rel_m4554 <int>, new_rel_m5564 <int>, new_rel_m65 <int>,
## #   new_rel_f014 <int>, new_rel_f1524 <int>, new_rel_f2534 <int>, new_rel_f3544 <int>,
## #   new_rel_f4554 <int>, new_rel_f5564 <int>, new_rel_f65 <int>

who %>%
  gather(group, cases, -country, -iso2, -iso3, -year)

## # A tibble: 405,440 x 6
##        country  iso2  iso3  year       group cases
##          <chr> <chr> <chr> <int>       <chr> <int>
##  1 Afghanistan    AF   AFG  1980 new_sp_m014    NA
##  2 Afghanistan    AF   AFG  1981 new_sp_m014    NA
##  3 Afghanistan    AF   AFG  1982 new_sp_m014    NA
##  4 Afghanistan    AF   AFG  1983 new_sp_m014    NA
##  5 Afghanistan    AF   AFG  1984 new_sp_m014    NA
##  6 Afghanistan    AF   AFG  1985 new_sp_m014    NA
##  7 Afghanistan    AF   AFG  1986 new_sp_m014    NA
##  8 Afghanistan    AF   AFG  1987 new_sp_m014    NA
##  9 Afghanistan    AF   AFG  1988 new_sp_m014    NA
## 10 Afghanistan    AF   AFG  1989 new_sp_m014    NA
## # ... with 405,430 more rows

`separate` and `unite`

separate unite

mini

df <- data.frame(x = c(NA, "a.b", "a.d", "b.c"))
df %>% 
  tidyr::separate(x, c("A", "B")) %>%
  tidyr::unite(x, A, B, sep=".")

##       x
## 1 NA.NA
## 2   a.b
## 3   a.d
## 4   b.c

mtcars %>%
  tbl_df() %>%
  select(7:9) %>% 
  tidyr::unite(vs_am, vs, am) %>%
  tidyr::separate(vs_am, c("vs", "am"))

large

library(EDAWR)
storms %>%
  top_n(2,date) %>%
  separate(date, c("y", "m", "d")) %>%
  unite(date, y,m,d, sep="-")

## # A tibble: 2 x 4
##     storm  wind pressure       date
## *   <chr> <int>    <int>      <chr>
## 1 Alberto   110     1007 2000-08-03
## 2  Arlene    50     1010 1999-06-11

# extra
library(EDAWR)
pollution %>%
  tbl_df() %>%
  spread(size, amount) %>%
  gather(size, amount, -city) %>%
  arrange(desc(city))

## # A tibble: 6 x 3
##       city  size amount
##      <chr> <chr>  <dbl>
## 1 New York large     23
## 2 New York small     14
## 3   London large     22
## 4   London small     16
## 5  Beijing large    121
## 6  Beijing small     56

`dplyr`

Common data(frame) manipulation tasks.

There are seven fundamental functions of data transformation:

select() select variables
mutate() create new variables
filter() filter observations
arrange() reorder observations
group_by() groups observations by categorical levels
summarise() summarise observations by functions of choice
join() joins separate dataframes

`select`

select variables

iris %>%
  tbl_df() %>%
  select(Petal.Length, Petal.Width)

## # A tibble: 150 x 2
##    Petal.Length Petal.Width
##           <dbl>       <dbl>
##  1          1.4         0.2
##  2          1.4         0.2
##  3          1.3         0.2
##  4          1.5         0.2
##  5          1.4         0.2
##  6          1.7         0.4
##  7          1.4         0.3
##  8          1.5         0.2
##  9          1.4         0.2
## 10          1.5         0.1
## # ... with 140 more rows

# equivalent
iris %>%
  tbl_df() %>%
  select(3,4)

iris %>%
  tbl_df() %>%
  select(-Species)

iris %>%
  tbl_df() %>%
  select_if(is.factor)

use select_helpers!!!

# ?select_helpers
iris %>%
  tbl_df() %>%
  select(starts_with("Petal"))
iris %>%
  tbl_df() %>%
  select(ends_with("Width"))
iris %>%
  tbl_df() %>%
  select(contains("etal"))
iris %>%
  tbl_df() %>%
  select(-matches(".t.")) # accepts 'NOT' condition

`mutate`

create new variables

mtcars %>%
  tbl_df() %>%
  select(1:3) %>% 
  mutate(gpm= 1/mpg)

## # A tibble: 32 x 4
##      mpg   cyl  disp        gpm
##    <dbl> <dbl> <dbl>      <dbl>
##  1  21.0     6 160.0 0.04761905
##  2  21.0     6 160.0 0.04761905
##  3  22.8     4 108.0 0.04385965
##  4  21.4     6 258.0 0.04672897
##  5  18.7     8 360.0 0.05347594
##  6  18.1     6 225.0 0.05524862
##  7  14.3     8 360.0 0.06993007
##  8  24.4     4 146.7 0.04098361
##  9  22.8     4 140.8 0.04385965
## 10  19.2     6 167.6 0.05208333
## # ... with 22 more rows

iris %>%
  tbl_df() %>%
  mutate_at(vars(-Species), funs(log))# %>% # vars() funs()

## # A tibble: 150 x 5
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##           <dbl>       <dbl>        <dbl>       <dbl>  <fctr>
##  1     1.629241    1.252763    0.3364722  -1.6094379  setosa
##  2     1.589235    1.098612    0.3364722  -1.6094379  setosa
##  3     1.547563    1.163151    0.2623643  -1.6094379  setosa
##  4     1.526056    1.131402    0.4054651  -1.6094379  setosa
##  5     1.609438    1.280934    0.3364722  -1.6094379  setosa
##  6     1.686399    1.360977    0.5306283  -0.9162907  setosa
##  7     1.526056    1.223775    0.3364722  -1.2039728  setosa
##  8     1.609438    1.223775    0.4054651  -1.6094379  setosa
##  9     1.481605    1.064711    0.3364722  -1.6094379  setosa
## 10     1.589235    1.131402    0.4054651  -2.3025851  setosa
## # ... with 140 more rows

`filter`

filter observations
try to use always the form dplyr::filter

iris %>%
  tbl_df() %>%
  # logical criteria
  dplyr::filter(Sepal.Length > 7)

## # A tibble: 12 x 5
##    Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
##           <dbl>       <dbl>        <dbl>       <dbl>    <fctr>
##  1          7.1         3.0          5.9         2.1 virginica
##  2          7.6         3.0          6.6         2.1 virginica
##  3          7.3         2.9          6.3         1.8 virginica
##  4          7.2         3.6          6.1         2.5 virginica
##  5          7.7         3.8          6.7         2.2 virginica
##  6          7.7         2.6          6.9         2.3 virginica
##  7          7.7         2.8          6.7         2.0 virginica
##  8          7.2         3.2          6.0         1.8 virginica
##  9          7.2         3.0          5.8         1.6 virginica
## 10          7.4         2.8          6.1         1.9 virginica
## 11          7.9         3.8          6.4         2.0 virginica
## 12          7.7         3.0          6.1         2.3 virginica

`arrange`

reorder observations

mtcars %>%
  tbl_df() %>%
  select(1:3) %>% 
  # order rows
  dplyr::arrange(mpg) %>%
  dplyr::arrange(desc(mpg))

## # A tibble: 32 x 3
##      mpg   cyl  disp
##    <dbl> <dbl> <dbl>
##  1  33.9     4  71.1
##  2  32.4     4  78.7
##  3  30.4     4  75.7
##  4  30.4     4  95.1
##  5  27.3     4  79.0
##  6  26.0     4 120.3
##  7  24.4     4 146.7
##  8  22.8     4 108.0
##  9  22.8     4 140.8
## 10  21.5     4 120.1
## # ... with 22 more rows

`group_by` + `summarise`

group_by() groups observations by categorical levels
summarise() summarise observations by functions of choice

iris %>%
  tbl_df() %>%
  # compute separate summary row for each group
  dplyr::group_by(Species) %>%
  summarise(avg= mean(Sepal.Length)) %>%
  dplyr::ungroup()

## # A tibble: 3 x 2
##      Species   avg
##       <fctr> <dbl>
## 1     setosa 5.006
## 2 versicolor 5.936
## 3  virginica 6.588

`joins`

dplyr also does multi-table joins and can connect to various types of databases.

t1 = data_frame(alpha = letters[1:6], num = 1:6)
t2 = data_frame(alpha = letters[4:10], num = 4:10)
full_join(t1, t2, by = "alpha", suffix = c("_t1", "_t2"))

## # A tibble: 10 x 3
##    alpha num_t1 num_t2
##    <chr>  <int>  <int>
##  1     a      1     NA
##  2     b      2     NA
##  3     c      3     NA
##  4     d      4      4
##  5     e      5      5
##  6     f      6      6
##  7     g     NA      7
##  8     h     NA      8
##  9     i     NA      9
## 10     j     NA     10

Super-secret pro-tip: You can group_by %>% mutate to accomplish a summarize + join

data_frame(group = sample(letters[1:3], 10, replace = TRUE),
           value = rnorm(10)) %>%
  group_by(group) %>%
  mutate(group_average = mean(value))

`ggplot2`

Visualization package

Note that the pipe and consistent API make it easy to combine functions from different packages, and the whole thing is quite readable.

# density, cumsum, cume_dist + facet
z <- iris %>%
  tbl_df() %>%
  gather(key=attrib, value= attrib_m, -Species) %>%
  group_by(attrib, Species) %>%
  arrange(attrib, Species, attrib_m) %>%
  dplyr::mutate_if(is.numeric,funs(cumsum, cume_dist))
#dplyr::mutate_each(funs(cumsum, cume_dist), -Species)

b <- z %>%
  ggplot(aes(attrib_m,cumsum)) + 
  geom_line(aes(colour= Species)) +
  facet_grid(. ~ attrib)

c <- iris %>%
  gather(key=attrib, value= attrib_m, -Species) %>%
  ggplot(aes(attrib_m)) + 
  geom_density(aes(colour= Species)) + 
  facet_grid(. ~ attrib)

Rmisc::multiplot(b, c, cols = 1)

who %>%
  select(-iso2, -iso3) %>%
  gather(group, cases, -country, -year) %>%
  count(country, year, wt = cases) %>%
  ggplot(aes(x = year, y = n, group = country)) +
  geom_line(size = .2)

ANEXO: Rstudio

¡Sácale el jugo a sus ventajas!

Atajos:
- Ctrl+ Shift+K: knitr
- Alt+ Shift+K: show all key shortcuts
Atajos con Ctrl+
1. script
2. console
3. help
4. ~~history search~~
5. files
6. plots
7. packages
8. ~~environment~~
9. ~~Viewer~~
Recuerda el pipe %>%
- Ctrl+M

stats with `broom`

# 1. summary stats
iris %>% 
  tbl_df() %>% 
  gather(key=attrib, value= attrib_m, -Species) %>%
  group_by(Species, attrib) %>%
  summarise_if(is.numeric,c("mean", "median", #location
                            "IQR", "mad", "sd", "var")) %>% #spread
  filter(attrib=="Sepal.Length")
  #glimpse()

# 2. distribution visualization
iris %>%
  ggplot(aes(Sepal.Length)) + 
  geom_density(aes(colour= Species))

# 2. test hypothesis
iris %>%
  filter(Species!="setosa") %>%
  t.test(Sepal.Length ~ Species, data=.) %>%
  broom::tidy()

iris %>%
  filter(Species!="versicolor") %>%
  t.test(Sepal.Length ~ Species, data=.) %>%
  broom::tidy()

iris %>%
  filter(Species!="virginica") %>%
  t.test(Sepal.Length ~ Species, data=.) %>%
  broom::tidy()

iris %>%
  #filter(Species!="setosa") %>%
  aov(Sepal.Length ~ Species, data=.) %>%
  broom::tidy()
  broom::glance()
  broom::augment()

MÁS EJEMPLOS

library(tidyverse)
library(stringr)
library(forcats)
library(broom)
#library(EDAWR)

#
tidyr::who %>%
  filter(iso3=="PER") %>% 
  summarise_if(is.numeric,mean, na.rm=T) %>%
  glimpse()

# one -------------------------------------
who1 <- tidyr::who %>%
  gather(new_sp_m014:newrel_f65,
         key= "key",
         value= "cases",
         na.rm=T) %>%
  mutate(key= stringr::str_replace(key, 
                                   "newrel","new_rel")) %>% 
  separate(key, 
           c("new", "type", "sexage"), 
           sep="_") %>% 
  select(-new, -iso2, -iso3) %>% 
  separate(sexage, 
           c("sex", "age"), 
           sep=1) 

who1 %>%
  filter(country=="Peru") %>% 
  mutate(age= forcats::fct_reorder(age, desc(cases))) %>% 
  ggplot(aes(year, cases)) + 
  geom_line(aes(colour=age))
  #count(age)
  #View()

# two -------------------------------------
who2 <- who1 %>%
  group_by(country, year, sex) %>% 
  summarise_at(vars(cases), sum, na.rm=T) 

who2 %>%
  filter(country=="Peru") %>%
  ggplot(aes(year, cases)) + 
  geom_line(aes(colour=sex)) +
  facet_wrap(~ country)

# three -------------------------------------
who3 <- who2 %>% 
  group_by(country) %>% 
  summarise_at(vars(cases), sum, na.rm=T) %>% 
  top_n(20, wt=cases) %>% 
  select(country) %>% 
  inner_join(who1) %>% 
  bind_rows(who1 %>%
              filter(country=="Peru")) 

who3 %>% 
  group_by(country) %>% 
  mutate(age= forcats::fct_reorder(age, desc(cases))) %>%
  ggplot(aes(year, log10(cases))) + 
  geom_line(aes(colour=age)) +
  facet_wrap(~ country)

Computer environment

devtools::session_info()

## Session info ----------------------------------------------------------------------------

##  setting  value                       
##  version  R version 3.4.1 (2017-06-30)
##  system   x86_64, linux-gnu           
##  ui       X11                         
##  language en_US                       
##  collate  en_US.UTF-8                 
##  tz       America/Lima                
##  date     2017-08-04

## Packages --------------------------------------------------------------------------------

##  package    * version date       source                        
##  assertthat   0.2.0   2017-04-11 CRAN (R 3.4.0)                
##  backports    1.1.0   2017-05-22 CRAN (R 3.4.1)                
##  base       * 3.4.1   2017-07-08 local                         
##  bindr        0.1     2016-11-13 cran (@0.1)                   
##  bindrcpp   * 0.2     2017-06-17 CRAN (R 3.4.1)                
##  broom        0.4.2   2017-02-13 CRAN (R 3.4.0)                
##  cellranger   1.1.0   2016-07-27 CRAN (R 3.4.0)                
##  colorspace   1.3-2   2016-12-14 CRAN (R 3.4.0)                
##  compiler     3.4.1   2017-07-08 local                         
##  datasets   * 3.4.1   2017-07-08 local                         
##  devtools     1.13.2  2017-06-02 CRAN (R 3.4.1)                
##  digest       0.6.12  2017-01-27 CRAN (R 3.4.0)                
##  dplyr      * 0.7.2   2017-07-20 CRAN (R 3.4.1)                
##  EDAWR      * 0.1     2017-02-24 Github (rstudio/EDAWR@2652ea6)
##  evaluate     0.10.1  2017-06-24 CRAN (R 3.4.1)                
##  forcats      0.2.0   2017-01-23 CRAN (R 3.4.0)                
##  foreign      0.8-69  2017-06-21 CRAN (R 3.4.1)                
##  ggplot2    * 2.2.1   2016-12-30 CRAN (R 3.4.0)                
##  glue         1.1.1   2017-06-21 CRAN (R 3.4.1)                
##  graphics   * 3.4.1   2017-07-08 local                         
##  grDevices  * 3.4.1   2017-07-08 local                         
##  grid         3.4.1   2017-07-08 local                         
##  gtable       0.2.0   2016-02-26 CRAN (R 3.4.0)                
##  haven        1.1.0   2017-07-09 CRAN (R 3.4.1)                
##  hms          0.3     2016-11-22 CRAN (R 3.4.0)                
##  htmltools    0.3.6   2017-04-28 CRAN (R 3.4.0)                
##  httr         1.2.1   2016-07-03 CRAN (R 3.4.0)                
##  jsonlite     1.5     2017-06-01 cran (@1.5)                   
##  knitr        1.16    2017-05-18 cran (@1.16)                  
##  labeling     0.3     2014-08-23 CRAN (R 3.4.0)                
##  lattice      0.20-35 2017-03-25 CRAN (R 3.3.3)                
##  lazyeval     0.2.0   2016-06-12 CRAN (R 3.4.0)                
##  lubridate    1.6.0   2016-09-13 CRAN (R 3.4.0)                
##  magrittr     1.5     2014-11-22 CRAN (R 3.4.0)                
##  memoise      1.1.0   2017-04-21 CRAN (R 3.4.0)                
##  methods    * 3.4.1   2017-07-08 local                         
##  mnormt       1.5-5   2016-10-15 CRAN (R 3.4.0)                
##  modelr       0.1.1   2017-07-24 CRAN (R 3.4.1)                
##  munsell      0.4.3   2016-02-13 CRAN (R 3.4.0)                
##  nlme         3.1-131 2017-02-06 CRAN (R 3.4.0)                
##  parallel     3.4.1   2017-07-08 local                         
##  pkgconfig    2.0.1   2017-03-21 cran (@2.0.1)                 
##  plyr         1.8.4   2016-06-08 CRAN (R 3.4.0)                
##  psych        1.7.5   2017-05-03 CRAN (R 3.4.0)                
##  purrr      * 0.2.2.2 2017-05-11 cran (@0.2.2.2)               
##  R6           2.2.2   2017-06-17 CRAN (R 3.4.1)                
##  Rcpp         0.12.12 2017-07-15 CRAN (R 3.4.1)                
##  readr      * 1.1.1   2017-05-16 CRAN (R 3.4.1)                
##  readxl       1.0.0   2017-04-18 CRAN (R 3.4.0)                
##  reshape2     1.4.2   2016-10-22 CRAN (R 3.4.0)                
##  rlang        0.1.1   2017-05-18 cran (@0.1.1)                 
##  rmarkdown    1.6     2017-06-15 CRAN (R 3.4.1)                
##  Rmisc        1.5     2013-10-22 CRAN (R 3.4.0)                
##  rprojroot    1.2     2017-01-16 CRAN (R 3.4.0)                
##  rvest        0.3.2   2016-06-17 CRAN (R 3.4.0)                
##  scales       0.4.1   2016-11-09 CRAN (R 3.4.0)                
##  stats      * 3.4.1   2017-07-08 local                         
##  stringi      1.1.5   2017-04-07 CRAN (R 3.4.0)                
##  stringr      1.2.0   2017-02-18 CRAN (R 3.4.0)                
##  tibble     * 1.3.3   2017-05-28 cran (@1.3.3)                 
##  tidyr      * 0.6.3   2017-05-15 CRAN (R 3.4.1)                
##  tidyverse  * 1.1.1   2017-01-27 CRAN (R 3.4.0)                
##  tools        3.4.1   2017-07-08 local                         
##  utils      * 3.4.1   2017-07-08 local                         
##  withr        2.0.0   2017-07-28 CRAN (R 3.4.1)                
##  xml2         1.1.1   2017-01-24 CRAN (R 3.4.0)                
##  yaml         2.1.14  2016-11-12 CRAN (R 3.4.0)

Limpieza y visualización de datos: `tidyverse`

avallecam

2017-08-04

Objetivo

Análisis de datos

Data manipulation

Tidy data

What is the tidyverse?

Tidy APIs

Okay but really, what is it?

Why tidyverse?

`tibble`

The pipe `%>%`

`tidyr`

`gather` and `spread`

`separate` and `unite`

`dplyr`

`select`

`mutate`

`filter`

`arrange`

`group_by` + `summarise`

`joins`

`ggplot2`

ANEXO: Rstudio

stats with `broom`

MÁS EJEMPLOS

Computer environment

References

Limpieza y visualización de datos: tidyverse

avallecam

2017-08-04

Objetivo

Análisis de datos

Data manipulation

Tidy data

What is the tidyverse?

Tidy APIs

Okay but really, what is it?

Why tidyverse?

tibble

The pipe %>%

tidyr

gather and spread

separate and unite

dplyr

select

mutate

filter

arrange

group_by + summarise

joins

ggplot2

ANEXO: Rstudio

stats with broom

MÁS EJEMPLOS

Computer environment

References

Limpieza y visualización de datos: `tidyverse`

`tibble`

The pipe `%>%`

`tidyr`

`gather` and `spread`

`separate` and `unite`

`dplyr`

`select`

`mutate`

`filter`

`arrange`

`group_by` + `summarise`

`joins`

`ggplot2`

stats with `broom`