tidyverse
El presente tutorial está basado en la publicación de Michael Levy y la publicación de Bradley Boehmke. El material ha sido readaptado para cumplir el objetivo del curso.
Mayor detalle en el libro de Hadley Wickham.
tidyverse
para la limpieza de datos.Analysts tend to follow 4 fundamental processes to turn data into understanding, knowledge & insight:
This tutorial will focus on data manipulation
It is often said that 80% of data analysis is spent on the process of cleaning and preparing the data. (Dasu and Johnson, 2003)
Well structured data serves two purposes:
Put data in data frames
The tidyverse is a suite of R tools that follow a tidy philosophy.
Functions should be consistent and easily (human) readable
pipe
Suite of ~20 packages that provide consistent, user-friendly, smart-default tools to do most of what most people do in R.
install.packages(tidyverse)
installs all of the above packages.
library(tidyverse)
attaches only the core packages.
tidyr::separate
)Rcpp
).|
“spirit”)tibble
A modern reimagining of data frames.
tdf = tibble(x = 1:1e4, y = rnorm(1e4)) # == data_frame(x = 1:1e4, y = rnorm(1e4))
class(tdf)
## [1] "tbl_df" "tbl" "data.frame"
Tibbles print politely.
tdf
## # A tibble: 10,000 x 2
## x y
## <int> <dbl>
## 1 1 1.5769043
## 2 2 -0.7643222
## 3 3 -1.5141980
## 4 4 -2.2969305
## 5 5 1.1883936
## 6 6 -0.8345075
## 7 7 0.2408071
## 8 8 -0.4245211
## 9 9 -1.0505558
## 10 10 0.6696043
## # ... with 9,990 more rows
Tibbles have some convenient and consistent defaults that are different from base R data.frames.
%>%
Sends the output of the LHS function to the first argument of the RHS function.
sum(1:8) %>%
sqrt()
## [1] 6
%>%
se obtiene de forma automática con el atajoCtrl+M
When you desire to perform multiple functions its advantage becomes obvious.
Nested Option:
arrange(
summarize(
filter(data, variable == numeric_value),
Total = sum(variable)
),
desc(Total)
)
or
Multiple Object Option:
a <- filter(data, variable == numeric_value)
b <- summarise(a, Total = sum(variable))
c <- arrange(b, desc(Total))
or
%>%
Option:
data %>%
filter(variable == “value”) %>%
summarise(Total = sum(variable)) %>%
arrange(desc(Total))
%>%
operator becomes more efficient and makes your code more legible.%>%
operator allows you to flow from data manipulation tasks straight into vizualization functions (via ggplot
and ggvis) and also into many analytic functions.tidyr
There are four fundamental functions of data tidying:
gather()
takes multiple columns, and gathers them into key-value pairs: it makes “wide” data longer.spread()
takes two columns (key & value) and spreads in to multiple columns, it makes “long” data wider.
separate()
splits a single column into multiple columnsunite()
combines multiple columns into a single column
gather
and spread
gather
to make wide table long, spread
to make long tables wide.
library(EDAWR)
cases %>%
tbl_df() %>%
gather(key= year, value=n, -country) %>%
spread(year, n)
## # A tibble: 3 x 4
## country `2011` `2012` `2013`
## * <chr> <dbl> <dbl> <dbl>
## 1 DE 5800 6000 6200
## 2 FR 7000 6900 7000
## 3 US 15000 14000 13000
stocks <- data.frame(
time = as.Date('2009-01-01') + 0:9,
X = rnorm(10, 0, 1),
Y = rnorm(10, 0, 2),
Z = rnorm(10, 0, 4)
)
stocksm <- stocks %>% gather(stock, price, -time) #%>% count(stock) #use gather()+count()
stocksm %>% spread(stock, price)
stocksm %>% spread(time, price)
who # Tuberculosis data from the WHO
## # A tibble: 7,240 x 60
## country iso2 iso3 year new_sp_m014 new_sp_m1524 new_sp_m2534 new_sp_m3544
## <chr> <chr> <chr> <int> <int> <int> <int> <int>
## 1 Afghanistan AF AFG 1980 NA NA NA NA
## 2 Afghanistan AF AFG 1981 NA NA NA NA
## 3 Afghanistan AF AFG 1982 NA NA NA NA
## 4 Afghanistan AF AFG 1983 NA NA NA NA
## 5 Afghanistan AF AFG 1984 NA NA NA NA
## 6 Afghanistan AF AFG 1985 NA NA NA NA
## 7 Afghanistan AF AFG 1986 NA NA NA NA
## 8 Afghanistan AF AFG 1987 NA NA NA NA
## 9 Afghanistan AF AFG 1988 NA NA NA NA
## 10 Afghanistan AF AFG 1989 NA NA NA NA
## # ... with 7,230 more rows, and 52 more variables: new_sp_m4554 <int>,
## # new_sp_m5564 <int>, new_sp_m65 <int>, new_sp_f014 <int>, new_sp_f1524 <int>,
## # new_sp_f2534 <int>, new_sp_f3544 <int>, new_sp_f4554 <int>, new_sp_f5564 <int>,
## # new_sp_f65 <int>, new_sn_m014 <int>, new_sn_m1524 <int>, new_sn_m2534 <int>,
## # new_sn_m3544 <int>, new_sn_m4554 <int>, new_sn_m5564 <int>, new_sn_m65 <int>,
## # new_sn_f014 <int>, new_sn_f1524 <int>, new_sn_f2534 <int>, new_sn_f3544 <int>,
## # new_sn_f4554 <int>, new_sn_f5564 <int>, new_sn_f65 <int>, new_ep_m014 <int>,
## # new_ep_m1524 <int>, new_ep_m2534 <int>, new_ep_m3544 <int>, new_ep_m4554 <int>,
## # new_ep_m5564 <int>, new_ep_m65 <int>, new_ep_f014 <int>, new_ep_f1524 <int>,
## # new_ep_f2534 <int>, new_ep_f3544 <int>, new_ep_f4554 <int>, new_ep_f5564 <int>,
## # new_ep_f65 <int>, new_rel_m014 <int>, new_rel_m1524 <int>, new_rel_m2534 <int>,
## # new_rel_m3544 <int>, new_rel_m4554 <int>, new_rel_m5564 <int>, new_rel_m65 <int>,
## # new_rel_f014 <int>, new_rel_f1524 <int>, new_rel_f2534 <int>, new_rel_f3544 <int>,
## # new_rel_f4554 <int>, new_rel_f5564 <int>, new_rel_f65 <int>
who %>%
gather(group, cases, -country, -iso2, -iso3, -year)
## # A tibble: 405,440 x 6
## country iso2 iso3 year group cases
## <chr> <chr> <chr> <int> <chr> <int>
## 1 Afghanistan AF AFG 1980 new_sp_m014 NA
## 2 Afghanistan AF AFG 1981 new_sp_m014 NA
## 3 Afghanistan AF AFG 1982 new_sp_m014 NA
## 4 Afghanistan AF AFG 1983 new_sp_m014 NA
## 5 Afghanistan AF AFG 1984 new_sp_m014 NA
## 6 Afghanistan AF AFG 1985 new_sp_m014 NA
## 7 Afghanistan AF AFG 1986 new_sp_m014 NA
## 8 Afghanistan AF AFG 1987 new_sp_m014 NA
## 9 Afghanistan AF AFG 1988 new_sp_m014 NA
## 10 Afghanistan AF AFG 1989 new_sp_m014 NA
## # ... with 405,430 more rows
separate
and unite
df <- data.frame(x = c(NA, "a.b", "a.d", "b.c"))
df %>%
tidyr::separate(x, c("A", "B")) %>%
tidyr::unite(x, A, B, sep=".")
## x
## 1 NA.NA
## 2 a.b
## 3 a.d
## 4 b.c
mtcars %>%
tbl_df() %>%
select(7:9) %>%
tidyr::unite(vs_am, vs, am) %>%
tidyr::separate(vs_am, c("vs", "am"))
library(EDAWR)
storms %>%
top_n(2,date) %>%
separate(date, c("y", "m", "d")) %>%
unite(date, y,m,d, sep="-")
## # A tibble: 2 x 4
## storm wind pressure date
## * <chr> <int> <int> <chr>
## 1 Alberto 110 1007 2000-08-03
## 2 Arlene 50 1010 1999-06-11
# extra
library(EDAWR)
pollution %>%
tbl_df() %>%
spread(size, amount) %>%
gather(size, amount, -city) %>%
arrange(desc(city))
## # A tibble: 6 x 3
## city size amount
## <chr> <chr> <dbl>
## 1 New York large 23
## 2 New York small 14
## 3 London large 22
## 4 London small 16
## 5 Beijing large 121
## 6 Beijing small 56
dplyr
Common data(frame) manipulation tasks.
There are seven fundamental functions of data transformation:
select()
select variablesmutate()
create new variablesfilter()
filter observationsarrange()
reorder observationsgroup_by()
groups observations by categorical levelssummarise()
summarise observations by functions of choice join()
joins separate dataframesselect
iris %>%
tbl_df() %>%
select(Petal.Length, Petal.Width)
## # A tibble: 150 x 2
## Petal.Length Petal.Width
## <dbl> <dbl>
## 1 1.4 0.2
## 2 1.4 0.2
## 3 1.3 0.2
## 4 1.5 0.2
## 5 1.4 0.2
## 6 1.7 0.4
## 7 1.4 0.3
## 8 1.5 0.2
## 9 1.4 0.2
## 10 1.5 0.1
## # ... with 140 more rows
# equivalent
iris %>%
tbl_df() %>%
select(3,4)
iris %>%
tbl_df() %>%
select(-Species)
iris %>%
tbl_df() %>%
select_if(is.factor)
use select_helpers!!!
# ?select_helpers
iris %>%
tbl_df() %>%
select(starts_with("Petal"))
iris %>%
tbl_df() %>%
select(ends_with("Width"))
iris %>%
tbl_df() %>%
select(contains("etal"))
iris %>%
tbl_df() %>%
select(-matches(".t.")) # accepts 'NOT' condition
mutate
mtcars %>%
tbl_df() %>%
select(1:3) %>%
mutate(gpm= 1/mpg)
## # A tibble: 32 x 4
## mpg cyl disp gpm
## <dbl> <dbl> <dbl> <dbl>
## 1 21.0 6 160.0 0.04761905
## 2 21.0 6 160.0 0.04761905
## 3 22.8 4 108.0 0.04385965
## 4 21.4 6 258.0 0.04672897
## 5 18.7 8 360.0 0.05347594
## 6 18.1 6 225.0 0.05524862
## 7 14.3 8 360.0 0.06993007
## 8 24.4 4 146.7 0.04098361
## 9 22.8 4 140.8 0.04385965
## 10 19.2 6 167.6 0.05208333
## # ... with 22 more rows
iris %>%
tbl_df() %>%
mutate_at(vars(-Species), funs(log))# %>% # vars() funs()
## # A tibble: 150 x 5
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## <dbl> <dbl> <dbl> <dbl> <fctr>
## 1 1.629241 1.252763 0.3364722 -1.6094379 setosa
## 2 1.589235 1.098612 0.3364722 -1.6094379 setosa
## 3 1.547563 1.163151 0.2623643 -1.6094379 setosa
## 4 1.526056 1.131402 0.4054651 -1.6094379 setosa
## 5 1.609438 1.280934 0.3364722 -1.6094379 setosa
## 6 1.686399 1.360977 0.5306283 -0.9162907 setosa
## 7 1.526056 1.223775 0.3364722 -1.2039728 setosa
## 8 1.609438 1.223775 0.4054651 -1.6094379 setosa
## 9 1.481605 1.064711 0.3364722 -1.6094379 setosa
## 10 1.589235 1.131402 0.4054651 -2.3025851 setosa
## # ... with 140 more rows
filter
dplyr::filter
iris %>%
tbl_df() %>%
# logical criteria
dplyr::filter(Sepal.Length > 7)
## # A tibble: 12 x 5
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## <dbl> <dbl> <dbl> <dbl> <fctr>
## 1 7.1 3.0 5.9 2.1 virginica
## 2 7.6 3.0 6.6 2.1 virginica
## 3 7.3 2.9 6.3 1.8 virginica
## 4 7.2 3.6 6.1 2.5 virginica
## 5 7.7 3.8 6.7 2.2 virginica
## 6 7.7 2.6 6.9 2.3 virginica
## 7 7.7 2.8 6.7 2.0 virginica
## 8 7.2 3.2 6.0 1.8 virginica
## 9 7.2 3.0 5.8 1.6 virginica
## 10 7.4 2.8 6.1 1.9 virginica
## 11 7.9 3.8 6.4 2.0 virginica
## 12 7.7 3.0 6.1 2.3 virginica
arrange
mtcars %>%
tbl_df() %>%
select(1:3) %>%
# order rows
dplyr::arrange(mpg) %>%
dplyr::arrange(desc(mpg))
## # A tibble: 32 x 3
## mpg cyl disp
## <dbl> <dbl> <dbl>
## 1 33.9 4 71.1
## 2 32.4 4 78.7
## 3 30.4 4 75.7
## 4 30.4 4 95.1
## 5 27.3 4 79.0
## 6 26.0 4 120.3
## 7 24.4 4 146.7
## 8 22.8 4 108.0
## 9 22.8 4 140.8
## 10 21.5 4 120.1
## # ... with 22 more rows
group_by
+ summarise
group_by()
groups observations by categorical levelssummarise()
summarise observations by functions of choice iris %>%
tbl_df() %>%
# compute separate summary row for each group
dplyr::group_by(Species) %>%
summarise(avg= mean(Sepal.Length)) %>%
dplyr::ungroup()
## # A tibble: 3 x 2
## Species avg
## <fctr> <dbl>
## 1 setosa 5.006
## 2 versicolor 5.936
## 3 virginica 6.588
joins
dplyr
also does multi-table joins and can connect to various types of databases.t1 = data_frame(alpha = letters[1:6], num = 1:6)
t2 = data_frame(alpha = letters[4:10], num = 4:10)
full_join(t1, t2, by = "alpha", suffix = c("_t1", "_t2"))
## # A tibble: 10 x 3
## alpha num_t1 num_t2
## <chr> <int> <int>
## 1 a 1 NA
## 2 b 2 NA
## 3 c 3 NA
## 4 d 4 4
## 5 e 5 5
## 6 f 6 6
## 7 g NA 7
## 8 h NA 8
## 9 i NA 9
## 10 j NA 10
Super-secret pro-tip: You can group_by
%>% mutate
to accomplish a summarize + join
data_frame(group = sample(letters[1:3], 10, replace = TRUE),
value = rnorm(10)) %>%
group_by(group) %>%
mutate(group_average = mean(value))
ggplot2
Visualization package
# density, cumsum, cume_dist + facet
z <- iris %>%
tbl_df() %>%
gather(key=attrib, value= attrib_m, -Species) %>%
group_by(attrib, Species) %>%
arrange(attrib, Species, attrib_m) %>%
dplyr::mutate_if(is.numeric,funs(cumsum, cume_dist))
#dplyr::mutate_each(funs(cumsum, cume_dist), -Species)
b <- z %>%
ggplot(aes(attrib_m,cumsum)) +
geom_line(aes(colour= Species)) +
facet_grid(. ~ attrib)
c <- iris %>%
gather(key=attrib, value= attrib_m, -Species) %>%
ggplot(aes(attrib_m)) +
geom_density(aes(colour= Species)) +
facet_grid(. ~ attrib)
Rmisc::multiplot(b, c, cols = 1)
who %>%
select(-iso2, -iso3) %>%
gather(group, cases, -country, -year) %>%
count(country, year, wt = cases) %>%
ggplot(aes(x = year, y = n, group = country)) +
geom_line(size = .2)
¡Sácale el jugo a sus ventajas!
%>%
broom
# 1. summary stats
iris %>%
tbl_df() %>%
gather(key=attrib, value= attrib_m, -Species) %>%
group_by(Species, attrib) %>%
summarise_if(is.numeric,c("mean", "median", #location
"IQR", "mad", "sd", "var")) %>% #spread
filter(attrib=="Sepal.Length")
#glimpse()
# 2. distribution visualization
iris %>%
ggplot(aes(Sepal.Length)) +
geom_density(aes(colour= Species))
# 2. test hypothesis
iris %>%
filter(Species!="setosa") %>%
t.test(Sepal.Length ~ Species, data=.) %>%
broom::tidy()
iris %>%
filter(Species!="versicolor") %>%
t.test(Sepal.Length ~ Species, data=.) %>%
broom::tidy()
iris %>%
filter(Species!="virginica") %>%
t.test(Sepal.Length ~ Species, data=.) %>%
broom::tidy()
iris %>%
#filter(Species!="setosa") %>%
aov(Sepal.Length ~ Species, data=.) %>%
broom::tidy()
broom::glance()
broom::augment()
library(tidyverse)
library(stringr)
library(forcats)
library(broom)
#library(EDAWR)
#
tidyr::who %>%
filter(iso3=="PER") %>%
summarise_if(is.numeric,mean, na.rm=T) %>%
glimpse()
# one -------------------------------------
who1 <- tidyr::who %>%
gather(new_sp_m014:newrel_f65,
key= "key",
value= "cases",
na.rm=T) %>%
mutate(key= stringr::str_replace(key,
"newrel","new_rel")) %>%
separate(key,
c("new", "type", "sexage"),
sep="_") %>%
select(-new, -iso2, -iso3) %>%
separate(sexage,
c("sex", "age"),
sep=1)
who1 %>%
filter(country=="Peru") %>%
mutate(age= forcats::fct_reorder(age, desc(cases))) %>%
ggplot(aes(year, cases)) +
geom_line(aes(colour=age))
#count(age)
#View()
# two -------------------------------------
who2 <- who1 %>%
group_by(country, year, sex) %>%
summarise_at(vars(cases), sum, na.rm=T)
who2 %>%
filter(country=="Peru") %>%
ggplot(aes(year, cases)) +
geom_line(aes(colour=sex)) +
facet_wrap(~ country)
# three -------------------------------------
who3 <- who2 %>%
group_by(country) %>%
summarise_at(vars(cases), sum, na.rm=T) %>%
top_n(20, wt=cases) %>%
select(country) %>%
inner_join(who1) %>%
bind_rows(who1 %>%
filter(country=="Peru"))
who3 %>%
group_by(country) %>%
mutate(age= forcats::fct_reorder(age, desc(cases))) %>%
ggplot(aes(year, log10(cases))) +
geom_line(aes(colour=age)) +
facet_wrap(~ country)
devtools::session_info()
## Session info ----------------------------------------------------------------------------
## setting value
## version R version 3.4.1 (2017-06-30)
## system x86_64, linux-gnu
## ui X11
## language en_US
## collate en_US.UTF-8
## tz America/Lima
## date 2017-08-04
## Packages --------------------------------------------------------------------------------
## package * version date source
## assertthat 0.2.0 2017-04-11 CRAN (R 3.4.0)
## backports 1.1.0 2017-05-22 CRAN (R 3.4.1)
## base * 3.4.1 2017-07-08 local
## bindr 0.1 2016-11-13 cran (@0.1)
## bindrcpp * 0.2 2017-06-17 CRAN (R 3.4.1)
## broom 0.4.2 2017-02-13 CRAN (R 3.4.0)
## cellranger 1.1.0 2016-07-27 CRAN (R 3.4.0)
## colorspace 1.3-2 2016-12-14 CRAN (R 3.4.0)
## compiler 3.4.1 2017-07-08 local
## datasets * 3.4.1 2017-07-08 local
## devtools 1.13.2 2017-06-02 CRAN (R 3.4.1)
## digest 0.6.12 2017-01-27 CRAN (R 3.4.0)
## dplyr * 0.7.2 2017-07-20 CRAN (R 3.4.1)
## EDAWR * 0.1 2017-02-24 Github (rstudio/EDAWR@2652ea6)
## evaluate 0.10.1 2017-06-24 CRAN (R 3.4.1)
## forcats 0.2.0 2017-01-23 CRAN (R 3.4.0)
## foreign 0.8-69 2017-06-21 CRAN (R 3.4.1)
## ggplot2 * 2.2.1 2016-12-30 CRAN (R 3.4.0)
## glue 1.1.1 2017-06-21 CRAN (R 3.4.1)
## graphics * 3.4.1 2017-07-08 local
## grDevices * 3.4.1 2017-07-08 local
## grid 3.4.1 2017-07-08 local
## gtable 0.2.0 2016-02-26 CRAN (R 3.4.0)
## haven 1.1.0 2017-07-09 CRAN (R 3.4.1)
## hms 0.3 2016-11-22 CRAN (R 3.4.0)
## htmltools 0.3.6 2017-04-28 CRAN (R 3.4.0)
## httr 1.2.1 2016-07-03 CRAN (R 3.4.0)
## jsonlite 1.5 2017-06-01 cran (@1.5)
## knitr 1.16 2017-05-18 cran (@1.16)
## labeling 0.3 2014-08-23 CRAN (R 3.4.0)
## lattice 0.20-35 2017-03-25 CRAN (R 3.3.3)
## lazyeval 0.2.0 2016-06-12 CRAN (R 3.4.0)
## lubridate 1.6.0 2016-09-13 CRAN (R 3.4.0)
## magrittr 1.5 2014-11-22 CRAN (R 3.4.0)
## memoise 1.1.0 2017-04-21 CRAN (R 3.4.0)
## methods * 3.4.1 2017-07-08 local
## mnormt 1.5-5 2016-10-15 CRAN (R 3.4.0)
## modelr 0.1.1 2017-07-24 CRAN (R 3.4.1)
## munsell 0.4.3 2016-02-13 CRAN (R 3.4.0)
## nlme 3.1-131 2017-02-06 CRAN (R 3.4.0)
## parallel 3.4.1 2017-07-08 local
## pkgconfig 2.0.1 2017-03-21 cran (@2.0.1)
## plyr 1.8.4 2016-06-08 CRAN (R 3.4.0)
## psych 1.7.5 2017-05-03 CRAN (R 3.4.0)
## purrr * 0.2.2.2 2017-05-11 cran (@0.2.2.2)
## R6 2.2.2 2017-06-17 CRAN (R 3.4.1)
## Rcpp 0.12.12 2017-07-15 CRAN (R 3.4.1)
## readr * 1.1.1 2017-05-16 CRAN (R 3.4.1)
## readxl 1.0.0 2017-04-18 CRAN (R 3.4.0)
## reshape2 1.4.2 2016-10-22 CRAN (R 3.4.0)
## rlang 0.1.1 2017-05-18 cran (@0.1.1)
## rmarkdown 1.6 2017-06-15 CRAN (R 3.4.1)
## Rmisc 1.5 2013-10-22 CRAN (R 3.4.0)
## rprojroot 1.2 2017-01-16 CRAN (R 3.4.0)
## rvest 0.3.2 2016-06-17 CRAN (R 3.4.0)
## scales 0.4.1 2016-11-09 CRAN (R 3.4.0)
## stats * 3.4.1 2017-07-08 local
## stringi 1.1.5 2017-04-07 CRAN (R 3.4.0)
## stringr 1.2.0 2017-02-18 CRAN (R 3.4.0)
## tibble * 1.3.3 2017-05-28 cran (@1.3.3)
## tidyr * 0.6.3 2017-05-15 CRAN (R 3.4.1)
## tidyverse * 1.1.1 2017-01-27 CRAN (R 3.4.0)
## tools 3.4.1 2017-07-08 local
## utils * 3.4.1 2017-07-08 local
## withr 2.0.0 2017-07-28 CRAN (R 3.4.1)
## xml2 1.1.1 2017-01-24 CRAN (R 3.4.0)
## yaml 2.1.14 2016-11-12 CRAN (R 3.4.0)