Data Carpentry Workshop - Day 2 -Data Manipulation using dplyr and tidyr

IMG_7436.JPG

Main Contents

  • Describe the purpose of the dplyr and tidyr packages.
  • Select certain columns in a data frame with the dplyr function select.
  • Select certain rows in a data frame according to filtering conditions with the dplyr function filter .
  • Link the output of one dplyr function to the input of another function with the ‘pipe’ operator %>%.
  • Add new columns to a data frame that are functions of existing columns with mutate.
  • Use the split-apply-combine concept for data analysis.
  • Use summarize, group_by, and count to split a data frame into groups of observations, apply a summary statistics for each group, and then combine the results.
  • Describe the concept of a wide and a long table format and for which purpose those formats are useful.
  • Describe what key-value pairs are.
  • Reshape a data frame from long to wide format and back with the spread and gather commands from the tidyr package.
  • Export a data frame to a .csv file.

1. Install and load tidyverse

install.packages("tidyverse")
library("tidyverse")
  • tidyverse including tidyr,dplyr,ggplot2,tibble et.al.

2. What are dplyr and tidyr packages?

  • The package dplyr provides easy tools for the most common data manipulation tasks. It is built to work directly with data frames, with many common tasks optimized by being written in a compiled language (C++).
  • The package tidyr addresses the common problem of wanting to reshape your data for plotting and use by different R functions.
  • We’ll read in our data using the read_csv() function, from the tidyverse package readr, instead of read.csv().
Characteristic of tibble:
  • In addition to displaying the data type of each column under its name, it only prints the first few rows of data and only as many columns as fit on one screen.
  • Columns of class character are never converted into factors.

3. Functions of dplyr.

  • select(): subset columns
  • filter(): subset rows on conditions
  • mutate(): create new columns by using information from other columns
  • group_by() and summarize(): create summary statisitcs on grouped data
  • arrange(): sort results
  • count(): count discrete value
3.1 Selecting columns and filtering rows
select(surveys,plot_id,species_id,weight) #select columns
filter(surveys,year == 1995)  # filter rows
3.2 Pipes
Three ways to do select and filter at the same time:
(1) using intermediate steps
surveys2 <- filter(surveys,weight <5)
surveys_sml1 <- select(surveys2,species_id,sex,weight)
view(surveys_sml)
(2) nest functions
surveys_sml2 <- select(filter(surveys,weight <5),species_id,sex,weight)
surveys_sml2
(3) pipes
surveys_sml3 <- surveys %>% 
filter(weight <5) %>% 
select(species_id,sex,weight)
surveys_sml3
Challenge
surveys_select <- surveys %>% 
filter(year < 1995) %>% 
select(year,sex,weight)
surveys_select
3.3 Split-apply-combine data analysis and the summarize() function
split-apply-combine paradigm: split the data into groups, apply some analysis to each group, and then combine the results.
(1) group_by() & summarize()
# group by single columns
surveys %>% 
group_by(sex) %>% 
summarize(mean_weight = mean(weight,na.rm = TRUE))
# group by multiple columns
surveys_test <- surveys %>% 
filter(!is.na(weight)) %>% 
group_by(sex,species_id) %>% 
summarize(mean_weight = mean(weight)) %>% 
print(n =15)
View(surveys_test)
# summarize multiple variables at the same time
surveys %>%
filter(!is.na(weight)) %>%
group_by(sex, species_id) %>%
summarize(mean_weight = mean(weight),
          min_weight = min(weight)) %>% 
print(n =10)
(2) arrange
# rearrange by a column content
surveys %>%
filter(!is.na(weight)) %>%
group_by(sex, species_id) %>%
summarize(mean_weight = mean(weight),
          min_weight = min(weight)) %>%
arrange(min_weight) %>% 
print(n=10)
#sort in descending order
surveys %>%
filter(!is.na(weight)) %>%
group_by(sex, species_id) %>%
summarize(mean_weight = mean(weight),
          min_weight = min(weight)) %>%
arrange(desc(mean_weight)) %>% 
print(n=10)
(3) counting
 # count a single factor
surveys %>%
count(sex)
surveys %>%
count(sex,sort =TRUE)
# count combination of factors
surveys %>%
count(sex,species)
# count and arrange
surveys %>%
count(sex,species) %>% 
arrange(species,desc(n))
(4) challenge
#counting
surveys %>%
count(plot_id)
# group_by & summarize
surveys %>%
filter(!is.na(hindfoot_length)) %>% 
group_by(species_id) %>% 
summarize(mean_hindfoot_length = mean(hindfoot_length),
        min_hindfoot_length = min(hindfoot_length),
        max_hindfoot_length = max(hindfoot_length),
        n =n())
# arrange
surveys %>%
filter(!is.na(weight)) %>%
group_by(year) %>% 
filter(weight == max(weight)) %>% 
select(year,genus,species_id,weight) %>%
arrange(year)

4. tidyr functions: reshaping with spread and gather

4.1 spreading
three principal arguments:
  • (1) the data
  • (2) the key column variable whose value will become new column names;
  • (3) the value column whose values will fill the new column variables.
  • others :fill.
spread code
# step1. prepare for data
surveys_gw <- surveys %>%
filter(!is.na(weight)) %>% 
group_by(genus,plot_id) %>% 
summarize(mean_weight = mean(weight))
str(surveys_gw)
# step2. spread
surveys_spread <- surveys_gw %>% 
spread(key = genus,value = mean_weight)
str(surveys_spread)
# step3. fill
surveys_spread %>%
spread(genus, mean_weight, fill = 0) %>%
head()
spread
4.2 Gathering
four principal arguments:
  • (1) the data
  • (2) the key column variable we wish to create from column names.
  • (3) values column variable we wish to create and fill with values associated with the key.
  • (4) the names of the columns we use to fill the key variable (or to drop).
gather code
surveys_gather <- surveys_spread %>%
gather(key = genus, value = mean_weight, -plot_id)
str(surveys_gather)
gather
challenge

5. Export Data

(1)filter
surveys_complete <- surveys %>%
filter(!is.na(weight),           # remove missing weight
     !is.na(hindfoot_length),  # remove missing hindfoot_length
     !is.na(sex))                # remove missing sex
(2) extract
# Extract the most common species_id
species_counts <- surveys_complete %>%
count(species_id) %>% 
filter(n >= 50)
(3) Only keep the most common species
surveys_complete <- surveys_complete %>%
filter(species_id %in% species_counts$species_id)
dim(surveys_complete)
(4) save
write_csv(surveys_complete, path = "C:/Users/home/Desktop/Rcourse/DataCarpentry33/Outputs/surveys_complete-data.csv")
read_csv("C:/Users/home/Desktop/Rcourse/DataCarpentry33/Outputs/surveys_complete-data.csv")

下期預告

Data visualization with ggplot2

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時請結(jié)合常識與多方信息審慎甄別。
平臺聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點,簡書系信息發(fā)布平臺,僅提供信息存儲服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

  • rljs by sennchi Timeline of History Part One The Cognitiv...
    sennchi閱讀 7,855評論 0 10
  • 心靈自由 30天寫作 認真的從網(wǎng)上搜索了有關(guān)裸貸的材料,翻看了兩篇,沒有定力再往下看了。一個個年輕的女孩子,裸著自...
    我就是云錦閱讀 488評論 0 0
  • 燕地風殺人,今日始方知。 匆匆出門去,性命后悔遲。 北極有寒渦,凌侵如鐵甲。 昨夜入關(guān)來,風讒更威加。 寒隨大風來...
    柏舟BAEJOU閱讀 240評論 0 0

友情鏈接更多精彩內(nèi)容