R報錯-gather & spread 函數(shù),有新版替代函數(shù)

做一個練習(xí)題,以為很簡單,但是卻碰到問題

1.將iris數(shù)據(jù)框的前4列g(shù)ather,然后還原

iris_gather <- gather(data = iris,
       key = LW,
       value = S,
       -Species)

還原

iris_spread <- spread(data = iris_gather,
                      key = LW,
                      value = S)

錯誤: Each row of output must be identified by a unique combination of keys.

Keys are shared for 600 rows:

查了下解釋如下:
The error in spread can occur when there are more than one unique combinations exist. With pivot_wider, it is now replaced with a warning and would return a list column if there are duplicates and then we can unnest. Or another way is to create a sequence column grouped by the column identifier that have duplicates to make a unique row identifier i.e.

需要額外做些處理
iris_gather <- gather(data = iris,
       key = LW,
       value = SW,
       -Species)
iris_gather %>% group_by(LW) %>% mutate(id=1:n())%>% spread(LW,SW)

雖然解決了這個問題,但是作為菜鳥,我也不知道到底錯的是什么,也不能解釋這個-there is something fundamentally wrong with the design of spread() and gather(),但是這個包的作者本人原話是:

For some time, it’s been obvious that there is something fundamentally wrong with the design of spread() and gather(). Many people don’t find the names intuitive and find it hard to remember which direction corresponds to spreading and which to gathering. It also seems surprisingly hard to remember the arguments to these functions, meaning that many people (including me!) have to consult the documentation every time.#我還以為只有我分不清這2個參數(shù),有種共鳴
就是勸放棄使用這兩個函數(shù)-spread() and gather()
給出了新的替代改進(jìn)版函數(shù) 具備state-of-the-art features

There are two important new features inspired by other R packages that have been advancing reshaping in R:

  1. pivot_longer() can work with multiple value variables that may have different types, inspired by the enhanced melt() and dcast() functions provided by the data.table package by Matt Dowle and Arun Srinivasan.

  2. pivot_longer() and pivot_wider() can take a data frame that specifies precisely how metadata stored in column names becomes data variables (and vice versa), inspired by the cdata package by John Mount and Nina Zumel.

1 gather的替代版本,pivot_longer

數(shù)據(jù)列數(shù)減少,行數(shù)增加

relig_income

數(shù)據(jù)集
總共3個變量,18個 V1=religion; 10個 V2=salary(income-收入范圍);V3=count(多少人)
那將表格整合成同一個religion的不同收入對應(yīng)的人數(shù),按照V2=10列;V2 作為一個單位來循環(huán)unique(relig_income$religion) 18次,預(yù)測總共生成了18*10=180行
test <- relig_income %>% 
  pivot_longer(!religion, names_to = "income", values_to = "count")
dim(test)
[1] 180   3

第一個參數(shù)是:The first argument is the dataset to reshape, relig_income. ---這里選擇relig_income這個數(shù)據(jù)集

第二個參數(shù)是:The second argument describes which columns need to be reshaped. In this case, it’s every column apart from religion.
具體解釋下:!religion是排除這個因素;其它的列-V2 都參與進(jìn)去reshape;把所有的帶有count值的列都統(tǒng)計進(jìn)去
第三個參數(shù)是:給V2這個變量作為一列,自命名;The names_to gives the name of the variable that will be created from the data stored in the column names, i.e. income. 給你合并的列也就是第二個變量V2一個新名稱,它這里是income,可以自行取名字
第四個參數(shù)是:給V3這個變量提取出來成為一列;The values_to gives the name of the variable that will be created from the data stored in the cell value, i.e. count.

180行?3列,后面的兩列都是自己命名的

上述的是String data in column names-需要整合的數(shù)據(jù)是單純的字符串

對于這個數(shù)據(jù)集,名稱比較規(guī)律

billboard %>% 
  pivot_longer(
    cols = starts_with("wk"), #限定合并的列是以wk開頭的字符串
    names_to = "week", #給合并的列(所有的wk)所在的行一個變量名稱
    values_to = "rank",#給count值一個列名稱
    values_drop_na = F)#這個很棒了,如果參數(shù)為T直接幫你去除合并后的NA值
#預(yù)估是76*317=24092行
數(shù)字是wk,一直到wk76

總共有24092行
但是week這一列有字符wk,也有數(shù)字,只想看數(shù)字,怎么拆分,作者給了參數(shù)-names_prefix 和另外一個names_transform
billboard %>% 
  pivot_longer(
    cols = starts_with("wk"), 
    names_to = "week", 
    names_prefix = "wk",#去除week那一列的字符串"wk"
    names_transform = list(week = as.integer),#將week那一列經(jīng)過了去除字符串后留下的數(shù)字轉(zhuǎn)換為integer
#另外一種是     names_transform = list(week = readr::parse_number),

    values_to = "rank",
    values_drop_na = TRUE,
  )
作者也給了另外一種方式,Alternatively, you could do this with a single argument by using readr::parse_number() which automatically strips non-numeric components:

這里是給的是列名稱中含有字符串和數(shù)字,并且想把數(shù)字作為整數(shù)來直觀統(tǒng)計

那如果列名稱又包含很多變量呢?

Many variables in column names
使用的數(shù)據(jù)集是who

who
數(shù)據(jù)集的列名稱還包含了3個變量

country, iso2, iso3, and year are already variables, so they can be left as is. But the columns from new_sp_m014 to newrel_f65 encode four variables in their names:
其中一個變量是-new不用管它
The new_/new prefix indicates these are counts of new cases. This dataset only contains new cases, so we’ll ignore it here because it’s constant.
另外3個是圖中給標(biāo)注的那樣的
V1. sp/rel/ep describe how the case was diagnosed.診斷方法差別

V2. m/f gives the gender. 男女性別

V3. 014/1524/2535/3544/4554/65 supplies the age range. 年齡段

對數(shù)據(jù)整理的前提是-得了解這個數(shù)據(jù)集的構(gòu)造
這個參數(shù)比較厲害了names_pattern
who %>% pivot_longer(
  cols = new_sp_m014:newrel_f65,
  names_to = c("diagnosis", "gender", "age"), #給上述的V1~V3命名
  names_pattern = "new_?(.*)_(.)(.*)",
  values_to = "count" #帶有數(shù)值的那一列名稱叫做count
)

We can break these variables up by specifying multiple column names in names_to, and then either providing names_sep or names_pattern. Here names_pattern is the most natural fit. It has a similar interface to extract: you give it a regular expression containing groups (defined by ()) and it puts each group in a column
以這個為例new_sp_m2534 ,其實我們手動分開的話是 sp/m/2534這樣分成三列
使用這個函數(shù) names_pattern = "new_?(.*)_(.)(.*)"
就是這樣分的:() () () 使用小括號把這三列先括起來
第一列是遇到了new_?不管是任何字符串,直到遇到下一個"_"之間的都是作為第一列;第二列是第二個"_"之間的任意一個字符串,規(guī)定只有一個(f或者是m);第三列是去掉前面一個的所有剩余字符串

數(shù)據(jù)看著要簡便很多

作者又進(jìn)一步把上述的做了個分類處理
who %>% pivot_longer(
  cols = new_sp_m014:newrel_f65,
  names_to = c("diagnosis", "gender", "age"), 
  names_pattern = "new_?(.*)_(.)(.*)",
  names_transform = list(
    gender = ~ readr::parse_factor(.x, levels = c("f", "m")),
    age = ~ readr::parse_factor(
      .x,
      levels = c("014", "1524", "2534", "3544", "4554", "5564", "65"), 
      ordered = TRUE
    )
  ),
  values_to = "count",
)
     
另外一個可能是-這個數(shù)據(jù)集的行包含分類信息,dob, gender
library(readr)
family <- tribble(
  ~family,  ~dob_child1,  ~dob_child2, ~gender_child1, ~gender_child2,
  1L, "1998-11-26", "2000-01-29",             1L,             2L,
  2L, "1996-06-22",           NA,             2L,             NA,
  3L, "2002-07-11", "2004-04-05",             2L,             2L,
  4L, "2004-10-10", "2009-08-27",             1L,             1L,
  5L, "2000-12-05", "2005-02-28",             2L,             1L,
)
family <- family %>% mutate_at(vars(starts_with("dob")), parse_date)
family %>% 
  pivot_longer(
    !family, 
    names_to = c(".value", "child"), 
    names_sep = "_", 
    values_drop_na = F
  )

Note that we have two pieces of information (or values) for each child: their gender and their dob (date of birth). These need to go into separate columns in the result. Again we supply multiple variables to names_to, using names_sep to split up each variable name. Note the special name .value: this tells pivot_longer() that that part of the column name specifies the “value” being measured (which will become a variable in the output).
請注意,我們?yōu)槊總€孩子提供兩條信息(或值):他們的性別和他們的出生日期(出生日期)。 這些需要進(jìn)入結(jié)果中的單獨列。 我們再次為names_to 提供多個變量,使用names_sep 拆分每個變量名。 請注意特殊名稱 .value:它告訴 pivot_longer() 列名稱的那部分指定了被測量的“值”(它將成為輸出中的變量)。

family數(shù)據(jù)集

處理后的

理解為

names_to = c(".value", "child")這個列名稱中被指定的是child前面的名字 dob 和 gender,并把它們各自作為輸出數(shù)據(jù)的列

用法太多了,可以自行查看這個函數(shù)的說明書,如果感興趣


vignette("pivot")#加載這個函數(shù)的幫助文檔

總結(jié):根據(jù)上述給的提示,得到了解決方案是這個

##1 gather替代函數(shù)建議使用這個pivot_longer
test2 <- iris %>% pivot_longer(!Species,names_to = "LW", values_to = "size")
##1 spread替代函數(shù)建議使用這個pivot_wider,會警告,但是不報錯
test3 <- test2 %>% 
  pivot_wider(names_from = LW, values_from = size) %>% 
  unnest()
1: Values are not uniquely identified; output will contain list-cols.
* Use `values_fn = list` to suppress this warning.
* Use `values_fn = length` to identify where the duplicates arise
* Use `values_fn = {summary_fun}` to summarise duplicates 
2: `cols` is now required when using unnest().
Please use `cols = c(Sepal.Length, Sepal.Width, Petal.Length, Petal.Width)`

它的逆-增加列,還原為原來的列,是很麻煩的,會出現(xiàn)警告,但是不會報錯

一定要加unnest進(jìn)行分解list

不加unnest后是這樣的,所有Species的4個觀察數(shù)值都被作為一個向量

vignette("pivot")

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時請結(jié)合常識與多方信息審慎甄別。
平臺聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點,簡書系信息發(fā)布平臺,僅提供信息存儲服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容