正則表達(dá)式(stringr包)

前言

R語言的stringr包是由大名鼎鼎的Hadley Wickham開發(fā),是對于stringi的進(jìn)一步封裝。這個(gè)人有多牛,因?yàn)樗麛?shù)據(jù)處理和可視化開發(fā)工具方面的突出貢獻(xiàn),獲得專為統(tǒng)計(jì)計(jì)算而設(shè)立的約翰·錢伯斯獎,這在當(dāng)年可是讓一眾統(tǒng)計(jì)學(xué)家大呼不滿。Hadley Wickham通過開發(fā)ggplot2包讓人們意識到原來R語言繪圖可以這么簡單美觀,這可是為R語言爭取了不少用戶,因?yàn)橛X得在數(shù)據(jù)處理不夠便捷,大神就寫了一個(gè)目前堪稱數(shù)據(jù)處理的神器tidyverse,將眾多的方法串聯(lián)到一起,tidyverse是他把自己所寫的包整理成了一整套數(shù)據(jù)處理的方法,包括ggplot2、readr、purrr、dplyr、tidyr、stringr、forcats、reshape2等。同時(shí)還專門寫了一本書《R for Data Science》,中文書名是《R數(shù)據(jù)科學(xué)》。這本書里面也詳細(xì)介紹了tidyverse的使用方法。這個(gè)大佬目前是Rstudio首席科學(xué)家,中國R語言的榮光謝益輝大神也在這個(gè)公司工作。而在這里,我們主要介紹R語言的stringr包。

定義和舉例

正則表達(dá)式是什么?它是一種提取文本串特征、描述文本串的方法!

  • 舉個(gè)例子,假如我想提取字符串的"Homo sapiens intraflagellar transport 80 homolog (Chlamydomonas) (IFT80), mRNA"中的"Chlamydomonas""IFT80"。
  • 然后,我觀察它們的共同特征,發(fā)現(xiàn)它們都是位于括號內(nèi)。
  • "Chlamydomonas"這個(gè)字符串主要由大寫英文字母、小寫英文字母組成。
  • "IFT80"主要由大寫英文字母數(shù)字組成。
  • 歸納一下,"Chlamydomonas"和"IFT80"這兩個(gè)字符串的共同特征:1. 都位于括號內(nèi);2. 都是由大寫英文字母、小寫英文字母和數(shù)字這三部分中的任意兩個(gè)組成。
  • 因此,我們可以使用\\([A-Za-z0-9]*\\)來代表這兩個(gè)字符串。
  • 為什么是\\([A-Za-z0-9]*\\)?
  • 建議你先瀏覽完下面的內(nèi)容再返回來觀看,可能理解會更加深刻一點(diǎn)!
  1. 首先,我們想要查找的兩個(gè)字符串剛好位于括號內(nèi)。
  • 我們先把括號作為一種匹配的特征,這樣可以更精準(zhǔn)地找到我們想要的字符串。因?yàn)槔ㄌ柺窃址?,需要斜杠來轉(zhuǎn)譯,而斜杠也是元字符,需要另外一個(gè)斜杠來轉(zhuǎn)譯。因此,括號的表示模式可以寫成這個(gè)樣子:
pattern = "\\(\\),"
  1. 找到了括號,這個(gè)括號里面是要有東西的,內(nèi)容就是大寫英文字母、小寫英文字母和數(shù)字,所以我們用[A-Za-z0-9]代替。A-Z代表從A-Z的大寫英文字母集合,a-z代表從a-z的小寫英文字母集合。0-9代表從0-9的數(shù)字的集合。
  2. 此外,括號內(nèi)的內(nèi)容是大寫英文字母、小寫英文字母和數(shù)字中的任意兩部分且數(shù)量不限,所以我們用*符號代替。*代表0或者多個(gè)。
  • 因此,括號內(nèi)的表達(dá)模式可以寫成這樣:
pattern = "\\([A-Za-z0-9]*\\),"
  • 現(xiàn)在,我們使用str_match函數(shù)來實(shí)際操作一下
> string <- "Homo sapiens intraflagellar transport 80 homolog (Chlamydomonas) (IFT80), mRNA"
> string
[1] "Homo sapiens intraflagellar transport 80 homolog (Chlamydomonas) (IFT80), mRNA"
> pattern <- "\\([A-Za-z0-9]*\\)"
> pattern
[1] "\\([A-Za-z0-9]*\\)"
> str_match(string,pattern)
     [,1]             
[1,] "(Chlamydomonas)"
> str_match_all(string,pattern)
[[1]]
     [,1]             
[1,] "(Chlamydomonas)"
[2,] "(IFT80)"        
  • 我們發(fā)現(xiàn),str_match只返回匹配到的字符串的第一個(gè),但str_match_all可以返回所有匹配到的字符串。
  • 但即使是str_match_all,我們也發(fā)現(xiàn)返回的兩個(gè)字符串,都是帶著括號的。但我們最終結(jié)果并不想包含括號,那我們需要怎么做?
  • 通常的處理方法,是使用str_remove_all函數(shù)去除括號,具體演示如下:
> string <- "Homo sapiens intraflagellar transport 80 homolog (Chlamydomonas) (IFT80), mRNA"
> pattern <- "\\([A-Za-z0-9]*\\)"
> a <- str_match_all(string,pattern) # str_match_all函數(shù)輸出結(jié)果是列表
> pattern <- "\\(|\\)"
> str_remove_all(a[[1]], pattern)
[1] "Chlamydomonas" "IFT80"        
  • 但通過?str_match查詢到了,我們發(fā)現(xiàn)了一個(gè)更簡單高效的方式:

For str_match, a character matrix. First column is the complete match, followed by one column for each capture group. For str_match_all, a list of character matrices.

  • str_match函數(shù)會返回完整匹配的第一列,也會返回完整匹配中的捕獲組。
  • 什么是捕獲組?正則表達(dá)式分組分為捕獲組(Capturing Groups)非捕獲組(Non-Capturing Groups)。當(dāng)你把一個(gè)正則表達(dá)式用一對小括號包起來的時(shí)候,就形成了一個(gè)捕獲組,如(\d)表示一個(gè)分組,(\d)(\d)表示有兩個(gè)分組,(\d)(\d)(\d)表示有三個(gè)分組,有幾對小括號元字符組成,就表示有幾個(gè)分組,以此類推。
  • 在我們的例子中,我們想要把括號中的內(nèi)容[A-Za-z0-9]*單獨(dú)提取出來,那么需要給它增加一對括號,形成一個(gè)捕獲組。
> pattern = "\\([A-Za-z0-9]*\\)"
> pattern2 = "\\(([A-Za-z0-9]*)\\)"
> str_match_all(string,pattern2)
[[1]]
     [,1]              [,2]           
[1,] "(Chlamydomonas)" "Chlamydomonas"
[2,] "(IFT80)"         "IFT80"        
> str_match_all(string,pattern2)[[1]][,2]
[1] "Chlamydomonas" "IFT80"

正則表達(dá)式

①.R中的正則表達(dá)式模式有三種:

1、擴(kuò)展正則表達(dá)式:默認(rèn)方式(第一種是最常用的);
2、Perl風(fēng)格正則表達(dá)式:設(shè)置參數(shù)perl = TRUE;
3、字面意義正則表達(dá)式:設(shè)置參數(shù)fixed = TRUE。

②.R中的基本元字符如下:(這些字符的含義與Python一樣)

. \ | ( ) [ ] ^ $ * + ?

.     表示任意字符,包括換行符;
\     表示對字符進(jìn)行轉(zhuǎn)義,即恢復(fù)它本來的含義。但在R中,\中也是字符,所以轉(zhuǎn)義\是要用\\;
|     表示匹配,舉個(gè)例子,A|B,表示對A或B其中一個(gè)匹配,A匹配成功則不匹配B;
()    字符組,括號中的模式作為一個(gè)整體進(jìn)行匹配
[]    字符集合,括號內(nèi)的任意字符將被匹配
^     匹配字符串開頭。舉個(gè)例子,^MT-表示匹配開頭含有M、T這兩個(gè)字母的字符串(常見于單細(xì)胞測序中線粒體基因的匹配)。但如果加了[]并且位于首位,則表示反義。例如[^6],則表示匹配所有不是6的字符
$     匹配字符串結(jié)尾。但將它置于[]內(nèi)則消除了它的特殊含義。例如[akm$],表示匹配’a’,’k’,’m’或者’$’。

數(shù)量詞:* + ? {m} {m,n} {m,}

*      前一個(gè)規(guī)則匹配0或無限次
+      前一個(gè)規(guī)則匹配1或無限次
?      前一個(gè)規(guī)則匹配0或1次,也常用語非貪婪模式中
{m}    前一個(gè)規(guī)則匹配m次
{m,n}  前一個(gè)規(guī)則匹配m~n次,盡可能多
{m,}   前一個(gè)規(guī)則匹配m次以上,盡可能多
③.R中的轉(zhuǎn)義

如果我們想查找元字符本身,如?*,我們需要提前告訴編譯系統(tǒng),取消這些字符的特殊含義。這個(gè)時(shí)候,就需要用到轉(zhuǎn)義字符\,即使用\?\.當(dāng)然,如果我們要找的是\,則使用\\進(jìn)行匹配。

④.R中預(yù)定義的字符組
代碼 含義說明
[:digit:] 數(shù)字:0-9
[:lower:] 小寫字母:a-z
[:upper:] 大寫字母:A-Z
[:alpha:] 字母:a-z及A-Z
[:alnum:] 所有字母及數(shù)字
[:punct:] 標(biāo)點(diǎn)符號,如. , ;
[:graph:] Graphical characters,即[:alnum:]和[:punct:]
[:blank:] 空字符,即:Space和Tab
[:space:] Space,Tab,newline,及其他space characters
[:print:] 可打印的字符,即:[:alnum:],[:punct:]和[:space:]
⑤.R中代表字符組的特殊符號
代碼 含義說明
\w 字符串,等價(jià)于[:alnum:]
\W 非字符串,等價(jià)于[^[:alnum:]]
\s 空格字符,等價(jià)于[:blank:]
\S 非空格字符,等價(jià)于[^[:blank:]]
\d 數(shù)字,等價(jià)于[:digit:]
\D 非數(shù)字,等價(jià)于[^[:digit:]]
\b Word edge(單詞開頭或結(jié)束的位置)
\B No Word edge(非單詞開頭或結(jié)束的位置)
\< Word beginning(單詞開頭的位置)
\> Word end(單詞結(jié)束的位置)

stringr包

①.stringr包安裝
#第一種方法是從CRAN上安裝發(fā)行版:
install.packages("stringr")
#第二種方法是從github上安裝最新的版本,可測試最新的功能:
install.packages("devtools")
devtools::install_github("tidyerse/stringr")
#第三種方法是直接安裝tidyverse包,它會順便就把stringr包安裝上
install.packages("tidyverse")
②.stringr包函數(shù)

stringr包里面的函數(shù)主要分為6大類,包括:

  1. 字符串匹配函數(shù):str_detect、str_which、str_count、str_locate、str_locate_all、str_view、str_view_all
  2. 字符串截取函數(shù):str_sub、str_subset、str_extract、str_extract_all、str_match、str_match_all
  3. 字符串長度控制函數(shù):str_length、str_pad、str_trunc、str_trim、str_squish
  4. 字符串變化函數(shù):str_replace、str_replace_all、str_replace_na、str_to_lower、str_to_upper、str_remove、str_remove_all
  5. 字符串拼接/切割函數(shù):str_c、str_dup、str_split、str_split_fixed
  6. 字符串排序函數(shù):str_sort、str_order

接下來,我們將逐個(gè)演示這些函數(shù)的使用方法。

1. 字符串匹配函數(shù)

str_detect可以檢測pattern是否包括在某個(gè)字符串中,并返回TRUE和FALSE

> x <- c("apple","banana","pear")
> str_detect(x,"a")
[1] TRUE TRUE TRUE

str_count檢測pattern是否包括在某個(gè)字符串中的數(shù)目

> x <- c("apple","banana","pear")
> str_count(x,"a")
[1] 1 3 1

str_which告訴pattern的索引位置

> x <- c("apple","banana","pear")
> str_which(x,"a")
[1] 1 2 3
> str_which(x,"ar")
[1] 3
> str_which(x,"an")
[1] 2
> str_which(x,"ap")
[1] 1

str_locatestr_locate_all返回pattern的開始和終止位置;
區(qū)別是str_locate只返回字符串里面的首個(gè)匹配到的pattern;
str_locate_all返回字符串里面的所有匹配到的pattern;

> x <- c("apple","banana","pear")
> str_locate(x,"a")
     start end
[1,]     1   1
[2,]     2   2
[3,]     3   3
> str_locate(x,"an")
     start end
[1,]    NA  NA
[2,]     2   3
[3,]    NA  NA
> str_locate_all(x,"a")
[[1]]
     start end
[1,]     1   1

[[2]]
     start end
[1,]     2   2
[2,]     4   4
[3,]     6   6

[[3]]
     start end
[1,]     3   3

> str_locate_all(x,"an")
[[1]]
     start end

[[2]]
     start end
[1,]     2   3
[2,]     4   5

[[3]]
     start end

str_viewstr_view_all函數(shù)都可以以可視化的方式,返回字符串中匹配到的pattern;

  • 區(qū)別是str_view只返回字符串里面的首個(gè)匹配到的pattern;
  • str_view_all返回字符串里面的所有匹配到的pattern;
  • 強(qiáng)烈建議掌握這兩個(gè)函數(shù),在自己書寫正則表達(dá)式時(shí),可以清晰地看到字符串有沒有被匹配上自己書寫的正則表達(dá)式
> x <- c("apple","banana","pear")
> str_view(x,"a")
image.png
> x <- c("apple","banana","pear")
> str_view_all(x,"a")
image.png
> string <- "Homo sapiens intraflagellar transport 80 homolog (Chlamydomonas) (IFT80), mRNA"
> pattern <- "\\([A-Za-z0-9]*\\)"
> str_view_all(string,pattern)
image.png
2. 字符串截取函數(shù)

str_sub在給定起始和終止參數(shù)的基礎(chǔ)上對字符串進(jìn)行截取或者替換

> x <- c("apple","banana","pear")
> str_sub(x,1,3)
[1] "app" "ban" "pea"
> # 負(fù)號表示從后往前數(shù)
> str_sub(x,-3,-1)
[1] "ple" "ana" "ear"
> # "a"替換截取出來的字符串,此時(shí)原本的x會發(fā)生改變
> str_sub(x,1,3) <- "a"
> x
[1] "ale"  "aana" "ar"

str_subset返回pattern所在的字符串

  • 與前面字符串匹配函數(shù)函數(shù)的區(qū)別是:前面的字符串匹配函數(shù),要么返回True或False(例如str_detect)、要么返回?cái)?shù)字(例如str_count
> x <- c("apple","banana","pear")
> str_subset(x,"ap")
[1] "apple"
> str_subset(x,"an")
[1] "banana"
> str_subset(x,"a")
[1] "apple"  "banana" "pear"  
># negate = T時(shí),返回不匹配的字符串
> str_subset(x,"ap",negate = T)
[1] "banana" "pear" 

str_extract函數(shù)返回每個(gè)字符串中首個(gè)匹配到的pattern
str_extract_all函數(shù)返回每個(gè)字符串中所有匹配到的patternstr_extract_all函數(shù)中simplify默認(rèn)為False,默認(rèn)返回list;當(dāng)simplify為True,則返回matrix

> x <- c("apple","banana","pear")
> str_extract(x,"a")
[1] "a" "a" "a"
> str_extract_all(x,"a")
[[1]]
[1] "a"

[[2]]
[1] "a" "a" "a"

[[3]]
[1] "a"

> str_extract_all(x,"a",simplify = T)
     [,1] [,2] [,3]
[1,] "a"  ""   ""  
[2,] "a"  "a"  "a" 
[3,] "a"  ""   ""  

str_match函數(shù)返回每個(gè)字符串中首個(gè)匹配到的pattern,以matrix的形式呈現(xiàn)
str_match_all函數(shù)返回每個(gè)字符串中所有匹配到的pattern,以list的形式呈現(xiàn)

> x <- c("apple","banana","pear")
> str_match(x,"a")
     [,1]
[1,] "a" 
[2,] "a" 
[3,] "a" 
> str_match_all(x,"a")
[[1]]
     [,1]
[1,] "a" 

[[2]]
     [,1]
[1,] "a" 
[2,] "a" 
[3,] "a" 

[[3]]
     [,1]
[1,] "a" 
3. 字符串長度控制函數(shù)

str_length函數(shù)可以計(jì)算字符串的長度

> x <- c("apple","banana","pear")
> str_length(x)
[1] 5 6 4

str_pad函數(shù)可以填充字符

  • width控制我們要填充后的字符串的整體長度,如果width比字符串本身要短,它就不會繼續(xù)填充。謹(jǐn)記,str_pad函數(shù)永遠(yuǎn)不會使字符串更短;
  • side表示填充方向,默認(rèn)是“l(fā)eft”;
  • pad就是我們要填充什么進(jìn)去,但是只能指定單個(gè)字符;
> str_pad(c("a", "abc", "abcdef"), 10,pad="a")
[1] "aaaaaaaaaa" "aaaaaaaabc" "aaaaabcdef"
> str_pad(c("a", "abc", "abcdef"), 10,pad="k")
[1] "kkkkkkkkka" "kkkkkkkabc" "kkkkabcdef"
> str_pad(c("a", "abc", "abcdef"), 10,side="left",pad="k")
[1] "kkkkkkkkka" "kkkkkkkabc" "kkkkabcdef"
> str_pad(c("a", "abc", "abcdef"), 10,side="right",pad="k")
[1] "akkkkkkkkk" "abckkkkkkk" "abcdefkkkk"
> str_pad(c("a", "abc", "abcdef"), 10,side="both",pad="k")
[1] "kkkkakkkkk" "kkkabckkkk" "kkabcdefkk"
> str_pad(c("a", "abc", "abcdef"), 5,pad="k")
[1] "kkkka"  "kkabc"  "abcdef"
> str_pad(c("a", "abc", "abcdef"), 1,pad="k")
[1] "a"      "abc"    "abcdef"
> str_pad(c("aa", "abc", "abcdef"), 1,pad="k")
[1] "aa"     "abc"    "abcdef"
> str_pad(c("a", "abc", "abcdef"), c(1, 2, 3),pad="k")
[1] "a"      "abc"    "abcdef"
> str_pad(c("a", "abc", "abcdef"), c(2, 4, 7),pad="k")
[1] "ka"      "kabc"    "kabcdef"
> str_pad(c("a", "abc", "abcdef"), c(2, 4, 7),pad=c("k","l","m"))
[1] "ka"      "labc"    "mabcdef"

str_trim函數(shù)去除字符串的空白部分

  • side可選擇"both", "left", "right",默認(rèn)是both
> str_trim("  String with trailing and leading white space\t")
[1] "String with trailing and leading white space"
> str_trim("\n\nString with trailing and leading white space\n\n")
[1] "String with trailing and leading white space"

str_squish函數(shù)作用和str_trim函數(shù)作用一致,但除了去除字符串前、后的空格,它還可以去除字符串中間出現(xiàn)的重復(fù)的空格。這一點(diǎn)上,str_trim函數(shù)無法辦到。

> str_trim("\n\nString with excess,  trailing and leading white   space\n\n")
[1] "String with excess,  trailing and leading white   space"
> str_squish("\n\nString with excess,  trailing and leading white   space\n\n")
[1] "String with excess, trailing and leading white space"

str_trunc函數(shù)可以把字符串切割到指定長度

> x <- "This string is moderately long"
> str_trunc(x, 20, "right")
[1] "This string is mo..."
> str_trunc(x, 20, "left")
[1] "...s moderately long"
> str_trunc(x, 20, "center")
[1] "This stri...ely long"
4. 字符串變化函數(shù)

str_replace函數(shù)可以替換pattern為新的字符,僅限于第一個(gè)匹配到的
str_replace_all函數(shù)可以替換所有匹配到的pattern
str_replace_na 可以將缺失值替換成‘NA’,這樣na.omit函數(shù)就無法將缺失值刪除了

  • 這個(gè)函數(shù)很好用,建議重點(diǎn)掌握
> x <- c("apple","banana","pear")
> # 把 a 替換成 k 
> str_replace(x,"a","k")
[1] "kpple"  "bknana" "pekr"  
> str_replace_all(x,"a","k")
[1] "kpple"  "bknknk" "pekr" 

> x <- c(NA, "abc", "def")
> x
[1] NA    "abc" "def"
> is.na(x)
[1]  TRUE FALSE FALSE
> table(is.na(x))
FALSE  TRUE 
    2     1
> na.omit(x)
[1] "abc" "def"
attr(,"na.action")
[1] 1
attr(,"class")
[1] "omit"
> str_replace_na(x)
[1] "NA"  "abc" "def"
> x <- str_replace_na(x)
> x
[1] "NA"  "abc" "def"
> na.omit(x)
[1] "NA"  "abc" "def"

str_replacestr_replace_all函數(shù)中,replacement可以用\1, \2中表示模式中的捕獲

  • 注意數(shù)據(jù)中第二個(gè)元素因?yàn)椴荒芷ヅ涞絧attern,所以就原樣返回了, 沒有進(jìn)行替換。
> str_replace_all(c("123,456", "011"), 
+                 "([[:digit:]]+),([[:digit:]]+)", "\\2,\\1")
[1] "456,123" "011"

str_to_upper函數(shù)可以將小寫字母轉(zhuǎn)成大寫字母
str_to_lower函數(shù)可以將大寫字母轉(zhuǎn)成小寫字母

> x <- c("apple","banana","pear")
> str_to_upper(x)
[1] "APPLE"  "BANANA" "PEAR"  
> str_to_lower(x)
[1] "apple"  "banana" "pear"

str_remove可以移除字符串中首個(gè)匹配到的pattern
str_remove_all可以移除字符串中所有匹配到的pattern

> fruits <- c("one apple", "two pears", "three bananas")
> str_remove(fruits, "[aeiou]")
[1] "ne apple"     "tw pears"     "thre bananas"
> str_remove_all(fruits, "[aeiou]")
[1] "n ppl"    "tw prs"   "thr bnns"
5. 字符串拼接/切割函數(shù)

str_c函數(shù)可以拼接多個(gè)字符串

  • sep: 把多個(gè)小的字符串拼接為多個(gè)更大的字符串,sep用于其中字符串的分割
  • collapse: 把多個(gè)小的字符串拼接為一個(gè)大的字符串,collapse用于其中字符串的分割
> letters
 [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s" "t" "u"
[22] "v" "w" "x" "y" "z"
> str_c(letters,letters,sep = "-")
 [1] "a-a" "b-b" "c-c" "d-d" "e-e" "f-f" "g-g" "h-h" "i-i" "j-j" "k-k" "l-l" "m-m" "n-n"
[15] "o-o" "p-p" "q-q" "r-r" "s-s" "t-t" "u-u" "v-v" "w-w" "x-x" "y-y" "z-z"
> str_c(letters,letters,collapse = "-")
[1] "aa-bb-cc-dd-ee-ff-gg-hh-ii-jj-kk-ll-mm-nn-oo-pp-qq-rr-ss-tt-uu-vv-ww-xx-yy-zz"

str_dup函數(shù)可以復(fù)制字符串

> fruit <- c("apple", "pear", "banana")
> str_dup(fruit, 2)
[1] "appleapple"   "pearpear"     "bananabanana"
> str_dup(fruit, 1:3)
[1] "apple"              "pearpear"           "bananabananabanana"
> str_c("ba", str_dup("na", 0:5))
[1] "ba"           "bana"         "banana"       "bananana"     "banananana"  
[6] "bananananana"

str_split按照pattern分割字符串

  • 當(dāng)simplify為TRUE時(shí),返回matrix

str_split_fixed按照pattern將字符串分割成指定個(gè)數(shù)

> fruits <- c(
+     "apples and oranges and pears and bananas",
+     "pineapples and mangos and guavas"
+ )
> 
> str_split(fruits, " and ")
[[1]]
[1] "apples"  "oranges" "pears"   "bananas"

[[2]]
[1] "pineapples" "mangos"     "guavas"    

> str_split(fruits, " and ", simplify = TRUE)
     [,1]         [,2]      [,3]     [,4]     
[1,] "apples"     "oranges" "pears"  "bananas"
[2,] "pineapples" "mangos"  "guavas" ""       
> 
> # Specify n to restrict the number of possible matches
> str_split(fruits, " and ", n = 3)
[[1]]
[1] "apples"            "oranges"           "pears and bananas"

[[2]]
[1] "pineapples" "mangos"     "guavas"    

> str_split(fruits, " and ", n = 2)
[[1]]
[1] "apples"                        "oranges and pears and bananas"

[[2]]
[1] "pineapples"        "mangos and guavas"

> # If n greater than number of pieces, no padding occurs
> str_split(fruits, " and ", n = 5)
[[1]]
[1] "apples"  "oranges" "pears"   "bananas"

[[2]]
[1] "pineapples" "mangos"     "guavas"    

> # Use fixed to return a character matrix
> str_split_fixed(fruits, " and ", 3)
     [,1]         [,2]      [,3]               
[1,] "apples"     "oranges" "pears and bananas"
[2,] "pineapples" "mangos"  "guavas"           
> str_split_fixed(fruits, " and ", 4)
     [,1]         [,2]      [,3]     [,4]     
[1,] "apples"     "oranges" "pears"  "bananas"
[2,] "pineapples" "mangos"  "guavas" ""       
6. 字符串排序函數(shù)

str_order函數(shù)和str_sort函數(shù)都可以對字符串進(jìn)行排序,兩者之前的區(qū)別在于前者返回排序后的索引(下標(biāo)),而后者返回排序后的實(shí)際值。

  • decreasing:排序方式,默認(rèn)為False,即升序;
> x <- c("a", "cc", "bbb", "dddd")
> str_sort(x)
[1] "a"    "bbb"  "cc"   "dddd"
> str_order(x)
[1] 1 3 2 4
stringr重要函數(shù)總結(jié)
  • stringr包里面有很多函數(shù),但許多函數(shù)我們平時(shí)使用率其實(shí)不高
  • 為了提高學(xué)習(xí)的性價(jià)比,我列舉了自己平時(shí)學(xué)習(xí)過程中使用到頻率最高的幾個(gè)stringr函數(shù)。相信掌握了這些高頻率的重要函數(shù),你可以更游刃有余地應(yīng)付日常R語言工作中關(guān)于處理字符串的需求。
字符串匹配函數(shù):str_detect、str_which、str_view、str_view_all
字符串截取函數(shù):str_subset、str_extract、str_extract_all、str_match、str_match_all
字符串變化函數(shù):str_replace、str_replace_all、str_remove、str_remove_all
字符串拼接/切割函數(shù):str_c、str_split、str_split_fixed
  • 以表格形式總結(jié)如下:
函數(shù) 功能說明 R Base中對應(yīng)函數(shù) 是否允許正則表達(dá)式
str_detect 檢測pattern是否包括在某個(gè)字符串中 grepl
str_which 告訴pattern的索引位置
str_view 可視化字符串中首個(gè)pattern匹配到的位置
str_view_all 可視化字符串中所有pattern匹配到的位置
str_subset 返回pattern所在的字符串
str_extract 返回每個(gè)字符串中首個(gè)匹配到的pattern regmatches
str_extract_all 返回每個(gè)字符串中所有匹配到的pattern regmatches
str_match 返回每個(gè)字符串中首個(gè)匹配到的pattern
str_match_all 返回每個(gè)字符串中所有匹配到的pattern
str_replace 替換每個(gè)字符串首個(gè)匹配到的pattern sub
str_replace_all 替換每個(gè)字符串所有匹配到的pattern gsub
str_remove 移除字符串中首個(gè)匹配到的pattern
str_remove_all 移除字符串中所有匹配到的pattern
str_c 可以拼接多個(gè)字符串 paste或paste0
str_split 按照pattern分割字符串 strsplit
str_split_fixed 按照pattern將字符串分割成指定個(gè)數(shù)
  • 最后,我們再回到文章最開頭的例子。
  • 假如我想提取字符串的"Homo sapiens intraflagellar transport 80 homolog (Chlamydomonas) (IFT80), mRNA"中的"Chlamydomonas""IFT80"。
  • 文章開頭使用的是str_match_all函數(shù).
  • 但當(dāng)你學(xué)習(xí)完以上內(nèi)容,相信你心中已經(jīng)有其他的解法:
  1. 例如,使用str_split函數(shù)可以順利將括號內(nèi)的東西提取出來
> string <- "Homo sapiens intraflagellar transport 80 homolog (Chlamydomonas) (IFT80), mRNA"
> str_split(string,"\\(|\\)",simplify = T)
     [,1]                                                [,2]            [,3] [,4]   
[1,] "Homo sapiens intraflagellar transport 80 homolog " "Chlamydomonas" " "  "IFT80"
     [,5]    
[1,] ", mRNA"
> str_split(string,"\\(|\\)",simplify = T)[,c(2,4)]
[1] "Chlamydomonas" "IFT80"        
  1. 例如,使用str_exact函數(shù)和str_remove_all函數(shù)可以順利將括號內(nèi)的東西提取出來
> string <- "Homo sapiens intraflagellar transport 80 homolog (Chlamydomonas) (IFT80), mRNA"
> str_extract_all(string,"\\([A-Za-z0-9]*\\)")
[[1]]
[1] "(Chlamydomonas)" "(IFT80)"
> a <- str_extract_all(string,"\\([A-Za-z0-9]*\\)")
> str_remove_all(a[[1]], pattern)
[1] "Chlamydomonas" "IFT80"        
  1. 你甚至可以用srt_replace_all函數(shù)提取括號內(nèi)的字符
> string <- "Homo sapiens intraflagellar transport 80 homolog (Chlamydomonas) (IFT80), mRNA"
> pattern <- "([:print:]+)(\\()([:print:]+)(\\))([:print:]+)(\\()([:print:]+)(\\))([:print:]+)"
> str_replace_all(string, pattern,"\\3")
[1] "Chlamydomonas"
> str_replace_all(string, pattern,"\\7")
[1] "IFT80"
  • 但我個(gè)人的建議是,正則表達(dá)式不要寫得太花里胡哨。
  • 對于初學(xué)者來說,正則表達(dá)式寫得越長越容易出錯。
  • 個(gè)人建議,初學(xué)者盡量從簡單出發(fā),只要最后能達(dá)到你想要的目的,比如說將特定字符串提取出來,哪怕中間過程多用幾個(gè)stringr的函數(shù),也是值得鼓勵的。
參考:

R 正則表達(dá)式
R語言與正則表達(dá)式
原來是它!正則表達(dá)式揪出生信分析中沒有報(bào)錯的內(nèi)鬼錯誤
R語言教程
R for Data Science

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時(shí)請結(jié)合常識與多方信息審慎甄別。
平臺聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡書系信息發(fā)布平臺,僅提供信息存儲服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容