前言
R語言的
stringr包是由大名鼎鼎的Hadley Wickham開發(fā),是對于stringi的進(jìn)一步封裝。這個(gè)人有多牛,因?yàn)樗麛?shù)據(jù)處理和可視化開發(fā)工具方面的突出貢獻(xiàn),獲得專為統(tǒng)計(jì)計(jì)算而設(shè)立的約翰·錢伯斯獎,這在當(dāng)年可是讓一眾統(tǒng)計(jì)學(xué)家大呼不滿。Hadley Wickham通過開發(fā)ggplot2包讓人們意識到原來R語言繪圖可以這么簡單美觀,這可是為R語言爭取了不少用戶,因?yàn)橛X得在數(shù)據(jù)處理不夠便捷,大神就寫了一個(gè)目前堪稱數(shù)據(jù)處理的神器tidyverse,將眾多的方法串聯(lián)到一起,tidyverse是他把自己所寫的包整理成了一整套數(shù)據(jù)處理的方法,包括ggplot2、readr、purrr、dplyr、tidyr、stringr、forcats、reshape2等。同時(shí)還專門寫了一本書《R for Data Science》,中文書名是《R數(shù)據(jù)科學(xué)》。這本書里面也詳細(xì)介紹了tidyverse的使用方法。這個(gè)大佬目前是Rstudio首席科學(xué)家,中國R語言的榮光謝益輝大神也在這個(gè)公司工作。而在這里,我們主要介紹R語言的stringr包。
定義和舉例
正則表達(dá)式是什么?它是一種提取文本串特征、描述文本串的方法!
- 舉個(gè)例子,假如我想提取字符串的"Homo sapiens intraflagellar transport 80 homolog (Chlamydomonas) (IFT80), mRNA"中的
"Chlamydomonas"和"IFT80"。 - 然后,我觀察它們的共同特征,發(fā)現(xiàn)它們都是
位于括號內(nèi)。 - "Chlamydomonas"這個(gè)字符串主要由
大寫英文字母、小寫英文字母組成。 - "IFT80"主要由
大寫英文字母和數(shù)字組成。 - 歸納一下,"Chlamydomonas"和"IFT80"這兩個(gè)字符串的共同特征:
1. 都位于括號內(nèi);2. 都是由大寫英文字母、小寫英文字母和數(shù)字這三部分中的任意兩個(gè)組成。 - 因此,我們可以使用
\\([A-Za-z0-9]*\\)來代表這兩個(gè)字符串。 - 為什么是
\\([A-Za-z0-9]*\\)? - 建議你先瀏覽完下面的內(nèi)容再返回來觀看,可能理解會更加深刻一點(diǎn)!
- 首先,我們想要查找的兩個(gè)字符串剛好位于括號內(nèi)。
- 我們先把
括號作為一種匹配的特征,這樣可以更精準(zhǔn)地找到我們想要的字符串。因?yàn)槔ㄌ柺窃址?,需要斜杠來轉(zhuǎn)譯,而斜杠也是元字符,需要另外一個(gè)斜杠來轉(zhuǎn)譯。因此,括號的表示模式可以寫成這個(gè)樣子:
pattern = "\\(\\),"
- 找到了括號,這個(gè)括號里面是要有東西的,內(nèi)容就是大寫英文字母、小寫英文字母和數(shù)字,所以我們用
[A-Za-z0-9]代替。A-Z代表從A-Z的大寫英文字母集合,a-z代表從a-z的小寫英文字母集合。0-9代表從0-9的數(shù)字的集合。 - 此外,括號內(nèi)的內(nèi)容是大寫英文字母、小寫英文字母和數(shù)字中的任意兩部分且數(shù)量不限,所以我們用
*符號代替。*代表0或者多個(gè)。
- 因此,括號內(nèi)的表達(dá)模式可以寫成這樣:
pattern = "\\([A-Za-z0-9]*\\),"
- 現(xiàn)在,我們使用
str_match函數(shù)來實(shí)際操作一下
> string <- "Homo sapiens intraflagellar transport 80 homolog (Chlamydomonas) (IFT80), mRNA"
> string
[1] "Homo sapiens intraflagellar transport 80 homolog (Chlamydomonas) (IFT80), mRNA"
> pattern <- "\\([A-Za-z0-9]*\\)"
> pattern
[1] "\\([A-Za-z0-9]*\\)"
> str_match(string,pattern)
[,1]
[1,] "(Chlamydomonas)"
> str_match_all(string,pattern)
[[1]]
[,1]
[1,] "(Chlamydomonas)"
[2,] "(IFT80)"
- 我們發(fā)現(xiàn),
str_match只返回匹配到的字符串的第一個(gè),但str_match_all可以返回所有匹配到的字符串。 - 但即使是
str_match_all,我們也發(fā)現(xiàn)返回的兩個(gè)字符串,都是帶著括號的。但我們最終結(jié)果并不想包含括號,那我們需要怎么做? - 通常的處理方法,是使用
str_remove_all函數(shù)去除括號,具體演示如下:
> string <- "Homo sapiens intraflagellar transport 80 homolog (Chlamydomonas) (IFT80), mRNA"
> pattern <- "\\([A-Za-z0-9]*\\)"
> a <- str_match_all(string,pattern) # str_match_all函數(shù)輸出結(jié)果是列表
> pattern <- "\\(|\\)"
> str_remove_all(a[[1]], pattern)
[1] "Chlamydomonas" "IFT80"
- 但通過
?str_match查詢到了,我們發(fā)現(xiàn)了一個(gè)更簡單高效的方式:
For str_match, a character matrix. First column is the complete match, followed by one column for each capture group. For str_match_all, a list of character matrices.
-
str_match函數(shù)會返回完整匹配的第一列,也會返回完整匹配中的捕獲組。 - 什么是
捕獲組?正則表達(dá)式分組分為捕獲組(Capturing Groups)與非捕獲組(Non-Capturing Groups)。當(dāng)你把一個(gè)正則表達(dá)式用一對小括號包起來的時(shí)候,就形成了一個(gè)捕獲組,如(\d)表示一個(gè)分組,(\d)(\d)表示有兩個(gè)分組,(\d)(\d)(\d)表示有三個(gè)分組,有幾對小括號元字符組成,就表示有幾個(gè)分組,以此類推。 - 在我們的例子中,我們
想要把括號中的內(nèi)容[A-Za-z0-9]*單獨(dú)提取出來,那么需要給它增加一對括號,形成一個(gè)捕獲組。
> pattern = "\\([A-Za-z0-9]*\\)"
> pattern2 = "\\(([A-Za-z0-9]*)\\)"
> str_match_all(string,pattern2)
[[1]]
[,1] [,2]
[1,] "(Chlamydomonas)" "Chlamydomonas"
[2,] "(IFT80)" "IFT80"
> str_match_all(string,pattern2)[[1]][,2]
[1] "Chlamydomonas" "IFT80"
正則表達(dá)式
①.R中的正則表達(dá)式模式有三種:
1、擴(kuò)展正則表達(dá)式:默認(rèn)方式(第一種是最常用的);
2、Perl風(fēng)格正則表達(dá)式:設(shè)置參數(shù)perl = TRUE;
3、字面意義正則表達(dá)式:設(shè)置參數(shù)fixed = TRUE。
②.R中的基本元字符如下:(這些字符的含義與Python一樣)
. \ | ( ) [ ] ^ $ * + ?
. 表示任意字符,包括換行符;
\ 表示對字符進(jìn)行轉(zhuǎn)義,即恢復(fù)它本來的含義。但在R中,\中也是字符,所以轉(zhuǎn)義\是要用\\;
| 表示匹配,舉個(gè)例子,A|B,表示對A或B其中一個(gè)匹配,A匹配成功則不匹配B;
() 字符組,括號中的模式作為一個(gè)整體進(jìn)行匹配
[] 字符集合,括號內(nèi)的任意字符將被匹配
^ 匹配字符串開頭。舉個(gè)例子,^MT-表示匹配開頭含有M、T這兩個(gè)字母的字符串(常見于單細(xì)胞測序中線粒體基因的匹配)。但如果加了[]并且位于首位,則表示反義。例如[^6],則表示匹配所有不是6的字符
$ 匹配字符串結(jié)尾。但將它置于[]內(nèi)則消除了它的特殊含義。例如[akm$],表示匹配’a’,’k’,’m’或者’$’。
數(shù)量詞:* + ? {m} {m,n} {m,}
* 前一個(gè)規(guī)則匹配0或無限次
+ 前一個(gè)規(guī)則匹配1或無限次
? 前一個(gè)規(guī)則匹配0或1次,也常用語非貪婪模式中
{m} 前一個(gè)規(guī)則匹配m次
{m,n} 前一個(gè)規(guī)則匹配m~n次,盡可能多
{m,} 前一個(gè)規(guī)則匹配m次以上,盡可能多
③.R中的轉(zhuǎn)義
如果我們想查找元字符本身,如
?和*,我們需要提前告訴編譯系統(tǒng),取消這些字符的特殊含義。這個(gè)時(shí)候,就需要用到轉(zhuǎn)義字符\,即使用\?和\.當(dāng)然,如果我們要找的是\,則使用\\進(jìn)行匹配。
④.R中預(yù)定義的字符組
| 代碼 | 含義說明 |
|---|---|
[:digit:] |
數(shù)字:0-9 |
[:lower:] |
小寫字母:a-z |
[:upper:] |
大寫字母:A-Z |
[:alpha:] |
字母:a-z及A-Z |
[:alnum:] |
所有字母及數(shù)字 |
[:punct:] |
標(biāo)點(diǎn)符號,如. , ;等 |
[:graph:] |
Graphical characters,即[:alnum:]和[:punct:] |
[:blank:] |
空字符,即:Space和Tab |
[:space:] |
Space,Tab,newline,及其他space characters |
[:print:] |
可打印的字符,即:[:alnum:],[:punct:]和[:space:] |
⑤.R中代表字符組的特殊符號
| 代碼 | 含義說明 |
|---|---|
\w |
字符串,等價(jià)于[:alnum:]
|
\W |
非字符串,等價(jià)于[^[:alnum:]]
|
\s |
空格字符,等價(jià)于[:blank:]
|
\S |
非空格字符,等價(jià)于[^[:blank:]]
|
\d |
數(shù)字,等價(jià)于[:digit:]
|
\D |
非數(shù)字,等價(jià)于[^[:digit:]]
|
\b |
Word edge(單詞開頭或結(jié)束的位置) |
\B |
No Word edge(非單詞開頭或結(jié)束的位置) |
\< |
Word beginning(單詞開頭的位置) |
\> |
Word end(單詞結(jié)束的位置) |
stringr包
①.stringr包安裝
#第一種方法是從CRAN上安裝發(fā)行版:
install.packages("stringr")
#第二種方法是從github上安裝最新的版本,可測試最新的功能:
install.packages("devtools")
devtools::install_github("tidyerse/stringr")
#第三種方法是直接安裝tidyverse包,它會順便就把stringr包安裝上
install.packages("tidyverse")
②.stringr包函數(shù)
stringr包里面的函數(shù)主要分為6大類,包括:
- 字符串匹配函數(shù):
str_detect、str_which、str_count、str_locate、str_locate_all、str_view、str_view_all- 字符串截取函數(shù):
str_sub、str_subset、str_extract、str_extract_all、str_match、str_match_all- 字符串長度控制函數(shù):
str_length、str_pad、str_trunc、str_trim、str_squish- 字符串變化函數(shù):
str_replace、str_replace_all、str_replace_na、str_to_lower、str_to_upper、str_remove、str_remove_all- 字符串拼接/切割函數(shù):
str_c、str_dup、str_split、str_split_fixed- 字符串排序函數(shù):
str_sort、str_order
接下來,我們將逐個(gè)演示這些函數(shù)的使用方法。
1. 字符串匹配函數(shù)
str_detect可以檢測pattern是否包括在某個(gè)字符串中,并返回TRUE和FALSE
> x <- c("apple","banana","pear")
> str_detect(x,"a")
[1] TRUE TRUE TRUE
str_count檢測pattern是否包括在某個(gè)字符串中的數(shù)目
> x <- c("apple","banana","pear")
> str_count(x,"a")
[1] 1 3 1
str_which告訴pattern的索引位置
> x <- c("apple","banana","pear")
> str_which(x,"a")
[1] 1 2 3
> str_which(x,"ar")
[1] 3
> str_which(x,"an")
[1] 2
> str_which(x,"ap")
[1] 1
str_locate和str_locate_all返回pattern的開始和終止位置;
區(qū)別是str_locate只返回字符串里面的首個(gè)匹配到的pattern;
str_locate_all返回字符串里面的所有匹配到的pattern;
> x <- c("apple","banana","pear")
> str_locate(x,"a")
start end
[1,] 1 1
[2,] 2 2
[3,] 3 3
> str_locate(x,"an")
start end
[1,] NA NA
[2,] 2 3
[3,] NA NA
> str_locate_all(x,"a")
[[1]]
start end
[1,] 1 1
[[2]]
start end
[1,] 2 2
[2,] 4 4
[3,] 6 6
[[3]]
start end
[1,] 3 3
> str_locate_all(x,"an")
[[1]]
start end
[[2]]
start end
[1,] 2 3
[2,] 4 5
[[3]]
start end
str_view和str_view_all函數(shù)都可以以可視化的方式,返回字符串中匹配到的pattern;
- 區(qū)別是
str_view只返回字符串里面的首個(gè)匹配到的pattern; -
str_view_all返回字符串里面的所有匹配到的pattern; - 強(qiáng)烈建議掌握這兩個(gè)函數(shù),在自己書寫正則表達(dá)式時(shí),可以清晰地看到字符串有沒有被匹配上自己書寫的正則表達(dá)式
> x <- c("apple","banana","pear")
> str_view(x,"a")

> x <- c("apple","banana","pear")
> str_view_all(x,"a")

> string <- "Homo sapiens intraflagellar transport 80 homolog (Chlamydomonas) (IFT80), mRNA"
> pattern <- "\\([A-Za-z0-9]*\\)"
> str_view_all(string,pattern)

2. 字符串截取函數(shù)
str_sub在給定起始和終止參數(shù)的基礎(chǔ)上對字符串進(jìn)行截取或者替換
> x <- c("apple","banana","pear")
> str_sub(x,1,3)
[1] "app" "ban" "pea"
> # 負(fù)號表示從后往前數(shù)
> str_sub(x,-3,-1)
[1] "ple" "ana" "ear"
> # "a"替換截取出來的字符串,此時(shí)原本的x會發(fā)生改變
> str_sub(x,1,3) <- "a"
> x
[1] "ale" "aana" "ar"
str_subset返回pattern所在的字符串
-
與前面字符串匹配函數(shù)函數(shù)的區(qū)別是:前面的字符串匹配函數(shù),要么返回True或False(例如
str_detect)、要么返回?cái)?shù)字(例如str_count)
> x <- c("apple","banana","pear")
> str_subset(x,"ap")
[1] "apple"
> str_subset(x,"an")
[1] "banana"
> str_subset(x,"a")
[1] "apple" "banana" "pear"
># negate = T時(shí),返回不匹配的字符串
> str_subset(x,"ap",negate = T)
[1] "banana" "pear"
str_extract函數(shù)返回每個(gè)字符串中首個(gè)匹配到的pattern
str_extract_all函數(shù)返回每個(gè)字符串中所有匹配到的patternstr_extract_all函數(shù)中simplify默認(rèn)為False,默認(rèn)返回list;當(dāng)simplify為True,則返回matrix
> x <- c("apple","banana","pear")
> str_extract(x,"a")
[1] "a" "a" "a"
> str_extract_all(x,"a")
[[1]]
[1] "a"
[[2]]
[1] "a" "a" "a"
[[3]]
[1] "a"
> str_extract_all(x,"a",simplify = T)
[,1] [,2] [,3]
[1,] "a" "" ""
[2,] "a" "a" "a"
[3,] "a" "" ""
str_match函數(shù)返回每個(gè)字符串中首個(gè)匹配到的pattern,以matrix的形式呈現(xiàn)
str_match_all函數(shù)返回每個(gè)字符串中所有匹配到的pattern,以list的形式呈現(xiàn)
> x <- c("apple","banana","pear")
> str_match(x,"a")
[,1]
[1,] "a"
[2,] "a"
[3,] "a"
> str_match_all(x,"a")
[[1]]
[,1]
[1,] "a"
[[2]]
[,1]
[1,] "a"
[2,] "a"
[3,] "a"
[[3]]
[,1]
[1,] "a"
3. 字符串長度控制函數(shù)
str_length函數(shù)可以計(jì)算字符串的長度
> x <- c("apple","banana","pear")
> str_length(x)
[1] 5 6 4
str_pad函數(shù)可以填充字符
-
width控制我們要填充后的字符串的整體長度,如果width比字符串本身要短,它就不會繼續(xù)填充。謹(jǐn)記,str_pad函數(shù)永遠(yuǎn)不會使字符串更短; -
side表示填充方向,默認(rèn)是“l(fā)eft”; -
pad就是我們要填充什么進(jìn)去,但是只能指定單個(gè)字符;
> str_pad(c("a", "abc", "abcdef"), 10,pad="a")
[1] "aaaaaaaaaa" "aaaaaaaabc" "aaaaabcdef"
> str_pad(c("a", "abc", "abcdef"), 10,pad="k")
[1] "kkkkkkkkka" "kkkkkkkabc" "kkkkabcdef"
> str_pad(c("a", "abc", "abcdef"), 10,side="left",pad="k")
[1] "kkkkkkkkka" "kkkkkkkabc" "kkkkabcdef"
> str_pad(c("a", "abc", "abcdef"), 10,side="right",pad="k")
[1] "akkkkkkkkk" "abckkkkkkk" "abcdefkkkk"
> str_pad(c("a", "abc", "abcdef"), 10,side="both",pad="k")
[1] "kkkkakkkkk" "kkkabckkkk" "kkabcdefkk"
> str_pad(c("a", "abc", "abcdef"), 5,pad="k")
[1] "kkkka" "kkabc" "abcdef"
> str_pad(c("a", "abc", "abcdef"), 1,pad="k")
[1] "a" "abc" "abcdef"
> str_pad(c("aa", "abc", "abcdef"), 1,pad="k")
[1] "aa" "abc" "abcdef"
> str_pad(c("a", "abc", "abcdef"), c(1, 2, 3),pad="k")
[1] "a" "abc" "abcdef"
> str_pad(c("a", "abc", "abcdef"), c(2, 4, 7),pad="k")
[1] "ka" "kabc" "kabcdef"
> str_pad(c("a", "abc", "abcdef"), c(2, 4, 7),pad=c("k","l","m"))
[1] "ka" "labc" "mabcdef"
str_trim函數(shù)去除字符串的空白部分
-
side可選擇"both", "left", "right",默認(rèn)是both
> str_trim(" String with trailing and leading white space\t")
[1] "String with trailing and leading white space"
> str_trim("\n\nString with trailing and leading white space\n\n")
[1] "String with trailing and leading white space"
str_squish函數(shù)作用和str_trim函數(shù)作用一致,但除了去除字符串前、后的空格,它還可以去除字符串中間出現(xiàn)的重復(fù)的空格。這一點(diǎn)上,str_trim函數(shù)無法辦到。
> str_trim("\n\nString with excess, trailing and leading white space\n\n")
[1] "String with excess, trailing and leading white space"
> str_squish("\n\nString with excess, trailing and leading white space\n\n")
[1] "String with excess, trailing and leading white space"
str_trunc函數(shù)可以把字符串切割到指定長度
> x <- "This string is moderately long"
> str_trunc(x, 20, "right")
[1] "This string is mo..."
> str_trunc(x, 20, "left")
[1] "...s moderately long"
> str_trunc(x, 20, "center")
[1] "This stri...ely long"
4. 字符串變化函數(shù)
str_replace函數(shù)可以替換pattern為新的字符,僅限于第一個(gè)匹配到的
str_replace_all函數(shù)可以替換所有匹配到的pattern
str_replace_na 可以將缺失值替換成‘NA’,這樣na.omit函數(shù)就無法將缺失值刪除了
- 這個(gè)函數(shù)很好用,建議重點(diǎn)掌握
> x <- c("apple","banana","pear")
> # 把 a 替換成 k
> str_replace(x,"a","k")
[1] "kpple" "bknana" "pekr"
> str_replace_all(x,"a","k")
[1] "kpple" "bknknk" "pekr"
> x <- c(NA, "abc", "def")
> x
[1] NA "abc" "def"
> is.na(x)
[1] TRUE FALSE FALSE
> table(is.na(x))
FALSE TRUE
2 1
> na.omit(x)
[1] "abc" "def"
attr(,"na.action")
[1] 1
attr(,"class")
[1] "omit"
> str_replace_na(x)
[1] "NA" "abc" "def"
> x <- str_replace_na(x)
> x
[1] "NA" "abc" "def"
> na.omit(x)
[1] "NA" "abc" "def"
在str_replace和str_replace_all函數(shù)中,replacement可以用\1, \2中表示模式中的捕獲
- 注意數(shù)據(jù)中第二個(gè)元素因?yàn)椴荒芷ヅ涞絧attern,所以就原樣返回了, 沒有進(jìn)行替換。
> str_replace_all(c("123,456", "011"),
+ "([[:digit:]]+),([[:digit:]]+)", "\\2,\\1")
[1] "456,123" "011"
str_to_upper函數(shù)可以將小寫字母轉(zhuǎn)成大寫字母
str_to_lower函數(shù)可以將大寫字母轉(zhuǎn)成小寫字母
> x <- c("apple","banana","pear")
> str_to_upper(x)
[1] "APPLE" "BANANA" "PEAR"
> str_to_lower(x)
[1] "apple" "banana" "pear"
str_remove可以移除字符串中首個(gè)匹配到的pattern
str_remove_all可以移除字符串中所有匹配到的pattern
> fruits <- c("one apple", "two pears", "three bananas")
> str_remove(fruits, "[aeiou]")
[1] "ne apple" "tw pears" "thre bananas"
> str_remove_all(fruits, "[aeiou]")
[1] "n ppl" "tw prs" "thr bnns"
5. 字符串拼接/切割函數(shù)
str_c函數(shù)可以拼接多個(gè)字符串
-
sep: 把多個(gè)小的字符串拼接為多個(gè)更大的字符串,sep用于其中字符串的分割 -
collapse: 把多個(gè)小的字符串拼接為一個(gè)大的字符串,collapse用于其中字符串的分割
> letters
[1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s" "t" "u"
[22] "v" "w" "x" "y" "z"
> str_c(letters,letters,sep = "-")
[1] "a-a" "b-b" "c-c" "d-d" "e-e" "f-f" "g-g" "h-h" "i-i" "j-j" "k-k" "l-l" "m-m" "n-n"
[15] "o-o" "p-p" "q-q" "r-r" "s-s" "t-t" "u-u" "v-v" "w-w" "x-x" "y-y" "z-z"
> str_c(letters,letters,collapse = "-")
[1] "aa-bb-cc-dd-ee-ff-gg-hh-ii-jj-kk-ll-mm-nn-oo-pp-qq-rr-ss-tt-uu-vv-ww-xx-yy-zz"
str_dup函數(shù)可以復(fù)制字符串
> fruit <- c("apple", "pear", "banana")
> str_dup(fruit, 2)
[1] "appleapple" "pearpear" "bananabanana"
> str_dup(fruit, 1:3)
[1] "apple" "pearpear" "bananabananabanana"
> str_c("ba", str_dup("na", 0:5))
[1] "ba" "bana" "banana" "bananana" "banananana"
[6] "bananananana"
str_split按照pattern分割字符串
- 當(dāng)simplify為TRUE時(shí),返回matrix
str_split_fixed按照pattern將字符串分割成指定個(gè)數(shù)
> fruits <- c(
+ "apples and oranges and pears and bananas",
+ "pineapples and mangos and guavas"
+ )
>
> str_split(fruits, " and ")
[[1]]
[1] "apples" "oranges" "pears" "bananas"
[[2]]
[1] "pineapples" "mangos" "guavas"
> str_split(fruits, " and ", simplify = TRUE)
[,1] [,2] [,3] [,4]
[1,] "apples" "oranges" "pears" "bananas"
[2,] "pineapples" "mangos" "guavas" ""
>
> # Specify n to restrict the number of possible matches
> str_split(fruits, " and ", n = 3)
[[1]]
[1] "apples" "oranges" "pears and bananas"
[[2]]
[1] "pineapples" "mangos" "guavas"
> str_split(fruits, " and ", n = 2)
[[1]]
[1] "apples" "oranges and pears and bananas"
[[2]]
[1] "pineapples" "mangos and guavas"
> # If n greater than number of pieces, no padding occurs
> str_split(fruits, " and ", n = 5)
[[1]]
[1] "apples" "oranges" "pears" "bananas"
[[2]]
[1] "pineapples" "mangos" "guavas"
> # Use fixed to return a character matrix
> str_split_fixed(fruits, " and ", 3)
[,1] [,2] [,3]
[1,] "apples" "oranges" "pears and bananas"
[2,] "pineapples" "mangos" "guavas"
> str_split_fixed(fruits, " and ", 4)
[,1] [,2] [,3] [,4]
[1,] "apples" "oranges" "pears" "bananas"
[2,] "pineapples" "mangos" "guavas" ""
6. 字符串排序函數(shù)
str_order函數(shù)和str_sort函數(shù)都可以對字符串進(jìn)行排序,兩者之前的區(qū)別在于前者返回排序后的索引(下標(biāo)),而后者返回排序后的實(shí)際值。
-
decreasing:排序方式,默認(rèn)為False,即升序;
> x <- c("a", "cc", "bbb", "dddd")
> str_sort(x)
[1] "a" "bbb" "cc" "dddd"
> str_order(x)
[1] 1 3 2 4
stringr重要函數(shù)總結(jié)
-
stringr包里面有很多函數(shù),但許多函數(shù)我們平時(shí)使用率其實(shí)不高 - 為了提高學(xué)習(xí)的性價(jià)比,我列舉了自己平時(shí)學(xué)習(xí)過程中
使用到頻率最高的幾個(gè)stringr函數(shù)。相信掌握了這些高頻率的重要函數(shù),你可以更游刃有余地應(yīng)付日常R語言工作中關(guān)于處理字符串的需求。
字符串匹配函數(shù):str_detect、str_which、str_view、str_view_all
字符串截取函數(shù):str_subset、str_extract、str_extract_all、str_match、str_match_all
字符串變化函數(shù):str_replace、str_replace_all、str_remove、str_remove_all
字符串拼接/切割函數(shù):str_c、str_split、str_split_fixed
- 以表格形式總結(jié)如下:
| 函數(shù) | 功能說明 | R Base中對應(yīng)函數(shù) | 是否允許正則表達(dá)式 |
|---|---|---|---|
str_detect |
檢測pattern是否包括在某個(gè)字符串中 | grepl |
是 |
str_which |
告訴pattern的索引位置 | 是 | |
str_view |
可視化字符串中首個(gè)pattern匹配到的位置 | 是 | |
str_view_all |
可視化字符串中所有pattern匹配到的位置 | 是 | |
str_subset |
返回pattern所在的字符串 | 是 | |
str_extract |
返回每個(gè)字符串中首個(gè)匹配到的pattern | regmatches |
是 |
str_extract_all |
返回每個(gè)字符串中所有匹配到的pattern | regmatches |
是 |
str_match |
返回每個(gè)字符串中首個(gè)匹配到的pattern | 是 | |
str_match_all |
返回每個(gè)字符串中所有匹配到的pattern | 是 | |
str_replace |
替換每個(gè)字符串首個(gè)匹配到的pattern | sub |
是 |
str_replace_all |
替換每個(gè)字符串所有匹配到的pattern | gsub |
是 |
str_remove |
移除字符串中首個(gè)匹配到的pattern | 是 | |
str_remove_all |
移除字符串中所有匹配到的pattern | 是 | |
str_c |
可以拼接多個(gè)字符串 | paste或paste0 |
否 |
str_split |
按照pattern分割字符串 | strsplit |
是 |
str_split_fixed |
按照pattern將字符串分割成指定個(gè)數(shù) | 是 |
- 最后,我們再回到文章最開頭的例子。
- 假如我想提取字符串的"Homo sapiens intraflagellar transport 80 homolog (Chlamydomonas) (IFT80), mRNA"中的
"Chlamydomonas"和"IFT80"。 - 文章開頭使用的是
str_match_all函數(shù). - 但當(dāng)你學(xué)習(xí)完以上內(nèi)容,相信你心中已經(jīng)有其他的解法:
- 例如,使用
str_split函數(shù)可以順利將括號內(nèi)的東西提取出來
> string <- "Homo sapiens intraflagellar transport 80 homolog (Chlamydomonas) (IFT80), mRNA"
> str_split(string,"\\(|\\)",simplify = T)
[,1] [,2] [,3] [,4]
[1,] "Homo sapiens intraflagellar transport 80 homolog " "Chlamydomonas" " " "IFT80"
[,5]
[1,] ", mRNA"
> str_split(string,"\\(|\\)",simplify = T)[,c(2,4)]
[1] "Chlamydomonas" "IFT80"
- 例如,使用
str_exact函數(shù)和str_remove_all函數(shù)可以順利將括號內(nèi)的東西提取出來
> string <- "Homo sapiens intraflagellar transport 80 homolog (Chlamydomonas) (IFT80), mRNA"
> str_extract_all(string,"\\([A-Za-z0-9]*\\)")
[[1]]
[1] "(Chlamydomonas)" "(IFT80)"
> a <- str_extract_all(string,"\\([A-Za-z0-9]*\\)")
> str_remove_all(a[[1]], pattern)
[1] "Chlamydomonas" "IFT80"
- 你甚至可以用
srt_replace_all函數(shù)提取括號內(nèi)的字符
> string <- "Homo sapiens intraflagellar transport 80 homolog (Chlamydomonas) (IFT80), mRNA"
> pattern <- "([:print:]+)(\\()([:print:]+)(\\))([:print:]+)(\\()([:print:]+)(\\))([:print:]+)"
> str_replace_all(string, pattern,"\\3")
[1] "Chlamydomonas"
> str_replace_all(string, pattern,"\\7")
[1] "IFT80"
- 但我個(gè)人的建議是,正則表達(dá)式不要寫得太花里胡哨。
- 對于初學(xué)者來說,正則表達(dá)式寫得越長越容易出錯。
- 個(gè)人建議,初學(xué)者盡量從簡單出發(fā),只要最后能達(dá)到你想要的目的,比如說將特定字符串提取出來,哪怕中間過程多用幾個(gè)stringr的函數(shù),也是值得鼓勵的。
參考:
R 正則表達(dá)式
R語言與正則表達(dá)式
原來是它!正則表達(dá)式揪出生信分析中沒有報(bào)錯的內(nèi)鬼錯誤
R語言教程
R for Data Science