承接上次繼續(xù)學(xué)習(xí)stringr這個(gè)包,今天學(xué)習(xí)這幾個(gè)函數(shù)str_detect, str_subset,str_extract,str_replace,str_split
1.str_detect
該函數(shù)適用于模式匹配,即變量里面到底有沒(méi)有我們需要的匹配的字符,返回值為邏輯值T or F,同時(shí)可以結(jié)合使用sum,mean等基礎(chǔ)函數(shù)
x <- c("apple", "banana", "pear")
str_detect(x, "e")
#> [1] TRUE FALSE TRUE
#很明顯第二個(gè)不含字母e,返回的邏輯值為false
# 統(tǒng)計(jì)多少字母a開(kāi)頭的的單詞
sum(str_detect(x, "^a"))
#> [1] 1
mean(str_detect(x, "^a"))
#> [1] 0.333
str_detect()的一個(gè)常見(jiàn)用法是選擇與模式匹配的元素。您可以通過(guò)邏輯子設(shè)置或者使用更方便
str_subset來(lái)實(shí)現(xiàn)這一點(diǎn)
x[str_detect(x, "^a")]
[1] "apple"
或者
str_subset(x,"^a")
[1] "apple"
however, your strings will be one column of a data frame, and you’ll want to use filter instead
如果是數(shù)據(jù)框,我們就需要使用filter函數(shù)了
df <- tibble(
word = words,
i = seq_along(word)
)
> df %>%
+ filter(str_detect(words, "x$"))
# A tibble: 4 x 2
word i
<chr> <int>
1 box 108
2 sex 747
3 six 772
4 tax 841
2.str_subset
是提取可以匹配到的原始數(shù)據(jù)內(nèi)容,即提取的是第一個(gè)參數(shù)中含有可以匹配到的變量
colours <- c("red", "orange", "yellow", "green", "blue", "purple")
colour_match <- str_c(colours, collapse = "|")
has_colour <- str_subset(sentences, colour_match)
has_colour
[1] "Glue the sheet to the dark blue background." "Two blue fish swam in the tank."
[3] "The colt reared and threw the tall rider." "The wide road shimmered in the hot sun."
[5] "See the cat glaring at the scared mouse." "A wisp of cloud hung in the blue air."
[7] "Leaves turn brown and yellow in the fall." "He ordered peach pie with ice cream."
ok ,我們可以看到匹配到的是sentence的變量?jī)?nèi)容
3.str_extract
要提取匹配的實(shí)際文本
matches <- str_extract(has_colour, colour_match)
head(matches)
#> [1] "blue" "blue" "red" "red" "red" "blue"
# 注意,str_extract()只提取第一個(gè)匹配,就是只提取句子中第一個(gè)可以匹配的值。我們可以很容易地看到,首先選擇所有匹配大于1的句子
more <- sentences[str_count(sentences, colour_match) > 1]
str_extract(more, colour_match)
#> [1] "blue" "green" "orange"
4.str_replace
str_replace()和str_replace_all()允許用新字符串替換匹配項(xiàng)。最簡(jiǎn)單的用法是用固定的字符串替換模式
就是字符串替換
x <- c("apple", "pear", "banana")
str_replace(x, "[aeiou]", "-")
#> [1] "-pple" "p-ar" "b-nana"
str_replace_all(x, "[aeiou]", "-")
#> [1] "-ppl-" "p--r" "b-n-n-"
很明顯默認(rèn)情況下,每一項(xiàng)只匹配第一個(gè)匹配到的字母
使用str_replace_all()可以通過(guò)提供一個(gè)命名向量來(lái)執(zhí)行多個(gè)替換
x <- c("1 house", "2 cars", "3 people")
str_replace_all(x, c("1" = "one", "2" = "two", "3" = "three"))
#> [1] "one house" "two cars" "three people"
5.str_split
使用str_split()將字符串拆分為多個(gè)部分。例如,我們可以把句子分成幾個(gè)單詞,但是返回的是一個(gè)列表
sentences %>%
head(5) %>%
str_split(" ")
#> [[1]]
#> [1] "The" "birch" "canoe" "slid" "on" "the" "smooth"
#> [8] "planks."
#>
#> [[2]]
#> [1] "Glue" "the" "sheet" "to" "the"
#> [6] "dark" "blue" "background."
#>
#> [[3]]
#> [1] "It's" "easy" "to" "tell" "the" "depth" "of" "a" "well."
#>
#> [[4]]
#> [1] "These" "days" "a" "chicken" "leg" "is" "a"
#> [8] "rare" "dish."
#>
#> [[5]]
#> [1] "Rice" "is" "often" "served" "in" "round" "bowls."
"a|b|c|d" %>%
str_split("\\|") %>%
.[[1]]
#> [1] "a" "b" "c" "d"
否則,與返回列表的其他stringr函數(shù)一樣,您可以使用simplify = TRUE來(lái)返回一個(gè)矩陣:
sentences %>%
head(5) %>%
str_split(" ", simplify = TRUE)
#> [,1] [,2] [,3] [,4] [,5] [,6] [,7]
#> [1,] "The" "birch" "canoe" "slid" "on" "the" "smooth"
#> [2,] "Glue" "the" "sheet" "to" "the" "dark" "blue"
#> [3,] "It's" "easy" "to" "tell" "the" "depth" "of"
#> [4,] "These" "days" "a" "chicken" "leg" "is" "a"
#> [5,] "Rice" "is" "often" "served" "in" "round" "bowls."
#> [,8] [,9]
#> [1,] "planks." ""
#> [2,] "background." ""
#> [3,] "a" "well."
#> [4,] "rare" "dish."
#> [5,] ""