ebce08c3ba50b9a2b0aa801b079943b4.gif

1. 概述

本文將討論多種從字符串中移除stopwords（停用詞匯）的方法。從文本中去除不需要的或禁止的單詞，比如用戶(hù)發(fā)布的評(píng)論。
我們將使用一個(gè)輪詢(xún)、Collection.removeAll()、正則表達(dá)式。最后會(huì)使用java-microbenchmark-harness會(huì)對(duì)比這幾個(gè)方法的性能。

2. 加載stopwords

首先從文本文件加載stopwords。
準(zhǔn)備一個(gè)文件，english_stopwords.txt，里面包括準(zhǔn)備禁用的詞匯，比如I、she、he、the。
首先用Files.readAllLines()加載到List中。

@BeforeClass
public static void loadStopwords() throws IOException {
    stopwords = Files.readAllLines(Paths.get("english_stopwords.txt"));
}

3.手動(dòng)剔除stopwords

第一個(gè)解決方案，通過(guò)遍歷每個(gè)詞匯來(lái)判斷其是否為stopwords：

@Test
public void whenRemoveStopwordsManually_thenSuccess() {
    String original = "The quick brown fox jumps over the lazy dog"; 
    String target = "quick brown fox jumps lazy dog";
    String[] allWords = original.toLowerCase().split(" ");
 
    StringBuilder builder = new StringBuilder();
    for(String word : allWords) {
        if(!stopwords.contains(word)) {
            builder.append(word);
            builder.append(' ');
        }
    }
     
    String result = builder.toString().trim();
    assertEquals(result, target);
}

4.使用Collection.removeAll()

第二個(gè)解決方案，使用Collection.removeAll()來(lái)一次性解決問(wèn)題。

@Test
public void whenRemoveStopwordsUsingRemoveAll_thenSuccess() {
    ArrayList<String> allWords = 
      Stream.of(original.toLowerCase().split(" "))
            .collect(Collectors.toCollection(ArrayList<String>::new));
    allWords.removeAll(stopwords);
 
    String result = allWords.stream().collect(Collectors.joining(" "));
    assertEquals(result, target);
}

5.使用正則表達(dá)式

最后，為stopwords創(chuàng)建正則表達(dá)式，用正則表達(dá)式來(lái)替換stopwords。

@Test
public void whenRemoveStopwordsUsingRegex_thenSuccess() {
    String stopwordsRegex = stopwords.stream()
      .collect(Collectors.joining("|", "\\b(", ")\\b\\s?"));
 
    String result = original.toLowerCase().replaceAll(stopwordsRegex, "");
    assertEquals(result, target);
}

The resulting stopwordsRegex will have the format “\b(he|she|the|…)\b\s?”. In this regex, “\b” refers to a word boundary, to avoid replacing “he” in “heat” for example, while “\s?” refers to zero or one space, to delete the extra space after replacing a stopword.
stopwordsRegex最后的形式是：\b(he|she|the|…)\b\s?。 \b：匹配一個(gè)單詞邊界，避免出現(xiàn)替換了heat中的he的情況。 \s?：意味0或1個(gè)空格。這樣stopword如果還有多余的空格，也會(huì)被匹配上。被匹配上，就意味著會(huì)在下面的代碼中被替換掉。

6 性能比較

我們來(lái)看一下最佳性能的方法。
首先，設(shè)置benchmark。使用一個(gè)足夠大的文本文件作為要剔除stopwords的字符串的來(lái)源：shakespeare-hamlet.txt。

@Setup
public void setup() throws IOException {
    data = new String(Files.readAllBytes(Paths.get("shakespeare-hamlet.txt")));
    data = data.toLowerCase();
    stopwords = Files.readAllLines(Paths.get("english_stopwords.txt"));
    stopwordsRegex = stopwords.stream().collect(Collectors.joining("|", "\\b(", ")\\b\\s?"));
}

然后創(chuàng)建benchmark方法，首先使用removeManually()：

@Benchmark
public String removeManually() {
    String[] allWords = data.split(" ");
    StringBuilder builder = new StringBuilder();
    for(String word : allWords) {
        if(!stopwords.contains(word)) {
            builder.append(word);
            builder.append(' ');
        }
    }
    return builder.toString().trim();
}

然后，使用removeAll()

@Benchmark
public String removeAll() {
    ArrayList<String> allWords = 
      Stream.of(data.split(" "))
            .collect(Collectors.toCollection(ArrayList<String>::new));
    allWords.removeAll(stopwords);
    return allWords.stream().collect(Collectors.joining(" "));
}

最后使用replaceRegex()

@Benchmark
public String replaceRegex() {
    return data.replaceAll(stopwordsRegex, "");
}

看一下測(cè)試結(jié)果：

Benchmark                           Mode  Cnt   Score    Error  Units
removeAll                           avgt   60   7.782 ±  0.076  ms/op
removeManually                      avgt   60   8.186 ±  0.348  ms/op
replaceRegex                        avgt   60  42.035 ±  1.098  ms/op

看上去，removeAll是最快的，正則方式的replaceRegex最慢。

7 結(jié)論

本文測(cè)試了3種從字符串中剔除stopwords的方式，
示例代碼見(jiàn)github

編譯：https://www.baeldung.com/java-string-remove-stopwords

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

Java中如何從字符串中剔除特定單詞

Java中如何從字符串中剔除特定單詞

1. 概述

2. 加載stopwords

3.手動(dòng)剔除stopwords

4.使用Collection.removeAll()

5.使用正則表達(dá)式

6 性能比較

7 結(jié)論

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九 欧美,1769亚洲,黄色成人av

Java中如何從字符串中剔除特定單詞

1. 概述

2. 加載stopwords

3.手動(dòng)剔除stopwords

4.使用Collection.removeAll()

5.使用正則表達(dá)式

6 性能比較

7 結(jié)論

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av