聊聊Spring AI Alibaba的SentenceSplitter

本文主要研究一下Spring AI Alibaba的SentenceSplitter

SentenceSplitter

spring-ai-alibaba-core/src/main/java/com/alibaba/cloud/ai/transformer/splitter/SentenceSplitter.java

public class SentenceSplitter extends TextSplitter {

    private final EncodingRegistry registry = Encodings.newLazyEncodingRegistry();

    private final Encoding encoding = registry.getEncoding(EncodingType.CL100K_BASE);

    private static final int DEFAULT_CHUNK_SIZE = 1024;

    private final SentenceModel sentenceModel;

    private final int chunkSize;

    public SentenceSplitter() {
        this(DEFAULT_CHUNK_SIZE);
    }

    public SentenceSplitter(int chunkSize) {
        this.chunkSize = chunkSize;
        this.sentenceModel = getSentenceModel();
    }

    @Override
    protected List<String> splitText(String text) {
        SentenceDetectorME sentenceDetector = new SentenceDetectorME(sentenceModel);
        String[] texts = sentenceDetector.sentDetect(text);
        if (texts == null || texts.length == 0) {
            return Collections.emptyList();
        }

        List<String> chunks = new ArrayList<>();
        StringBuilder chunk = new StringBuilder();
        for (int i = 0; i < texts.length; i++) {
            int currentChunkSize = getEncodedTokens(chunk.toString()).size();
            int textTokenSize = getEncodedTokens(texts[i]).size();
            if (currentChunkSize + textTokenSize > chunkSize) {
                chunks.add(chunk.toString());
                chunk = new StringBuilder(texts[i]);
            }
            else {
                chunk.append(texts[i]);
            }

            if (i == texts.length - 1) {
                chunks.add(chunk.toString());
            }
        }

        return chunks;
    }

    private SentenceModel getSentenceModel() {
        try (InputStream is = getClass().getResourceAsStream("/opennlp/opennlp-en-ud-ewt-sentence-1.2-2.5.0.bin")) {
            if (is == null) {
                throw new RuntimeException("sentence model is invalid");
            }

            return new SentenceModel(is);
        }
        catch (IOException e) {
            throw new RuntimeException(e);
        }
    }

    private List<Integer> getEncodedTokens(String text) {
        Assert.notNull(text, "Text must not be null");
        return this.encoding.encode(text).boxed();
    }

}

SentenceSplitter繼承了TextSplitter,其構(gòu)造器會(huì)通過(guò)getSentenceModel()來(lái)加載/opennlp/opennlp-en-ud-ewt-sentence-1.2-2.5.0.bin這個(gè)SentenceModel;splitText方法創(chuàng)建SentenceDetectorME,使用其sentDetect來(lái)拆分句子,再根據(jù)chunkSize進(jìn)一步合并或拆分

示例

spring-ai-alibaba-core/src/test/java/com/alibaba/cloud/ai/transformer/splitter/SentenceSplitterTests.java

class SentenceSplitterTests {

    private SentenceSplitter splitter;

    private static final int CUSTOM_CHUNK_SIZE = 100;

    @BeforeEach
    void setUp() {
        // Initialize with default chunk size
        splitter = new SentenceSplitter();
    }

    /**
     * Test default constructor. Verifies that splitter can be created with default chunk
     * size.
     */
    @Test
    void testDefaultConstructor() {
        SentenceSplitter defaultSplitter = new SentenceSplitter();
        assertThat(defaultSplitter).isNotNull();
    }

    /**
     * Test constructor with custom chunk size. Verifies that splitter can be created with
     * specified chunk size.
     */
    @Test
    void testCustomChunkSizeConstructor() {
        SentenceSplitter customSplitter = new SentenceSplitter(CUSTOM_CHUNK_SIZE);
        assertThat(customSplitter).isNotNull();
    }

    /**
     * Test splitting simple sentences. Verifies basic sentence splitting functionality.
     */
    @Test
    void testSplitSimpleSentences() {
        String text = "This is a test. This is another test. And this is a third test.";
        Document doc = new Document(text);
        List<Document> documents = splitter.apply(Collections.singletonList(doc));

        assertThat(documents).isNotNull();
        assertThat(documents).hasSize(1);
        assertThat(documents.get(0).getText()).contains("This is a test", "This is another test",
                "And this is a third test");
    }

    /**
     * Test splitting empty text. Verifies handling of empty input.
     */
    @Test
    void testSplitEmptyText() {
        Document doc = new Document("");
        List<Document> documents = splitter.apply(Collections.singletonList(doc));
        assertThat(documents).isEmpty();
    }

    /**
     * Test splitting text with special characters. Verifies handling of text with various
     * punctuation and special characters.
     */
    @Test
    void testSplitTextWithSpecialCharacters() {
        String text = "Hello, world! How are you? I'm doing great... This is a test; with various punctuation.";
        Document doc = new Document(text);
        List<Document> documents = splitter.apply(Collections.singletonList(doc));

        assertThat(documents).isNotNull();
        assertThat(documents).hasSize(1);
        assertThat(documents.get(0).getText()).contains("Hello, world", "How are you", "I'm doing great",
                "This is a test");
    }

    /**
     * Test splitting long text. Verifies handling of text that exceeds default chunk
     * size.
     */
    @Test
    void testSplitLongText() {
        // Generate a very long text that will exceed the default chunk size (1024
        // tokens)
        StringBuilder longText = new StringBuilder();
        String longSentence = "This is a very long sentence with many words that will contribute to the total token count and eventually force the text to be split into multiple chunks because it exceeds the default chunk size limit of 1024 tokens. ";
        // Repeat the sentence enough times to ensure we exceed the chunk size
        for (int i = 0; i < 50; i++) {
            longText.append(longSentence);
        }
        Document doc = new Document(longText.toString());

        List<Document> documents = splitter.apply(Collections.singletonList(doc));

        // Verify that the text was split into multiple documents
        assertThat(documents).isNotNull();
        assertThat(documents).hasSizeGreaterThan(1);
        // Verify that each document contains part of the original text
        documents.forEach(document -> assertThat(document.getText()).contains("This is a very long sentence"));
    }

    /**
     * Test splitting text with multiple line breaks. Verifies handling of text with
     * various types of line breaks.
     */
    @Test
    void testSplitTextWithLineBreaks() {
        String text = "First sentence.\nSecond sentence.\r\nThird sentence.\rFourth sentence.";
        Document doc = new Document(text);
        List<Document> documents = splitter.apply(Collections.singletonList(doc));

        assertThat(documents).isNotNull();
        assertThat(documents.get(0).getText()).contains("First sentence", "Second sentence", "Third sentence",
                "Fourth sentence");
    }

    /**
     * Test splitting text with single character sentences. Verifies handling of very
     * short sentences.
     */
    @Test
    void testSplitSingleCharacterSentences() {
        String text = "A. B. C. D.";
        Document doc = new Document(text);
        List<Document> documents = splitter.apply(Collections.singletonList(doc));

        assertThat(documents).isNotNull();
        assertThat(documents).hasSize(1);
        assertThat(documents.get(0).getText()).contains("A", "B", "C", "D");
    }

    /**
     * Test splitting multiple documents. Verifies handling of multiple input documents.
     */
    @Test
    void testSplitMultipleDocuments() {
        List<Document> inputDocs = new ArrayList<>();
        inputDocs.add(new Document("First document. With multiple sentences."));
        inputDocs.add(new Document("Second document. Also with multiple sentences."));

        List<Document> documents = splitter.apply(inputDocs);
        assertThat(documents).isNotNull();
        assertThat(documents).hasSizeGreaterThan(1);
    }

}

小結(jié)

Spring AI Alibaba提供了SentenceSplitter,它使用了opennlp的SentenceDetectorME進(jìn)行拆分,其構(gòu)造器會(huì)加載/opennlp/opennlp-en-ud-ewt-sentence-1.2-2.5.0.bin這個(gè)SentenceModel。

doc

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時(shí)請(qǐng)結(jié)合常識(shí)與多方信息審慎甄別。
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡(jiǎn)書(shū)系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容