Apache Lucene - Index File Formats V7.3.0

Apache Lucene - Index File Formats(索引文件格式)


Introduction(引言)

This document defines the index file formats used in this version of Lucene. If you are using a different version of Lucene, please consult the copy of docs/ that was distributed with the version you are using.

本文檔定義了V7.0.0版本Lucene中使用的索引文件格式。如果您使用的是不同版本的Lucene,請(qǐng)查閱docs/隨您使用的版本一起發(fā)布的副本。

This document attempts to provide a high-level definition of the Apache Lucene file formats.

本文檔嘗試提供對(duì)Apache Lucene文件格式的高級(jí)定義。

Definitions(定義)

The fundamental concepts in Lucene are index, document, field and term.

Lucene的基本概念包括索引(index),文檔(document),域(field)和詞(term)。

An index contains a sequence of documents.

索引包含一系列文檔。

  • A document is a sequence of fields.

  • 文檔(document)是一系列域(field)。

  • A field is a named sequence of terms.

  • 域(field)是一系列經(jīng)過(guò)命名的詞(term)。

  • A term is a sequence of bytes.

  • 詞(term)是一系列字節(jié)(byte)。

The same sequence of bytes in two different fields is considered a different term. Thus terms are represented as a pair: the string naming the field, and the bytes within the field.

兩個(gè)不同域(field)中的相同字節(jié)(byte)序列被認(rèn)為是不同的詞(term)。因此,詞由一對(duì)(要素)表示:域(field)名、域(field)內(nèi)字節(jié)組(bytes)。

Inverted Indexing(倒排索引)

The index stores statistics about terms in order to make term-based search more efficient. Lucene's index falls into the family of indexes known as an inverted index. This is because it can list, for a term, the documents that contain it. This is the inverse of the natural relationship, in which documents list terms.

(倒排索引即)存儲(chǔ)與詞(term)相關(guān)的統(tǒng)計(jì)信息的索引(index),其目的是提高基于詞(term based)的搜索效率。Lucene的索引(index)屬于倒排索引的索引族。(叫做倒排索引)是因?yàn)閷?duì)于一個(gè)詞(term),此索引(index)可以存儲(chǔ)包含此詞(term)的文檔列表(document list)。這與某個(gè)文件(document)存儲(chǔ)一些詞列表(term list)的自然順序相反。

Types of Fields(域(field)的類型)

In Lucene, fields may be stored, in which case their text is stored in the index literally, in a non-inverted manner. Fields that are inverted are called indexed. A field may be both stored and indexed.

在Lucene中,域(field)可以被存儲(chǔ)(store),此時(shí),域(field)內(nèi)的文本(text)會(huì)依照原樣,以非倒轉(zhuǎn)(non-inverted)的方式存入索引中。被倒轉(zhuǎn)(inverted)的域(field)稱作被索引的(indexed)。一個(gè)域可以同時(shí)被存儲(chǔ)(store)和索引(index)。

The text of a field may be tokenized into terms to be indexed, or the text of a field may be used literally as a term to be indexed. Most fields are tokenized, but sometimes it is useful for certain identifier fields to be indexed literally.

在域(field)中的文本被索引(index)時(shí),文本既可以被分詞(tokenize)為詞(term),也可以(不經(jīng)過(guò)分詞)依原樣(literally)作為詞(term)被索引(index)。大多數(shù)域(field)都是經(jīng)過(guò)分詞(tokenize)的,但有時(shí)對(duì)于特定標(biāo)識(shí)符域(field),按照字面意思(literally)來(lái)索引(index)很有效。

See the Field java docs for more information on Fields.

有關(guān)域的更多信息,請(qǐng)查閱java docs文檔。

Segments(段)

Lucene indexes may be composed of multiple sub-indexes, or segments. Each segment is a fully independent index, which could be searched separately. Indexes evolve by:

Lucene的索引(index)可能有多個(gè)子索引或段(segment)構(gòu)成。每個(gè)段(segment)是一個(gè)完全獨(dú)立的索引(index),可以單獨(dú)搜索。索引(index)演變?yōu)椋?/p>

  1. Creating new segments for newly added documents.
  2. Merging existing segments.
  1. 為新添加的文檔(document)創(chuàng)建新段(segment)。
  2. 合并已有段(segment)。

Searches may involve multiple segments and/or multiple indexes, each index potentially composed of a set of segments.

搜索(search)可能涉及多個(gè)段(segment)和/或多個(gè)索引(index),每個(gè)索引(index)可能由一系列段(segment)構(gòu)成。

Document Numbers(文檔編號(hào))

Internally, Lucene refers to documents by an integer document number. The first document added to an index is numbered zero, and each subsequent document added gets a number one greater than the previous.

在內(nèi)部,Lucene通過(guò)整型(integer)文檔(document number)編號(hào)引用文檔(index)。添加到索引(index)中的首個(gè)文檔(document)編號(hào)為0,后續(xù)添加的文檔(document)其編號(hào)依次增長(zhǎng)1。

Note that a document's number may change, so caution should be taken when storing these numbers outside of Lucene. In particular, numbers may change in the following situations:

注意,文檔(document)的編號(hào)可能會(huì)改變,因此在Lucene外部存儲(chǔ)文檔編號(hào)(document number)時(shí)應(yīng)謹(jǐn)慎。尤其是在以下情況中,編號(hào)可能會(huì)改變。

  • The numbers stored in each segment are unique only within the segment, and must be converted before they can be used in a larger context. The standard technique is to allocate each segment a range of values, based on the range of numbers used in that segment. To convert a document number from a segment to an external value, the segment's base document number is added. To convert an external value back to a segment-specific value, the segment is identified by the range that the external value is in, and the segment's base value is subtracted. For example two five document segments might be combined, so that the first segment has a base value of zero, and the second of five. Document three from the second segment would have an external value of eight.

  • 每段(segment)中存儲(chǔ)的編號(hào)僅在本段(segment)內(nèi)唯一,在更大的上下文(context)中使用前必須進(jìn)行轉(zhuǎn)換。標(biāo)準(zhǔn)技術(shù)依據(jù)每段(segment)使用的數(shù)值范圍為其分配一個(gè)范圍的值。要將文檔編號(hào)(document number)從段(segment)內(nèi)轉(zhuǎn)換為一個(gè)外部值,需要給(這個(gè)編號(hào))添加段(segment)內(nèi)基本文檔編號(hào)(document number)。要將外部值轉(zhuǎn)換為特定段(segment-specific)的值,此段(segment)需檢查外部值是否在數(shù)值范圍內(nèi),并(使外部值)減去段(segment)的基本文檔編號(hào)。例如,兩個(gè)五篇文檔(document)的段(segment)將要合并(merge),那么第一個(gè)段(segment)的基本值為0,第二個(gè)的基本值為5(之前的文檔里值的總數(shù),第一段合并前文檔中總數(shù)為0,第二段合并前,第一段已合并完,文檔總數(shù)為5)。第二段(segment)的3號(hào)文檔(document)的外部值將為8(3+5)。

  • When documents are deleted, gaps are created in the numbering. These are eventually removed as the index evolves through merging. Deleted documents are dropped when segments are merged. A freshly-merged segment thus has no gaps in its numbering.

  • 當(dāng)文檔(delete)被刪除(delete),編號(hào)中會(huì)產(chǎn)生空白。隨著索引(index)的通過(guò)合并(merge)的演變,這些(空白)最終會(huì)被移除(remove)。被刪除的文檔在段(segment)合并(merge)時(shí)會(huì)被物理刪除(drop)。新合并(merge)的段(segment)編號(hào)中就沒(méi)有空白了。

Index Structure Overview(索引結(jié)構(gòu)概述)

Each segment index maintains the following:

每段(segment)索引(index)都包含以下內(nèi)容:

  • Segment info. This contains metadata about a segment, such as the number of documents, what files it uses.

  • 段元數(shù)據(jù):包含段(segment)的元數(shù)據(jù),例如文檔數(shù)量,使用哪些文件。

  • Field names. This contains the set of field names used in the index.

  • 域名:包含索引(index)中使用的一些列域(field)名。

  • Stored Field values. This contains, for each document, a list of attribute-value pairs, where the attributes are field names. These are used to store auxiliary information about the document, such as its title, url, or an identifier to access a database. The set of stored fields are what is returned for each hit when searching. This is keyed by document number.

  • 域存儲(chǔ)值:對(duì)于每篇文檔(document),其包含屬性-值對(duì)的列表,其中屬性是域(field)名。這被用來(lái)存儲(chǔ)文檔(document)的輔助數(shù)據(jù),如標(biāo)題,url,訪問(wèn)數(shù)據(jù)庫(kù)的標(biāo)識(shí)符。這一系列域(field)就是搜索(search)時(shí)每次命(hit)后要返回的內(nèi)容。以文檔(document)編號(hào)作為關(guān)鍵字。

  • Term dictionary. A dictionary containing all of the terms used in all of the indexed fields of all of the documents. The dictionary also contains the number of documents which contain the term, and pointers to the term's frequency and proximity data.

  • 詞典:包含所有文檔(document)中所有被索引(indexed)域(field)使用的所有詞(term)的字典(dictionary)。該字典(dictionary)還包含了含有該詞(term)的文檔(document)數(shù)量,以及指向詞頻(term frequency)、位置信息(proximity data)的指針。

  • Term Frequency data. For each term in the dictionary, the numbers of all the documents that contain that term, and the frequency of the term in that document, unless frequencies are omitted (IndexOptions.DOCS_ONLY)

  • 詞頻數(shù)據(jù):對(duì)于字典(dictionary)中每個(gè)詞(term),存儲(chǔ)包含此詞(term)的文檔(document)數(shù)量(df),及其在每篇文檔(document)的詞頻(frequency)(tf)。條件是IndexOptions.DOCS_ONLY不為omitted。

  • Term Proximity data. For each term in the dictionary, the positions that the term occurs in each document. Note that this will not exist if all fields in all documents omit position data.

  • 詞位置數(shù)據(jù):對(duì)于字典(dictionary)中每個(gè)詞(term),存儲(chǔ)其在每篇文檔(document)中的出現(xiàn)的位置(position)。注意,如果每篇文檔(document)的每個(gè)域(field)都忽略(omit)位置(position)信息,則此節(jié)點(diǎn)不生成。

  • Normalization factors. For each field in each document, a value is stored that is multiplied into the score for hits on that field.

  • 標(biāo)準(zhǔn)化因子:對(duì)于文檔(document)中的每個(gè)域(field),存儲(chǔ)一個(gè)值,此值將會(huì)與域上(field)命中(hit)的得分(score)相乘。

  • Term Vectors. For each field in each document, the term vector (sometimes called document vector) may be stored. A term vector consists of term text and term frequency. To add Term Vectors to your index see the Field constructors。

  • 詞向量:對(duì)于文檔(document)中的每個(gè)域(field),可以存儲(chǔ)詞向量(term vector)(有時(shí)稱作文檔向量(document vector))。詞向量(term vector)由詞(term)的文本和詞頻(term frequency)組成。向索引(index)中添加詞向量(term vector),請(qǐng)參閱Field的構(gòu)造器。

  • Per-document values. Like stored values, these are also keyed by document number, but are generally intended to be loaded into main memory for fast access. Whereas stored values are generally intended for summary results from searches, per-document values are useful for things like scoring factors.

  • 單個(gè)文檔值:與存儲(chǔ)值相似,單個(gè)文檔(document)值也以文檔編號(hào)作為關(guān)鍵字,但其通常加載到主內(nèi)存中以便快速訪問(wèn)。相比與存儲(chǔ)值通常用作搜索結(jié)果的摘要,單個(gè)文檔值對(duì)于打分元素(scoring factors)這類過(guò)程非常有用。

  • Live documents. An optional file indicating which documents are live.

  • 活躍文檔:一個(gè)可選文件,找出哪些文檔(document)在活躍狀態(tài)。

  • Point values. Optional pair of files, recording dimensionally indexed fields, to enable fast numeric range filtering and large numeric values like BigInteger and BigDecimal (1D) and geographic shape intersection (2D, 3D).

  • 關(guān)鍵點(diǎn)值:一個(gè)可選文件,記錄維度級(jí)的被索引域,用于啟用快速數(shù)值范圍過(guò)濾器和BigInteger、BigDecimal(1D)、地理形狀交叉(geographic shape intersection) (2D, 3D)這樣的大數(shù)值。

Details on each of these are provided in their linked pages.

每處鏈接的詳細(xì)信息都在鏈接頁(yè)面中提供。

File Naming(文件命名)

All files belonging to a segment have the same name with varying extensions. The extensions correspond to the different file formats described below. When using the Compound File format (default for small segments) these files (except for the Segment info file, the Lock file, and Deleted documents file) are collapsed into a single .cfs file (see below for details)

同段(segment)中所有文件擁有相同命名與不同擴(kuò)展名。擴(kuò)展名對(duì)應(yīng)下方對(duì)不同文件格式的描述。使用復(fù)合文件格式(默認(rèn)受眾為小的段(segment))時(shí),這些文件(除段元數(shù)據(jù)文件、鎖文件和刪除文檔文件)被折疊為單個(gè).cfs文件()。

Typically, all segments in an index are stored in a single directory, although this is not required.

通常,一個(gè)索引(index)下的所有段(segment)存儲(chǔ)在單個(gè)目錄(dierctory)中,盡管這不是必須的。

File names are never re-used. That is, when any file is saved to the Directory it is given a never before used filename. This is achieved using a simple generations approach. For example, the first segments file is segments_1, then segments_2, etc. The generation is a sequential long integer represented in alpha-numeric (base 36) form.

文件名不會(huì)復(fù)用。也就是說(shuō),當(dāng)任何文件被保存到目錄時(shí),它會(huì)被賦予一個(gè)從未使用過(guò)的文件名。這是由簡(jiǎn)單的層代(generation)方法實(shí)現(xiàn)的。例如,第一個(gè)段(segment)的文件為segments_1,然后是segments_2,以此類推。層代(generation)是一系列基于數(shù)字(base 36)-字母表的順序長(zhǎng)整數(shù)。

Summary of File Extensions(文件擴(kuò)展名摘要)

The following table summarizes the names and extensions of the files in Lucene:

下方表格總結(jié)了Lucene中的文件名和擴(kuò)展名:

Name(名稱) Extension(擴(kuò)展名) Brief Description(簡(jiǎn)介)
Segments File segments_N Stores information about a commit point.
存儲(chǔ)提交點(diǎn)的信息
Lock File write.lock The Write lock prevents multiple IndexWriters from writing to the same file.
寫鎖可以防止多個(gè)IndexWriter寫入同一文件
Segment Info .si Stores metadata about a segment.
存儲(chǔ)段(segment)的元數(shù)據(jù)
Compound File .cfs, .cfe An optional "virtual" file consisting of all the other index files for systems that frequently run out of file handles.
一個(gè)可選的"虛擬"文件,包含經(jīng)常用光的系統(tǒng)內(nèi)所有其他索引(index)文件
Fields .fnm Stores information about the fields.
存儲(chǔ)域(field)信息
Field Index .fdx Contains pointers to field data.
存儲(chǔ)指向域(field)數(shù)據(jù)的指針
Field Data .fdt The stored fields for documents.
存儲(chǔ)的文檔(document)域(field)
Term Dictionary .tim The term dictionary, stores term info.
詞典(term dictionary),存儲(chǔ)詞(term)信息
Term Index .tip The index into the Term Dictionary.
詞典(term dictionary)中的索引(index)
Frequencies .doc Contains the list of docs which contain each term along with frequency.
包含每個(gè)詞(term)及詞頻(term fuequency)的文檔列表
Positions .pos Stores position information about where a term occurs in the index.
存儲(chǔ)詞(term)在索引(index)中出現(xiàn)位置信息
Payloads .pay Stores additional per-position metadata information such as character offsets and user payloads.
存儲(chǔ)額外的位置元數(shù)據(jù)信息,如字符串的偏移量和用戶有效載荷
Norms .nvd, .nvm Encodes length and boost factors for docs and fields.
編碼長(zhǎng)度和文檔(document)及域(field)的提升因素
Per-Document Values .dvd, .dvm Encodes additional scoring factors or other per-document information.
編碼附加打分因子(score factors)和其他每篇文檔(document)信息
Term Vector Index .tvx Stores offset into the document data file.
存儲(chǔ)到文檔(document)數(shù)據(jù)文件中的偏移量(offset)
Term Vector Data .tvd Contains term vector data.
包含詞向量(term vector)數(shù)據(jù)
Live Documents .liv Info about what documents are live.
哪些文檔(document)是活躍的信息
Point values .dii, .dim Holds indexed points, if any.
保留被索引的點(diǎn),如果有的話

Lock File(鎖文件)

The write lock, which is stored in the index directory by default, is named "write.lock". If the lock directory is different from the index directory then the write lock will be named "XXXX-write.lock" where XXXX is a unique prefix derived from the full path to the index directory. When this file is present, a writer is currently modifying the index (adding or removing documents). This lock file ensures that only one writer is modifying the index at a time.

默認(rèn)存儲(chǔ)在索引(index)目錄(directory)中的寫入鎖(lock),叫做"write.lock"。如果鎖目錄域與索引(index)目錄不同,則寫入鎖被命名為"XXXX-write.lock","XXXX"是從索引(index)目錄的完整路徑派生的唯一前綴(prefix)。當(dāng)文件存在時(shí),一個(gè)寫入程序(writer)正在修改索引(index)(添加或刪除文檔(document))。此鎖文件確保一次只有一個(gè)寫程序(writer)在修改索引(index)。

History(歷史)

Compatibility notes are provided in this document, describing how file formats have changed from prior versions:

本文檔提供了兼容性說(shuō)明,描述了以前版本中的文件格式是如何變化的:

  • In version 2.1, the file format was changed to allow lock-less commits (ie, no more commit lock). The change is fully backwards compatible: you can open a pre-2.1 index for searching or adding/deleting of docs. When the new segments file is saved (committed), it will be written in the new file format (meaning no specific "upgrade" process is needed). But note that once a commit has occurred, pre-2.1 Lucene will not be able to read the index.

  • 在V2.1中,文件格式被調(diào)成為允許無(wú)鎖提交(即,不再提交鎖)。這些變更完全向后兼容:您可以打開2.1之前的索引搜索或添加/刪除文檔。當(dāng)新的文檔段文件被保存(提交)時(shí),它將被寫入新的文件格式(意味著不需要特定的"升級(jí)"過(guò)程)。但請(qǐng)注意,一旦發(fā)生提交,2.1之前的Lucene將無(wú)法讀取索引。

  • In version 2.3, the file format was changed to allow segments to share a single set of doc store (vectors & stored fields) files. This allows for faster indexing in certain cases. The change is fully backwards compatible (in the same way as the lock-less commits change in 2.1).

  • 在V2.3,文檔格式被調(diào)整為允許段共享一組文檔存儲(chǔ)(向量和存儲(chǔ)域)文件。這在某些情況下允許更快的索引。這種變化是完全向后兼容的(與2.1中的無(wú)鎖提交相同)。

  • In version 2.4, Strings are now written as true UTF-8 byte sequence, not Java's modified UTF-8. See LUCENE-510 for details.

  • 在V2.4中,現(xiàn)在字符串以真正的UTF-8字節(jié)序列寫入,而不是Java的經(jīng)過(guò)修改的UTF-8。詳情見LUCENE-510 。

  • In version 2.9, an optional opaque Map<String,String> CommitUserData may be passed to IndexWriter's commit methods (and later retrieved), which is recorded in the segments_N file. See LUCENE-1382 for details. Also, diagnostics were added to each segment written recording details about why it was written (due to flush, merge; which OS/JRE was used; etc.). See issue LUCENE-1654 for details.

  • 在V2.9中,一個(gè)可選的不透明Map<String,String>CommituserData可以被傳遞給IndexWriter提交方法(以后被提取),它被記錄在segments_N文件中。細(xì)節(jié)請(qǐng)參見問(wèn)題LUCENE-1654 。

  • In version 3.0, compressed fields are no longer written to the index (they can still be read, but on merge the new segment will write them, uncompressed). See issue LUCENE-1960 for details.

  • 在版本3.0中,壓縮字段不再寫入索引(它們?nèi)匀豢梢员蛔x取,但在合并時(shí),新的段會(huì)寫入它們,未壓縮)。詳情請(qǐng)參閱LUCENE-1960 。

  • In version 3.1, segments records the code version that created them. See LUCENE-2720 for details. Additionally segments track explicitly whether or not they have term vectors. See LUCENE-2811 for details.

  • 在版本3.1中,段記錄了創(chuàng)建它們的代碼版本。細(xì)節(jié)見 LUCENE-2720。此外段明確跟蹤是否有術(shù)語(yǔ)向量。 細(xì)節(jié)見LUCENE-2811

  • In version 3.2, numeric fields are written as natively to stored fields file, previously they were stored in text format only.

  • 在版本3.2中,數(shù)字字段原生寫入存儲(chǔ)的字段文件,以前它們僅以文本格式存儲(chǔ)。

  • In version 3.4, fields can omit position data while still indexing term frequencies.
    在版本3.4中,字段可以省略位置數(shù)據(jù),同時(shí)仍然對(duì)術(shù)語(yǔ)頻率編制索引。

  • In version 4.0, the format of the inverted index became extensible via the Codec api. Fast per-document storage (DocValues) was introduced. Normalization factors need no longer be a single byte, they can be any NumericDocValues. Terms need not be unicode strings, they can be any byte sequence. Term offsets can optionally be indexed into the postings lists. Payloads can be stored in the term vectors.

  • 在版本4.0中,倒排索引的格式通過(guò)Codecapi 變得可擴(kuò)展。快速的每文檔存儲(chǔ)(DocValues)被引入。規(guī)范化因素不再需要一個(gè)字節(jié),它們可以是任何一個(gè)字節(jié)NumericDocValues。術(shù)語(yǔ)不必是unicode字符串,它們可以是任何字節(jié)序列。術(shù)語(yǔ)偏移量可以選擇性地編入發(fā)布列表中。有效載荷可以存儲(chǔ)在術(shù)語(yǔ)向量中。

  • In version 4.1, the format of the postings list changed to use either of FOR compression or variable-byte encoding, depending upon the frequency of the term. Terms appearing only once were changed to inline directly into the term dictionary. Stored fields are compressed by default.
    在版本4.1中,發(fā)布列表的格式更改為使用FOR壓縮或可變字節(jié)編碼,具體取決于術(shù)語(yǔ)的頻率。僅出現(xiàn)一次的術(shù)語(yǔ)被改為直接內(nèi)聯(lián)到術(shù)語(yǔ)詞典中。存儲(chǔ)字段默認(rèn)為壓縮字段。

  • In version 4.2, term vectors are compressed by default. DocValues has a new multi-valued type (SortedSet), that can be used for faceting/grouping/joining on multi-valued fields.

  • 在版本4.2中,術(shù)語(yǔ)向量是默認(rèn)壓縮的。DocValues有一個(gè)新的多值類型(SortedSet),可用于在多值字段上進(jìn)行分組/合并。

  • In version 4.5, DocValues were extended to explicitly represent missing values.

  • 在版本4.5中,DocValues被擴(kuò)展為顯式表示缺失值。

  • In version 4.6, FieldInfos were extended to support per-field DocValues generation, to allow updating NumericDocValues fields.

  • 在4.6版本中,F(xiàn)ieldInfos已擴(kuò)展為支持每字段DocValues生成,以允許更新NumericDocValues字段。

  • In version 4.8, checksum footers were added to the end of each index file for improved data integrity. Specifically, the last 8 bytes of every index file contain the zlib-crc32 checksum of the file.

  • 在版本4.8中,校驗(yàn)和頁(yè)腳被添加到每個(gè)索引文件的末尾以提高數(shù)據(jù)完整性。特別是,每個(gè)索引文件的最后8個(gè)字節(jié)都包含文件的zlib-crc32校驗(yàn)和。

  • In version 4.9, DocValues has a new multi-valued numeric type (SortedNumeric) that is suitable for faceting/sorting/analytics.

  • 在版本4.9中,DocValues具有適用于分面/排序/分析的新多值數(shù)值類型(SortedNumeric)。

  • In version 5.4, DocValues have been improved to store more information on disk: addresses for binary fields and ord indexes for multi-valued fields.

  • 在版本5.4中,DocValues已得到改進(jìn),可在磁盤上存儲(chǔ)更多信息:二進(jìn)制字段的地址和多值字段的ord索引。

  • In version 6.0, Points were added, for multi-dimensional range/distance search.

  • 在版本6.0中,添加了點(diǎn),用于多維范圍/距離搜索。

  • In version 6.2, new Segment info format that reads/writes the index sort, to support index sorting.

  • 在6.2版本中,讀取/寫入索引排序的新段信息格式支持索引排序。

  • In version 7.0, DocValues have been improved to better support sparse doc values thanks to an iterator API.

  • 在版本7.0中,由于使用了迭代器API,DocValues得到了改進(jìn)以更好地支持稀疏文檔值。

Limitations(限制)

Lucene uses a Java int to refer to document numbers, and the index file format uses an Int32 on-disk to store document numbers. This is a limitation of both the index file format and the current implementation. Eventually these should be replaced with either UInt64 values, or better yet, VInt values which have no limit.

Lucene使用Java int來(lái)引用文檔編號(hào),索引文件格式使用Int32 磁盤上的文檔編號(hào)。這是索引文件格式和當(dāng)前實(shí)現(xiàn)的限制。最終,這些應(yīng)該被替換為任何UInt64值,或者更好的VInt是沒(méi)有限制的值。

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時(shí)請(qǐng)結(jié)合常識(shí)與多方信息審慎甄別。
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡(jiǎn)書系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

  • rljs by sennchi Timeline of History Part One The Cognitiv...
    sennchi閱讀 7,854評(píng)論 0 10
  • 跨行業(yè)跨專業(yè),混到一群高學(xué)歷高智商的人群中間,我像個(gè)傻子一樣被碾壓的毫無(wú)存在感,上班四天,我無(wú)數(shù)次自我否認(rèn)與反思,...
    悅心0516閱讀 467評(píng)論 0 1
  • 已到午夜時(shí)分,我們兩個(gè)還沒(méi)有感到疲憊,兩條干凈而又嫩滑的酮體交叉在一起,她在我的懷里背過(guò)頭來(lái),乞求著渴...
    辣眼睛洋蔥閱讀 2,536評(píng)論 0 0
  • 2018年05月15日 農(nóng)歷四月初一 星期二 東莞 晴天 歐陽(yáng)玲三年級(jí)語(yǔ)文老師 呂婷 呂老師,家長(zhǎng)會(huì)上您講的課,是...
    老幺lisa閱讀 693評(píng)論 1 1
  • 在家也能輕松作出美味的各式各樣大餅
    餐飲食謀君閱讀 475評(píng)論 0 0

友情鏈接更多精彩內(nèi)容