基本概念
Roaring BitMap 以下簡稱 RBM,中文翻譯為咆哮位圖,它本質(zhì)上是定義了一個很大的 bit 數(shù)組,每個元素對應(yīng)到 bit 數(shù)組的其中一位,一個Integer是32-bit, 一共有Integer.MAX_VALUE = 2 ^ 32個值,32-bit的unsigned integer的集合(共2 ^ 32 = 42 9496 7296個)
這個數(shù)足夠覆蓋一款產(chǎn)品的user數(shù)或item數(shù)(item 泛指是新聞,商品等)
由定義可知,其去重是針對int 型數(shù)據(jù)進(jìn)行操作,對于非Integer類型的數(shù)據(jù)(比如String類型)可以通過數(shù)據(jù)字典映射成Integer,比如數(shù)據(jù)庫的ID
基本位圖實現(xiàn)存在問題
- 即使只存一個數(shù)(最大的),存儲空間也是512MB
對于原始的Bitmap來說,這就需要2 ^ 32長度的bit數(shù)組
通過計算可以發(fā)現(xiàn)(2 ^ 32 / 8 bytes = 512MB), 一個普通的Bitmap需要耗費512MB的存儲空間
不管業(yè)務(wù)值的基數(shù)有多大,這個存儲空間的消耗都是恒定不變的,這顯然是不能接受的。
redis 的基本的位圖實現(xiàn)就存在這個問題
REDIS BITMAP 測試
localhost:0>flushdb
localhost:0>info
# Memory
used_memory:690288
used_memory_human:674.11K
used_memory_rss:652376
used_memory_rss_human:637.09K #表示該進(jìn)程所占物理內(nèi)存的大小
used_memory_peak:667584400
used_memory_peak_human:637.09K #是過去Redis內(nèi)存使用的峰值
localhost:0> setbit a 4294967295 1 #將最大位置1,此時基數(shù)為1
# Memory
used_memory:541755832
used_memory_human:516.66M #set 一個值,直接占512m內(nèi)存
used_memory_rss:541717944
used_memory_rss_human:516.62M
used_memory_peak:667584400
used_memory_peak_human:636.66M
附redis raoring-bitmap 實現(xiàn) https://github.com/aviggiano/redis-roaring
Roaring Bitmap實現(xiàn)
- RBM實現(xiàn)源理
將32位無符號整數(shù)按照高16位分桶,即最多可能有216=65536個桶,論文內(nèi)稱為container。存儲數(shù)據(jù)時,按照數(shù)據(jù)的高16位找到container(找不到就會新建一個),再將低16位放入container中。也就是說,一個RBM就是很多container的集合。
RoaringBitmap 構(gòu)造源碼
RoaringArray highLowContainer = null;
/**
* Create an empty bitmap
*/
public RoaringBitmap() {
highLowContainer = new RoaringArray();
}
RoaringArray
static final int INITIAL_CAPACITY = 4;
short[] keys = null;//高16位數(shù)組 有序數(shù)組,方便二分查找
Container[] values = null;//低16位容器數(shù)組
int size = 0;
protected RoaringArray() {
this(INITIAL_CAPACITY);
}
-
低16位容器數(shù)組存儲示意圖
低16位容器數(shù)組存儲示意圖 - 前1000個62的倍數(shù)
- 高16位為1,低16位為0到99 0x00010000-0x00010063
- 高16位為2, 低16位的所有偶數(shù)
Container 實現(xiàn)
Container只需要處理低16位的數(shù)據(jù)。
- ArrayContainer
public final class ArrayContainer extends Container implements Cloneable {
private static final int DEFAULT_INIT_SIZE = 4;
private static final int ARRAY_LAZY_LOWERBOUND = 1024;
static final int DEFAULT_MAX_SIZE = 4096;// containers with DEFAULT_MAX_SZE or less integers
// should be ArrayContainers
private static final long serialVersionUID = 1L;
protected int cardinality = 0;
short[] content;
/**
* Create an array container with default capacity
*/
public ArrayContainer() {
this(DEFAULT_INIT_SIZE);
}
short[] content,將16位value直接存儲。
short[] content始終保持有序且不重,方便使用二分查找。
根據(jù)源碼可以看出,常量DEFAULT_MAX_SIZE值為4096,當(dāng)容量超過這個值的時候會將當(dāng)前Container替換為BitmapContainer。
/**
* running time is in O(n) time if insert is not in order.
*/
@Override
public Container add(final short x) {
//基數(shù)為0或當(dāng)前值大于容器中的最大值(因為有序,最后一個即最大值)
if (cardinality == 0 || (cardinality > 0
&& toIntUnsigned(x) > toIntUnsigned(content[cardinality - 1]))) {
//基數(shù)>=4096則轉(zhuǎn)為BitmapContainer
if (cardinality >= DEFAULT_MAX_SIZE) {
return toBitmapContainer().add(x);
}
//如果容器空間不路,則擴(kuò)容[見下方擴(kuò)容源代碼]
if (cardinality >= this.content.length) {
increaseCapacity();
}
content[cardinality++] = x;
} else {
//如果當(dāng)前值在容器范圍內(nèi),則找到該值對應(yīng)的位置[見下二分查找源碼]
int loc = Util.unsignedBinarySearch(content, 0, cardinality, x);
//未找到
if (loc < 0) {
// Transform the ArrayContainer to a BitmapContainer
// when cardinality = DEFAULT_MAX_SIZE
//雖然在容器范圍內(nèi)也可能會涉及到容器升級和擴(kuò)容
if (cardinality >= DEFAULT_MAX_SIZE) {
return toBitmapContainer().add(x);
}
if (cardinality >= this.content.length) {
increaseCapacity();
}
// insertion : shift the elements > x by one position to
// the right
// and put x in it's appropriate place
System.arraycopy(content, -loc - 1, content, -loc, cardinality + loc + 1);
content[-loc - 1] = x;
++cardinality;
}
}
return this;
}
數(shù)組擴(kuò)容邏輯
private void increaseCapacity(boolean allowIllegalSize) {
int newCapacity = (this.content.length == 0) ? DEFAULT_INIT_SIZE
: this.content.length < 64 ? this.content.length * 2
: this.content.length < 1067 ? this.content.length * 3 / 2
: this.content.length * 5 / 4;
// never allocate more than we will ever need
if (newCapacity > ArrayContainer.DEFAULT_MAX_SIZE && !allowIllegalSize) {
newCapacity = ArrayContainer.DEFAULT_MAX_SIZE;
}
// if we are within 1/16th of the max, go to max
if (newCapacity > ArrayContainer.DEFAULT_MAX_SIZE - ArrayContainer.DEFAULT_MAX_SIZE / 16
&& !allowIllegalSize) {
newCapacity = ArrayContainer.DEFAULT_MAX_SIZE;
}
this.content = Arrays.copyOf(this.content, newCapacity);
}
數(shù)組的二分查找(找到則返回value的位置,未找到則返回當(dāng)前值將要insert 的位置,為與找到值的位置區(qū)分,這里返回負(fù)值)
/**
* Look for value k in array in the range [begin,end). If the value is found, return its index. If
* not, return -(i+1) where i is the index where the value would be inserted. The array is assumed
* to contain sorted values where shorts are interpreted as unsigned integers.
*
* @param array array where we search
* @param begin first index (inclusive)
* @param end last index (exclusive)
* @param k value we search for
* @return count
*/
public static int unsignedBinarySearch(final short[] array, final int begin, final int end,
final short k) {
if (USE_HYBRID_BINSEARCH) {
return hybridUnsignedBinarySearch(array, begin, end, k);
} else {
return branchyUnsignedBinarySearch(array, begin, end, k);
}
}
為什么是4096的時侯升級容器?

- bitmap存儲空間恒定為8K,最大的基數(shù)可達(dá)到8*1024*8=65536個
- array的基數(shù)與存儲空間成正比,即基數(shù)越大,占用空占越多
通過以上兩點我們?nèi)烧呓幌嘟坏狞c為平衡點,即小于該點array更省空間,大于該點bitmap更省空間。
上圖中的前兩個container基數(shù)都沒超過4096,所以均為ArrayContainer。
- BitmapContainer
/**
* Simple bitset-like container.
*/
public final class BitmapContainer extends Container implements Cloneable {
protected static final int MAX_CAPACITY = 1 << 16;
//保存低16位,所以最大值為216
// the parameter is for overloading and symmetry with ArrayContainer
protected static int serializedSizeInBytes(int unusedCardinality) {
return MAX_CAPACITY / 8;
}
final long[] bitmap;
int cardinality;
// nruns value for which RunContainer.serializedSizeInBytes ==
// BitmapContainer.getArraySizeInBytes()
private final int MAXRUNS = (getArraySizeInBytes() - 2) / 4;
/**
* Create a bitmap container with all bits set to false
*構(gòu)造最大的long 值數(shù)組,每個long 64位
*/
public BitmapContainer() {
this.cardinality = 0;
this.bitmap = new long[MAX_CAPACITY / 64];//1024個long
}
這種Container使用long[]存儲位圖數(shù)據(jù)。我們知道,每個Container處理16位整形的數(shù)據(jù),也就是0~65535,因此根據(jù)位圖的原理,需要65536個比特來存儲數(shù)據(jù),每個比特位用1來表示有,0來表示無。每個long有64位,因此需要1024個long來提供65536個bit。
因此,每個BitmapContainer在構(gòu)建時就會初始化長度為1024的long[]。這就意味著,不管一個BitmapContainer中只存儲了1個數(shù)據(jù)還是存儲了65536個數(shù)據(jù),占用的空間都是同樣的8kb。
上圖中的第三個container基數(shù)遠(yuǎn)遠(yuǎn)大于4096,所以要用BitmapContainer存儲。
bit map container add方法源碼分析
@Override
public Container add(final short i) {
final int x = Util.toIntUnsigned(i);
//當(dāng)前值/64取整找到long數(shù)組的索引
final long previous = bitmap[x / 64];
//找到當(dāng)前值所在long 的第幾位,并將該位置1,1L<<x 等價于1x<<(x%64)
關(guān)于位移操作詳情官方解釋
https://docs.oracle.com/javase/specs/jls/se10/html/jls-15.html#jls-15.19
The operators << (left shift), >> (signed right shift), and >>> (unsigned right shift) are called the shift operators. The left-hand operand of a shift operator is the value to be shifted; the right-hand operand specifies the shift distance.
The type of the shift expression is the promoted type of the left-hand operand.
If the promoted type of the left-hand operand isint, then only the five lowest-order bits of the right-hand operand are used as the shift distance.(int 只有低5位有效)It is as if the right-hand operand were subjected to a bitwise logical AND operator & (§15.22.1) with the mask value 0x1f (0b11111). The shift distance actually used is therefore always in the range 0 to 31, inclusive.
If the promoted type of the left-hand operand islong, then only the six lowest-order bits of the right-hand operand are used as the shift distance.(long 只有低6位有效)It is as if the right-hand operand were subjected to a bitwise logical AND operator & (§15.22.1) with the mask value 0x3f (0b111111). The shift distance actually used is therefore always in the range 0 to 63, inclusive.
long newval = previous | (1L << x);
//賦新值
bitmap[x / 64] = newval;
if (USE_BRANCHLESS) {
cardinality += (previous ^ newval) >>> x;
} else if (previous != newval) {
++cardinality;
}
return this;
}
- RunContainer
該容器由RLE實現(xiàn)
/**
* This container takes the form of runs of consecutive values (effectively, run-length encoding).
*
* Adding and removing content from this container might make it wasteful so regular calls to
* "runOptimize" might be warranted.
*/
public final class RunContainer extends Container implements Cloneable {
private static final int DEFAULT_INIT_SIZE = 4;
private static final boolean ENABLE_GALLOPING_AND = false;
private short[] valueslength;//主要存儲結(jié)構(gòu) we interleave values and lengths, so
// that if you have the values 11,12,13,14,15, you store that as 11,4 where 4 means that beyond 11
// itself, there are
// 4 contiguous values that follows.
// Other example: e.g., 1, 10, 20,0, 31,2 would be a concise representation of 1, 2, ..., 11, 20,
// 31, 32, 33
int nbrruns = 0;// how many runs, this number should fit in 16 bits.
- Run-Length Encoding(RLE)
Run-length encoding是被許多bitmap文件格式支持的數(shù)據(jù)壓縮算法
RLE工作方式是減少重復(fù)字符的物理尺寸,被編碼成兩個字節(jié),第一個字節(jié)表示字符數(shù)量,第二個字節(jié)表示本身字符值。
字符串包含4個不同字符
例如:
AAAAAAbbbXXXXXt
使用RLE,可以形成4個2字節(jié)包
6A3b5X1t
當(dāng)手動執(zhí)行runOptimize 方法時會觸發(fā)優(yōu)化
/**
* Use a run-length encoding where it is more space efficient
*
* @return whether a change was applied
*/
public boolean runOptimize() {
boolean answer = false;
for (int i = 0; i < this.highLowContainer.size(); i++) {
Container c = this.highLowContainer.getContainerAtIndex(i).runOptimize();
if (c instanceof RunContainer) {
answer = true;
}
this.highLowContainer.setContainerAtIndex(i, c);
}
return answer;
}
ArrayContainer 優(yōu)化邏輯
@Override
public Container runOptimize() {
// TODO: consider borrowing the BitmapContainer idea of early
// abandonment
// with ArrayContainers, when the number of runs in the arrayContainer
// passes some threshold based on the cardinality.
int numRuns = numberOfRuns();
int sizeAsRunContainer = RunContainer.serializedSizeInBytes(numRuns);
//如果RunContainer 比當(dāng)前的容器省空間,則升級為 RunContainer。BitMap 同理
if (getArraySizeInBytes() > sizeAsRunContainer) {
return new RunContainer(this, numRuns); // this could be maybe
// faster if initial
// container is a bitmap
} else {
return this;
}
}
numberOfRuns 計算run 的個數(shù)
@Override
int numberOfRuns() {
if (cardinality == 0) {
return 0; // should never happen
}
int numRuns = 1;
int oldv = toIntUnsigned(content[0]);
for (int i = 1; i < cardinality; i++) {
int newv = toIntUnsigned(content[i]);
if (oldv + 1 != newv) {
++numRuns;
}
oldv = newv;
}
return numRuns;
}
計算RunContainer 可能占用的容量
protected static int serializedSizeInBytes(int numberOfRuns) {
return 2 + 2 * 2 * numberOfRuns; // each run requires 2 2-byte entries.
//值和長度成對出現(xiàn),分別兩個字節(jié)
}
容器轉(zhuǎn)換總結(jié)
- ArrayContainer:
如果插入值后容量超過4096,則自動轉(zhuǎn)換為BitmapContainer。因此正常使用的情況下不會出現(xiàn)容量超過4096的ArrayContainer。
調(diào)用runOptimize()方法時,會比較和RunContainer的空間占用大小,選擇是否轉(zhuǎn)換為RunContainer。 - BitmapContainer:
如果刪除某值后容量低至4096,則會自動轉(zhuǎn)換為ArrayContainer。因此正常使用的情況下不會出現(xiàn)容量小于4096的BitmapContainer。
調(diào)用runOptimize()方法時,會比較和RunContainer的空間占用大小,選擇是否轉(zhuǎn)換為RunContainer。 - RunContainer:
只有在調(diào)用runOptimize()方法才會發(fā)生轉(zhuǎn)換,會分別和ArrayContainer、BitmapContainer比較空間占用大小,然后選擇是否轉(zhuǎn)換。
參考
《Better bitmap performance with Roaring bitmaps》
《Consistently faster and smaller compressed bitmaps with Roaring》
The Java? Language Specification
http://www.itdecent.cn/p/818ac4e90daf
https://blog.csdn.net/luanpeng825485697/article/details/101110798
