295 Find Median from Data Stream
題目
輸入一個(gè)數(shù)據(jù)流, 求這個(gè)數(shù)據(jù)流的中位數(shù)median。中位數(shù)的意思就是對數(shù)據(jù)進(jìn)行排序,如果總數(shù)是奇數(shù)個(gè),那么就是最中間的數(shù),如果是偶數(shù)個(gè),那么就是中間兩個(gè)數(shù)的平均值。
例子
addNum(1)
addNum(2)
findMedian() -> 1.5
addNum(3)
findMedian() -> 2
方法
Simple Sorting
- 最直接的方法就是有序的存住目前為止的數(shù),用一個(gè)PriorityQueue可以實(shí)現(xiàn)。
- 查找的時(shí)候,掃描PriorityQueue取中間值。
- 時(shí)間復(fù)雜度:O(NlogN)
- 空間復(fù)雜度:O(N)
Insertion Sort
- 當(dāng)一個(gè)序列是有序的,新加入一個(gè)數(shù),如果還想保持有序,可以用Insertion Sort。
- Insertion Sort需要我們找到新數(shù)需要插入的位置,我們可以用Binary Search實(shí)現(xiàn)(時(shí)間復(fù)雜度:O(logN))。
- 查找的時(shí)候,在有序序列中取中間值。
- 時(shí)間復(fù)雜度:O(logN)+O(N) = O(N)
- 空間復(fù)雜度:O(N)
在Insertion Sort的實(shí)現(xiàn)中,如果用LinkedList,那么插入可為O(1),查找位置是O(N)。如果用ArrayList,那么插入可為O(N) (因?yàn)樾枰巡迦胩幍脑赝笠苿?dòng)),查找位置是O(logN)。所以總的時(shí)間復(fù)雜度為O(N)。
[Best] Two Heaps
前兩種方法都保證了整體序列有序,但其實(shí)我們只想要Median,不一定要整體有序。
- 維護(hù)兩個(gè)PriorityQueue:一個(gè)存小數(shù)maxHeap,一個(gè)存大數(shù)minHeap,并且盡量維持左右PriorityQueue的size大小相同,或者小的那個(gè)size比大的大1。
- 當(dāng)新的元素x進(jìn)來,比較maxHeap里面的最大值,minHeap里面的最小值,和x。再次平衡兩個(gè)heap。
- 時(shí)間復(fù)雜度:O(logN)
- 空間復(fù)雜度:O(N)
Multiset and Two Pointers
方法4本質(zhì)上和3是一樣的。只不過是用了Self-balancing Binary Search Trees(比如AVL Tree)去實(shí)現(xiàn)。
實(shí)現(xiàn)起來有點(diǎn)復(fù)雜,在這里就不展開了。個(gè)人覺得方法3簡單容易理解。
- 時(shí)間復(fù)雜度:O(logN)
- 空間復(fù)雜度:O(N)
Others
Buckets 如果stream里面的數(shù)是 statistically distributed,那么很容易通過bucket去獲得中位數(shù)。一旦知道了正確的bucket,sort這個(gè)bucket就能知道中位數(shù)。如果bucket大小 << stream的大小,就大大節(jié)省了時(shí)間。
Reservoir Sampling(蓄水池). Following along the lines of using buckets: if the stream is statistically distributed, you can rely on Reservoir Sampling. Basically, if you could maintain just one good bucket (or reservoir) which could hold a representative sample of the entire stream, you could estimate the median of the entire stream from just this one bucket. This means good time and memory performance. Reservoir Sampling lets you do just that. Determining a "good" size for your reservoir? Now, that's a whole other challenge. A good explanation for this can be found in this StackOverflow answer.
Segment Trees 如果想對有限范圍內(nèi)的數(shù)據(jù)進(jìn)行大量的讀取或者大量的插入操作,Segment Tree是個(gè)很好的數(shù)據(jù)結(jié)構(gòu)。They allow us to do all such operations fast and in roughly the same amount of time, always. 缺點(diǎn)是不容易實(shí)現(xiàn)(寫起來麻煩)。參考 introductory article on Segment Trees。
Order Statistic Trees are data structures which seem to be tailor-made for this problem. They have all the nice features of a BST, but also let you find the kth order element stored in the tree. 缺點(diǎn)是不容易實(shí)現(xiàn)(寫起來麻煩),面試時(shí)幾乎不會(huì)考到。但如果這個(gè)結(jié)構(gòu)已經(jīng)提供,用起來會(huì)很有趣。