一、Physical partitioning
- 在一個transformation之后,F(xiàn)link也提供了底層API以允許用戶在必要時精確控制流分區(qū)。
- 所謂的Physical partitioning(或operator partition)就是operator parallel instance即SubTask。
- Flink allows us to perform physical partitioning of the stream data. You have an option to
provide custom partitioning. Let us have a look at the different types of partitioning.
1、Custom partitioning
- As mentioned earlier, you can provide custom implementation of a partitioner.
- While writing a custom partitioner you need make sure you implement an efficient hash function.
2、Shuffle (Random partitioning 隨機(jī))
- Random partitioning randomly partitions data streams in an evenly(均勻) manner.
3、Rebalancing partitioning(均勻 via round robin method)
- This type of partitioning helps distribute the data evenly. It uses a round robin method for distribution. This type of partitioning is good when data is skewed.
rebalance.jpg
4、Rescaling
Rescaling is used to distribute the data across operations, perform transformations on sub-sets of data and combine them together. This rebalancing happens over a single node only, hence it does not require any data transfer across networks.
5、Broadcasting(動態(tài)規(guī)則更新)
- Broadcasting distributes all records to each partition. This fans out each and every element to all partitions.
二、比較
1、shuffle VS rebalance(來自stackoverflow)
正如文檔所述,shuffle將隨機(jī)分發(fā)數(shù)據(jù),而rebalance將以循環(huán)方式分發(fā)數(shù)據(jù)。后者更有效,因為您不必計算隨機(jī)數(shù)。而且,根據(jù)隨機(jī)性,你最終可能會得到某種不那么均勻的分布。
另一方面,rebalance將始終開始將第一個元素發(fā)送到第一個通道。因此,如果您只有少量元素(元素少于子任務(wù)),那么只有部分子任務(wù)將接收元素,因為您總是開始將第一個元素發(fā)送到第一個子任務(wù)。在流式傳輸?shù)那闆r下,這最終無關(guān)緊要,因為你通常有一個無界的輸入流。
兩種方法存在的實際原因是歷史原因。shuffle首先介紹。為了使批處理的流API更加相似,rebalance然后介紹了。
2、Rescaling(低配版Rebalance, 無需網(wǎng)絡(luò)傳輸)
以round-robin方式對元素分區(qū)到下游operations。如果你想從source的每個并行實例分散到若干個mappers以負(fù)載均衡,但是你不期望rebalacne()那樣進(jìn)行全局負(fù)載均衡,這將會有用。這將僅需要本地數(shù)據(jù)傳輸,而不是通過網(wǎng)絡(luò)傳輸數(shù)據(jù),具體取決于其他配置值,例如TaskManager的插槽數(shù)。
上游operation所發(fā)送的元素被分區(qū)到下游operation的哪些子集,取決于上游和下游操作的并發(fā)度。例如,如果上游operation并發(fā)度為2,而下游operation并發(fā)度為6,則其中1個上游operation會將元素分發(fā)到3個下游operation,另1個上游operation會將元素分發(fā)到另外3個下游operation。相反地,如果上游operation并發(fā)度為6,而下游operation并發(fā)度為2,則其中3個上游operation會將元素分發(fā)到1個下游operation,另1個上游operation會將元素分發(fā)到另外1個下游operation。
在上下游operation的并行度不是彼此的倍數(shù)的情況下,下游operation對應(yīng)的上游的operation輸入數(shù)量不同。

