Java NIO: Non-blocking Server
- Non-blocking Server - GitHub Repository
- Non-blocking IO Pipelines
- Non-blocking vs. Blocking IO Pipelines
- Basic Non-blocking IO Pipeline Design
- Reading Partial Messages
- Storing Partial Messages
- Writing Partial Messages
- Putting it All Together
- Server Thread Model
Even if you understand how the Java NIO non-blocking features work (Selector, Channel, Buffer etc.), designing a non-blocking server is still hard. Non-blocking IO contains several challenges compared blocking IO. This non-blocking server tutorial will discuss the major challenges of non-blocking servers, and describe some potential solutions for them.
即使你已經(jīng)理解了nio 的組件(Selector、Channel、Buffer)是怎么工作的,設(shè)計(jì)一個(gè)Non-Blocking server依然比較困難。
NIO相對(duì)于Blocking IO 存在一些挑戰(zhàn),這個(gè)章節(jié)會(huì)介紹這些挑戰(zhàn),并提出一些可能的解決方案
Finding good information about designing non-blocking servers is hard. Therefore the solutions provided in this tutorial are based on my own work and ideas. If you have some alternative or even better ideas, I would be happy to hear about them! You can write a comment under the article or send me an email (see our About page), or catch me on Twitter.
The ideas described in this tutorial are designed around Java NIO. However, I believe that the ideas can be reused in other languages as long as they have some kind of Selector-like construct. As far as I know, such constructs are provided by the underlying OS, so there is a good chance that you can get access to this in other languages too.
這篇文章是圍繞著Java NIO來(lái)討論的,但是我相信同樣的思想適用于其它支持類(lèi)似Selector結(jié)構(gòu)的開(kāi)發(fā)語(yǔ)言。
Non-blocking Server - GitHub Repository
I have created a simple proof-of-concept of the ideas presented in this tutorial and put it in a GitRebu repository for you to look at. Here is the GitHub repository:
https://github.com/jjenkov/java-nio-server
Non-blocking IO Pipelines
A non-blocking IO pipeline is a chain of components that process non-blocking IO. This includes both reading and writing IO in a non-blocking fashion.
NIO pipeline 是一個(gè)由多個(gè)組件組成的鏈,用于處理nio,包括對(duì)IO的讀、寫(xiě)操作
Here is an illustration of a simplified non-blocking IO pipeline:

A component uses an Selector to check when a Channel has data to read. Then the component reads the input data and generates some output based on the input. The output is written to a Channel again.
component使用selector檢查channel是否有數(shù)據(jù)待讀取,如果有,component讀取輸入數(shù)據(jù),生成對(duì)應(yīng)的輸入數(shù)據(jù),并將輸出數(shù)據(jù)重新寫(xiě)入到channel。
A non-blocking IO pipeline does not need to both read and write data. Some pipelines may only read data, and some pipelines may only write data.
pipeline并不需要同時(shí)支持讀、寫(xiě)數(shù)據(jù),有些pipeline只支持讀數(shù)據(jù),有些則支持寫(xiě)入數(shù)據(jù)。例如netty中的pipeline。
The above diagram only shows a single component. A non-blocking IO pipeline may have more than one component process incoming data. The length of a non-blocking IO pipeline depends on what the pipeline needs to do.
上圖中只有一個(gè)component,但是實(shí)際上根據(jù)需求,可以有多個(gè)component存在,pipeline的長(zhǎng)度取決于這個(gè)pipeline要做什么。
A non-blocking IO pipeline may also be reading from multiple Channels at the same time. For instance, reading data from multiple SocketChannels.
nio pipeline 可能會(huì)同時(shí)從多個(gè)channel中讀取數(shù)據(jù),比如從多個(gè)socketChannel中讀。
The flow of control in the above diagram is also simplified. It is the component that initiates the reading of data from the Channel via the Selector. It is not the Channel that pushes the data into the Selector and from there into the component, even if that is what the above diagram suggests.
上圖中控制流也很簡(jiǎn)單,component通過(guò)selector從channel啟動(dòng)讀取數(shù)據(jù)。
注意,并不是channel主動(dòng)推送數(shù)據(jù)到selector中再流轉(zhuǎn)到component,即使上圖中是這么畫(huà)的
Non-blocking vs. Blocking IO Pipelines
The biggest difference between a non-blocking and a blocking IO pipeline is how data is read from the underlying Channel (socket or file).
NIO pipeline和 BIO pipeline最大的區(qū)別在于,數(shù)據(jù)是怎么從channel中讀取的
IO pipelines typically read data from some stream (from a socket or file) and split that data into coherent messages. This is similar to breaking a stream of data into tokens for parsing using a tokenizer. Instead, you break the stream of data into bigger messages. I will call the component breaking the stream into messages for a Message Reader.
BIO pipeline從stream中讀取數(shù)據(jù),并將數(shù)據(jù)切分成多個(gè)連續(xù)的message。
Here is an illustration of a Message Reader breaking a stream into messages:

A blocking IO pipeline can use an InputStream-like interface where one byte at a time can be read from the underlying Channel, and where the InputStream-like interface blocks until there is data ready to read. This results in a blocking Message Reader implementation.
Using a blocking IO interface to a stream simplifies the implementation of a Message Reader a lot. A blocking Message Reader never has to handle situations where no data was read from the stream, or where only a partial message was read from the stream and message parsing needs to be resumed later.
Similarly, a blocking Message Writer (a component that writes messages to a stream) never has to handle the situation where only part of a message was written, and where message writing has to be resumed later.
BIO pipeline實(shí)現(xiàn)簡(jiǎn)單,而且不需要考慮讀場(chǎng)景時(shí)stream中沒(méi)有數(shù)據(jù),或者僅有一部分?jǐn)?shù)據(jù)的情況,同樣的寫(xiě)場(chǎng)景也不需要考慮。
Blocking IO Pipeline Drawbacks(缺點(diǎn))
While a blocking Message Reader is easier to implement, it has the unfortunate drawback of requiring a separate thread for each stream that needs to be split into messages. The reason this is necessary is that the IO interface of each stream blocks until there is some data to read from it. That means that a single thread cannot attempt to read from one stream, and if there is no data, read from another stream. As soon as a thread attempts to read data from a stream, the thread blocks until there is actually some data to read.
雖然Blocking Message Reader容易實(shí)現(xiàn),但是也有缺點(diǎn),就是需要為每個(gè)Stream接收、處理都分配一個(gè)單獨(dú)的線(xiàn)程。而且即使一個(gè)stream中沒(méi)有數(shù)據(jù),這個(gè)線(xiàn)程也無(wú)法去讀取另一個(gè)stream,because the IO interface of each stream blocks until there is some data to read from it.
If the IO pipeline is part of a server which has to handle lots of concurrent connections, the server will need one thread per active ingoing connection. This may not be a problem if the server only has a few hundred concurrent connections at any time. But, if the server has millions of concurrent connections, this type of design does not scale so well. Each thread will take between 320K (32 bit JVM) and 1024K (64 bit JVM) memory for its stack. So, 1.000.000 threads will take 1 TB memory! And that is before the server has used any memory for processing the incoming messages (e.g. memory allocated for objects used during message processing).
如果IO pipeline是一個(gè)服務(wù)(server)的組成部分,而這個(gè)服務(wù)需要處理多個(gè)請(qǐng)求連接,那么服務(wù)需要為每個(gè)請(qǐng)求進(jìn)來(lái)的連接創(chuàng)建一個(gè)線(xiàn)程。如果請(qǐng)求量比較小,還不是什么太大的問(wèn)題,但是如果需要服務(wù)百萬(wàn)級(jí)別的并發(fā),BIO pipeline就不太合適了。
每個(gè)線(xiàn)程需要根據(jù)-Xss的指定或者默認(rèn)值,為線(xiàn)程棧信息分配320K(32位的jvm)或者1024K(64位JVM),100W的線(xiàn)程就需要1TB的內(nèi)存空間,而且還不算server需要處理數(shù)據(jù)所用的堆空間。
To keep the number of threads down, many servers use a design where the server keeps a pool of threads (e.g. 100) which reads messages from the inbound connections one at a time. The inbound connections are kept in a queue, and the threads process messages from each inbound connection in the sequence the inbound connections are put into the queue.
為了降低線(xiàn)程數(shù),引入了線(xiàn)程池和隊(duì)列,stream請(qǐng)求放入隊(duì)列中,線(xiàn)程池依次讀取處理。
This design is illustrated here:

However, this design requires that the inbound connections send data reasonably often. If the inbound connections may be inactive for longer periods, a high number of inactive connections may actually block all the threads in the thread pool. That means that the server becomes slow to respond or even unresponsive.
但是這種設(shè)計(jì)需要請(qǐng)求進(jìn)來(lái)的connection經(jīng)常發(fā)送數(shù)據(jù),如果請(qǐng)求的連接休眠(不活躍)一段時(shí)間,那么多個(gè)不活躍的請(qǐng)求會(huì)block掉線(xiàn)程池中所有的線(xiàn)程,也就意味著server的響應(yīng)會(huì)變慢,甚至無(wú)法響應(yīng)請(qǐng)求
Some server designs try to mitigate this problem by having some elasticity in the number of threads in the thread pool. For instance, if the thread pool runs out of threads, the thread pool might start more threads to handle the load. This solution means that it takes a higher number of slow connections to make the server unresponsive. But remember, there is still an upper limit to how many threads you can have running. So, this would not scale well with 1.000.000 slow connections.
有些server設(shè)計(jì)了更靈活的線(xiàn)程池實(shí)現(xiàn),比如已跑滿(mǎn)線(xiàn)程池初始會(huì)時(shí)指定的線(xiàn)程數(shù),就再啟更多線(xiàn)程用于處理請(qǐng)求。不過(guò)這種實(shí)現(xiàn)方案治標(biāo)不治本,僅僅表明可以容納更多不活躍線(xiàn)程而已,并且一個(gè)server可以起的線(xiàn)程數(shù)是有上限的。
Basic Non-blocking IO Pipeline Design(開(kāi)始NIO相關(guān)設(shè)計(jì))
A non-blocking IO pipeline can use a single thread to read messages from multiple streams. This requires that the streams can be switched to non-blocking mode. When in non-blocking mode, a stream may return 0 or more bytes when you attempt to read data from it. The 0 bytes is returned if the stream has no data to read. The 1+ bytes are returned when the stream actually has some data to read.
Non-Blocking pipeline可以使用單個(gè)線(xiàn)程讀取多個(gè)stream中的數(shù)據(jù)(這里stream相當(dāng)于channel),前提是stream支持切換為non-blocking模式。
當(dāng)處于non-blocking模式的時(shí)候,一個(gè)stream可能返回0,也可能返回N個(gè)bytes。 0 bytes表明stream中沒(méi)有數(shù)據(jù)可供讀取,而N bytes表明當(dāng)前stream中有可供讀取的數(shù)據(jù),N 可能為allocate的byteBuffer capacity,也可能小于這個(gè)值。
To avoid checking streams that has 0 bytes to read we use a Java NIO Selector. One or more SelectableChannel instances can be registered with a Selector. When you call select() or selectNow()on the Selector it gives you only the SelectableChannel instances that actually has data to read.
為了避免總是檢查一個(gè)stream是否返回的是0bytes數(shù)據(jù),我們使用java NIO Selector。select方法只會(huì)返回實(shí)際有數(shù)據(jù)的channel。
This design is illustrated here:

Reading Partial Messages(讀取部分信息)
When we read a block of data from a SelectableChannel we do not know if that data block contains less or more than a message. A data block could potentially(可能) contain a partial message (less than a message), a full message, or more than a message, for instance 1.5 or 2.5 messsages. The various partial message possibilities are illustrated here:

There are two challenges(挑戰(zhàn)) in handling partial messages:
- Detecting if you have a full message in the data block.
- What to do with partial messages until the rest of the message arrives.
Detecting full messages requires that the Message Reader looks at the data in the data block to see if the data contains at least one full message. If the data block contains one or more full messages, these messages can be sent down the pipeline for processing. The process of looking for full messages will be repeated a lot, so this process has to be as fast as possible.
Whenever there is a partial message in a data block, either by itself or after one or more full messages, that partial message needs to be stored until the rest of that message arrives from the Channel.
Both detecting full messages and storing partial messages is the responsibility (職責(zé))of the Message Reader. To avoid mixing message data from different Channel instances we will use one Message Reader per Channel.( 每個(gè)channel有單獨(dú)的Message Reader) The design looks like this:

After retrieving(檢索) a Channel instance which has data to read from the Selector, the Message Reader associated(聯(lián)系) with that Channel reads data and attempt to break it it into messages. If that results in any full messages being read, these message can be passed down the read pipeline to whatever component needs to process them.
A Message Reader is of course protocol specific. A Message Reader needs to know the message format of the messages it is trying to read. If our server implementation is to be reusable across protocols, it needs to be able to have the Message Reader implementation plugged in - possibly by accepting a Message Reader factory as configuration parameter somehow(以某種方法).
Storing Partial Messages(存儲(chǔ)-部分信息)
Now that we have established(確定) that it is the responsibility of the Message Reader to store partial messages until a full message has been received, we need to figure out how this partial message storage should be implemented.
There are two design considerations(注意事項(xiàng)) we should take into consideration(考慮):
- We want to copy message data around as little as possible. The more copying, the lower performance.
- We want full messages to be stored in consecutive(連續(xù)的) byte sequences to make parsing messages easier.
A Buffer Per Message Reader
Obviously the partial messages need to be stored in some kind of buffer. The straightforward(簡(jiǎn)單的) implementation would be to simply have one buffer internally in each Message Reader. However, how big should that buffer be? It would need to be big enough to be able to store even the biggest allowed messages. So, if the biggest allowed message is 1MB, then the internal buffer in each Message Reader would need to be at least 1MB.
Using 1MB per connection doesn't really work when we reach millions of connections. 1.000.000 x 1MB is still 1TB memory! And what if the maximum message size is 16MB? Or 128MB ?
每個(gè)Message Reader一個(gè)Buffer,但是buffer多大容量是合適的呢?
Resizable Buffers
Another option would be to implement a resizable buffer for use inside each Message Reader. A resizable buffer will start small, and if a message gets too big for the buffer, the buffer is expanded(擴(kuò)展). This way each connection will not necessarily require an e.g. 1MB buffer. Each connection only takes as much memory as they need to hold the next message.
There are several ways to implement a resizable buffer. All of them have advantages and disadvantages, so I will discuss them all in the following sections.
Resize by Copy
The first way to implement a resizable buffer is to start with a small buffer of e.g. 4KB. If a message cannot fit into the 4KB buffer, a larger buffer of e.g. 8KB could be allocated, and the data from the 4KB buffer copied into the bigger buffer.(擴(kuò)展的時(shí)候,將原來(lái)的buffer數(shù)據(jù)復(fù)制到新buffer中)
The advantage of the resize-by-copy buffer implementation is that all data for a message is kept together in a single consecutive(連續(xù)) byte array. This makes parsing the message much easier.
The disadvantage of the resize-by-copy buffer implementation is that it will lead to a lot of data copying for bigger messages.(需要占用額外空間)
To reduce data copying you could analyze the size of the messages flowing through your system to find some buffer sizes that would reduce the amount of copying. For instance, you might see that most messages are less than 4KB because they only contain very small requests / responses. That means that the first buffer size should be 4KB.
Then you might see that if a message is larger than 4KB it is often because it contains a file. You might then notice that most of the files flowing through the system is less than 128KB. Then it makes sense to make the second buffer size 128KB.
Finally you might see that once a message is above 128KB there is no real pattern(no real pattern 沒(méi)有固定的模式) in how large the message is, so perhaps the final buffer size should just be the maximum message size.
With these 3 buffer sizes based on the size of messages flowing through your system, you will have reduced data copying somewhat. Messages below 4KB will never be copied. For 1.000.000 concurrent connections that results in 1.000.000 x 4KB = 4GB which is possible in most servers today (2015). Messages between 4KB and 128KB will get copied once, and only 4KB data will need to be copied into the 128KB buffer. Messages between 128KB and maximum message size will be copied twice. First time 4KB will get copied, second time 128KB will get copied, so a total of 132KB copying for the biggest messages. Assuming that there are not that many messages above 128KB this might be acceptable.
Once a message has been fully processed the allocated memory should be freed again. That way the next message received from the same connection starts with the smallest buffer size again. This is necessary to make sure that the memory can be shared more efficiently between connections. Most likely not all connections will need big buffers at the same time.
I have a complete tutorial about how to implement such a memory buffer that supports resizable arrays here: Resizable Arrays . The tutorial also contains a link to a GitHub repository with code showing a working implementation.
Resize by Append
Another way to resize a buffer is to make the buffer consist of multiple arrays. When you need to resize the buffer you simply allocate another byte array and write the data into that.
There are two ways to grow such a buffer. One way is to allocate separate byte arrays and keep a list of these byte arrays. Another way is to allocate slices of a larger, shared byte array, and then keep a list of the slices allocated to the buffer. Personally, I feel the slices approach is slightly better, but the difference is small.(數(shù)組和鏈表)
The advantage of growing a buffer by appending separate arrays or slices to it is that no data needs to be copied during writing. All data can be copied directly from a socket (Channel) directly into an array or slice.
The disadvantage of growing a buffer this way is that the data is not stored in a single, consecutive array. This makes message parsing harder, since the parsers need to look out for both the end of every individual(個(gè)別的) array and the end of all arrays at the same time. Since you need to look for the end of a message in the written data, this model is not too easy to work with.(鏈表避免了數(shù)據(jù)copy,但是會(huì)導(dǎo)致解析困難)
TLV Encoded Messages
Some protocol message formats are encoded using a TLV format (Type, Length, Value). That means, that when a message arrives the total length of the message is stored in the beginning of the message. That way you know immediately how much memory to allocate for the whole message.(在頭部把長(zhǎng)度、類(lèi)型都記錄下來(lái),方便server端處理)
TLV encodings make memory management much easier. You know immediately how much memory to allocate for the message. No memory is wasted at the end of a buffer that is only partially used.
One disadvantage of TLV encodings is that you allocate all the memory for a message before all the data of the message has arrived. A few, slow connections sending big messages can thus allocate all the memory you have available, making your server unresponsive.
A workaround(解決方案) for this problem would be to use a message format that contains multiple TLV fields inside. Thus, memory is allocated for each field, not for the whole message, and memory is only allocated as the fields arrive. Still, a large field can have the same effect on your memory management as a large message.
Another workaround is to time out messages which have not been received within e.g. 10-15 seconds. This can make your server recover from a coincidental(巧合的,一致的), simultaneous(同時(shí)發(fā)生的) arrival of many big messages, but it will still make the server unresponsive for a while. Additionally, an intentional(故意的,蓄謀的) DoS (Denial(否認(rèn)、拒絕) of Service) attack can still result in full allocation of the memory for your server.
TLV encodings exist in different variations(變更,變種). Exactly how many bytes is used so specify the type and length of a field depends on each individual TLV encoding. There are also TLV encodings that put the length of the field first, then the type, and then the value (an LTV encoding). While the sequence of the fields is different, it is still a TLV variation.
The fact that TLV encodings makes memory management easier is one of the reasons why HTTP 1.1 is such a terrible protocol. That is one of the problems they are trying to fix in HTTP 2.0 where data is transported in LTV encoded frames. This is also why we have designed our own network protocol for our VStack.co project that uses a TLV encoding.
Writing Partial Messages(寫(xiě)-部分)
In a non-blocking IO pipeline writing data is also a challenge. When you call write(ByteBuffer) on a Channel in non-blocking mode there is no guarantee(保證) about how many of the bytes in the ByteBuffer is being written. The write(ByteBuffer) method returns how many bytes were written, so it is possible to keep track of the number of written bytes. And that is the challenge: Keeping track of partially written messages so that in the end all bytes of a message have been sent.(一條消息會(huì)分成多次被寫(xiě)入)
To manage the writing of partial messages to a Channel we will create a Message Writer. Just like with the Message Reader, we will need a Message Writer per Channel we write messages to. Inside each Message Writer we keep track of exactly how many bytes have been written of the message it is currently writing.
In case(假設(shè),萬(wàn)一) more messages arrives for a Message Writer than it can write directly out to the Channel, the messages needs to be queued up internally in the Message Writer. The Message Writer then writes the messages as fast as it can to the Channel.
Here is a diagram showing how the partial message writing is designed so far:

For the Message Writer to be able to send messages that were only partially sent earlier, the Message Writer needs to be called from time to time, so it can send more data.
If you have a lot of connections you will have a lot of Message Writer instances. Checking e.g. a million Message Writer instances to see if they can write any data is slow. First of all, many of the Message Writer instance many not have any messages to send. We don't want to check those Message Writer instances. Second, not all Channel instances may be ready to write data to. We don't want to waste time trying to write data to a Channel that cannot accept any data anyways.
To check if a Channel is ready for writing you can register the channel with a Selector. However, we do not want to register all Channel instances with the Selector. Imagine if you have 1.000.000 connections which are mostly idle and all 1.000.000 connections were registered with the Selector. Then, when you call select() most of these Channel instances would be write-ready (they are mostly idle, remember?). You would then have to check the Message Writer of all those connections to see if they had any data to write.
To avoid checking all Message Writer instances for messages, and all Channel instances which anyways do not have any messages to be sent to them, we use this two-step approach(方法):
-
When a message is written to a Message Writer, the Message Writer registers its associated
Channelwith theSelector(if it is not already registered).有數(shù)據(jù)寫(xiě)入時(shí),才將Message Write分配到的channel注冊(cè)到Selector中
-
When your server has time, it checks the
Selectorto see which of the registeredChannelinstances are ready for writing. For each write-readyChannelits associated Message Writer is requested to write data to theChannel. If a Message Writer writes all its messages to itsChannel, theChannelis unregistered from theSelectoragain.如果數(shù)據(jù)寫(xiě)入完畢,unregister對(duì)應(yīng)的channel,這樣selector中注冊(cè)的channel能保證都是有用的
This little two-step approach makes sure that only Channel instances that have messages to be written to them are actually registered with the Selector.
Putting it All Together
As you can see, a non-blocking server needs to check for incoming data from time to time to see if there are any new full messages received. The server might need to check multiple times until one or more full messages have been received. Checking one time is not enough.(需要多次檢查,所以要有循環(huán))
Similarly, a non-blocking server needs to check from time to time if there is any data to write. If yes, the server needs to check if any of the corresponding connections are ready to have that data written to them. Checking only when a message is queued up the first time is not enough, since the message might be written partially.
All in all a non-blocking server ends up with three "pipelines" it needs to execute regularly:
- The read pipeline which checks for new incoming data from the open connections. 讀入--接收
- The process pipeline which processes any full messages received. 業(yè)務(wù)處理
- The write pipeline which checks if it can write any outgoing messages to any of the open connections. 寫(xiě)出--發(fā)送
These three pipelines are executed repeatedly in a loop. You might be able to optimize the execution somewhat. For instance, if there are no messages queued up(排隊(duì)等候) you can skip the write pipeline. Or, if there we no new, full messages received, perhaps you can skip the process pipeline.
Here is a diagram illustrating the full server loop:

If you still find this a bit complicated(難懂、復(fù)雜), remember to check out the GitHub repository:
https://github.com/jjenkov/java-nio-server
Maybe seeing the code in action might help you understand how to implement this.
Server Thread Model
The non-blocking server implementation in the GitHub repository uses a thread model with 2 threads. The first thread accepts incoming connections from a ServerSocketChannel. The second thread processes the accepted connections, meaning reading messages, processing the messages and writing responses back to the connections. This 2 thread model is illustrated here:

The server processing loop explained in the previous section is executed by the processing thread.