最近我們的應(yīng)用搞了一次線上大促,導(dǎo)致流量暴增,應(yīng)用瞬時(shí)并發(fā)達(dá)到了平時(shí)幾十上百倍,結(jié)果出現(xiàn)了一個(gè)熱點(diǎn)接口大量超時(shí)的情況。這里記錄下排查過(guò)程,以備后查,也希望能給大家一些啟發(fā)。
在排查超時(shí)接口的時(shí)候,發(fā)現(xiàn)了一個(gè)很經(jīng)典的現(xiàn)象,就是超時(shí)接口本身只做了轉(zhuǎn)發(fā)邏輯,其調(diào)用的遠(yuǎn)程接口,幾乎都是瞬間返回的:通過(guò)nginx日志和應(yīng)用日志可以看到,在高并發(fā)的時(shí)刻,超時(shí)接口從接收到請(qǐng)求到返回響應(yīng)花了7、8秒甚至更長(zhǎng)(下圖請(qǐng)求1),而其內(nèi)部依賴的遠(yuǎn)程接口的響應(yīng)則是在毫秒級(jí)的(下圖請(qǐng)求2),并且響應(yīng)5和6幾乎是同時(shí)發(fā)生的。

那么要解決這個(gè)問(wèn)題其實(shí)就變成了去找到gateway接受到請(qǐng)求1后為什么沒(méi)有及時(shí)發(fā)出http請(qǐng)求到serviceA,這里我首先就把restTemplate定為了主要排查目標(biāo)對(duì)象,是不是RestTemplate的問(wèn)題導(dǎo)致了請(qǐng)求積壓。為了證明猜想,我們需要收集更多的信息,通過(guò)在測(cè)試環(huán)境壓測(cè)復(fù)現(xiàn)接口超時(shí)的場(chǎng)景,并且通過(guò)jstack命令獲取到高并發(fā)下的gateway應(yīng)用的堆棧信息,然后將thread stack dump文件丟到gceasy(fastThread)這個(gè)網(wǎng)站分析一波,看看當(dāng)時(shí)的活躍線程情況:
好家伙,人家的ML直接分析出來(lái)了應(yīng)用存在無(wú)法響應(yīng)的情況,還真是給他說(shuō)對(duì)了。

接著看幾個(gè)關(guān)鍵的指標(biāo),首先是線程狀態(tài)分布,可以看到大量的線程都處于WAITING狀態(tài):

而根據(jù)線程組聚合統(tǒng)計(jì),其中最多的線程都來(lái)自http請(qǐng)求,然而575個(gè)http線程中,真正RUNNABLE的數(shù)量只有少得可憐的幾個(gè),大量的都是WAITING和TIMED_WAITING的狀態(tài),這也和我們應(yīng)用的表象一致:

其中的堆棧完全一致的WAITING線程達(dá)到了497個(gè):

那么思路很明確了,就是看下這些WAITING的http線程都在做什么。是什么導(dǎo)致了 RestTemplate不能及時(shí)創(chuàng)建http請(qǐng)求。
通過(guò)查看這些線程的具體stacktrace,發(fā)現(xiàn)這些WAITING的線程正是執(zhí)行了RestTemplate的線程,它們內(nèi)部都是使用了apache httpClient創(chuàng)建連接,通過(guò)PoolingHttpClientConnectionManager.leaseConnection這條信息可以看出,大量的線程正嘗試獲取連接池中的連接,而連接資源是通過(guò)AQS獲取,目前處于阻塞等待的狀態(tài):
http-nio-8080-exec-571
PRIORITY : 5
THREAD ID : 0X00007F2A20263000
NATIVE ID : 0X52AA
NATIVE ID (DECIMAL) : 21162
STATE : WAITING
stackTrace:
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x0000000662488168> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
at org.apache.http.pool.AbstractConnPool.getPoolEntryBlocking(AbstractConnPool.java:380)
at org.apache.http.pool.AbstractConnPool.access$200(AbstractConnPool.java:69)
at org.apache.http.pool.AbstractConnPool$2.get(AbstractConnPool.java:246)
- locked <0x00000007595108c0> (a org.apache.http.pool.AbstractConnPool$2)
at org.apache.http.pool.AbstractConnPool$2.get(AbstractConnPool.java:193)
at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.leaseConnection(PoolingHttpClientConnectionManager.java:240)
at org.apache.http.impl.conn.PoolingHttpClientConnectionManager$1.get(PoolingHttpClientConnectionManager.java:227)
at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:173)
at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:195)
at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:86)
at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:108)
at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:186)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57)
at org.springframework.http.client.HttpComponentsClientHttpRequest.executeInternal(HttpComponentsClientHttpRequest.java:89)
at org.springframework.http.client.AbstractBufferingClientHttpRequest.executeInternal(AbstractBufferingClientHttpRequest.java:48)
at org.springframework.http.client.AbstractClientHttpRequest.execute(AbstractClientHttpRequest.java:53)
at org.springframework.web.client.RestTemplate.doExecute(RestTemplate.java:659)
at org.springframework.web.client.RestTemplate.execute(RestTemplate.java:620)
at org.springframework.web.client.RestTemplate.postForEntity(RestTemplate.java:414)
下略...
那么問(wèn)題的原因基本就已定位到了,一定是跟httpClient的連接池有關(guān),我們看到代碼里是這么定義restTemplate的:
@Bean
public RestTemplate restTemplate() {
HttpComponentsClientHttpRequestFactory httpFactory = new
HttpComponentsClientHttpRequestFactory();
requestFactory.setConnectTimeout(3000);
requestFactory.setReadTimeout(5000);
RestTemplate restTemplate = new RestTemplate(httpFactory);
restTemplate.getMessageConverters().add(0, new StringHttpMessageConverter(StandardCharsets.UTF_8));
return restTemplate;
}
其中HttpComponentsClientHttpRequestFactory類的作用是生成httpClient,通過(guò)查看源碼發(fā)現(xiàn),如果沒(méi)有顯示的指定,則會(huì)使用系統(tǒng)配置
if (systemProperties) {
String s = System.getProperty("http.keepAlive", "true");
if ("true".equalsIgnoreCase(s)) {
s = System.getProperty("http.maxConnections", "5");
final int max = Integer.parseInt(s);
poolingmgr.setDefaultMaxPerRoute(max);
poolingmgr.setMaxTotal(2 * max);
}
}
其中maxTotal表示連接池的總大小,而maxPerRoute表示每個(gè)路由的最大連接大小,也就是說(shuō)對(duì)于同一個(gè)域名的http請(qǐng)求,httpClient最多同時(shí)只會(huì)建立5個(gè)連接,怪不得有那么多線程處于等待狀態(tài)了。我們這里可以回過(guò)頭再次查看堆棧信息,確認(rèn)是否只有5個(gè)連接被建立了。
通過(guò)在所有的575個(gè)http線程中搜索關(guān)鍵堆棧信息org.apache.http.impl.io.SessionInputBufferImpl.streamRead,發(fā)現(xiàn)的確剛好只有5個(gè)RUNNABLE的線程正在建立scoket連接。

那么解決方案也就很清晰了,只需要將RestTemplate所使用的httpClient進(jìn)行合理的連接池配置,即可完成優(yōu)化:
@Bean
public RestTemplate restTemplate(HttpClient httpClient) {
HttpComponentsClientHttpRequestFactory requestFactory = new
HttpComponentsClientHttpRequestFactory();
requestFactory.setHttpClient(httpClient);
RestTemplate restTemplate = new RestTemplate(requestFactory);
restTemplate.getMessageConverters().add(0, new StringHttpMessageConverter(StandardCharsets.UTF_8));
return restTemplate;
}
@Bean
public HttpClient httpClient(PoolingHttpClientConnectionManager connectionManager){
RequestConfig requestConfig = RequestConfig.custom().setConnectTimeout(3000)
.setSocketTimeout(5000).setConnectionRequestTimeout(1000).build();
return HttpClientBuilder.create()
.setConnectionManager(connectionManager)
.setDefaultRequestConfig(requestConfig)
.build();
}
@Bean(destroyMethod = "close")
public PoolingHttpClientConnectionManager poolingHttpClientConnectionManager() {
PoolingHttpClientConnectionManager connectionManager = new PoolingHttpClientConnectionManager();
//連接池大小
connectionManager.setMaxTotal(1000);
//每個(gè)路由連接大小
connectionManager.setDefaultMaxPerRoute(200);
return connectionManager;
}
改完之后再壓測(cè)下看下效果,發(fā)現(xiàn)現(xiàn)在gateway幾乎所有的http線程都是RUNNABLE的狀態(tài)了:

如果點(diǎn)開(kāi)查看具體的堆棧信息,就會(huì)看到這些線程都是處于執(zhí)行socketRead的操作,現(xiàn)在壓力已經(jīng)到了下游的seviceA中:
http-nio-8080-exec-102
PRIORITY : 5
THREAD ID : 0X00007F7EE40E1000
NATIVE ID : 0X4B17
NATIVE ID (DECIMAL) : 19223
STATE : RUNNABLE
stackTrace:
java.lang.Thread.State: RUNNABLE
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
at java.net.SocketInputStream.read(SocketInputStream.java:170)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at org.apache.http.impl.io.SessionInputBufferImpl.streamRead(SessionInputBufferImpl.java:137)
at org.apache.http.impl.io.SessionInputBufferImpl.fillBuffer(SessionInputBufferImpl.java:153)
at org.apache.http.impl.io.SessionInputBufferImpl.readLine(SessionInputBufferImpl.java:282)
下略...
但從壓測(cè)結(jié)果來(lái)看,該接口還是存在大量超時(shí),同時(shí)從日志來(lái)看,這個(gè)時(shí)候超時(shí)的已經(jīng)是ServiceA提供的接口了:

ServiceA->ServiceB和gateway->ServiceA唯一的區(qū)別只是使用的http客戶端不同,由于是Spring Cloud框架,所以ServiceA使用的是feign,那么我們也有理由懷疑是feign導(dǎo)致了請(qǐng)求的積壓。使用排查gateway一樣的方法,我們可以看到,ServiceA在高并發(fā)場(chǎng)景下大量的http線程處于WAITING狀態(tài),只有5個(gè)線程處于RUNNABLE,到這里答案以及呼之欲出了。

我們確認(rèn)下這些WAITING狀態(tài)的線程堆棧信息,果然也是正在等待從連接池中的獲取連接:
http-nio-8080-exec-294
PRIORITY : 5
THREAD ID : 0X00007F5F3B119800
NATIVE ID : 0X1C9B
NATIVE ID (DECIMAL) : 7323
STATE : WAITING
stackTrace:
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00000007a8617d80> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
at org.apache.http.pool.PoolEntryFuture.await(PoolEntryFuture.java:138)
at org.apache.http.pool.AbstractConnPool.getPoolEntryBlocking(AbstractConnPool.java:306)
at org.apache.http.pool.AbstractConnPool.access$000(AbstractConnPool.java:64)
at org.apache.http.pool.AbstractConnPool$2.getPoolEntry(AbstractConnPool.java:192)
at org.apache.http.pool.AbstractConnPool$2.getPoolEntry(AbstractConnPool.java:185)
at org.apache.http.pool.PoolEntryFuture.get(PoolEntryFuture.java:107)
at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.leaseConnection(PoolingHttpClientConnectionManager.java:276)
at org.apache.http.impl.conn.PoolingHttpClientConnectionManager$1.get(PoolingHttpClientConnectionManager.java:263)
at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:190)
at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:184)
at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:88)
at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110)
at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:184)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:107)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55)
at feign.httpclient.ApacheHttpClient.execute(ApacheHttpClient.java:87)
at org.springframework.cloud.netflix.feign.ribbon.RetryableFeignLoadBalancer$1.doWithRetry(RetryableFeignLoadBalancer.java:92)
at org.springframework.cloud.netflix.feign.ribbon.RetryableFeignLoadBalancer$1.doWithRetry(RetryableFeignLoadBalancer.java:77)
at org.springframework.retry.support.RetryTemplate.doExecute(RetryTemplate.java:287)
at org.springframework.retry.support.RetryTemplate.execute(RetryTemplate.java:164)
at org.springframework.cloud.netflix.feign.ribbon.RetryableFeignLoadBalancer.execute(RetryableFeignLoadBalancer.java:77)
at org.springframework.cloud.netflix.feign.ribbon.RetryableFeignLoadBalancer.execute(RetryableFeignLoadBalancer.java:48)
at com.netflix.client.AbstractLoadBalancerAwareClient$1.call(AbstractLoadBalancerAwareClient.java:109)
at com.netflix.loadbalancer.reactive.LoadBalancerCommand$3$1.call(LoadBalancerCommand.java:303)
at com.netflix.loadbalancer.reactive.LoadBalancerCommand$3$1.call(LoadBalancerCommand.java:287)
...
中略
...
com.netflix.client.AbstractLoadBalancerAwareClient.executeWithLoadBalancer(AbstractLoadBalancerAwareClient.java:117)
at org.springframework.cloud.netflix.feign.ribbon.LoadBalancerFeignClient.execute(LoadBalancerFeignClient.java:63)
at feign.SynchronousMethodHandler.executeAndDecode(SynchronousMethodHandler.java:97)
at feign.SynchronousMethodHandler.invoke(SynchronousMethodHandler.java:76)
at feign.ReflectiveFeign$FeignInvocationHandler.invoke(ReflectiveFeign.java:103)
at com.sun.proxy.$Proxy199.xxx微服務(wù)接口名(Unknown Source)
下略...
那么解決方案也是和RestTemplate一致,將Feign所使用的httpClient的連接池進(jìn)行顯示的配置:
@Bean
public HttpClient httpClient(PoolingHttpClientConnectionManager connectionManager){
RequestConfig requestConfig = RequestConfig.custom().setConnectTimeout(3000)
.setSocketTimeout(5000).setConnectionRequestTimeout(1000).build();
return HttpClientBuilder.create()
.setConnectionManager(connectionManager)
.setDefaultRequestConfig(requestConfig)
.build();
}
@Bean(destroyMethod = "close")
public PoolingHttpClientConnectionManager poolingHttpClientConnectionManager() {
PoolingHttpClientConnectionManager connectionManager = new PoolingHttpClientConnectionManager();
//連接池大小
connectionManager.setMaxTotal(1000);
//每個(gè)路由連接大小
connectionManager.setDefaultMaxPerRoute(200);
return connectionManager;
}
至此為止,問(wèn)題全部解決,該接口在100線程的壓測(cè)下,從5~6QPS,增長(zhǎng)到了100QPS。
最后補(bǔ)充說(shuō)明下,RestTemplate和Feign作為http調(diào)用工具,如果什么都不設(shè)置實(shí)際上默認(rèn)使用的jdk自帶的UrlConnection作為http客戶端的,而這個(gè)UrlConnection是壓根沒(méi)有連接池的,不能支持高并發(fā)的場(chǎng)景。一般我們會(huì)選擇apache httpClient或者是okHttp來(lái)替換。本文中的應(yīng)用都是使用了apache httpClient的,對(duì)于使用了okHttp的應(yīng)用,在高并發(fā)下連接池的配置也是必須的。
另外,對(duì)于Feign來(lái)說(shuō),如果要使用apache httpClient,只需要引入相應(yīng)依賴就會(huì)自動(dòng)替換UrlConnection:
<dependency>
<groupId>io.github.openfeign</groupId>
<artifactId>feign-httpclient</artifactId>
<version>9.5.0</version>
</dependency>
如果要使用okHttp,除了引入相應(yīng)依賴外,還需將相關(guān)的配置項(xiàng)開(kāi)啟。
<dependency>
<groupId>io.github.openfeign</groupId>
<artifactId>feign-okhttp</artifactId>
<version>10.1.0</version>
</dependency>
feign:
httpclient:
enabled:false
okhttp:
enabled:true