前段時間遇到一場事故,配置中心服務(wù)依賴的 git 數(shù)據(jù)源不可訪問,K8s deployment 里配置的健康檢查超時時間較短,(如果超時時間設(shè)置為10s, 是不會觸發(fā)這次故障的)導(dǎo)致配置中心服務(wù)健康檢查掛掉,網(wǎng)關(guān)默認強依賴配置中心服務(wù),所以網(wǎng)關(guān)健康檢查接口也不通過,所以在負載均衡看來,網(wǎng)關(guān)也不可用,導(dǎo)致整體服務(wù)中斷。
為了實現(xiàn)服務(wù)高可用,我們會做以下2 點優(yōu)化:
- 去除網(wǎng)關(guān)對于配置中心的強依賴
- 去除配置中心對 git 服務(wù)的強依賴
disable config client health indicator
https://github.com/spring-cloud/spring-cloud-config/issues/435
The Config Client supplies a Spring Boot Health Indicator that attempts to load configuration from Config Server. The health indicator can be disabled by setting health.config.enabled=false. The response is also cached for performance reasons. The default cache time to live is 5 minutes. To change that value set the health.config.time-to-live property (in milliseconds).
management.health.hystrix.enabled: false
health.config.enabled: false
上面是spring boot 的配置方法 (https://docs.spring.io/spring-boot/docs/current/reference/html/common-application-properties.html)
下面是spring cloud
反思
- 對連鎖故障處理不夠熟練
- 事故過程信息傳遞不夠到位
- 系統(tǒng)存在單點設(shè)計,引起全局問題
- 需定期安排災(zāi)難演練 有興趣的讀者可以了解一下 chaosmonkey