2019.6
arXiv:https://arxiv.org/abs/1903.02740
github:https://github.com/Guzaiwang/CE-Net

Abstract
Medical image segmentation is an important step in medical image analysis. With the rapid development of convolutional neural network in image processing, deep learning has been used for medical image segmentation, such as optic disc segmentation, blood vessel detection, lung segmentation, cell segmentation, etc. Previously, U-net based approaches have been proposed. However, the consecutive pooling and strided convolutional operations lead to the loss of some spatial information. In this paper, we propose a context encoder network (referred to as CE-Net) to capture more high-level information and preserve spatial information for 2D medical image segmentation. CENet mainly contains three major components: a feature encoder module, a context extractor and a feature decoder module. We use pretrained ResNet block as the fixed feature extractor. The context extractor module is formed by a newly proposed dense atrous convolution (DAC) block and residual multi-kernel pooling (RMP) block. We applied the proposed CE-Net to different 2D medical image segmentation tasks. Comprehensive results show that the proposed method outperforms the original U-Net method and other state-of-the-art methods for optic disc segmentation, vessel detection, lung segmentation, cell contour segmentation and retinal optical coherence tomography layer segmentation.
醫(yī)學(xué)圖像分割是醫(yī)學(xué)圖像分析的重要環(huán)節(jié)。隨著卷積神經(jīng)網(wǎng)絡(luò)在圖像處理中的迅速發(fā)展,深度學(xué)習(xí)已應(yīng)用于醫(yī)學(xué)圖像分割,如視盤分割、血管檢測、肺分割、細(xì)胞分割等。此前,已提出基于U-net的方法。然而,連續(xù)的池和大步卷積操作會(huì)導(dǎo)致一些空間信息的丟失。本文提出了一種上下文編碼器網(wǎng)絡(luò)(簡稱CE-Net),用于捕獲更高級(jí)的信息,并保存二維醫(yī)學(xué)圖像分割的空間信息。CENet 主要包含三個(gè)主要組件:功能編碼器模塊、上下文提取器和功能解碼器模塊。我們使用預(yù)先訓(xùn)練的 ResNet 塊作為固定功能提取器。上下文提取器模塊由新提出的密集卷積 (DAC) 塊和殘余多內(nèi)核池 (RMP) 塊組成。我們將建議的CE-Net應(yīng)用于不同的2D醫(yī)學(xué)圖像分割任務(wù)。綜合結(jié)果表明,該方法優(yōu)于原有的U-Net方法和其他最先進(jìn)的光學(xué)視盤分割、血管檢測、肺分割、細(xì)胞輪廓分割和視網(wǎng)膜光學(xué)相干的方法。斷層掃描層分割。
INTRODUCTION
Medical image segmentation is often animportant step in medical image analysis, such as optic disc segmentation [1],[2], [3] and blood vessel detection [4], [5], [6], [7], [8] in retinal images,cell segmentation [9], [10], [11] in electron microscopic (EM) recordings, lungsegmentation [12], [13], [14], [15], [16] and brain segmentation [17], [18],[19], [20], [21], [22] in computed tomography (CT) and magnetic resonanceimaging (MRI). Previous approaches to medical image segmentation are oftenbased on edge detection and template matching [15]. For example, circular orelliptical Hough transform are used in optic disc segmentation [23],? [3]. Template matching is also used forspleen segmentation in MRI sequence images [24] and ventricular segmentation inbrain CT images [22].
醫(yī)學(xué)圖像分割通常是醫(yī)學(xué)圖像分析中的一個(gè)重要步驟,如視盤分割 [1]、[2]、[3] 和血管檢測 [4]、[5]、[6]、[7]、[8]在視網(wǎng)膜圖像中、細(xì)胞分割[9]、[10]、[11] 在電子顯微 (EM) 中記錄、肺分段[12]、[13]、[14]、[15]、[16]和大腦分割[17]、[18]、[19]、[20]、[21]、[22](22%)在計(jì)算機(jī)斷層掃描(CT)和磁共振成像(MRI)。以前的醫(yī)學(xué)圖像分割方法通?;谶吘墮z測和模板匹配[15]。例如,圓形或橢圓形霍夫變換用于光學(xué)盤分割 [23]、[3]。模板匹配也用于MRI序列圖像[24]和腦CT圖像的脾臟分割[22]。
Deformable models are also proposed for medical image segmentation. The shape-based method using level sets [25] has been proposed for two-dimensional segmentation of cardiac MRI images and three-dimensional segmentation of prostate MRI images. In addition, a level set-based deformable model is adopted for kidney segmentation from abdominal CT images [26]. The deformable model has also been integrated with the Gibbs prior models for segmenting the boundaries of organs [27], with an evolutionary algorithm and a statistical shape model to segment the liver [16] from CT volumes. In optic disc segmentation, different deformable models have also been proposed and adopted, such as mathematical morphology, global elliptical model, local deformable model [28], and modified active shape model [29].
提出了用于醫(yī)學(xué)圖像分割的可變形模型。提出了用于心臟MRI圖像的二維分割和前列腺M(fèi)RI圖像的三維分割的基于形狀的方法[25]。此外,還采用了基于水平設(shè)置的可變形模型,用于從腹部CT圖像進(jìn)行腎臟分割[26]??勺冃文P瓦€與吉布斯先前用于分割器官邊界[27]的模型集成,使用進(jìn)化算法和統(tǒng)計(jì)形狀模型將肝臟[16]從CT體積分割。在視盤分割中,還提出并采用了不同的可變形模型,如數(shù)學(xué)形態(tài)學(xué)、全局橢圓模型、局部可變形模型[28]、修正有源形狀模型[29]等。
Learning based approaches are proposed to segment medical images as well. Aganj et al. [30] proposed the local center of mass based method for unsupervised learning based image segmentation in X-ray and MRI images. Kanimozhi et al. [31] applied the stationary wavelet transform to obtain the feature vectors, and self-organizing map is adopted to handle these feature vectors for unsupervised MRI image segmentation. Tong et al. [32] combined dictionary learning and sparse coding to segment multi-organ in abdominal CT images. Pixel classification based approaches [33], [1] are also learning based approaches which train classifiers based on pixels using pre-annotated data. However, it is not easy to select the pixels and extract features to train the classifier from the larger number of pixels. Cheng et al. [1] used the superpixel strategy to reduce the number of pixels and performed the optic disc and cup segmentation using superpixel classification. Tian et al. [34] adopted a superpixel-based graph cut method to segment 3D prostate MRI images. In [35], superpixel learning based method is integrated with restricted regions of shape constrains to segment lung from CT images.
基于學(xué)習(xí)的方法被提出用于圖像分割。Aganj等人[30]提出了在X射線和MRI圖像中無監(jiān)督學(xué)習(xí)圖像分割的局部質(zhì)量中心方法。Kanimozhi等人[31]應(yīng)用固定小波變換來獲得特征向量,并采用自組織圖處理這些特征向量,用于無監(jiān)督的MRI圖像分割。Tong等人[32]將字典學(xué)習(xí)和稀疏編碼相結(jié)合,在腹部CT圖像中對多器官進(jìn)行分段。基于像素分類的方法 [33],[1] 也是基于學(xué)習(xí)的方法,這些方法使用預(yù)先編單的數(shù)據(jù)基于像素來訓(xùn)練分類器。但是,從大量像素中選擇像素和提取要素以訓(xùn)練分類器并不容易。程等人采用超像素策略減少像素?cái)?shù),采用超像素分類進(jìn)行光碟和杯分段。田等人采用超像素圖形切割法對3D前列腺M(fèi)RI圖像進(jìn)行分割。在[35]中,超像素學(xué)習(xí)方法與形狀約束的受限區(qū)域集成在一起,從CT圖像中分割肺。
The drawbacks of these methods lie in the utilization of hand-crafted features to obtain the segmentation results. On the one hand, it is difficult to design the representative features for different applications. On the other hand, the designed features working well for one type of images often fail on another type. Therefore, there is a lack of general approach to extract the feature.
這些方法的缺點(diǎn)是利用手工制作的特征來獲得分割結(jié)果。一方面,很難為不同的應(yīng)用設(shè)計(jì)具有代表性的功能。另一方面,為一種類型的圖像設(shè)計(jì)的功能通常在另一種圖像上失敗。因此,缺乏提取特征的一般方法。
With the development of convolutional neural network (CNN) in image and video processing [36] and medical image analysis [37], [38], automatic feature learning algorithms using deep learning have emerged as feasible approaches for medical image segmentation. Deep learning based segmentation methods are pixel-classification based learning approaches. Different from traditional pixel or superpixel classification approaches which often use hand-crafted features, deep learning approaches learn the features and overcome the limitation of hand-crafted features.
隨著卷積神經(jīng)網(wǎng)絡(luò)(CNN)在圖像和視頻處理[36]和醫(yī)學(xué)圖像分析[37]、[38]的發(fā)展,利用深度學(xué)習(xí)的自動(dòng)特征學(xué)習(xí)算法已成為醫(yī)學(xué)圖像分割的可行方法?;谏疃葘W(xué)習(xí)的分割方法是基于像素分類的學(xué)習(xí)方法。與通常使用手工制作功能的傳統(tǒng)像素或超像素分類方法不同,深度學(xué)習(xí)方法了解這些功能并克服手工制作功能的限制。
Earlier deep learning approaches for medical image segmentation are mostly based on image patches. Ciresan et al. [39] proposed to segment neuronal membranes in microscopy images based on patches and sliding window strategy. Then, Kamnitsas et al. [40] employed a multi-scale 3D CNN architecture with fully connected conditional random field (CRF) for boosting patch based brain lesion segmentation. Obviously, this solution introduces two main drawbacks: redundant computation caused from sliding window and the inability to learn global features.
早期的醫(yī)學(xué)圖像分割深度學(xué)習(xí)方法主要基于圖像補(bǔ)丁。Ciresan等人[39]建議根據(jù)貼片和滑動(dòng)窗口策略在顯微鏡圖像中分割神經(jīng)元膜。然后,Kamnitsas等人[40]使用具有完全連接條件隨機(jī)場(CRF)的多尺度3DCNN架構(gòu),用于促進(jìn)基于補(bǔ)丁的腦病變分割。顯然,此解決方案引入了兩個(gè)主要缺點(diǎn):滑動(dòng)窗口導(dǎo)致的冗余計(jì)算和無法學(xué)習(xí)全局功能。
With the emerging of the end-to-end fully convolutional network (FCN) [41], Ronneberger et al. [10] proposed Ushape Net (U-Net) framework for biomedical image segmentation. U-Net has shown promising results on the neuronal structures segmentation in electron microscopic recordings and cell segmentation in light microscopic images. It has becomes a popular neural network architecture for biomedical image segmentation tasks [42], [43], [44], [45]. Sevastopolsky et al. [43] applied U-Net to directly segment the optic disc and optic cup in retinal fundus images for glaucoma diagnosis. Roy et al. [44] used a similar network for retinal layer segmentation in optical coherence tomography (OCT) images. Norman et al. [42] used U-Net to segment cartilage and meniscus from knee MRI data. The U-Net is also applied to directly segment lung from CT images [45].
隨著端到端全卷積網(wǎng)絡(luò)(FCN)[41]的出現(xiàn),Ronneberger等人提出了用于生物醫(yī)學(xué)圖像分割的Ushape Net(U-Net)框架。U-Net在光顯圖像中電子顯微記錄和細(xì)胞分割的神經(jīng)元結(jié)構(gòu)分割方面已顯示出可喜的結(jié)果。它已成為生物醫(yī)學(xué)圖像分割任務(wù) [42] 、[43]、[44]、[45]的常用神經(jīng)網(wǎng)絡(luò)架構(gòu)。塞瓦斯托波爾斯基等人[43]應(yīng)用U-Net直接分割視盤和視杯的視網(wǎng)膜基質(zhì)圖像,用于青光眼診斷。Roy等人[44]在光學(xué)相干斷層掃描(OCT)圖像中使用了類似的視網(wǎng)膜層分割網(wǎng)絡(luò)。Norman等人[42]使用U-Net從膝關(guān)節(jié)核磁共振成像數(shù)據(jù)中分割軟骨和半月板。U-Net還應(yīng)用于直接分割肺從CT圖像[45]。
Many variations have been made on U-Net for different medical image segmentation tasks. Fu et al. [4] adopted the CRF to gather the multi-stage feature maps for boosting the vessel detection performance. Later, a modified U-Net framework (called M-Net) [2] is proposed for joint optic disc and cup segmentation by adding multi-scale inputs and deep supervision into the U-net architecture. Deep supervision mainly introduces the extra loss function associated with the middle-stage features. Based on the deep supervision, Chen et al. [46] proposed a Voxresnet to segment volumetric brain, and Dou et al. [47] proposed 3D deeply supervised network (3D DSN) to automatically segment lung in CT volumes.
針對不同的醫(yī)學(xué)圖像分割任務(wù),在U-Net上進(jìn)行了許多變化。Fu等人[4]采用CRF聚合多級(jí)特征圖,以提高船舶檢測性能。之后,通過在Unet架構(gòu)中增加多尺度輸入和深度監(jiān)控,提出了一個(gè)改進(jìn)的U-Net框架(稱為M-Net)[2],用于聯(lián)合光碟和杯分段。深度監(jiān)管主要介紹與中間階段特征相關(guān)的額外損耗功能。在深度監(jiān)督的基礎(chǔ)上,陳等人提出了Voxresnet對體積腦進(jìn)行分割,Dou等人[47]提出了3D深度監(jiān)督網(wǎng)絡(luò)(3D DSN),以CT體積自動(dòng)分割肺。
[46] H. Chen, Q. Dou, L. Yu, and P.-A. Heng, “Voxresnet: Deep voxelwise residual networks for volumetric brain segmentation,” arXiv preprint arXiv:1608.05895 , 2016.
[47] Q. Dou, H. Chen, Y. Jin, L. Yu, J. Qin, and P.-A. Heng, “3d deeply supervised network for automatic liver segmentation from ct volumes,” in International Conference on Medical Image Computing and ComputerAssisted Intervention. Springer, 2016, pp. 149–157.
To enhance the feature learning ability of U-Net, some new modules have been proposed to replace the original blocks. Stefanos et al. [48] proposed a branch residual U-network (BRU-net) to segment pathological OCT retinal layer for agerelated macular degeneration diagnosis. BRU-net relies on residual connection and dilated convolutions to enhance the final OCT retinal layer segmentation. Gibson et al. [49] introduced dense connection in each encoder block to automatically segment multiple organs on abdominal CT. Kumar et al. [21] proposed an InfiNet for infant brain MRI segmentation. Besides the above achievements for U-Net based medical image segmentation, some researchers have also made progress to modify U-Net for general image segmentation. Peng et al. [50] proposed a novel global convolutional network to improve semantic segmentation. Lin et al. [51] proposed a multi-path refinement network, which contains residual convolution unit, multi-resolution fusion and chained residual pooling. Zhao et al. [52] adopted spatial pyramid pooling to gather the extracted feature maps to improve the semantic segmentation performance.
為了提高U-Net的功能學(xué)習(xí)能力,提出了一些新的模塊來替換原來的模塊。Stefanos等人[48]提出了一個(gè)分支殘留U網(wǎng)絡(luò)(BRU-net),用于分割病理性O(shè)CT視網(wǎng)膜層,用于年齡相關(guān)黃斑變性診斷。BRU-net 依靠殘余連接和擴(kuò)張卷積來增強(qiáng)最終的 OCT 視網(wǎng)膜層分割。吉布森等人[49]在每個(gè)編碼器塊中引入密集連接,以自動(dòng)分割腹部CT上的多個(gè)器官。 Kumar等人[21]提出了用于嬰兒腦MRI分割的InfiNet。除了上述基于U-Net的醫(yī)療圖像分割成果外,一些研究人員還在修改U-Net進(jìn)行一般圖像分割方面取得了進(jìn)展。彭等人提出了一種新的全卷積網(wǎng)絡(luò),以改善語義分割。林等人提出了多路徑細(xì)化網(wǎng)絡(luò),其中包含殘余卷積單元、多分辨率融合和鏈?zhǔn)綒堄喑?。趙等人采用空間金字塔集合來采集提取的特征圖,以提高語義分割性能。
A common limitation of the U-Net and its variations is that the consecutive pooling operations or convolution striding reduce the feature resolution to learn increasingly abstract feature representations. Although this invariance is beneficial for classification or object detection tasks, it often impedes dense prediction tasks which require detailed spatial information. Intuitively, maintaining high-resolution feature maps at the middle stages can boost segmentation performance. However, it increases the size of feature maps, which is not optimal to accelerate the training and ease the difficulty of optimization. Therefore, there is a trade-off between accelerating the training and maintaining the high resolution. Generally, the U-Net structures can be considered as Encoder-Decoder architecture. The Encoder aims to reduce the spatial dimension of feature maps gradually and capture more high-level semantic features. The Decoder aims to recover the object details and spatial dimension. Therefore, it is spontaneous to capture more highlevel features in the encoder and preserve more spatial information in the decoder to improve the performance of image segmentation.
U-Net 及其變體的一個(gè)常見限制是,連續(xù)的池化操作或卷積旋轉(zhuǎn)會(huì)降低特征分辨率,從而學(xué)習(xí)越來越抽象的特征表示。盡管此不變性有利于分類或?qū)ο髾z測任務(wù),但它通常妨礙需要詳細(xì)空間信息的密集預(yù)測任務(wù)。直觀地講,在中間階段維護(hù)高分辨率要素地圖可以提高分段性能。但是,它增加了要素圖的大小,這不能優(yōu)化訓(xùn)練,緩解優(yōu)化的難度。因此,在加速訓(xùn)練和保持高分辨率之間需要權(quán)衡。通常,U-Net 結(jié)構(gòu)可視為編碼器解碼器體系結(jié)構(gòu)。編碼器旨在逐漸減小要素地圖的空間維度,并捕獲更多高級(jí)語義要素。解碼器旨在恢復(fù)對象詳細(xì)信息和空間維度。因此,在編碼器中捕獲更多高級(jí)要素并在解碼器中保留更多的空間信息,以提高圖像分割的性能是自發(fā)的。
Motivated by the above discussions and also the InceptionResNet structures [53], [54] which make the neural network wider and deeper, we propose a novel dense atrous convolution (DAC) block to employ atrous convolution. The original UNet architecture captures multi-scale features in the limited scaling range by adopting the consecutive 3×3 convolution and pooling operations in the encoding path. Our proposed DAC block could capture wider and deeper semantic features by infusing four cascade branches with multi-scale atrous convolutions. In this module, the residual connection is utilized to prevent the gradient vanishing. In addition, we also proposea residual multi-kernel pooling (RMP) motivated from spatialpyramid pooling [55]. The RMP block further encodes themulti-scale context features of the object extracted from the DAC module by employing various size pooling operations, without the extra learning weights. In summary, the DAC block is proposed to extract enriched feature representations with multi-scale atrous convolutions, followed by the RMP block for further context information with multi-scale pooling operations. Integrating the newly proposed DAC block and the RMP block with the backbone encoder-decoder structure, we propose a novel context encoder network named as CENet. It relies on the DAC block and the RMP block to get more abstract features and preserve more spatial information to boost the performance of medical image segmentation.
在上述討論以及初始空間ResNet結(jié)構(gòu)[53]的激勵(lì)下,[54]使神經(jīng)網(wǎng)絡(luò)更加廣泛和深入,我們提出了一種新的密集卷積(DAC)塊,以采用空洞卷積。原始 UNet 體系結(jié)構(gòu)通過在編碼路徑中采用連續(xù)的 3×3 卷積和池運(yùn)算來捕獲有限縮放范圍內(nèi)的多比例功能。我們建議的 DAC 塊可以通過在多尺度空洞卷積中注入四個(gè)級(jí)聯(lián)分支來捕獲更廣泛和更深入的語義特征。在本模塊中,殘存連接用于防止梯度消失。此外,我們還提出了一個(gè)基于空間金字塔池的殘余多內(nèi)核池 (RMP) [55]。RMP 塊通過采用各種大小的池化操作,進(jìn)一步編碼從 DAC 模塊中提取的對象的多尺度上下文特征,而無需額外的學(xué)習(xí)權(quán)重。總之, DAC 塊的提出是用于提取多尺度空洞卷積獲得的豐富特征表示,然后是 RMP 塊,以便使用多比例池操作獲取進(jìn)一步的上下文信息。將新提出的DAC模塊和RMP模塊與骨干編碼器解碼器結(jié)構(gòu)相結(jié)合,提出了一種名為CENet的新型上下文編碼器網(wǎng)絡(luò)。它依靠 DAC 塊和 RMP 塊來獲取更抽象的特征并保留更多的空間信息,以提高醫(yī)學(xué)圖像分割的性能。
The main contributions of this work are summarized as follows:
1) We propose a DAC block and RMP block to capture more high-level features and preserve more spatial information.
2) We integrate the proposed DAC block and RMP block with encoder-decoder structure for medical image segmentation.
3) We apply the proposed method in different tasks including optic disc segmentation, retinal vessel detection, lung segmentation, cell contour segmentation and retinal OCT layer segmentation. Results show that the proposed method outperforms the state-of-the-art methods in these different tasks.
The remainder of this paper is organized as follows. Section II introduces the proposed method in details. Section III presents the experimental results and discussions. In Section IV, we draw some conclusions.
這項(xiàng)工作的主要貢獻(xiàn)概述如下:
1) 我們提出使用 DAC 塊和 RMP 塊來捕獲更多高級(jí)要素并保留更多的空間信息。
2) 將建議的DAC塊和RMP模塊與編碼器解碼器結(jié)構(gòu)集成,用于醫(yī)學(xué)圖像分割。
3) 將該方法應(yīng)用于視盤分割、視網(wǎng)膜血管檢測、肺分割、細(xì)胞輪廓分割和視網(wǎng)膜OCT層分割等不同任務(wù)。結(jié)果表明,在不同任務(wù)中,該方法優(yōu)于最先進(jìn)的方法。
本文的其余部分按如下方式組織。第二節(jié)詳細(xì)介紹了擬議的方法。第三節(jié)介紹了實(shí)驗(yàn)結(jié)果和討論。在第四節(jié)中,我們得出一些結(jié)論。
METHOD

The proposed CE-Net consists of three major parts: the feature encoder module, the context extractor module, and the feature decoder module, as shown in Fig. 1
提出的CE-Net由三個(gè)主要部分組成:特征編碼器模塊、上下文提取器模塊和特征解碼器模塊,如圖1所示
A. Feature Encoder Module
In U-Net architecture, each block of encoder contains two convolution layers and one max pooling layer. In the proposed method, we replace it with the pretrained ResNet-34 [53] in the feature encoder module, which retains the first four feature extracting blocks without the average pooling layer and the fully connected layers. Compared with the original block, ResNet adds shortcut mechanism to avoid the gradient vanishing and accelerate the network convergence, as shown in Fig. 1(b). For convenience, we use the modified U-net with pretrained ResNet as backbone approach.
在 U-Net 體系結(jié)構(gòu)中,每個(gè)編碼器塊包含兩個(gè)卷積層和一個(gè)最大池層。在建議的方法中,我們將其替換為特征編碼器模塊中預(yù)先訓(xùn)練的 ResNet-34 [53],該模塊保留了前四個(gè)特征提取塊,沒有平均池層和完全連接的層。與原始?jí)K相比,ResNet 增加了快捷方式機(jī)制,以避免梯度消失并加速網(wǎng)絡(luò)收斂,如圖 1(b) 所示。為方便起見,我們使用經(jīng)過預(yù)訓(xùn)練的 ResNet 的經(jīng)過改進(jìn)的 U-net 作為骨干方法。
B. Context Extractor Module
The context extractor module is a newly proposed module, consisting of the DAC block and the RMP block. This module extracts context semantic information and generates more high-level feature maps.
上下文提取器模塊是新提出的模塊,由DAC塊和RMP塊組成。 該模塊提取上下文語義信息并生成更多高級(jí)特征映射。

1) Atrous convolution: In semantic segmentation tasks and object detection tasks, deep convolutional layers have shown to be effective in extracting feature representations for images. However, the pooling layers lead to the loss of semantic information in images. In order to overcome this limitation, atrous convolution is adopted for dense segmentation [56]:
1)空洞卷積:在語義分割任務(wù)和對象檢測任務(wù)中,深度卷積層已經(jīng)證明在提取圖像的特征表示方面是有效的。 但是,池化層會(huì)導(dǎo)致圖像中語義信息的丟失。 為了克服這種限制,采用了空洞卷積進(jìn)行密集分割[56]:

The atrous convolution is originally proposed for the efficient computation of the wavelet transform. Mathematically, the atrous convolution under two dimensional signals is computed as follows:
最初提出的膨脹卷積用于小波變換的有效計(jì)算。 在數(shù)學(xué)上,二維信號(hào)下的膨脹卷積計(jì)算如下:

where the convolution of the input feature map x and a filter w yields the output y, and the atrous rate r corresponds to the stride with which we sample the input signal. It is equivalent to convolute the input x with upsampled filters produced by inserting r ? 1 zeros between two consecutive filter values along each spatial dimension (hence the name atrous convolution in which the French word atrous means holes in English). Standard convolution is a special case for rate r = 1, and atrous convolution allows us to adaptively modify filters field-of-view by changing the rate value. See Fig. 2 for illustration.
其中輸入特征映射x和濾波器w的卷積產(chǎn)生輸出y,并且atrous rate r對應(yīng)于我們對輸入信號(hào)進(jìn)行采樣的步幅。 它相當(dāng)于輸入x和上采樣濾波器的旋轉(zhuǎn),它是通過在每個(gè)空間維度上的兩個(gè)連續(xù)濾波器值之間插入r - 1個(gè)零點(diǎn)而產(chǎn)生的(因此,名稱atrous convolution,其中法語單詞atrous表示英語中的空洞)。 標(biāo)準(zhǔn)卷積是速率r = 1的特殊情況,并且有空洞積允許我們通過改變速率值來自適應(yīng)地修改濾波器感受野。 參見圖2以進(jìn)行說明。
2) Dense Atrous Convolution module: Inception[54] and ResNet[53] are two classical and representative architectures in the deep learning. Inception-series structures adopt different receptive fields to widen the architecture. On the contrary, ResNet employs shortcut connection mechanism to avoid the exploding and vanishing gradients. It makes the neural network break through up to thousands of layers for the first time. Inception-ResNet [54] block, which combines the Inception and ResNet, inherits the advantages of both approaches. Then it becomes a baseline approach in the field of deep CNNs
2)密集的Atrous卷積模塊:Inception [54]和ResNet [53]是深度學(xué)習(xí)中的兩種經(jīng)典和代表性的體系結(jié)構(gòu)。 初始序列結(jié)構(gòu)采用不同的感受域來拓寬架構(gòu)。 相反,ResNet采用快捷連接機(jī)制來避免爆炸和消失的梯度。 它使神經(jīng)網(wǎng)絡(luò)首次突破數(shù)千層。 Inception-ResNet [54]塊結(jié)合了Inception和ResNet,繼承了兩種方法的優(yōu)點(diǎn)。 然后它成為深度CNN領(lǐng)域的基線方法
Inception-ResNet-v2
Motivated by the Inception-ResNet-V2 block and atrous convolution, we propose dense atrous convolution (DAC) block to encode the high-level semantic feature maps. As shown in Fig. 3, the atrous convolution is stacked in cascade mode. In this case, DAC has four cascade branches with the gradual increment of the number of atrous convolution, from 1 to 1, 3, and 5, then the receptive field of each branch will be 3, 7, 9, 19. It employs different receptive fields, similar to Inception structures. In each atrous branch, we apply one 1×1 convolution for rectified linear activation. Finally, we directly add the original features with other features, like shortcut mechanism in ResNet. Since the proposed block looks like a densely connected block, we name it dense atrous convolution block. Very often, the convolution of large reception field could extract and generate more abstract features for large objects, while the convolution of small reception field is better for small object. By combining the atrous convolution of different atrous rates, the DAC block is able to extract features for objects with various sizes.
在Inception-ResNet-V2模塊和atrous卷積的推動(dòng)下,我們提出密集的atrous卷積(DAC)塊來編碼高級(jí)語義特征映射。如圖3所示,萎縮卷曲以級(jí)聯(lián)模式堆疊。在這種情況下,DAC有四個(gè)級(jí)聯(lián)分支,隨著自然卷積數(shù)量的逐漸增加,從1到1,3和5,然后每個(gè)分支的感受野將是3,7,9,19。它采用不同的感知領(lǐng)域,類似于Inception結(jié)構(gòu)。在每個(gè)atrous分支中,我們應(yīng)用一個(gè)1×1卷積進(jìn)行整流線性激活。最后,我們直接添加其他功能的原始功能,如ResNet中的短接方式。由于所提出的塊看起來像一個(gè)密集連接的塊,我們將其命名為密集空洞卷積塊。通常,大接收場的卷積可以為大對象提取和生成更抽象的特征,而小接收場的卷積對于小對象更好。通過組合不同動(dòng)態(tài)速率的迂回卷積,DAC塊能夠提取具有各種尺寸的對象的特征。
3) Residual Multi-kernel pooling: A challenge in segmentation is the large variation of object size in medical image. For example, a tumor in middle or late stage can be much larger than that in early stage. In this paper, we propose a residual multi-kernel pooling to address the problem, which mainly relies on multiple effective field-of-views to detect objects at different sizes.
3)殘差多核池:分割中的一個(gè)挑戰(zhàn)是醫(yī)學(xué)圖像中對象大小的大變化。 例如,中期或晚期的腫瘤可能比早期的腫瘤大得多。 在本文中,我們提出了一個(gè)殘余的多內(nèi)核池來解決這個(gè)問題,它主要依靠多個(gè)有效的視場來檢測不同大小的對象。
The size of receptive field roughly determines how much context information we can use. The general max pooling operation just employs a single pooling kernel, such as 2×2. As illustrated in Fig. 4, the proposed RMP encodes global context information with four different-size receptive fields: 2×2, 3×3, 5×5 and 6×6. The four-level outputs contain the feature maps with various sizes. To reduce the dimension of weights and computational cost, we use a 1×1 convolution after each level of pooling. It reduces the dimension of the feature maps to the N1 of original dimension, where N represents number of channels in original feature maps. Then we upsample the low-dimension feature map to get the same size features as the original feature map via bilinear interpolation. Finally, we concatenate the original features with upsampled feature maps.
感受野的大小粗略地決定了我們可以使用多少上下文信息。 一般的最大池操作只使用單個(gè)池內(nèi)核,例如2×2。 如圖4所示,所提出的RMP用四個(gè)不同大小的感受域編碼全局上下文信息:2×2,3×3,5×5和6×6。 四級(jí)輸出包含各種尺寸的特征圖。 為了減少權(quán)重和計(jì)算成本的維數(shù),我們在每個(gè)匯集級(jí)別后使用1×1卷積。 它將要素圖的尺寸減小到原始尺寸的N1,其中N表示原始要素圖中的通道數(shù)。 然后我們對低維特征圖進(jìn)行上采樣,以通過雙線性插值獲得與原始特征圖相同的尺寸特征。 最后,我們將原始特征與上采樣特征映射相結(jié)合。

C. Feature Decoder Module
The feature decoder module is adopted to restore the highlevel semantic features extracted from the feature encoder5 module and context extractor module. The skip connection takes some detailed information from the encoder to the decoder to remedy the information loss due to consecutive pooling and striding convolutional operations. Similar to [48], we adopted an efficient block to enhance the decoding performance. The simple upscaling and deconvolution are two common operations of the decoder in the U-shape Networks. The upscaling operation increases the image size with linear interpolation, while deconvolution (also called transposed convolution) employs convolution operation to enlarge the image. Intuitively, the transposed convolution could learn a self-adaptive mapping to restore feature with more detailed information. Therefore, we choose to use the transposed convolution to restore the higher resolution feature in the decoder. As illustrated in Fig. 1(c), it mainly includes a 1×1 convolution, a 3×3 transposed convolution and a 1×1 convolution consecutively. Based on skip connection and the decoder block, the feature decoder module outputs a mask, the same size as the original input.
采用特征解碼器模塊恢復(fù)從特征編碼器5模塊和上下文提取器模塊中提取的高級(jí)語義特征。跳過連接從編碼器到解碼器獲取一些詳細(xì)信息,以補(bǔ)救由于連續(xù)匯集和跨步卷積操作而導(dǎo)致的信息丟失。與[48]類似,我們采用了一種有效的塊來增強(qiáng)解碼性能。簡單的放大和反卷積是U形網(wǎng)絡(luò)中解碼器的兩種常見操作。升頻操作通過線性插值增加圖像尺寸,而反卷積(也稱為轉(zhuǎn)置卷積)采用卷積操作來放大圖像。直觀地,轉(zhuǎn)置卷積可以學(xué)習(xí)自適應(yīng)映射以恢復(fù)具有更詳細(xì)信息的特征。因此,我們選擇使用轉(zhuǎn)置卷積來恢復(fù)解碼器中的更高分辨率特征。如圖1(c)所示,它主要包括1×1卷積,3×3轉(zhuǎn)置卷積和1×1卷積連續(xù)。基于跳過連接和解碼器塊,特征解碼器模塊輸出與原始輸入相同大小的掩碼。
D. Loss Function
Our framework is an end-to-end deep learning system. As illustrated in Fig. 1, we need to train the proposed method to predict each pixel to be foreground or background, which is a pixel-wise classification problem. The most common loss function is cross entropy loss function
我們的框架是一個(gè)端到端的深度學(xué)習(xí)系統(tǒng)。 如圖1所示,我們需要訓(xùn)練所提出的方法來預(yù)測每個(gè)像素是前景或背景,這是像素方式的分類問題。 最常見的損失函數(shù)是交叉熵?fù)p失函數(shù)
However, the objects in medical images such as optic disc and retinal vessels often occupy a small region in the image. The cross entropy loss is not optimal for such tasks. In this paper, we use the Dice coefficient loss function [57], [58] to replace the common cross entropy loss. The comparison experiments and discussions are also conducted in the following section. The Dice coefficient is a measure of overlap widely used to assess segmentation performance when ground truthis available, as in Equation (2):
然而,諸如視神經(jīng)盤和視網(wǎng)膜血管的醫(yī)學(xué)圖像中的物體通常占據(jù)圖像中的小區(qū)域。 交叉熵?fù)p失對于這樣的任務(wù)不是最佳的。 在本文中,我們使用Dice系數(shù)損失函數(shù)[57],[58]來代替常見的交叉熵?fù)p失。 比較實(shí)驗(yàn)和討論也在以下部分中進(jìn)行。 Dice系數(shù)是一種重疊度量,廣泛用于評估gt可用時(shí)的分割性能,如公式(2)所示:

where N is the pixel number, p(k;i) ∈[0; 1] and g(k;i) 屬于 f0; 1g denote predicted probability and ground truth label for class k, respectively. K is the class number, and Pk !k = 1 are the class weights. In our paper, we set wk = K1 empirically. The final loss function is defined as:
其中N是像素?cái)?shù),p(k; i)屬于[0; 1]和g(k; i)屬于f0; 1g分別表示類k的預(yù)測概率和地面實(shí)況標(biāo)簽。 K是類號(hào),wk之和為1是類權(quán)重。 在我們的論文中,我們根據(jù)經(jīng)驗(yàn)設(shè)置了wk =1/ K。 最終損失函數(shù)定義為:

where Lreg represents the regularization loss (also called to weight decay) [59] used to avoid overfitting.
To evaluate the performance of CE-Net, we apply the proposed method to five different medical image segmentation tasks: optic disc segmentation, retinal vessel detection, lung segmentation, cell contour segmentation and retinal OCT layer segmentation.
其中Lreg表示正則化損失(也稱為重量衰減)[59],用于避免過度擬合。為了評估CE-Net的性能,我們將所提出的方法應(yīng)用于五種不同的醫(yī)學(xué)圖像分割任務(wù):視盤分割,視網(wǎng)膜血管檢測,肺分割,細(xì)胞輪廓分割和視網(wǎng)膜OCT層分割。