Axial Assembled Correspondence Network for Few-Shot Semantic Segmentation

2023-03-27 02:40:52YuLiuBinJiangandJiamingXu

IEEE/CAA Journal of Automatica Sinica 2023年3期

Yu Liu,, Bin Jiang, and Jiaming Xu,

Abstract—Few-shot semantic segmentation aims at training a model that can segment novel classes in a query image with only a few densely annotated support exemplars.It remains a challenge because of large intra-class variations between the support and query images.Existing approaches utilize 4D convolutions to mine semantic correspondence between the support and query images.However, they still suffer from heavy computation, sparse correspondence, and large memory.We propose axial assembled correspondence network (AACNet) to alleviate these issues.The key point of AACNet is the proposed axial assembled 4D kernel,which constructs the basic block for semantic correspondence encoder (SCE).Furthermore, we propose the deblurring equations to provide more robust correspondence for the aforementioned SCE and design a novel fusion module to mix correspondences in a learnable manner.Experiments on PASCAL-5ireveal that our AACNet achieves a mean intersection-over-union score of 65.9% for 1-shot segmentation and 70.6% for 5-shot segmentation, surpassing the state-of-the-art method by 5.8% and 5.0%respectively.

I.INTRODUCTION

RECENT years have witnessed increasingly significant breakthroughs in many traditional vision tasks such as object detection [1]–[3] and semantic segmentation [4]–[6]due to the development of deep convolutional neural networks (DCNNs) [7], [8].However, these frameworks are restricted to the availability of enough annotated samples [9],which require substantial effort, time, and even some practical experiences to annotate, particularly for dense prediction tasks, e.g., semantic segmentation.In comparison, humans perform well in finding out a novel concept from an image after seeing only a few exemplars.

Inspired by the big gap between human vision and computer vision, few-shot learning converges on how to make machine learning algorithms learn a new unseen class quickly with only several exemplars, or even a densely labeled one.By now, a surge of works [10]–[14] suggest that the main difference may result from whether support-to-query correspondences established by machine are reliable enough to depict different instances of the same class in comparison with those captured by humans.

In this paper, we propose axial assembled correspondence network (AACNet) to address this issue on the topic of fewshot semantic segmentation.As done in previous works [10],[13]–[15], we focus on middle-layer features, which prove to be notably effective for accurate correspondences capture.Our network includes a correspondence encoder, in which a weight-shared feature extractor generates feature maps to cast 4D correspondence representations for the following encoding.Due to the separable hypotheses in [16], [17] factorizing a 4D kernel into separable 2D components at the sacrifice of sufficient information communication, neither separable 4D kernel [16] nor center-pivot 4D kernel [17] is implemented in our correspondence encoder.Specifically, we provide an alternative, i.e., axial assembly to guarantee sufficient information communication.The key point of axial assembly lies in an appropriate weight-sparsification.Following this idea, we have designed the axial assembled 4D kernel (AA-Conv4d)that consists of a 3D axial kernel and a 1D assembling kernel.Compared with existing 4D kernels [16], [17], AA-Conv4d focuses on maintaining a balance between weight-sparsification and information communication.We will elaborate on its rationality in Section IV-C.

Furthermore, in comparison with directly building the semantic correspondences with masked support and query features [17], we find that it is a better choice to modify the statistical distribution of these generated semantic correspondences by our deblurring equations (DEs).Our DEs consist of a normalization function and a sigmoid stretch function.Due to a fairer normalization and fewer ambiguous similarity scores, they can effectively improve segmentation performance.

In addition, to mix pyramid correspondence effectively,based on our proposed 4D kernel, we develop an effective fusion module (FM).It adopts the top-down form to propagate information associated with the target area across the different levels of correspondence tensors.Based on the propos ed DEs, AA-Conv4d, and FM, we build a new network—axial assembled correspondence network (AACNet).We prove its effectiveness in 1-shot and 5-shot settings with comprehensive experiments on both PASCAL-5i[18] and COCO-20i[19].

Compared with the previous works, the main contributions of this paper are summarized below:

1) We develop a novel 4d kernel (AA-Conv4d).It conducts an appropriate weight-sparsification while keeping sufficient communications between the support and the query subspaces.

2) We propose a simple but effective preprocessing module to modify the statistical distribution of the semantic correspondences, which can effectively improve segmentation performance.

3) By mixing pyramid correspondences with a learnable concatenation operation, our FM helps adaptively refine the squeezed correspondences for query segmentation.

4) This work achieves a mean Intersection-over-Union score of 6 5.9% and 7 0.6% on PASCAL-5ifor 1-shot and 5-shot settings respectively, outperforming state-of-the-art results by 5.8% and 5.0% respectively.

The rest of this paper is organized as follows.In Section II,we introduce some related works on few-shot semantic segmentation.In Section III, we briefly describe the few-shot semantic segmentation task.The proposed method including DEs, AA-Conv4d, and FM will be explained in Section IV.We report the experimental setting, comparison results, ablation studies and model efficiency in Section V.A summary of this paper is given in Section VI.

II.RELATED WORK

A.Semantic Segmentation

Semantic segmentation aims at performing dense pixel labeling.The success of full convolutional network (FCN) [5]promotes the disappearance of the fully-connected layer in the task of semantic segmentation.Due to its good performance in the task of dense segmentation, most of the later approaches[4], [20]–[23] follow and develop the structure of FCN.A pyramid pooling module is introduced to assemble long-range contextual cues, which is proved very effective in PSPNet[20].Reference [4] utilizes the atrous convolution to not only maintain the high feature resolution but also enlarge the receptive field.Otherwise, the effectiveness of the encoder-decoder architecture is commonly proved in [5], [21].In this structure,the encoder channels and compresses the hierarchical contextual cues including in the low-dimensional and high-resolution images into high-dimensional but low-resolution feature maps, and then the decoder decompresses the corresponding feature maps to make semantic segmentation.Although these methods succeed in the task of semantic segmentation, they still face the challenge of the reduction of generalizability under insufficient labeled training samples.

B.Few-shot Learning

Few-shot learning (FSL) aims to learn generic classifiers when very scarce annotated exemplars are available.Most of the current state-of-the-art approaches for FSL mainly concentrate on metric-learning [24], [25] and meta-learning [26],[27].Recent works [28]–[31] on few-shot segmentation are most inspired by prototypical network [25].Prototypical network learns to classify input data in a learnable embedding space, where the Euclidean distance is selected to measure the similarities between the embeddings of the query image and the representative embeddings extracted from all support images, i.e., class prototypes.In this way, much computational budget would be reduced without compromising classification performance.For instance, PANet [29] proposes prototype alignment regularization to align the prototypes extracted from the supports and queries bidirectionally.In SGOne [28], cosine similarity is leveraged between the individual prototype generated from supports and all pixels in query features to generate a semantic guidance map for prediction.

However, since the spatial structure of features is destroyed after global average pooling [32] or masked average pooling[28]–[30], pairwise correspondence is built via the graph attention mechanism in [33], [34].Note that the crux of both prototypical and graph-based methods is whether reliable correspondences across semantically similar images are generated for accurate query segmentation.In this work, we also follow this paradigm to ensure a more reliable correspondence learning.

C.Semantic Correspondence Learning

Witnessing the limitation of hand-engineered descriptors,e.g., the failure of SIFT [35] or HOG [36] descriptors to discover complex semantic patterns with vast intra-class variations, convolutional neural networks have been employed as corresponding alternatives for semantic correspondence learning.Longet al.[14] confirm the effectiveness of convolutional features pre-trained on classification for accurate correspondence generation.SCNet [12] introduces region proposals as matching primitive elements to learn a CNN end-to-end with sparse geometry kernel.Kimet al.[37] introduce a CNN-based descriptor, called fully convolutional self-similarity, which is formulated by local self-similarity for dense semantic correspondence.Furthermore, since recent methods[14], [15] demonstrate that semantic correspondences built upon middle-level features contribute to better matching performance, we leverage dense intermediate features [17], [33]to generate correspondence and then handle those generated semantic correspondences through a widely adopted technique [10], [13] called 4D convolution.

III.TASK DESCRIPTION

Fig.1.Our AACNet for 1-shot semantic segmentation.The green arrows indicate the data stream of support image Is, while the red arrows indicate the data stream of query image Iq.Semantic correspondence module (SCM) is composed of semantic correspondence generation (SCG) and semantic correspondence encoder (SCE).For convenience, we here flatten C∈Rhq×wq×hs×wsto R hqwq×hsws.

IV.METHOD

A.Overview

As illustrated in Fig.1, our framework is composed of a semantic correspondence module (SCM) and a decoder.The process for 1-shot segmentation can be divided into several steps as follows:

1) Semantic Correspondence Generation:The SCM consists of a backbone network to extract features, a semantic correspondence generation (SCG) to post-process these features via the proposed DEs, and a semantic correspondence encoder (SCE) to learn generated semantic correspondences.Both support and query images are first input to the backbone network and then post-processed by SCG to obtain deblurring semantic correspondences for SCE.

2) Semantic Correspondence Learning:In the SCE, the first 2D subspace (Hq,Wq) and the last 2D subspace (Hs,Ws) of correspondences represent the spatial information of query and support images respectively, so all correspondence tensors are learned by squeezing (Hs,Ws) subspace while maintaining (Hq,Wq) subspace by the proposed AA-Conv4d.

3) Semantic Correspondence Decoding:We implement a decoder discriminating pixel-wise label over those outputs from SCM.Note that the decoder here is constructed by 2D convolutions.

B.Semantic Correspondence Generation

Most of the conventional semantic correspondence learning approaches [10], [13]–[17] always focus on pairwise similarity between the supports and queries.The deblurring equations are proposed to provide better statistical property for generated correspondences.

For each entrycij∈Cj, whereCjis built from thej-th support-query pair in a batch of samples, we set irrelevant matching scores to zero and normalize each score in a batch of correspondences as

wherei={1×1,...,h′w′×h′w′} represents the location in correspondenceC, andjranges from 1 to the batch size.Furthermore, we utilize the coefficient of variation of ?j, i.e.,cv(?j),to control a fairer batch normalization due to the consideration that forC1with ?1=0.9 andC2with ?2=0.5, the similarity scores should not be mapped to the same interval [0, 1).

Fig.2.Visual illustration of axial assembly.Three regions with different colors are 3D axial kernels (right), and then merged in the vertical axis to simulate a full 4D kernel (left).The kernel size used here is 3 ×3×3×3.

Then, the flattened deblurring semantic correspondenceCfdis produced by using a sigmoid stretch function to modify the statistical distribution of all similarity scores in each correspondence tensor

In (6),αis employed to control the number of the matching scorecfdaround 0.5, and β is the threshold that leads to ambiguous inference.In general, a matching score much less than 0.5 is deemed as weak similarity, and a matching score much greater than 0.5 is regarded as strong similarity.However, for those matching scores close to 0.5, it is difficult to determine whether it represents strong similarity or weak similarity.With (6), we can locate the ambiguous similarity byβand control the deblurring level byα.In subsequent research,we find it is quite helpful to improve prediction performance.

C.Axial Assembled 4D Convolution

In this section, we revisit existing 4D convolutional implementation schemes and then introduce our AA-Conv4d for comparison.

1) Previous 4D Convolution Kernels:Full 4D convolution is formulated as

D.Encoder-Decoder Architecture

Fig.3.Visual illustration of SCE.Generated from different level features, deblurring semantic correspondences {Cid}are learned by three sequences of axial assembled 4D kernel, group normalization, and ReLU activation (AGR), and then merged by fusion module (FM) to generate the squeezed correspondence Z∈Rh′′×w′′×c′′for decoding.Numbers 16, 64 and 128 are output correspondence tensor channels, and the representations like 1×1×1×1 ands=(1,1,1,1)are kernel size and stride of AA-Conv4d respectively.

Our decoder network is composed of a series of 2D convolutions and ReLU activation as illustrated in Fig.4.We input the condensed representationZinto it and make it predict binary query maskM?q∈{0,1}H×W.While training, parameters in the whole network are learned by cross-entropy loss

whereMqandM?qdenote ground-truth and the prediction on all pixel locations (x,y) respectively.While the test, we compareM?qwithMqfor evaluation.

Fig.4.Visual illustration of decoder.M? ∈{0,1}H×Wis the predicted binary mask of the query image Iq.

E.Extension to K-Shot Scenario

V.EXPERIMENT

A.Setup

1) Datasets:For consistency of comparison with previous works, we choose two commonly-used datasets, namely, PASCAL-5i[18] and COCO-20i[19], [29], to evaluate our model for few-shot semantic segmentation.PASCAL-5i[18] is made up of the PASCAL VOC 2012 [39] and extra mask annotations from SBD [40] dataset.With the cross-validation strategy, 20 categories are separated into 4 folds where 3 folds containing 15 classes are sampled for training and the rest fold including 5 classes for the test.During each test, 1000 episodes are randomly sampled for evaluation [18].COCO-20iconsisting of 80 object classes is a more challenging dataset.With the same cross-validation protocol, the overall 80 classes are also split into 4 folds [29], [19].Note that 40 137 images included in the COCO validation set are much more than those in PASCAL-5i, so we instead randomly sample 20 000 episodes for a more stable evaluation [31].We also conduct extensive experiments on FSS-1000 [41].Following common practices [17], [33], [41], the total 1000 object classes are partitioned into three splits, i.e., 520 240 and 240 classes for training, validation and test respectively.

2) Implementation Details:We employ a ResNet50 [30],[34] with the weights pre-trained on ImageNet [9] as the backbone architecture in our network.It generates multi-level features, and we choose those distributed over all intermediate layer features, i.e., those fromconv3_xtoconv5_x, as theinput items of SCG.For each level correspondence, the encoder consists of three 4D convolutional layers formed by AA-Conv4d with the size of 3×3×3×3 or 5×5×5×5.While the decoder is established by ordinary 2D convolutions.The whole model is trained end-to-end using Adam [42] with a batch size of 8 for 300 epochs on a GeForce RTX 3080 GPU.We fix the learning rate to 1 0?3during training.

TABLE I COMPARISON WITH STATE-OF-THE-ARTS ON PASCAL-5I[18] IN MIOU AND PARAMS.PARAMS: THE NUMBER OF LEARNABLE PARAMETERS

B.Comparison to State-of-the-Art

1) PASCAL-5i:We report performance comparison with state-of-the-arts on PASCAL-5iin Tables I and II.We can see for either mIoU evaluation or FB-IoU evaluation, our AACNet achieves the best performances under both 1-shot and 5-shot settings.Specifically, for ResNet50-based methods in Table I, AACNet achieves 64.1% and 69.5% in terms of mIoU for the 1-shot and 5-shot scenarios respectively, which outperforms the state-of-the-art by 3.3% and 2.9% respectively.While for ResNet101-based methods, mIoU improvement increases to 5.8% and 5.0% respectively.Furthermore,AACNet also achieves the best performance in FB-IoU, i.e.,76.7% and 80.5% for ResNet50-based methods and 77.9%and 81.5% for ResNet101-based methods, while only requiring the slightest number of learning parameters.This demonstrates the effectiveness of AACNet on the topic of few-shot semantic segmentation.

TABLE II COMPARISON WITH STATE-OF-THE-ARTS ON PASCAL-5I[18] IN FB-IOU AND PARAMS

2) COCO-20i:In Table III, we present the comparison results of mIoU and FB-IoU on COCO-20i, which is created from a more challenging dataset [49] for evaluation.Our AACNet also resets a new state-of-the-art performancewhether evaluated in the 1-shot setting or the 5-shot setting.For example, in terms of ResNet101-based methods, the growths of 4.5% and 7.2% occur in mIoU and FB-IoU respectively for 1-shot setting, and 8.0% and 7.8% occur for 5-shot setting, which largely outperforms state-of-the-arts on COCO-20idataset.The significant performance improvement on COCO-20isignifies that our AACNet has a remarkable capability to handle complex scenes.

TABLE III COMPARISON WITH STATE-OF-THE-ARTS ON COCO-20I[19] IN MIOU AND FB-IOU

3) FSS-1000:We also extend our model to FSS-1000 [41]for comprehensive evaluation.As shown in Table IV, with the F-IoUs of 85.0% in 1-shot and 87.6% in 5-shot, our AACNet also outperforms the previous best performance on this dataset.Although for ResNet101 based methods it falls behind DAN [33] by 0.2% in the 5-shot scenario, AACNet achieves better performance by 0.7% in the 1-shot scenario.Note that following most of experiments [31], [33], [41], [50] on FSS-1000 [41], only the foreground IoU (F-IoU) is utilized as the evaluation metric.

TABLE IV COMPARISON WITH STATE-OF-THE-ARTS ON FSS-1000 [41] IN FOREGROUND IOU (F-IOU)

C.Ablation Study

In this section, we conduct extensive ablation studies on PASCAL-5i[18], COCO-20i[19] and FSS-1000 [41] to confirm the effectiveness of the proposed modules i.e., DEs, AAConv4d, and FM in our AACNet.As shown in Tables V?VIII, the research shows that each of them in our model contributes to better performance.

1) Deblurring Equations:As shown in Table V, to verify the effectiveness of the DEs, we conduct experiments with different DEs setups on PASCAL-5i.The item w/o DEs represents this experiment is carried out by our model which is trained without DEs.We carefully setαto 5, 10, and 20 because sigmoid function will behave like step function ifαis 20 or over.Forβ, we focus on which threshold is the most possible value to locate ambiguous inferences, e.g., 0.5 and the mean value of the semantic correspondence, i.e., MOC in Table V.As shown in Table V, whenαequals 10 andβequals 0.5, AACNet achieves the best performances 64.1% in 1-shot and 69.5% in 5-shot.In comparison, the mIoU scores without DEs are 62.8% and 68.0% respectively, and are only the 3rd best performance in 1-shot and the 4th best performance in 5-shot on PASCAL-5i.This confirms the effectiveness of DEs to improve segmentation performance.Furthermore,αis utilized to reduce the number of similarity scores that are close to 0.5, but 20 or over forαis not a better choice to reduce ambiguous inferences in comparison with 10.This is because if the sigmoid function behaves like a step function, the ability of 4D convolutions to learn complex semantic correspondences is limited by the sparsification of correspondences.Finally, we have conducted more extensive experiments on COCO-20iand FSS-1000 to verify the effectiveness of DEs much further.The corresponding experimental results are reported in Tables V and VIII respectively.

2) AA-Conv4d:To investigate the effectiveness of AAConv4d, we also perform experiments with different 4D kernels in the building blocks of our model.Table VI reports the 1-shot and the 5-shot segmentation results on PASCAL-5i,COCO-20iand FSS-1000.In these experiments, the mIoU score and the number of learnable parameters are employed to evaluate the segmentation performance and computation consumption of our model implemented by different kernels.As shown in Table VI, we can see that for Full-4D [13], a large number of parameters (13.7 M) are required due to its quadratic complexityO(d4).In comparison, only 3.4 M learnable parameters are required for the model implemented by AA-Conv4d.Furthermore, although quadratic complexityO(d4) is reduced toO(d2) by Sep-4D and CP-4D, the best performance in mIoU evaluation is also recorded by AA-Conv4d on all datasets.This suggests that for complex correspondence tensor learning, an appropriate sparsification is necessary, but over-sparsification may not be a good choice.Ourconclusion is further confirmed by the visual results from different datasets in Fig.5.As we can see in Fig.5, although both Sep-4D [16] and CP-4D [17] achieve comparable mIoU performance, they suffer from many scattered discriminating blocks.In comparison, AA-Conv4d discriminates more convergently.This is why we said AA-Conv4d could keep sufficient information communication between the support subspace and the query subspace in comparison with factorizing a 4D kernel into separable 2D components.

TABLE V CLASS MIOU ABOUT THE ABLATION STUDY OF DES SETUP ON PASCAL-5I[18] AND COCO-20I[19].MOC: MEAN OF CORRESPONDENCE.W/O DES: WITHOUT DES

TABLE VI CLASS MIOU ABOUT ABLATION STUDY OF AA-CONV4D ON PASCAL-5I[18] AND COCO-20I[19].PARAMS: THE NUMBER OF LEARNABLE PARAMETERS

TABLE VII CLASS MIOU ABOUT ABLATION STUDY OF FM ON PASCAL-5I[18] AND COCO-20I[19].PARAMS: THE NUMBER OF LEARNABLE PARAMETERS.UM: UNLEARNABLE MIXING

TABLE VIII CLASS MIOU ABOUT ABLATION STUDIES ON FSS-1000 [41].PARAMS: THE NUMBER OF LEARNABLE PARAMETERS.SPEED: AVERAGE FRAME-PER-SECOND (FPS) OF 1-SHOT SEGMENTATION.UM: UNLEARNABLE MIXING

Fig.5.Qualitative results in 1-shot scenario.Rows 1 to 4 are from PASCAL-5i, rows 5 to 8 are from COCO-20i, and rows 9 to 10 are from FSS-1000.From left to right: Support: support image with binary mask, UM:unlearnable mixing, Sep-4D: separable 4D kernel, CP-4D: center-pivot 4D kernel, GT: ground truth.

3) FM:We conduct an ablation study on pyramid correspondence concatenation by replacing the proposed fusion module with unlearnable mixing (UM) [17], which is performed by element-wise addition.As we can see in Table VII,in comparison with unlearnable mixing operation over squeezed correspondences, our FM can achieve better performance in both inference accuracy and model size.Furthermore, as illustrated in Fig.5, although equipped with the same 4D kernel as the proposed FM, UM ultimately fails to make better use of semantic information learned by 4D convolutions.

D.Model Efficiency

1) Parameters:The parameters of our backbone network are fixed as done in previous works [29]–[31].In our proposed methods, DEs are training-free.For AA-Conv4d and FM, we report the requirements of parameters in Tables VI?VIII.As shown in the mentioned Tables, only 3.4 M trainable parameters are required in our best model, and much fewer than other approaches shown in Tables I and II.

2) Speed:Based on ResNet50 backbone network, AACNet achieves the best performance with 12.6 FPS for 1-shot prediction and 3.2 FPS for 5-shot prediction on single GeForce RTX 3080 GPU.During evaluation, all test images are with the size of 400 × 400.As shown in Table VIII, DEs do not influence the inference speed significantly (from 13.4 to 12.6 FPS).Although AA-Conv4d slows down the model implemented by CP-4D [17] or Sep-4D [16] from 17.2 or 12.9 to 12.6 FPS, the inference speed of AA-Conv4d is still 10 times faster than the original 4D kernel [13].Furthermore, the proposed pyramid 4D tensor fusion module (FM) contributes to increasing inference speed (from 10.6 to 12.6 FPS).

VI.CONCLUSION

In this paper, we have proposed AACNet, a novel framework built by pseudo-dense 4D convolutions, to analyze complex semantic correspondences in a fully-convolutional manner.Although suffering from limited supervision, extensive experiments and ablation studies on standard benchmarks have demonstrated the effectiveness of the proposed deblurring equations (DEs), axial assembled 4D convolution (AAConv4d), and fusion module (FM) in semantic segmentation.We comprehensively incorporate them into our model and evaluate it by considering different settings or by replacing one of them with its alternatives or variants, and then set new state-of-the-art on benchmarks.Possible future work includes extending our work from few-shot scenario to zero-shot.

IEEE/CAA Journal of Automatica Sinica2023年3期

IEEE/CAA Journal of Automatica Sinica的其它文章: Current-Aided Multiple-AUV Cooperative Localization and Target Tracking in Anchor-Free Environments; Event-Triggered Asymmetric Bipartite Consensus Tracking for Nonlinear Multi-Agent Systems Based on Model-Free Adaptive Control; UltraStar: A Lightweight Simulator of Ultra-Dense LEO Satellite Constellation Networking for 6G; Multi-AUV Inspection for Process Monitoring of Underwater Oil Transportation; Process Monitoring Based on Temporal Feature Agglomeration and Enhancement; Parallel Learning: Overview and Perspective for Computational Learning Across Syn2Real and Sim2Real