TANG Siqi, HAN Congying, GUO Tiande
(School of Mathematical Sciences, University of Chinese Academy of Sciences, Beijing 100049, China;Key Laboratory of Big Data Mining and Knowledge Management of Chinese Academy of Sciences, Beijing 100190, China) (Received 24 November 2018; Revised 15 January 2019) Tang S Q, Han C Y, Guo T D. Point matching algorithm based on machine learning method[J]. Journal of University of Chinese Academy of Sciences, 2020,37(4):450- 457.
Abstract Point matching is an important issue of computer vision and pattern recognition, and it is widely used in target recognition, medical image, pose estimation, etc. In this study, we propose a novel end-to-end model (multi-pointer network) based on machine learning method to solve this problem. We capitalize on the idea of multi-label classification to ameliorate the pointer network. Instead of outputting a member of input sequence, our model selects a set of input elements as output. Considering matching problem as a sequential manner, our model takes the coordinates of points as input and outputs correspondences directly. Using this new method, we can effectively solve the translation of the whole space and other large-scale rigid transformations. Furthermore, experiment results show that our model can be generalized to other combinatorial optimization problems in which the output is a subset of input, like Delaunay triangulation.
Keywords multi-pointer network; point matching; recurrent neural network (RNN); long short-term memory (LSTM) network; multi-label classification
Point matching plays a central role on computer vision such as action recognition[1], multi-target tracking[2], image retrieval[3], etc. The purpose of point matching is to determine correspondence between two point sets. There are some traditional algorithms to handle it, such as random sample and consensus (RANSAC)[4], graph-based point pattern matching[5], etc. Most of the traditional algorithms are alternated between calculating parameters of transformation and getting correspondence to obtain the optimal solution. This means that different models need to learn different transformation parameters. Meanwhile, as the number of points increases, execution time of system increases notably. If we can directly get accurate correspondence, we will be able to enhance the efficiency of matching problem. Actually, the point set is a sequence of feature vectors and the correspondence can be seen as a special form of permutation. Inspired by these facts, we propose a genetic model based on recurrent neural network (RNN)[6]to solve the problem.
RNN has become a popular choice for modeling variable-length sequences for more than three decades. Long short-term memory (LSTM)[7]with the ability of analyzing data with long range temporal dependencies makes it possible to handle long sentences using RNN. RNN recently showed impressive performance on several sequence prediction problems including machine translation[8], contextual parsing[9], image captioning[10], and even video description[11]. In particular, Vinyals et al.[12], Milan et al.[13], and Bello et al.[14]used RNN to tackle combinatorial optimation problem.
Most of models to approach sequential problem belong to a family of encoder-decoder[8,15]that consists of two RNNs. One RNN encodes the source sentence into a fix-length vector, and the other decodes target sequences from the vector. The whole encoder-decoder system jointly maximizes the conditional probability of target sequence based on source sequence. However, compressing all necessary information of input sequence into a fixed-length vector is a bottleneck problem that limits the performance of encoder-decoder architecture. In order to address this issue, Bahdanau et al.[16]proposed a content-based attentional mechanism which encodes input sentence into a sequence of vectors and chooses a subset of these vectors adaptively while decoding the translation. Nonetheless, the size of output dictionary needs to be fixed before designing model. So we can not directly apply this to combinatorial problem where the size of output dictionary depends on the length of input sequence. Pointer network (Ptr-Net)[12]provides a new architecture which addresses this limitation by repurposing the attention mechanism to create pointer to input elements.
Point matching is also a special kind of combinatorial optimization problem that is to obtain the optimal corresponding references, which can be modeled by Ptr-Net. However, Ptr-Net does not take full advantage of the correspondences between two point sets owing to the fact that output is a member of input elements at each time series. We propose multi-pointer network (MPN), which draws idea from multi-label classification[17-18], to address this limitation by pointing out a set of input elements.
To our knowledge, the present study is the first study to empirically evaluate the ability of RNN to output correspondences of two sets in a purely data driven fashion (i.e., when we only have examples of inputs and desired outputs). Experiments show that MPN can address matching problem with large-scale transformation efficiently and simply. On the other hand, MPN can also be extended to other combinatorial optimization problem.
This paper is organized as follows. In section 2, we briefly introduce the encoder-decoder framework, Ptr-Net, and multi-label classification. Our model based on the combination of multi-label classification and Ptr-Net is described in details in section 3. To test the effectiveness of our model, we run and analyze experiment results, including extensive comparisons with multiple rigid transformation and computing Delaunay Triangulation in section 4. Finally, we summarize the main contribution of our work and discuss the future direction.
(1)
In this case, it is reasonable to model each example using the chain rule to decompose it as follows (for the sake of brevity, we omit the indexi):
(2)
The strategy of the encoder-decoder model is to map input sequence to a fixed-length vectorcwith one RNN, and then to map the vector to target sequence with another RNN. We call the former RNN the encoder and the latter the decoder. The most common encoder approach is to use one RNN such that
et=f1(xt,et-1),
(3)
wheref1is the activation function of encoder RNN,et-1andetare the states of encoder at timet-1 andt, respectively. Generally, we choose LSTM architecture as activation functionf1which is suitable for long range temporal dependencies. Then we get the vector
c=q(e1,e2,…eTx),
(4)
whereqis another nonlinear function. In Ref.[8], the fixed-dimensional representationcis the last hidden state of LSTM. Then we use another standard LSTM formulationf2whose initial hidden state isc, to obtain hidden state of the decoderdtat timet,
dt=f2(dt-1,yt-1,c).
(5)
The conditional distribution of the next symbol is
P(yt|c,y1,…yt-1;θ)=g(dt,yt-1,c),
(6)
wheregis an activation function. Generally, the function is represented with a softmax over all words in the dictionary. Thus, we need to train different models which correspond to different sizes of dictionary. So we can not straightforwardly extend this model to problem for which the size of output dictionary depends on the length of input sequence.
Pointer network (Ptr-Net) is an effective model repurposing a recently proposed mechanism of neural attention[16]to solve combinatorial optimization problem where output dictionary size is equal to the length of input sequence. It differs from previous attention attempt in using attention as a pointer to select a member of source sequence as target, instead of using attention to compute the weighted sum of these encoder hidden units at each decoder step.
(7)
P(yt|y1,y2,…,yt-1,X;θ)=soft max(ui),
(8)
where softmax normalizes the vectorui(of lengthTx) to be an output distribution over the dictionary of inputs, andv,W1, andW2are learnable parameters of the output model.
In Ref.[12], authors trained Ptr-Net to output satisfactory solutions to three challenging geometric problems, computing planar convex hulls, Delaunay Triangulation, and symmetric planar travelling salesman problem (TSP). Ptr-Net is a new architecture to provide novel methods for complex combinatorial optimization where the output sequence corresponds to positions in an input sequence. For the point matching problem that requires to assign labels to elements of the input, it is necessary to simultaneously output corresponding points at decoding time.
Multi-label classification is the supervised learning problem where each object is represented by a single instance while associated with a set of labels instead of a single label. The task is to predict the label sets of unseen instances through analyzing training instances with known label sets. It helps to address the problem which has a certain degree of correlation between labels. This characteristic provides us an important theoreticed gist which can preferably improve Ptr-Net to solve point matching. At the present, researches that apply LSTMs to multi-label classification tasks are quite limited. Liu and Ramakrishnan[19]formulated music composition as a multi-label classification task, using sigmoidal output units. Yeung et al.[20]used LSTM network with multi-label method to recognize actions in videos. Recently, Lipton et al.[21]recognized patterns in multivariate time series of clinical measurements using replicated targets at each sequence step in the medical domain. However, we could not locate any reported works using LSTMs with multi-label classification on combinatorial optimization domain, especially the point matching problem.
Matching problem is to determine correspondence between the reference setA:{pi=(xi,yi):i=1,2,…,m} and the sensed setB:{qi=(xi,yi):i=1,2,…,n}. We can simply add a special end-of-set symbol ? at the end of reference set, which connects two sets as a sequence. An ordered set of vertices replace the related correspondences between point sets. In this way, we can make use of the encoder-decoder framework to solve point matching problem. Figure 1 shows a simple example that how we turn the matching problem to sequential problem. SetA={p1,p2,p3} and setB={q1,q2,q3} is shown in the left of Fig.1. Let the correspondence of two sets is (p1,q2), (p2,q3), (p3,q1). In the right part, we show the target vector of each time series. The dotted circle of each vector represents the end of setA.
Fig.1 The schematic diagram showing how to turn the matching problem into the sequential problem
In this paper, we propose a novel model named as multi-pointer network. The model can generate a set of elements simultaneously at each series time by introducing multi-label classification idea into the Ptr-Net. Figure 2 is a schematic of the multi-pointer Network. We use the symbol ? to represent the end of setA, symbol ? to represent the end of encoder and symbol 〈ɡ〉 to represent input of the first decoder step. In the following article of this section, we will describe the model in detail.
Fig.2 A depiction of the multi-pointer network
The regular RNN reads the sequence in left-to-right order. However, future input information coming up later is also important for prediction. To overcome the limitation, we use the bidirectional recurrent neural network (Bi-RNN)[22]that can be trained using all available input information in the past and future of a specific time frame.
(9)
Note that there are no interactions between the two types of state neurons.
With both time directions taken care of in the same network, input sequence can directly include future information without the need for delays. So we can summarize not only the preceding point, but also the following point to adapt point matching problem. In experiments, we use LSTM as the activation function of the positive and negative RNN. In Fig.2, the two dashed rectangle boxes in the bigger boxes represents the process of the bidirectional RNN.
For point matching, we define matching point-pair as the target output of each time series in training process. After sorting the points of two sets, we construct a series of objective vectors that elements of each one are zero besides the position of the matching point-pair (see Fig.1). Hence, we need to output a set of the input sequences simultaneously at the decoder part. Drawing the idea on multi-label classification, we present a simple modification of the Ptr-Net. Now let us go through the details of the new model.
Firstly, we still use equation (7) to calculate the vectoruito be “attention” mask over the inputs. Then a different approach is adapted to obtain the conditional distribution at each time step.
P(yt|y1,y2,…,yt-1,X;θ)=sigmoid(ui).
(10)
What makes our model different from Ptr-Net is that we use sigmoid function instead of softmax function to fit the multi-label classification loss function. This allows us to maintain that the output of every time series is the matching point-pair or points of other geometry structure.
We generate outputs in chronological order at each sequence step. Then we adapt cross entropy method to calculate the loss function,
(11)
(12)
Our model is specifically aiming at problems whose outputs of every time series are high correlation. Ptr-Net tends to solve problems where outputs are discrete and dependent on its location in the input sequences. Our model mines the underlying information between two point sets. In this manner, information can be spread throughout two point sets which greatly improves the accuracy of matching problems. Besides, the multi-pointer network can also be applied to the problem of combined structural optimization, like Delaunay triangulation.
In this section, we take a brief introduction of experimental details and analyze the results. Firstly, contrast experiment is done to test the impact of using Bi-RNN. Then, in order to verify whether our model could handle point matching and be more efficient than the Ptr-Net, we run the following experiments on artificial data for tasks. Finally, to confirm the model’s generalization, we compare our model with the typical Ptr-Net on Delaunay Triangulation.
Across all experiments, we use mini-batches of 128 point set sequences and embed the two coordinates of each point in a 256-dimensional space. At encoder step, the Bi-RNN we used consists of the forward and the backward LSTM cells with 128 hidden units. At decoder step, we utilize LSTM cells with 256 hidden units and one attention glimpse[23]to aggregate the contributions of different parts of the input sequence. According to the experimental results of Ref. [14],we also adopt only one glimpse to reach the trade-off between performance and cost latency in our experiments. We initialize all of the LSTM’s parameters with the uniform distribution between -0.08 and 0.08. And we clip theL2norm of our gradients to 2.0 to avoid exploding gradient problem. We train our model with the Adam optimizer and decay every 5 000 steps by a factor of 0.96 with an initial learning rate of 10-3. The layer of the encoder and the decoder LSTM is one. Even though there are likely higher accuracy to be obtained by tuning the model, we consider that using the same model hyperparameters on all rigid transformations makes the paper stronger. Our model is trained in Tensorflow framework[24].
To fully exploit the performance of multi-pointer network, we set up different types of rigid transformation and different sizes of point sets for comparative experiment. First, a random point setAsamples from a uniform distribution in [1,2]×[1,2]. Then, another point setBis generated from rigid transformation onA. To avoid the vertex coordinates of point is less than zero, some correction is made on setB. The impact of network on point matching problem is discussed at different sizes and different kinds of transformation in more detail below.
3.2.1 Database
We set two different fixed-size of sets 6-to-6 (the point number of setAis 6, and the point number of setBis 6), 10-to-10 and one random number size of sets 5-10-to-5-10 (the number of setAis a random number of 5-10, and the number of setBis a random number of 5-10, |A|=|B|).
3.2.2 Translation
We adapt a different translation approach from a traditional enterprise deployment. First, we sample from uniform distribution [0,1] to generatedxanddyrespectively. Then, we apply a linear function to ensure the length of translation vector equal to 1 and leave the orientation of vector unchanged. In this way, the translation is no longer the pixel level which has a more extensive adaptability.
3.2.3 Rotation
Three different dynamic intervals, [0,45°], [0,90°], [0,135°] are built to evaluate the intervals on the rotation performance of the new model in this paper. We also verify the fact that the number of sets has great impact on experiment results.
3.2.4 Scaling
To verify the similarity transformation, the scaling parameters we adopted is from 50% to 150%.
To measure the accuracy we use average correct point pair ratio (ACPPR). The correct point pair ratio is defined as follows:
(13)
In this paper, all ACPPR values are calculated from at least 500 experiments. This section is mainly divided into two parts. The first part we verify the effectiveness of the bidirectional-LSTM. The second part is an analytical and comparative study of different version of transformations.
3.3.1 Bidirectional RNN
We test the impact of using bidirectional RNN (Bi-RNN) on transformation and rotation. The database we adopted is the set with random number size 5-10-to-5-10 and the interval of rotation angles is [0°,45°]. Figure 3 shows the importance of Bi-RNN in the process of training. Through observation we find that using Bi-RNN has faster convergence ability and smaller loss than the model without it.
Fig.3 Comparison of the experimental results in translation and rotation with and without Bi-RNN
3.3.2 Various transformation
The results of experiments verify that our new method can effectively solve matching problem with various transformations and different sizes of data set. Firstly, we verify the effect on various rotation angle interval. The results are presented in Table 1. Then, comparison between Ptr-Net (P) with our multi-pointer network (M) is carried out. Table 2 shows the comparing experimental results on different transformation with the fixed angleθ=30°. By leveraging the constraint of the multi-label classification, the network can be better to reinforce the connection between two sets, thus the accuracy rate is improved significantly.
Table 1 Matching results with various rotation angles θ
Table 2 Comparison between Ptr-Net (P) with our MPN (M) of different transformations %
In this section, we consider the Delaunay triangulation, which is another intensively studied problem in computer science and mathematics. Given a setPwith points in a plane, Delaunay triangulation is a triangulation such that there is no point from the setPin its interior. During the training phase, the labels of the outputSp={S1,S2,…,Sm(P)} is the corresponding sequences representing the triangulation of point set. EachSirepresents the vertices of theitriangle, the integers of triple which is from 1 toncorresponding to the position of setP.
In experiments, we sample from a uniform distribution in [0,1]×[0,1] to create the training data. Ref.[23] shows the order of the sequence to impact the experiment performance. Therefore, we order the triangulations by their incenter coordinats in experiments. Compared to the Ptr-Net, we do not have to consider the problem as a consequence, any permutation of its elements will represent the same triangulation. Our experiment is about 5 points of the set to achieves in average 0.45% improvement.
In this paper we present empirical study using RNN to solve point matching problem. By analyzing the matching problem using mathematical method, we transform it into sequential manner. We propose a new end-to-end network, which is based on multi-label classification, to improve the Ptr-Net. Experimental results show that the proposed method effectively solves translation and other rigid transformation with large-scale. In this way, we can rapidly gain the corresponding relations of point sets in the inference. Moreover, our method can be extend to solve other problems of combined structural optimization.
In future, we will try to design a more effective model to raise accuracy. Through the trained model we can obtain correspondences directly. On this foundation, we can calculate parameters of transformation much more quickly based on the traditional method like RANSAC. In the meantime, the network model proposed in this paper can be generalized to the problem of multi-dimension point sets matching. We are also excited about the possibility of using this method to other combinatorial optimization problem which requires to assign labels to elements of the input.