YU Jianqiao,LIANG Hui,SUN Yi
(The School of Electronic and Information Engineering, Dalian University of Technology, Dalian 116023, China)
Abstract: Computed tomography(CT) has enjoyed widespread applications, especially in the assistance of clinical diagnosis and treatment.However, fast CT imaging is not available for guiding adaptive precise radiotherapy in the current radiation treatment process because the conventional CT reconstruction requires numerous projections and rich computing resources.This paper mainly studies the challenging task of 3D CT reconstruction from a single 2D X-ray image of a particular patient, which enables fast CT imaging during radiotherapy.It is widely known that the transformation from a 2D projection to a 3D volumetric CT image is a highly nonlinear mapping problem.In this paper, we propose a progressive learning framework to facilitate 2D-to-3D mapping.The proposed network starts training from low resolution and then adds new layers to learn increasing high-resolution details as the training progresses.In addition, by bridging the distribution gap between an X-ray image and a CT image with a novel attention-based 2D-to-3D feature transform module and an adaptive instance normalization layer, our network obtains enhanced performance in recovering a 3D CT volume from a single X-ray image.We demonstrate the effectiveness of our approach on a ten-phase 4D CT dataset including 20 different patients created from a public medical database and show its outperformance over some baseline methods in image quality and structure preservation, achieving a PSNR value of 22.76±0.708 dB and FSIM value of 0.871±0.012 with the ground truth as a reference.This method may promote the application of CT imaging in adaptive radiotherapy and provide image guidance for interventional surgery.
Key words: single view tomography, deep neural networks, progressive learning
In practical radiation treatment, ensuring the correct patient’s position through 2D or 3D image registration[1-2]is important in radiotherapy.However, the registered position probably changes due to patient’s motion during treatment, and this leads to inaccurate radiotherapy.Therefore, high-speed CT imaging by collecting X-ray projection images only over a limited-angular range is necessary to track the change of patient position in real time.However, this represents a serious ill-conditioned inverse problem in reconstructing CT images from limited projections.Many scholars have been devoted to developing appropriate image reconstruction methods with sparse or limited projections[3-9]and seeking low-dose CT reconstruction[10-13]. For example, an iterative algorithm with prior constraint on images[8]as a regularization term have been investigated to alleviate artifacts in degraded CT images.Some recent studies exploit data-driven deep learning approaches[14-22]to excavate more prior knowledge than regularization approaches and have shown success in sparse-view and limited-angle CT imaging.The deep learning approaches can predict 3D CT images from limited 2D X-ray projections with the help of a prior model trained by the 2D-3D data pairs.If the dataset contains sufficient pairs of 2D X-ray images and their corresponding 3D volumes, it is possible for the trained model to infer the representation from a single 2D projection to 3D volumetric CT images within a second. This single-view CT reconstruction has potential in fast CT imaging for adaptive radiotherapy.
Since recovering the 3D structure of an organ from a single X-ray image is a highly ambiguous problem, some early works require prior knowledge from reference 3D scans of the organ and assume that its model is known.These methods generally adopt optimization schemes to iteratively perform 2D-to-3D correspondences through the points[24], lines and contours of interest[23].However, it is difficult to achieve ideal results due to the limited features extracted by conventional methods.Recently, deep learning has become very promising at discovering intricate structures in high-dimensional data and is therefore applicable to single-view CT reconstruction[26-27].Henzler et al[26]devised a deep CNN to reconstruct 3D volumes from a single 2D X-ray image and introduced a new dataset of mammalian species consisting of corresponding 2D-3D data pairs to train the CNN model.Shen et al[27]proposed a patient-specific reconstruction of volumetric computed tomography from a single projection view via deep learning.Both methods formulate the problem of 3D image reconstruction from a 2D projection into an encoder-decoder framework that learns the underlying relationship between feature representations across different dimensionalities.The two pioneering studies demonstrate the possibility of single-view CT reconstruction based on deep learning.However, the end-to-end training strategy employed by them will make the problems of slow convergence and local optima serious as the network architectures grow in complexity. Consequently, the reconstructed images still suffer from blurred and inaccurate anatomic structures, and numerous efforts are needed to improve the quality of predicted CT images.As the transformation from a 2D projection to a 3D volumetric CT image is a highly nonlinear mapping problem, it is well known to be a challenging task even based on deep learning.Furthermore, the anatomic structures of different patients vary greatly, which also makes it difficult for the model to accurately learn the 2D-to-3D mapping from the data of a group of patients.Therefore, to facilitate fast convergence and stable learning for improved CT image quality, we specially design a progressive learning model for particular patient.The main idea is to train the model with the particular patient’s prior knowledge of 2D radiograph to 3D CT volume, which enables the model to predict the CT images from the patient’s subsequent single X-ray projection.The proposed learning model has the following advantages.① It starts training with a low-resolution input and then adds new layers that model increasingly high-resolution details as training progresses, which allows the network first to discover the large-scale structure and then shift attention to increasingly higher-resolution details, instead of having to learn all scales simultaneously as a general encoder-decoder framework does.This leads to increased accuracy and stable convergence for the highly nonlinear mapping problem.② We propose a 2D-to-3D feature transform module based on an attention mechanism, which guides the network to shift attention to the representative features.Moreover, an adaptive instance normalization layer is embedded into the framework that aligns the mean and variance of the transformed features with those of the 3D volumetric CT image.These two steps facilitate 2D-to-3D mapping and improve the quality of reconstructed CT image.③ We evaluate the proposed model on the public medical image database for 20 different patients.Both quantitative and qualitative experimental results demonstrate that the proposed progressive learning model can enhance the performance of the tomographic reconstruction from a single 2D X-ray image.
CT reconstruction technology has achieved great progress in the past few decades, developing from full angular-range reconstruction to sparse or limited-angle reconstruction.Until recent years, there have been few studies on single view CT reconstruction based on deep learning.In this section, we briefly review the CT reconstruction methods that employ deep learning.
Learning 3D CT volume from limited-angular range data.To solve the prominent data-insufficient problem caused by limited projections and angular range, current deep learning methodology opens up a new approach for CT techniques, promoting numerous successes in learning-based CT reconstruction from sparse[14-16]and limited-angle projections[17-22]. Compared with the conventional regularization or statistical model, the deep learning model is more powerful in extracting prior knowledge for CT prediction by training an end-to-end network[22]or combining with conventional techniques, e.g., wavelet decomposition[20].In some cases, when the angular range is limited to 90°, Anirudh et al[17]proposed a CTNet completing the sinograms from limited angle to full view, followed by CT reconstruction with existing techniques(FBP). Ghani et al[22]completed CT reconstruction with a conditional GAN that integrated prior information from both the data and image domains through a consensus equilibrium framework.Recently, we also noticed that Ying et al[23]further cultivated the ability of deep learning to reconstruct CT from two orthogonal X-ray images with an X2CT-GAN.However, similar to the aforementioned limited-angle approaches, in severely limited data cases or even an extreme case of a single X-ray image, the generalization performance of X2CT-GAN model may decrease due to severe ill-condition problem.
Learning 3D CT volume from a single X-ray image. Although 3D object reconstruction from a single image is very challenging, some scholars still actively investigate this problem to recovered 3D CT volume from a single 2D X-ray image by leveraging deep learning.For example, a pioneering work by Henzler et al[26]devised simple CNNs, achieving single-image tomography of mammalian skulls with impressive performance.A more recent work by Shen et al[27]showed the exploration of single-image tomography for a specific patient by using a 2D encoder and a 3D decoder to generate volumetric CT images.Since transforming a 2D X-ray image to a 3D CT volume is a highly nonlinear mapping problem, tremendous efforts are still required in stabilizing network training and improving CT image quality.In this paper, we propose a progressive learning framework to facilitate the learning process.Furthermore, we notice that the X-ray image has a different distribution from that of the CT images.Therefore, we bridge this gap via a feature transform module and an adaptive instance normalization layer, expecting a further improvement of the reconstructed CT images.
Recovering 3D CT structures from a single 2D X-ray image is a well-known, challenging task.Unlike 2D-to-2D translation, the 2D-to-3D problem requires learning a highly nonlinear mapping between features across different dimensions.This makes it unstable and difficult for the network to learn 3D CT directly from the 2D image.The idea of making the network gradually accumulate learned experience shows an advantage in improving stability and quality[28]. Therefore, we propose a progressive learning methodology to address our 2D-to-3D mapping problem.Furthermore, we specifically design a feature transform module(FTM) combined with the attention mechanism and apply the adaptive instance normalization method to improve the CT reconstruction.We show the overview architecture of our model in Fig.1 and introduce each component in this section.
Fig.1 Overview of the progressive learning model trained with input from low resolution to high resolution.The proposed feature transform module makes the network redistribute attention in channels and convert DR features to CT-like features.
We focus on learning nonlinear mapping of 2D-to-3D images, which is a more complex and well-known ill-posed problem compared with 2D-to-2D image translation.Although the deep learning model is a powerful candidate for solving this problem, simultaneously learning large-scale structure information and small-scale detail information is difficult.In this work, we believe that the network can benefit from inheriting the learned knowledge to a subsequent stage of learning, which we call progressive learning.This strategy is expected to make the learning of complex mapping from a 2D image to a 3D image volume easier and improve CT prediction performance.Therefore, we propose a progressive learning framework that trains the network starting from low-resolution input and then adds new layers combined with previously learned experience to model increasingly high-resolution images, as displayed in Fig.1. The early input low-resolution images are downscaled from the corresponding high-resolution images.
As the training progresses, we take two steps to stabilize the training process and enhance the representation ability of the network.The first step is to transfer the well-trained parameters of the model for the low-resolution input to the same network layers in the model for the high-resolution input, which helps avoid sudden shocks on the network training.In addition, the previously trained parameters can also make the learning of high-resolution images easier and accelerate the convergence process.The second step is to add some new layers to the network, which facilitates the network’s extraction of more features from the high-resolution image.The detailed architecture will be described in Section 3.4.
Our proposed model has a similar architecture to the popular U-Net[29], but the difference is that we rethink the usage of skip connections and import a novel feature transform module.Generally, the skip connection fuses information of the same-scale features from the encoder and decoder networks, and these same-scale features are semantically similar in biomedical image segmentation.However, in our cases, we notice quite different semantics or content between the input X-ray image(or DR image) and the output CT volume.This content mismatch makes the direct skip connection between the low-level DR features and the high-level CT features intuitively unreasonable and probably causes inaccurate results.Therefore, we specifically design a feature transform module and add it into the skip connection as shown in Fig.1, which helps to convert DR features into CT features and improve the prediction performance.Moreover, since the traditional convolution kernel extracting information of multiscale images easily produces a large number of repetitive and redundant features, we import the attention mechanism to make the network shift attention to informative features.
The proposed feature transform module comprises two components named channel attention(CA) and feature transform(FT).The channel-wise attention mechanism embedded in the CA can selectively emphasize those informative features and suppress less useful ones, whichhelp to predict CT images.We construct the channel attention layers as described by Hu et al[30], which squeeze the global information of the input features with an adaptive average pooling layer followed by an excitation operation adaptively redistributing attention in channel-wise dependencies.The aware DR features of CA are then fed to the feature transform layer to convert the valuable DR features to CT-like features with max-pooling, convolutions and deconvolution operations.Our feature transform module parameters are listed in Table 1.
Table 1 Parameters of the proposed feature transformed module.The C denotes the number of channels.The k and p denotes the kernel size and padding size in two-dimensional convolution
As discussed in Section 3.2, we notice the content difference between DR features and CT features and narrowit by employing the proposed feature transform module.However, the CT-like features generated by the feature transform module actually cannot match those of the volumetric CT image in statistical distribution.Thus, the direct fusion of both CT features may have side effects on the predicted CT image quality.To match the statistical distribution of the CT features, we add an adaptive instance normalization(AdaIN)[31]layer after the skip connections.Unlike the AdaIN which is commonly used for style transfer to fuse content and style from two different image domains, in our method, both the input CT-like featuresxobtained from the feature transform module and the CT featuresygenerated from the latent feature space have similar content.However, they differ with respect to the mean and variance.Therefore, we align the channel-wise mean and variance ofxto those ofy, defined as:
(1)
where the CT-like featuresxare first normalized and then scaled and shift with the varianceσ(y) and meanμ(y) ofy.
In our case, the two inputs of the AdaIN are in a similar domain.We believe that the alignment of the mean and variance by AdaIN can complement and enhance the similar parts of the two inputs generated from the skipconnection and latent space, which helps to improve the quality of predicted CT image.
We design a CNN network(Fig.2) to solve the 2D-to-3D mapping problem while regarding the third dimension of the output 3D CT volume as the channel in the decoding network.As displayed in Fig.2, the proposed network first encodes the input 2D image features into the hidden layer space through a series of residual blocks and down-sampling layers and then transforms the hidden layer features into the CT image volume by cascading residual blocks, up-sampling layers and 1×1 convolution layers.Each Resblock(light gray block) in the network comprises three residual blocks, each of which combines three basic convolution layers(white block).We import the residual block because it can alleviate the vanishing and exploding gradient problem in deep learning networks[30].In the process of feature encoding, we use max-pooling to down sample the feature maps and extract the most active feature.Although max-pooling causes information loss due to down-sampling, the skip connection combined with the proposed feature transform module can bring back some useful information from the un-pooled features.Then, this information is fused with the outputs of the decoding network by the AdaIN module. Since our goal is to find a 2D-to-3D mapping, we regard the depth dimension of the 3D CT volume as feature channels and perform 1 × 1 convolution along the channels of output layer to obtain the final CT results.The corresponding change of the feature map size in the down sampling process is 1×128×128 → 320×128×128 → 1280×128×128 → 1280×64×64 → 1280×32×32 → 1280×16×16 → 1280×8×8, while the change in the up sampling to CT volume is 1280×8×8 → 1280×16×16 → 1280×32×32 → 1280×64×64 → 1280×128 ×128 → 128×128×128, where each ‘→’ denotes 2D convolution residual blocks as described in Fig.2.
Fig.2 Architecture of proposed network learning mapping from single 2D X-ray image to a 3D CT volume
We trained the network with paired X-ray images(DRs) and 3D CT volumes.Before feeding the DR-CT pairs into the network, we first resize the data samples to the same size.An input X-ray image is resized to 128 × 128, and the original CT image volume(D× 512 ×512) is resized toD×128×128, where the variableDdenotes the number of slices for CT volume.Moreover, for numerical stability in training, we normalize each DR-CT pair to a standard Gaussian distributionN(0,1) by calculating their statistical mean and deviation in both training and testing.
With the normalized DR-CT pairs, we train the network step-by-step in four stages, while the input image starts from low resolution(16×16) to high resolution(128×128) as the training proceeds.The Adam optimizer is used to minimize the MSE loss function, which evaluates the consistency between the predicted 3D CT and the ground truth.The initial learning rate of Adam is 2e-5, and the momentum parameters areβ1= 0.5 andβ2= 0.99, respectively.The batch size is 6 in the first three training stages and 2 in the last stage because of memory limitations.The network is trained 150 epochs for stable convergence in each of the first three stages and 230 epochs in the last stage on one NVIDIA 3090 GPU.The duration is typically approximately 19 h for training on the lung CT dataset, and the inference time is approximately 0.1 s for a CT prediction.We implement our network in PyTorch and will make it publicly available upon publication of the paper.
In this section, we present the experimental evaluation criteria, show the quantitative and qualitative results compared with the other two baseline methods[26-27], and demonstrate the contribution of each component of our model in the ablation study.
Dataset:To evaluate the effectiveness of our proposed model, we conduct experiments on public lung 4D CT dataset available in the Cancer Imaging Archive(TCIA http://www.cancerimagingarchive.net/).The dataset contains 20 patient cases, each comprising respiration correlated 4D CTs with ten breathing phases.All information about the 4DCT protocol and patients of this database can be found in reference[33]. The data need to be preprocessed for training.We first resize all CT images of 20 patients to the same size of 128 × 128.Then, we augment CT data of each patient by imposing a series of random translations(range: ±5 pixels) and rotations(range: ±5 degrees) on the 3D CT volumes.For each 3D CT volume, it is projected in anterior-posterior direction to obtain the corresponding 2D projections.The above processing finally produces paired 2D X-ray image and corresponding 3D image volume for training and testing.We evaluate the feasibility of our approach in two cases.①Case 1: Ten-phase lung 4D CT scans of patients are selected, and the first six phases of the 4D CT scan are used to generate 900 DR-CT pairs for training and the remaining four phases to generate 600 DR-CT pairs for testing.This ensures that the test samples have not been seen in the training so as to verify the prediction ability and effectiveness of the proposed approach. ②Case 2: Two ten-phase 4D CT data scanned at an interval of two weeks are selected. We augment the first 4D CT data to generate 1500 samples(DR-CT pairs) for training and the second to generate 200 samples for testing.This verifies the performance of the learning model trained with previous CT scan data of a specific patient in generating CT images from the same patient’s single X-ray projection collected in subsequent scans.This procedure is valuable for fast CT imaging in radiation therapy.
Metrics:To evaluate the performance of the proposed single-view CT reconstruction, we report RMSE, SSIM PSNR, Visual Information Fidelity(VIF) and Feature Similarity Index(FSIM) scores of predicted 3D images.RMSE is the L2-norm error between the prediction and ground truth images, SSIM represents the overall similarity between two images, and PSNR is the ratio between the maximum signal power and the noise power that has been widely used to measure the quality of image reconstruction.The VIF quantifies visual quality based on the mutual information between the test and reference image, and FSIM specifically characterizes the image local quality.Both VIF and FSIM are considered to be consistent with the radiologist’s evaluation of medical image[34].
We conduct fair comparative experiments with two baseline methods[25-26]in the above two cases.The experimental results of each compared method are obtained by implementing their released code and using training parameters as described in their paper.
Case 1: A ten-phase lung 4D CT scan.In this case, we use augmented data from the first six phases of the 4D CT to train the model and test its prediction ability on the remaining four phases.We show the quantitative results for each method in Table 2.The mean values and standard deviation of five metrics of all 20 patients are averaged to obtain the final results.Through comparison of each metric in Table 2, it can be seen that the performance of our method outperforms the other methods in all 5 metrics.Taking PSNR as an example, Ours yields improvement over Henzler et al by 3.52 dB and Shen et al by 11.67 dB. For FSIM index, which characterizes the image local quality, also shows consistent improvement.These results demonstrate the effectiveness of the progressive training strategy combined with the feature transform module and AdaIN module.In other words, our method progressively guiding the network to first learn a large-scale structure and then shifting attention to high-resolution details can effectively improve prediction performance.
Table 2 Quantitative comparison with state-of-the-art algorithms on a ten-phase lung 4D CT data of a particular patient.The bold number indicates the best result
For qualitative comparison(Fig.3), we present a CT slice for each of three individuals as an example, indicated by the dashed line in the X-ray image.The ground truth(‘GT’) images are shown in the second column, and the predicted results of Shen, Henzler and the proposed approach are displayed in the third to fifth columns, respectively.We can observe in the enlarged image that the restored anatomic structure(e.g., the calcified coronary in individual 1, tumor in individual 2 and pulmonary bronchioles in individual 3) in Shen’s result is blurred, while the result of Henzler is clearer.Compared with the results of Shen and Henzler, ours is the most accurate.This can be seen from the difference images in the sixth through eighth columns.Moreover, the comparison of the SSIM value shown in the upper left corner of the predicted image also quantitatively proves the advantages of the proposed approach.
Fig.3 Example of reconstruction results for a ten-phase lung 4D CT case, together with the difference images between the predicted CT image and the ground truth.The model is trained with 900 samples from first six phases of the 4D CT and tested with 600 samples from the remaining four phases which are not seen.The ground truth is denoted as ‘GT’.The SSIM shows the overall similarity between the predicted image and ground truth.
Case 2: Two ten-phase lung 4D CT scans.In this case, we evaluate whether the 2D-to-3D model trained with ten-phase 4D CT data at the previous time is helpful for fast CT prediction of the same patient from a single 2D X-ray image at a later time, which is valuable for image-guided radiation therapy.We train the learning model with the patient’s previous 4D CT data and test the model with an X-ray image acquired at two weeks later as input to reconstruct volumetric CT images.The quantitative results of all patients by different methods are averaged and reported in Table 3.In terms of PSNR, ours outperforms the Henzler method by 4.7 dB and Shen by 6.18 dB.Consistent performance of our method can also be seen on other metrics.For instance, the mutual-information-based index VIF of our method(0.521) are higher than that of Shen(0.476) and Henzler(0.506).These improvements can be visually seen in the reconstructed CT slices as shown in Fig.4.Compared with the predicted results of Shen and Henzler, ours are more accurate and clearer in shape and anatomic structure(e.g., pulmonary bronchioles of three individuals as indicated by the yellow arrow), which can be seen from the enlarged and the difference images.We calculate SSIM for each predicted.
Table 3 Generalization performance of each model on two ten-phase lung 4D CT data.The bold number indicates the best result
Fig.4 Example of predicted results for two ten-phase lung 4D CT cases, together with the difference images between the predicted CT image and the ground truth.In this case, the model is trained with 1500 samples from the first 4D CT data and tested with 200 samples from the second 4D CT scanned at another time.The ground truth is denoted as ‘GT’.The SSIM shows the overall similarity between the predicted image and ground truth.
CT image,and the comparison of the SSIM indicates that our results are closer to the ground truth. These experiments demonstrate that our progressive learning model, trained successfully at one time, has a certain predictive ability for the same individual at another time.In particular, the bone structure is clear, which potentially helps the position registration in radiotherapy.In addition, when the model has been trained, the CT reconstruction time is about0.1s, which allows for fast image-guided3D registration and radiotherapy.
Accurate image registration plays an important role in precise radiotherapy, aiming at correcting the patient’s offset at the time of treatment from the planned position.It assures that high radiation dose can be accurately delivered to the target volume and reduces side effects on healthy tissue.Generally, the registration methods in common image-guided radiation therapy(IGRT) falls into2D/2D registration[35-36]and3D/3D registration[37].The2D/2D registration requires paired orthogonal2D digital radiograph(DR) and digitally reconstructed radiography(DRR) generated from planned CT, and compares intensity or features(e.g., bony structure) between the DR and DRR images to estimate the patient’s translations and rotations.The3D/3D registration compares two volume data sets acquired from planned and treatment CT, and computes the geometric transformation by minimizing the similar metric such as intensity difference between the two volumetric images[35].Compared to the2D/2D registration, the3D/3D registration has been demonstrated to provide improved registration accuracy and stability[6], due to the3D nature of data and better contrast of the anatomic structure in CT image. However, the3D volume reconstruction on the cone beam CT system requires gantry rotation and numerous projections, which is time-consuming.Thus we cannot obtain real-time CT images to track patient’s motion during treatment for adaptive radiotherapy.The scheme of CT reconstruction from a single X-ray image studied in this paper potentially provides an alternative to fast adaptive registration and precise radiotherapy.
Our proposed learning model, not requiring gantry rotation, allows for fast(0.1s)3D CT volume reconstruction from a single image.Fig.5displays the CT reconstruction from a radiograph, where Fig.5(a) is the input2D X-ray image and the corresponding reconstructed3D CT volume in coronal and sagittal views are shown in Fig.5(f) and Fig.5(g), respectively.We can see that the restored shape and structure by our method are consistent with the ground truth as displayed in Fig.5(b) and Fig.5(c).This lays the foundation for the fast3D registration possible and further exploration of CT image-guided adaptive radiotherapy.To validate the accuracy of our reconstruction method, like conventional medical image registration, we extract the rigid bone structure, and then compare our reconstructed bone in two orthogonal views with that obtained from the commercial CBCT system through some annotated points on the bone.The bone extraction is performed by making the tissue with the CT value<65not transparent as shown in Fig.5(d) and Fig.5(e).To complete the annotation of the chest bone, we invite three participants aged between25and30for assistance.For the spine as shown in Fig.5(d), the junction of thoracic spine is annotated with several points.Centered on the spine, we annotate the ribs with some symmetrically distributed points.For the sternum as shown on the left in Fig.5(e), we mark it with seven points.Each participant independently marks the point in the same way.To ensure accuracy and consistency, we annotate and check using the ‘markup editor’ in3D Slicer software.Finally, the annotated points of all the participants are averaged to obtain the final results to reducing error of manual annotation.The landmarks on the ground truth bone structure are shown with green points which are taken as a reference for registration.In Fig.5(h) and5(i), we annotate the restored bone structure with red crosses and overlap it on the reference green landmarks.It shows that the red crosses are close to the green marks, and the average landmark registration error is0.57mm.This means that our model, using a particular patient’s single X-ray image, can provide valuable3D anatomic information for the same patient’s position registration in radiotherapy. In addition, the rapid CT reconstruction ability(within a second) of our model is helpful for fast registration.
Our approach applies three strategies to achieve promising CT reconstruction from a single X-ray image. The first strategy is training the network progressively from a low-resolution image to its high resolution.The second strategy employs the attention-based feature transform module in the skip connection.The third is using the adaptive instance normalization module to align the mean and variance of the generated features to those of the volumetric CT images.We represent the three strategies as PG., Att., and Ada., respectively, in Table4and conduct ablation experiments to evaluate the contribution of each strategy.When all strategies are applied, our approach achieves the best overall performance, as shown in the first row.
Ablation study on progressive learning.When the PG.strategy is not considered, from the second row of Table 4, we can see that the performance degrades.The PSNR in case 1 decreases from 35.23 dB to 32.25 dB, and the SSIM drops from 0.936 to 0.893. Consistent results can also be observed in case 2.These experiments prove that the progressive learning strategy can effectively guide a better mapping from the 2D X-ray image to the 3D CT volume and improve the performance of CT prediction.
Ablation study on an attention-based feature transform module.To verify the benefit of the proposed feature transform module applying the attention mechanism, we remove it from the network. As expected, the overall performance decreases, as shown in the third row of Table 4.The PSNR in case 1 decreases by 1.11 dB compared to that of the full model in the first row.Besides, we show the input and output feature maps of the feature transform module in Fig.6.It can be seen that the input DR-like features are transformed to CT-like features as expected, which is helpful to the generation of CT images.These results validate that making the network focus on informative features in latent space by applying our attention-based feature transform module can further improve the network prediction performance.
Fig.6 Feature maps before and after the feature transform module
Ablation study on adaptive instance normalization.We apply the AdaIN module to align the distribution of the features generated from the proposed transformed module with those of the CT features. When the AdaIN module is eliminated, we notice a drop in the performance in both cases, as shown in the 4th row of Table 4.For instance, the PSNR drops from 25.38 dB to 23.87 dB in case 2.This in turn demonstrates the effectiveness of adaptive instance normalization in improving the performance.Moreover, when both the Ada.and Att.strategies are removed, as displayed in the 5th row, the overall performance degrades compared with that of the full model.For example, in case 1, the PSNR decreases from 35.23 dB to 32.35 dB.This indicates that the combination of Ada.and Att.can further improve the performance compared to only considering the PG. strategy alone. Furthermore, we compare with the result of model removing all strategies(the last row), e.g., UNet.We can see that performance of the model decreases by a large margin.For instance, in case 2, the PSNR and SSIM drop by 8.12 dB and 0.265, respectively, when compared to the results in the 5th row. These experiments validate the effectiveness of the PG strategy once again.
We compare the loss curve of the network trained with and without the PG.strategy(e.g., UNet) for case 1 in Fig.7. We can observe that the network with progressive learning can fit the training data better and stably converges to a better optimum in testing, while the convergence of UNet in both training and testing is neither better nor stabler than our model.This indicates that progressive learning can help stabilize training and improve CT prediction performance.Moreover, we can see that the application of the Att.and Ada.strategies combined with the PG.strategy can further reduce the training and testing loss, and the model with all strategies applied converges to the best inboth training and testing, which verifies the effectiveness of our attention-based feature transform module and feature distribution alignment.
Fig.7 Training and testing loss of the network with and without the PG.strategy(e.g., UNet) on ten-phase 4D CT data.
This paper investigates the challenging problem of generating 3D tomographic images from a patient’s single X-ray image, which is valuable for fast CT imaging and registration in image-guided radiation therapy.We propose a progressive learning model to improve the restored CT image quality, which first learns a large-scale structure in low-resolution input and then shifts attention to high-resolution details.In addition, to bridge the gap between the Xray image and CT image, we propose a novel attention-based feature transform module and apply an adaptive instance normalization layer for feature distribution alignment.These contributions are validated through experiments on public 4D CT datasets including 20 patients, indicating that the proposed network can learn the prior knowledge of 2D-to-3D mapping from a patient’s previous CT scan data and use a single radiograph image acquired from the patient’s subsequent treatment to predict valuable CT information for 3D registration and radiotherapy.
Limitation.CT reconstruction from a single 2D X-ray image is a complex and challenging problem because the anatomic structure and its changes vary greatly between different people.Even for a specific individual, organ movement and body deformation will make the problem difficult.In this paper, we only study the problem for a particular patient and establish a progressive learning model.Since the public dataset does not provide X-ray images, we train the model with projection generated from volumetric CT images.For practical application, the proposed model can be trained with the X-ray images collected by CT equipment. It is noted that we choose anteroposterior chest X-ray images in the training, which retain the most anatomic structure information on the lung, thereby helping to reduce the difficulty and ambiguity associated with CT reconstruction from a single X-ray image.