Deep residual learning for denoising Monte Carlo renderings

2019-02-17 14:14:28KinMingWongTienTsinWong

Computational Visual Media 2019年3期

Kin-Ming Wong(),Tien-Tsin Wong

Abstract Learning-based techniques have recently been shown to be effective for denoising Monte Carlo rendering methods.However,there remains a quality gap to state-of-the-art handcrafted denoisers. In this paper,we propose a deep residual learning based method that outperforms both state-of-the-art handcrafted denoisers and learning-based denoisers.Unlike the indirect nature of existing learning-based methods(which e.g.,estimate the parameters and kernel weights of an explicit feature based fi lter),we directly map the noisy input pixels to the smoothed output. Using this direct mapping formulation,we demonstrate that even a simple-and-standard ResNet and three common auxiliary features(depth,normal,and albedo)are sufficient to achieve high-quality denoising. This minimal requirement on auxiliary data simpli fi es both training and integration of our method into most production rendering pipelines.We have evaluated our method on unseen images created by a different renderer.Consistently superior quality denoising is obtained in all cases.

KeywordsMonte Carlo rendering;denoising;deep learning;deep residual learning; fi lterfree denoising

1 Introduction

Monte Carlo rendering methods have become the mainstream photo-realistic image synthesis technique because of their generality,fast start-up,and progressive nature.Unfortunately,such methods take a prohibitive amount of time to obtain a noisefree image.While light transport techniques[1-3]can accelerate the integration process,noise-free images remain computationally expensive.Imagebased denoising techniques have matured quickly in recent years.They take less processing time and are often easy to integrate into existing rendering pipelines.Several post-processing image denoisers have been proposed which achieve high-quality results[4-6].

Recently,learning-based approaches[7,8]have been demonstrated to provide an effective means to denoising.However,their current results do not show a signi fi cant improvement in quality over stateof-the-art handcrafted denoisers.We believe that their mildly incremental improvements are due to the joint fi ltering model commonly found in mainstream image-based denoisers.The passive roles of their neural networks in fi lter kernel estimation fail to unleash the full power of deep learning.While the state-of-the-art deep learning method proposed by Ref.[7]requires a large number of auxiliary features,their true bene fi t is dubious.

In this paper,we propose afi lter-free direct denoisingmethod based on supervised learning using astandard-and-simpledeep residual network(ResNet)[9].Unlike previous learning-based methods which require a large number of auxiliary features,ours requires onlythree:depth,view-space normal,and albedo.We train our simple network to map the noisy inputs directly to high-quality noise-free outputs using our own dataset.The training takes less than 36 hours.Nevertheless,our network outperforms both state-of-the-art learning-based denoisers and carefully handcrafted image denoisers,in terms of visual quality. Figure 1 compares our denoising results with those from two leading denoisers,NFOR[5]and KPCN[7].

Fig.1 We propose a fi lter-free direct denoising method based on supervised learning with a deep residual network[9,10].Our network takes the noisy image together with depth,screen-space normal,and albedo as input(9 channels total),and it directly outputs a noise-free result with no intermediate fi ltering step.Both our network and KPCN[7]are trained with our own dataset(rendered using RenderMan)which covers diverse shading and distributed effects found in modern rendering methods.The above images compare the denoising performance of our network with other leading methods using a noisy image with depth of fi eld effect rendered at 8 samples per pixel using the Tungsten renderer(The Wooden Staircase scene by Wig42 from Ref.[11])

The key to achieving such high quality using a simple ResNet and just three auxiliary features is the notion of deep residual learning;its unique design forces the network to learn the difference between its input and the expected output,i.e.,the residual.In a supervised learning setting,the network learns to map the differences between the noisy input and the corresponding ground truth.Furthermore,the shortcut connections of ResNet allow reuse of upstream features to establish a multi-scale alike mapping capability.All these features make ResNet a perfect candidate for our denoising task. In addition,the batch normalization[12]layers in ResNet make it resilient to high dynamic range data(typical of our noisy color inputs),and scale well in depth.

To validate our method,we have tested it on a rich variety of scenes(not included in our training data)rendered by a different renderer.Extensive experiments and quantitative evaluation show that our method consistently outperforms other state-of-the-art denoising methods.In short,our contributions are:

·A deep learning based single image denoising method for Monte Carlo rendering.Our simple ResNet generalizes well,and utilizes only three standard auxiliary features as additional input.It integrates transparently into existing rendering pipelines without need for adaptation.

·We identify and demonstrate the bene fi ts of residual learning for high-quality denoising of the output of Monte Carlo rendering methods.

·We demonstrate the advantage of direct denoising using a deep neural network,and identify the importance of auxiliary feature selection and its balance with network capacity.

2 Related work

In this section,we review selected image-based denoising methods and deep learning techniques closely related to our work.A thorough review of denoising Monte Carlo rendering output,includinga priorimethods which study origins of sampling noise in the rendering process,is available in Ref.[13].Given the wide spectrum of deep learning techniques and applications,even a brief review is beyond the scope of this paper,and we refer to Refs.[14-16]for comprehensive reviews.

2.1 Joint fi ltering methods

Auxiliary features,such as depth,normal,and albedo computed during most Monte Carlo rendering methods,possess strong correlations with image structures and details seen in the rendered color image.More importantly,such feature data are often considerably less noisy than the rendered image itself,even when the sampling rate is low.Many successful denoising techniques adapt various edge-aware nonlinear image fi lters to leverage this correlation to produce powerful joint fi ltering methods.

McCool[17]leverages the feature data to produce a coherence map which controls an anisotropic diffusion[18] fi ltering process.Dammertz et al.[19]adapts auxiliary features as edge-stopping functions in their edge-avoiding wavelet[20]framework. Sen and Darabi[21]adapt the kernel weights of a crossbilateral fi lter[22-24]using mutual information,which helps to suppress the in fl uence of random inputs.Li et al.[25]propose use of an SURE estimator[26,27]to select the best per-pixel result among a set of cross-bilateral fi ltered candidates created with different bandwidths.Rousselle et al.[6]use both cross-bilateral and non-local means fi ltering[28]in their framework to produce a set of improved candidates for selection via SURE estimation.Firstorder regression-based methods using local regression and linear models have been proposed by Refs.[4,29].Bitterli et al.[5]use a holistic fi rst-order regression approach,which is considered to be the state-of-theart joint fi ltering method.

2.2 Learning-based fi ltering methods

Regression-based joint fi lters aim to produce smooth results from the noisy inputs;there is always a risk of over- fi tting to the noise.It is known they do not handle highly noisy inputs well.Kalantari et al.[8]propose the fi rst supervised machine learning method to estimate the ideal parameters of their crossbilateral fi lter model.Unlike traditional regressionbased approaches,a supervised learning model is trained with a large number of noisy and ground truth image pairs.A neural network such as the multilayer perceptron used in Ref.[8]can learn the complex relationship between the noisy inputs and the ground truth.Bako et al.[7]recognize the potential bene fi ts of deep convolutional neural networks,and further delegate the task of determining the ideal fi lter kernel(bandwidth is preassigned)to the neural network.They also report the difficulties faced by their direct CNN denoising attempt including slow training convergence and potential color shifts.

2.3 Deep learning for inverse problems

Deep convolutional neural networks have demonstrated their great feature extraction power in many difficult image classi fi cation problems[30-32].Supervised learning methods using CNN have also shown impressive results in image denoising[33],and many inverse problems such as inpainting[34],deconvolution[35],and super-resolution[36,37].These inverse problems share a common challenge,to reconstruct an output based on inputs with incomplete information.The capability of a CNN is known to directly depend on its depth[38]but it is not a simple matter of stacking more layers to improve capability.The various training difficulties related to deep neural networks have been studied and several practical means[12,39]exist to tackle them.As a result,denoising Monte Carlo rendering outputs with a deep neural network is likely to meet the challenges of training convergence,the high dynamic range of the image data(both reported in Ref.[7]),and selection of auxiliary features as additional inputs to the network.

3 Direct denoising using a deep residual network

In the following,we present the details of our deep learning based direct denoising approach,and the key design considerations that govern our network architecture and selection of auxiliary features.

3.1 Filter-free direct denoising model

Most regression-based joint fi ltering methods for denoising Monte Carlo rendering outputs share a generic model which reconstructs the noise-free image by fi ltering the noisy input colors.They compute the fi ltered colorof pixelias a weighted sum of the colors of pixels in a neighborhoodN(i)centered at pixeli:

whereωi,jis the normalized contribution weight of colorcjof pixeljto the result;the exact expression of this weight is determined by the fi lter model used in a speci fi c method.

Xu and Pattanaik[41]propose one of the earliest joint fi lters for denoising Monte Carlo rendering output.Their method augments the bilateral fi lter using pre- fi ltered pixels for range fi ltering. The general strategy of joint fi ltering is to exploit the correlations between various auxiliary features and the color input[42,43].In all cases in both the latest techniques[6,25,44]which apply error estimation to select fi lter parameters,and the state-of-the-art regression-based approaches[4,5],the joint fi lter variants can be expressed in the following form:

wherexiandxjare the inputs based on the selected auxiliary features andcjis the color of pixelj.θi,jrepresents the fi lter speci fi c parameters of the kernel functionF(·)de fi ned by the method.Recent supervised learning techniques estimate the joint fi lter parameters[8]or even predict the per-pixel kernel function[7].However,no matter how sophisticated the joint fi lter design is,its fundamental fi ltering formulation stays unchanged and is ultimately dependent on the noisy input colors.

This joint fi ltering approach unfortunately limits the solution space. One obvious potential consequence is the difficulty of producing good results when the color inputs are very noisy. This is especially common in high dynamic range Monte Carlo renderings as the non-converged samples often exhibit high variance.

In order to take full advantage of deep learning(especially its unparalleled non-linear mapping capability),we propose to solve the Monte Carlo denoising problem by learning adirectmapping from the noisy rendered images to the corresponding high-quality noise-free results.Ourfi lter-free direct denoising modelis expressed as follows:

whereG(·)is the mapping to be learned by our method,N(i)is the neighborhood centered at pixeli,and the input isxi.In our method,the feature vector comprises:

withci,color,zi,depth,ni(x,y),view-space normal,andai(r,g,b),albedo,of a noisy input pixeli.In order to learn such a challenging mapping,we need a neural network which is easy to train,and has the capacity to realize that highly non-linear mapping.In the next section,we present a network architecture which has a proven record of dealing with natural image inverse problems of a similar nature to our denoising problem.We then discuss the rationale behind the selection of features in Section 3.3.

3.2 Network architecture

We consider the key characteristics of our problem relevant to network architecture and training.The two primary concerns identi fi ed are as follows:

1.The network is expected to map from the noisy inputs directly to the corresponding noise-free results while exploiting the correlations between the auxiliary features and noisy color inputs.

2.Monte Carlo rendered image data have high dynamic range,which have the potential to cause instability during training.

The fi rst concern indicates the need for a network capable of learning a complex mapping,and is good at denoising-like tasks.The second concern over stability during training is due to the potentially large changes in inputs during training.Overall,we need a network which scales well with depth[31,38]in terms of both learning capacity and stability of training.

We propose to use a deep residual learning(ResNet)[9]based architecture for our method,as shown in Fig.2.This type of network has shown good performance on several inverse problems[37,40,45].Deep residual learning pioneered by He et al.[9]set several records in image recognition challenges.The depth of a ResNet can reach over one hundred convolutional layers with continued improvement in performance.

Fig.2 Our network is based on the ResNet architecture[9].We use 16 ResBlocks(basic ResNet building blocks).Various ResBlock variants exist[10,37,40].Experiments showed that our current choice(see Fig.3)performs best for our application in terms of both training efficiency and quality of output.There are a total of 35 convolutional layers,and 19,592,704 trainable parameters in our network.

The basic building block of ResNet is a small unit of stacked convolutional layers with a shortcut connection,shown as aResBlockin Fig.3.If we expect aResBlockto learn a mappingH(x)with an inputx,the stacked layers inside are now effectively learning a residual mappingF(x)=H(x)-x.This recasting makes certain mappings easier to learn,for example an identity mapping,i.e.,H(x)=x,andF(x)=0.In this case theResBlockhas to learn nothing.An identity mapping can be hard to learn for an ordinary convolutional network.

This shortcut connection in practice allows an upstream block to share its input data with any downstream block via a coordinated identity mapping if desired.This unique feature is actually the key which makes ResNet a powerful mapping learner because data are free to fl ow across the network instead of in a layer-by-layer fashion.

Fig.3 Inside our ResBlock,there are 2 convolutional layers,each having 128 fi lters of size 3×3.Each convolutional layer is followed by a batch normalization layer[12]to form a sub-unit,and a parametric recti fi er[32]is sandwiched between these two sub-units.

The residual learning capability also makes ResNet an ideal candidate for the denoising task in a supervised learning setting as it can focus on learning and mapping the differences between the noisy input and the corresponding ground truth at different scales.EachResBlockincludes a batch normalization layer[12]which improves training stability by suppressing the internal covariate shift[12]caused by large changes in inputs between layers.Although He et al.[10]propose an improvedResBlock,we found that the one proposed in Ref.[37],which includes the use of a parametric recti fi er[32]as an activation unit,performed best in our experiments.We also include a network wide skip connection[10,37,46]for added mapping fl exibility.

3.3 Auxiliary feature selection and preprocessing

In our fi lter-free model,there is no prede fi ned fi lter which governs the choice of auxiliary features to be learned by the neural network,unlike in Ref.[8].Bako et al.[7]use a 27-channel input to each of their fi lter pipelines,presumably to maximize the potential bene fi ts of using more auxiliary features.However,the overall bene fi ts of auxiliary feature inputs depend naturally on the learning capacity of the network,and we suspect it may be incapable of learning such a complicated mapping.We performed a simple experiment using the diffuse fi lter pipeline of KPCN[7].The color,depth,and normal related auxiliary features were selected to form an 11-channel subset from their original input to train the same network.Figure 4 shows that the KPCN network trained with the subset input achieves a lower training loss,and the 11-channel trained network delivers similar and a few better denoising results,but we did not pursue this further.This simple experiment re fl ects the importance of evaluating a network’s capacity relative to its input.

Fig.4 Comparison of the KPCN(diffuse pipeline)[7]training loss for an 11-channel subset of the input(color,depth,and normal related only)and the original 27-channel input.

In order to verify the candidate auxiliary features we planned to use,we relied on a smaller8-ResBlockResNet(2 convolutional layers of 64 fi lters of size 3×3 per block)to evaluate their usefulness.Figure 5 shows the impact of selected combinations of auxiliary features on theL1 training loss.Our candidate auxiliary features,depth,view-space normals,and albedo are shown to be useful but albedo provides only marginal improvement to the loss.Furthermore,the inclusion of view-space normals and albedo seems to accelerate training convergence.

Fig.5 Impact of auxiliary features on a smaller ResNet similar to our network.

From a practical point of view,the selection of auxiliary features should also consider the ease of adoption as mentioned in Ref.[5],so we chose the most common ones which are readily available from most renderers.In a supervised learning setting,the ground truth images already provide much information,and we believe the per-pixel statistical information may not be as useful to us as they are in the regression-based methods.In addition,the feature extraction power of CNNs is well recognized[30,47],so we chose to include only primary auxiliary features.

Lastly,we follow Ref.[7]to apply range compression via logarithmic transform as a pre-processing step to our high dynamic range inputs,i.e.,color and depth.This compression step improves the results in terms of smoothness.In short,our 9-channel input for training is as follows:

·noisy color(3 channels,range compressed),

·view-space normals(2 channels),

·albedo(3 channels).

4 Implementation and network training

In this section,we present details of training data preparation,and implementation of our network model.We also discuss the impact of loss functions and concerns about the capacity of our network.

4.1 Training data preparation

To the best of our knowledge,there is no publicly available training dataset dedicated to the Monte Carlo rendering denoising task. Knowing the quality of data has an important impact on the trained denoiser’s performance,so we have invested considerable time in carefully preparing a reasonably large dataset which covers a wide range of object scale,shading,and distributed effects seen in most modern renderings.Figure 6 shows selected ground truth images from our dataset.

We imported assets at different scales collected from various public resource archives(full credits are included in our Electronic Supplementary Material(ESM))into Autodesk Maya to create our scenes.By applying different lighting(both analytical lights and image-based lighting),materials(mostly physically based materials),and cameras with different angles,aperture size,and focal lengths,the whole dataset was rendered with Pixar’s non-commercial RenderMan RIS renderer.We authored over 50 different scenes covering a wide range of genres and scales,including natural and procedural objects,interiors,sci- fi,automobiles,street scenes,cityscapes,etc.As our resources were very limited,and this dataset was to be used to train our own network and also KPCN for comparison,we were cautious and attempted to produce a dataset diverse enough for supervised learning of neural networks with good generalization.

Fig.6 Selected ground truth images from our dataset.

We rendered our ground truth images at 1024×1024 resolution with 2048 samples per pixel(spp)to achieve a perceptually noise-free quality,with a few exceptions rendered at 512 and 8192 spp.The noisy counterparts were rendered at 8-32 spp according to the level of perceived noise which often depended on the material response and lighting conditions rather than a particular sampling rate.Figure 7 shows a few noisy training samples from our dataset.

Therewerein total256multichannelhigh resolution images in the fi nal dataset.For training purposes,we further extracted small patches from them. We relied on the color variance channel and blue noise sampling for patch selection so as to collect noisy data rather than unhelpful smooth data which might be collected by sampling uniformly.We extracted 256 unique patches of size 64×64 from each image,giving a dataset of 65,536 multichannel image patches.For training efficiency,we also created network model speci fi c datasets with the unused image channels removed or simple statistics precomputed for overall improved I/O performance.

Fig.7 Selected noisy images rendered at different sample rates from our dataset.

4.2 Model implementation and training

We implemented our16-ResBlock(2 convolutional layers of 128 fi lters of size 3×3 per block)ResNet(see Fig.2)using the Python API of the Cognitive Toolkit(CNTK)[48,49].Training data were stored in OpenEXR high dynamic range image fi les,and we used the OpenImageIO[50]library to read and serve the image patches as NumPy[51]arrays to our CNTK network.Image patches were pre-shuffled to ensure a good mix of patches from different scenes in each mini-batch.In addition,660 image patches were randomly selected,and reserved for in-training validation use.

All weight parameters were initialized with the He initializer[32]which is designed to be used together with parametric ReLU activation units. As the batch normalization[12]units have built-in bias parameters,there was no need to include bias in the convolutional layers.Our network was optimized using the ADAM[52]optimizer available in CNTK with a momentum value set at 0.9,and gradient clip threshold set at 1.0.L1-loss was used as the loss function for our fi nal network(choice of loss function is discussed in the next section).Training used a mini-batch size of 10,and ran for 106iterations.The corresponding learning rate schedule was as follows:

Now it was the custom in those days for princes and princesses to be brought up by fairies, who loved them as their own children, and did not mind what inconvenience they put other people to for their sakes, for all the world as if they had been real mothers

·0.01 for the fi rst 1000 iterations,

·0.001 for the second 1000 iterations,

·10-4for the rest.

Our network has a total of 13,847,296 trainable parameters,and the training took less than 36 hours to complete on an nVIDIA GeForce GTX 1080 Ti GPU(98%GPU load).Figure 8 shows theL1-loss and errors during the training session.The training loss converged quickly,with stable progression.The validation error also decreased steadily without any sign of over- fi tting.

4.3 Loss functions

Fig.8 Training loss(top),and training and validation errors(bottom)for our 16-ResBlock network during a training session with 106iterations.

In many recent CNN-based denoising applications[7,53,54],theL1-loss function has been found to be a consistently good performer.It is inexpensive to compute,and often surpasses metric-speci fi c loss functions such as MSE loss.We evaluated a combination of potentially useful loss functions for our denoising task before training the fi nal network.We evaluated combinations ofL1 andL2 loss functions,and also the VGG-network[38]based perceptual loss[55].The VGG-perceptual loss is expensive as it requires inference through the VGG network,and the average sum of multiple feature maps.Although this perceptual loss function is known to improve sharpness in some inverse problems when coupled with MSE loss,our experiments showed thatL1-loss remains the best loss function,especially if cost effectiveness or fast convergence is a concern.

4.4 Capacity of our network

One common phenomenon related to deep neural networks is calleddiminishing feature reuse[56]:some parts of a deep network end up not learning anything,and it can be understood due to overcapacity.ResNet is especially prone to this issue because the gradient information can basically fl ow freely to any block during training because of the shortcut connections.Huang et al.[57]propose use of a stochastic training strategy which randomly shuts down different layers to form a virtually shallower network during training.Zagoruyko and Komodakis[58]propose use of a widened(more fi lters per layer)version of ResNet with reduced depth.Some wide residual networks have been reported to perform better than deep ones for some applications[58].As a result,we built a shallower but wider,8-ResBlockversion of our network(with 256 fi lters of size 3×3 in each convolutional layer).We trained this wide version using the same training setup as for the16-ResBlock,and Fig.9 shows the corresponding progression of training loss and errors.

This wide version has a total of 19,592,704 trainable parameters,and the 106iterations for training took approximately 52 hours to complete on an nVIDIA GeForce GTX 1080 Ti GPU.This wide ResNet has considerably more trainable parameters but only achieved similar training loss and error as our proposed16-ResBlockResNet,while its denoising performance was consistently inferior to the16-ResBlockversion in our tests.This suggests that the proposed16-ResBlockResNet is making proper use of its capacity,while greater depth allows more sophisticated mapping at least empirically.

Fig.9 Training loss(top)and training error(bottom)of an alternative wide 8-ResBlock network during a training session with 106iterations.

5 Results and evaluation

5.1 Outline

We now compare the results of ourfi lter-free direct denoisingneural network with current state-of-the-art denoisers based on joint fi ltering,and learning-based approaches.We compare with the denoisers NFOR[5]and KPCN[7]using low sample rate noisy images rendered with scenes curated by Ref.[11];they have diverse lighting,materials,and level of detail.

For conciseness,we report only the SSIM[59]and relative MSE[44]in the fi gures.A fuller report with additional quality metrics is available in the ESM,and we recommend close inspection of the full-resolution images available in it.

The learning-based denoiser KPCN[7]requires some additional preparation.We followed the details in their paper and source code to re-implement and train their network model using CNTK[49].We pre-processed image patches from our dataset to produce the 27-channel data inputs required by each diffuse and specular KPCN fi lter pipeline in order to minimize data processing during training.We trained the networks with an nVIDIA GeForce GTX 1080 Ti;the training time for each fi lter pipeline took approximately 14 hours over 750k iterations each,while the 250k iterations of joint fi ne-tuning took another 10 hours.For the NFOR denoiser,we used the publicly available implementation provided by the Tungsten software package.We now discuss the results.

5.2 Kitchen(close-up)scene comparison(16 spp,dynamic range 0.0-4.0)

Figure 10 shows the denoising results for a scene populated with objects with different glossiness.NFOR (seeFig.10(b))denoisesthekitchen countertop reasonably but leaves some splotches.There are also very subtle splotches left on the stainless steel wall panel but the overall denoising result is clean.KPCN(see Fig.10(c))removes most noise on the countertop,but with noticeable smear marks,and the texture of the countertop is not well recovered.The denoising result for the stainless steel panel is unsatisfactory,and the silhouette of the kettle has room for improvement.Our method(see Fig.10(d))denoises both countertop and stainless steel panel with good results.The silhouette of the kettle is very sharp and the shading on the stainless steel panel closely matches the reference.

Fig.10 Kitchen(close-up).Input(16 spp)and reference(2048 spp)images rendered by Tungsten.Quality metrics refer to the top row of images.RelMSE(10-3).

5.3 Bedroom scene comparison(16 spp,dynamic range 0.0-16.49)

Figure 11 shows denoising results with a rather low sampling rate input.The high intensity and directional light setup is challenging for a regular path tracer,and the input is seemingly under-sampled.NFOR(see Fig.11(b))recovers the corrupted ceiling lamp nicely but it leaves visually distracting splotches on most diffuse surfaces,and the black details on the decorative plant are also softened.The splotches could be a consequence of the relatively high local sensitivity of a fi rst-order method. KPCN(see Fig.11(c))successfully denoises most diffuse surfaces but it has difficulty in recovering the ceiling lamp,with some artifacts on the silhouette.We suspect this could be caused by the diffuse/specular decomposed pipeline or the occasional inability of KPCN to generalize as reported in the original paper(retraining is required for improvement);we return to discuss this in detail later. Our method(see Fig.11(d))denoises most diffuse surfaces properly and recovers the ceiling lamp with the best results of any denoiser.

Fig.11 Bedroom scene.Input(16 spp)and reference(2048 spp)images rendered by Tungsten.Quality metrics refer to the top row images.RelMSE(10-3).

5.4 Car scene comparison(32 spp,dynamic range 0.0-3.60)

Figure 12 shows denoising results for a scene with a depth of fi eld effect which requires high sample rate to obtain noise-free results.NFOR(see Fig.12(b))denoises the out-of-focus area but the reconstruction looks somewhat splotchy and the smoothness could be improved.We note that although the fl oor should not be difficult to denoise,NFOR leaves some visually distracting splotches on the fl oor and it seems that all image metrics fail to capture these artifacts.KPCN(see Fig.12(c))performs unsatisfactorily in this test although it was trained by the same dataset,we suspect this could be related to the choice of providing feature information such as depth in gradient form;this requires the network to learn re-integrating the gradients in order to extract the correct relative difference of depth between distant pixels.Ours(Fig.12(d))shows that the network has successfully learned from the dataset how to map from noisy depth of fi eld pixels to their noise-free counterparts,and its result is rated highest quantitatively.

Fig.12 Car scene.Input(32 spp)and reference(8192 spp)images rendered by Tungsten.Quality metrics refer to the top row images.RelMSE(10-3).

5.5 Hair scene comparison(32 spp,dynamic range 0.0-1.06)

Figure 13 shows denoising results for a fairly challenging scene. Hair and fur are challenging objects to sample and render,as the naturally high frequency details exhibit complicated noise patterns when under-sampled spatially or in terms of shading.NFOR(see Fig.13(b))seems to aggressively smooth everything.KPCN(see Fig.13(c))attempts to maintain fi ne features but the residual shading noise gives an impression of incomplete fi ltering. Our method(see Fig.13(d))removes the shading noise more successfully while maintaining a reasonable amount of fi ne detail,but the result is not particularly impressive even it is judged best by the quality metrics.We have only a few hair-related images in our training dataset,and it seems denoising such fi ne objects might require more specialized training.Such images remain a challenge to most denoisers.

5.6 Classroom scene comparison(32 spp,dynamic range 0.0-36.34)

Figure14 showsdenoising resultsofanother challenging scene.The lighting conditions are similar to those for the bedroom scene(see Fig.11)but it has an even higher dynamic range,more details in the dark area,and glossy materials on thin objects(the chair frame).NFOR(see Fig.14(b))follows its own pattern of leaving distracting splotches on the diffuse walls and ceilings.For this scene,it fails to handle dark areas corrupted by outliers and leaves unpleasant artifacts in those areas.NFOR handles the noise on the chair frames properly but there are still subtle splotches on them.KPCN(see Fig.14(c))denoises the diffuse areas properly but in the dark areas which are corrupted high intensity outliers,it leaves some unexpected edges.As for the ceiling lamp case in the bedroom scene(see Fig.11),KCPN has difficulty in maintaining a smooth boundary between glossy and diffuse areas,and leaves an impression of aliasing.Our method(see Fig.14(d))seems to denoise consistently well in both bright and dark areas,and the glossy shading on the chair frames is smoother than the reference which still exhibits residual noise.Our denoiser works remarkably well in this scene.

Fig.13 Hair scene.Input(32 spp)and reference(2048 spp)images rendered by Tungsten.Quality metrics refer to the top row images.RelMSE(10-3).

Fig.14 Classroom scene.Input(32 spp)and reference(2048 spp)images rendered by Tungsten.Quality metrics refer to the top row images.RelMSE(10-3).

5.7 Overall evaluation

In all the tests shown,our direct denoising network consistently outperformed the other two state-of-theart solutions quantitatively.Additional test results can be found in the ESM;Tables 1 and 2 respectively summarize the SSIM and relative MSE results for the complete set of tests.NFOR performs better in a few scenes when judged by the pixel-space metric MSE,but in those cases,their results have obvious splotches on diffuse surfaces,and such artifacts cannot be captured by most pixel-space metrics.We have included all results in full resolution in the ESM,which provides for closer visual inspection.

Table 1 SSIM results

Table 2 Relative MSE results(×10-3)

6 Discussion

6.1 Key fi ndings

The denoising results in the last section proves the competitiveness of ourfi lter-free direct denoisingnetwork as a practical solution for denoising Monte Carlo rendering output.We have to emphasize that the quality of most supervised learning methods relies heavily on the quality of the training data set,and we are pleased to see that our dataset helps to achieve very competitive denoising results. The unique architecture of ResNet[9]enables sophisticated mapping possibilities through the identity mapping.The freedom to allow reuse of upstream features repeatedly is very similar to many multi-scale algorithms,e.g.,multigrid[60]as mentioned in the original paper[9].This is also the main reason why we choose ResNet as our network solution.In addition,the choice of using a small set of primary auxiliary features and letting the network explore the solutions itself without overloading seems to be a good strategy.

6.2 Limitations

A fundamental limitation of all supervised learning methods is connected to the coverage of training set.Our approach relies on the samples in the training set to establish its non-linear mapping.For any cases that have not been included in the training set,there is a potential that our network may fail to deliver expected results.Our training set has no samples of fi ne objects such as hair on a blank background(zeros in all inputs).We used an untrained case to test both our network and KPCN(both trained with our dataset).Figure 15 shows the denoising artifacts arising in both methods.KPCN(see Fig.15(b))blurs all fi ne hairs on an empty background,while ours(see Fig.15(c))shows sparsely colored pixels.A common solution is to retrain the networks with additional desired training samples.

6.3 Joint fi ltering versus direct denoising

Fig.15 Denoising images not covered by our training set.Our training set has no example pairs for fi ne objects with a background of blank pixels(zeros in color and auxiliary features).Both KPCN and our method show artifacts,which highlight the underlying differences between these methods.

Carefully handcrafted joint fi lters,such as the stateof-the-art denoiser NFOR[5]can handle many denoising cases with outstanding results,but the underlying rigid regression-based formulation cannot adapt well to extremely noisy inputs.In order to improve these handcrafted models,researchers need to explore further for potential causes of noise or hidden correlations.In contrast,a direct denoising method based on a neural network relies on learning from training samples to establish powerful non-linear mappings.There is no difficulty in obtaining many noise-free images for training in most production studios,and a deep learning method seems to be a more natural choice.If a direct denoising network faces a new challenge,it can be improved quickly by learning from additional examples in a matter of hours.KPCN[7]can be classi fi ed as a hybrid method,and its current formulation seems to have inherited the disadvantages of both joint fi lters and direct denoising,i.e.,the solution space is limited to noisy input colors and the dependency on training set coverage.We observe that the choice of separating diffuse and specular components in KPCN as in Ref.[61]might not be a good decision. The original idea of such decomposition is to facilitate an analytical approach to handle specular paths under a light transport setting.The specular-only solution space for joint fi ltering can sometimes be very sparse,making good fi ltering even more difficult.

6.4 Runtime performance

To process a 1024×1024 image,a non-optimized implementation of our network running on an nVIDIA GTX 1080 Ti takes approximately 18 s to perform denoising.Our implementation of KPCN is not optimal,and we believe a competent implementation should take a similar or lower time to our method.The open source implementation of NFOR spends 89 s to denoise the same image on an Intel XEON E5-2683 V3 CPU.

7 Conclusions and future work

We have presented a fi lter-free direct denoising network solution for processing noisy Monte Carlo rendering output.Our ResNet[9]based network is able to establish a sophisticated mapping through supervised learning.Our network generalizes very well and is able to deliver high-quality denoising results from noisy images rendered by a different renderer.

Temporal stability is the fi rst subject we want to explore next with our method,and also the possibility of including denoising level control,which is often desirable in a production environment.

Acknowledgements

This work was supported by the Research Grants Council of the Hong Kong Special Administrative Region,under RGC General Research Fund(Project No.CUHK14217516).

Electronic Supplementary MaterialSupplementary material is available in the online version of this article at https://doi.org/10.1007/s41095-019-0142-3.

Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License,which permits use,sharing,adaptation,distribution and reproduction in any medium or format,as long as you give appropriate credit to the original author(s)and the source,provide a link to the Creative Commons licence,and indicate if changes were made.

The images or other third party material in this article are included in the article’s Creative Commons licence,unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use,you will need to obtain permission directly from the copyright holder.

To view a copy ofthis licence, visit http://creativecommons.org/licenses/by/4.0/.

Other papers from this open access journal are available free of charge from http://www.springer.com/journal/41095.To submit a manuscript,please go to https://www.editorialmanager.com/cvmj.

Computational Visual Media2019年3期

Computational Visual Media的其它文章: Single image shadow removal by optimization using non-shadow anchor values; Automatic route planning for GPS art generation; Manufacturable pattern collage along a boundary; Object removal from complex videos using a few annotations; Fast raycasting using a compound deep image for virtual point light range determination; Unsupervised natural image patch learning