A Local Quadratic Embedding Learning Algorithm and Applications for Soft Sensing

2022-02-13 09:18:36YaoyaoBaoYuanmingZhuFengQian

Engineering 2022年11期

Yaoyao Bao, Yuanming Zhu*, Feng Qian*

Key Laboratory of Smart Manufacturing in Energy Chemical Process, Ministry of Education, East China University of Science and Technology, Shanghai 200237, China

Keywords:Local quadratic embedding Metric learning Regression machine Soft sensor

A B S T R A C T Inspired by the tremendous achievements of meta-learning in various fields,this paper proposes the local quadratic embedding learning (LQEL) algorithm for regression problems based on metric learning and neural networks (NNs). First, Mahalanobis metric learning is improved by optimizing the global consistency of the metrics between instances in the input and output space. Then, we further prove that the improved metric learning problem is equivalent to a convex programming problem by relaxing the constraints. Based on the hypothesis of local quadratic interpolation, the algorithm introduces two lightweight NNs; one is used to learn the coefficient matrix in the local quadratic model, and the other is implemented for weight assignment for the prediction results obtained from different local neighbors.Finally, the two sub-models are embedded in a unified regression framework, and the parameters are learned by means of a stochastic gradient descent(SGD)algorithm.The proposed algorithm can make full use of the information implied in target labels to find more reliable reference instances.Moreover,it prevents the model degradation caused by sensor drift and unmeasurable variables by modeling variable differences with the LQEL algorithm. Simulation results on multiple benchmark datasets and two practical industrial applications show that the proposed method outperforms several popular regression methods.

1. Introduction

In the cement production process, it is essential to monitor the quality of products,such as the finenessof raw meal,the free calcium oxide content of clinkers,and so forth.However,online instrumentations for these indicators are costly and require frequent regular maintenance. In industrial practice, off-line analysis in the lab is often implemented for these indexes every 2 h or more, which results in untimely feedback for real-time control systems. These problems can be solved by soft-sensing techniques[1,2].

Soft-sensing models originate from multivariate statistical regression models,including linear regression(LR),principal component regression(PCR),partial least squares(PLS),and some variants with regularization strategies to balance the empirical error and complexity of the model, such as least absolute shrinkage and selection operator (LASSO) and ridge [7]. Kernel strategies have been extensively studied and combined with the aforementioned algorithms to solve the regression problem for nonlinear problems [8,9]. After that, machine learning methods such as knearest neighbor regression(k-NNR)[10],classification and regression trees (CARTs) [11,12], and support vector regression (SVR)[13,14]have been proposed for knowledge mining in massive data.To improve the performance of a single tree model,bagging strategies are implemented in random forest (RF) algorithms [15,16].Similarly, the prediction accuracy of boosting algorithms can be increased by combining a series of iteratively learned weak machines [17,18], such as gradient boosting machines (GBMs) and extreme gradient boosting(XGBoost).Furthermore,breakthroughs in deep learning in image and speech recognition have caused neural networks (NNs) [19,20] to become one of the most popular methods in the field of machine learning,especially when the data samples are sufficient. This popularity can be attributed to NNs’powerful feature extraction capabilities with specially designed structures [21].

Among these algorithms, k-NNR is the simplest and one of the most prevailing regression methods. It is widely used in machine learning problems because it does not require an explicit model structure or any prior knowledge for data distribution. However,the strategy to use the average output of its k-nearest neighbors(k-NNs)as the prediction result also leads to this method’s greatest disadvantages. Initially, the k-NNR algorithm employed the Euclidean distance metric for the measurement of sample similarities.However, the magnitudes of the input features can vary greatly;redundancies and correlations between variables can also be misleading,resulting in an unpractical distance metric.To cope with this problem,a generalization of the Mahalanobis distance[22]was proposed, which is equivalent to a weighted Euclidean distance between two linear projected images.However,in practical applications, the input features tend to have distinct contributions to the output variables.The key is to develop a reliable feature extraction model and apply the classical metrics, such as the Euclid distance and cosine similarity,to the mapped features.Locally linear embedding (LLE) reconstructs the samples in a low-dimensional space using the locally linear weighting method and achieves dimension reduction by minimizing the reconstruction error [23]. Nevertheless, the adjacency relation constructed by the classical Euclidean metric in a high-dimensional space cannot meet the needs of all classification tasks.Thus,researchers usually try to transform the input features into a scaled space[24,25]and to get the weight coefficients to predict the label by means of local reconstruction in the space.However, this method is very dependent on elegant design of the transformation model. For example, in a fuzzy transformation, the basic function and the division of fuzzy intervals may have a great influence on the prediction result,because the meaningful information contained in the output labels is not made full use of.To address this issue, Weinberger and Saul [26] introduced the concept of Mahalanobis distance metric learning, which allows the inverse covariance matrix in the Mahalanobis distance to denote any positive semidefinite matrix. Similar to the idea of linear discriminant analysis(LDA)[27], the Mahalanobis distance metric is learned by maximizing the ratio of the average internal class distance to the average between-class distance.Xing et al.[28]constructed a convex optimization problem for metric learning by taking the average between-class distance as the optimization target and the average within-class distance as the constraint. This method has been applied to semi-supervised data clustering problems.

The above methods are mainly designed for classification problems.For regression problems,Nguyen et al.[27]established a convex optimization problem by maximizing the consistency of the input and output distances over a set of constraint triplets in the neighborhood of each instance. However, the researchers did not elaborate the solution for a transformation matrix A in metric learning; the weight matrix W is optimized only under the condition of a given transformation matrix A. Moreover, the tradeoff parameter C tends to have a significant impact on the performance of the algorithm.Linear metric learning(LML)has limited power in feature representation, especially for high-dimensional samples such as image and text data.Deep metric learning(DML)uses deep neural network(DNN)models instead of linear transformations to extract features in order to achieve metric learning[29-31].One of the greatest differences between LML and DML lies in the form of the loss function. For example, Song et al. [30] minimized the distances between samples from the same class and maximized the distances with a margin from different classes. In general, these methods involve the construction of triplet sets, which consist of an anchor,a positive point, and a negative point. This implies that the methods cannot be directly applied to regression problems.

In addition,using the average of k-NNs as the output prediction often results in conservative result. Take the wine quality assessment dataset on University of California Irvine (UCI) machine learning repository as an example. The k-NNR algorithm does not distinguish well between particularly high-grade or inferior wines.So,how does an operator predict the label?First,the operator will identify the most similar cases to the current sample in the historical data as references and then modify the label according to the change of the input features.We summarize this process and propose the local quadratic embedding learning (LQEL) algorithm.However, the coefficient matrix of the quadratic embedding function is difficult to obtain. Fortunately, the matrix is dependent on the location of the expansion point—that is, the current sample mentioned above. Thus, the coefficient matrix can be estimated by NNs,taking the current sample as the input.However,an appropriate network scale must be determined; otherwise, the model becomes over-fitted. To this end, ensemble methods to integrate multiple NNs are utilized to improve the generalization ability of NNs model [20,32]. The literature shows that standardizing the output of the hidden layer in the network by batch normalization(BN) can prevent distribution changes during the training process[33], which accelerates the convergence of networks. It has been pointed out that the dropout strategy can improve the generalization ability of the NN [34]. Moreover, superimposing a certain intensity of Gaussian noise on sample data can increase the number of training samples and thus improve the robustness of the model [35]. In general, these approaches improve the generalization of NNs in two ways.First,they increase the number of training samples; second, they add constraints to the network structure,reduce the complexity,and thus improve the network’s predictive ability. This paper follows the latter route.

In this paper,metric learning is first accomplished to determine the neighborhood of a certain instance by maximizing the consistency of the distances between the input and output spaces. This makes full use of the information contained in the target labels and achieves the first step of the operators’ strategy. Then, a local quadratic coefficient matrix is generated by a well-trained NN to make predictions based on neighboring references; this prevents the model degradation caused by sensor drift and unmeasured variables by means of the differential compensation method. Furthermore, the other NN assigns weights to the predictions provided by different neighbors according to their confidence, which achieves a balance between the prediction errors and measurement noises,thereby minimizing the prediction errors.The parameters of these two networks can be optimized by end-to-end training with stochastic gradient descent(SGD)algorithms.Empirical studies on several regression datasets, including two practical industrial datasets from the cement production process and hydrocracking process, show that, in most cases, the proposed method outperforms the popular regression methods.

The rest of this paper is organized as follows. In Section 2, a metric learning model is introduced and the optimization problem is proved to be equivalent to a convex optimization problem. In Section 3,the framework of the proposed LQEL is presented.In Section 4,several empirical studies,including a validation using actual industrial cases,are reported.The conclusions and contributions of this paper are summarized in Section 5.

2. Metric learning

A metric distance is a function d:X×X →R+0that satisfies the following, for any x（i），x（j），x （k ） ∈X:

where M is a positive definite metric matrix to be learned, and u and v are two different instances.The objective of MML is to obtain the optimal matrix M that meets the purpose of metric learning.

We hope to use the information implied in the output labels to guide the direction of metric learning. The basic principle is that similar input samples lead to similar target labels.The consistency of the distances between the input and output spaces, from a statistical point of view,can be described with the Pearson correlation coefficient. Therefore, the optimization problem is formed as follows:

3. Local quadratic embedding learning

The scheme of the LQEL algorithm is shown in Fig. 1. To obtain the output label corresponding to sample x, the k-NNs are first determined using the conclusion of the metric learning in Section 2(the ellipse in the left of the figure).Suppose a function F:δx →δy is learned to describe the mapping from the difference of input to the difference of output in two spaces. Then, for each

Fig. 1. The scheme of LQEL.

where Uδ（x0） represents the δ neighborhood of x0in the metric space defined in Section 2, W ≡（x0）TAx0+Bx0is the weight coefficient matrix of the linear mapping function.

The result of Eq. (10) implies that a linear model could be designed for prediction in x0’s δ neighborhood. The matrix W expanded on different reference points can be estimated by an independent NN—for example, using an NN N:X →X to approximate the matrix as N（x0）=W. Considering that the parameter matrices ?2g （x0） and ?g （x0） tend to be more stable than g （x0） in most practical circumstances, the NN required here should be much simpler than the one used to estimate the output label directly.In particular,when g0is a quadratic function,the matrices Ax0and Bx0do not change with the reference point. In this case, a simple linear NN could work well. In general, these procedures can effectively reduce the complexity of the model and improve the generalization.

This strategy provides k estimation results for each instance,one from each nearest neighbor,but the reliabilities can vary considerably. From an intuitive perspective, the predictions given by distant neighbors tend to have high uncertainty. This implies that different weights should be assigned to each of the predictions.Prediction uncertainties caused by the presence of measuring noise can be restrained by the averaging method. Inspired by this idea,we intend to design a machine that generates different weights according to the relative location of the instance,which minimizes the expectation of mean square error (MSE).

In this paper, we introduce state-of-the-art strategies for NNs,such as BN and dropout.The MSE is employed as the loss function.The parameters of the proposed model, including the weights and biases in the two NNs, are optimized by the SGD algorithm.

4. Empirical learning

Fig. 2. The model structure of the proposed method.

In order to know how well the proposed algorithm works, we use real-world benchmark regression datasets along with two practical industrial datasets for verification. A series of classical approaches are briefly introduced for the purpose of comparison with the proposed method. Finally, the experimental results are reported with tables and figures.

4.1. Descriptions of datasets

4.1.1. Benchmark datasets

The details of the datasets [37-39] are shown in Table 1. For example, the red wine dataset shown in the first line contains 1599 samples.Each record contains 12 feature variables and a target label to be predicted.The objective is to establish a mathematical model to evaluate red wine quality through color,composition,and so forth. In this case, the quality of red wine is divided into nine grades from high to low,and only samples between the third and eighth grades are included in the dataset.

4.1.2. Powder fineness dataset

The aim of the first practical industrial application is to make online prediction of powder fineness in the raw meal preparation process. The details of this technological process are presented in Fig.3.In the raw meal preparation process,raw materials that consist of three or four minerals are transported onto the center of the grinding table. The materials are continuously pushed outward across the rotating grinding table due to centrifugal force. Rocks are crushed into small particles by the squeezing of the grinding rollers and the grinding table before leaving the grinding disk.When high-speed hot wind enters the mill from the bottom, finer particles are blown into the chamber, while larger particles fall to the bottom and are transported back to the entrance of the mill by a bucket elevator. High-speed airflow driven by an induced draftfan brings those finer particles into a high-efficiency dynamic classifier, where unqualified particles fall back to the mill table along the cone and get reground. Fine products gathered from cyclones and the electric dust collector are finally transported into a homogenization silo for storage.

Table 1 Details of the datasets used in this paper.

Fig. 3. Process flow chart of the raw meal preparation process.

The most important indicator of this process is the fineness of the product, which further influences the product quality and energy consumption of the subsequent calcination process. However,samples are collected and analyzed every 2 h due to the limited capacity for manual analysis in the lab, resulting in time lags for real-time process control and further resulting in fluctuations in raw meal fineness.Therefore,the aim is to estimate the powder fineness in real time with other available and relevant online variables—that is, to achieve soft sensing for raw meal fineness.

All of the variables that may affect or represent the fineness are considered to be auxiliary variables. These include the current of the draft fan,the current of the classifier,the current of the driven motor,the current of the bucket elevator to transport the product,the current of the bucket elevator to transport the rejected slags,the differential pressure,the inlet temperature,the outlet temperature,the feed quantity,and so forth.In general,an 80 μm sieve residue and a 200 μm sieve residue are considered to be the indicators of raw meal fineness,with the former being more sensitive.Therefore,the dataset is constructed with 14 auxiliary variables and one output label,with a total of 959 instances(about 4 months).

4.1.3. Hydrocracking process dataset

The simplified flow diagram of a typical hydrocracking process is shown in Fig.4.The feedstock is mixed with externally supplied hydrogen, which is heated to a specified temperature and then enters the two cascade reactors. The first reactor is loaded with a hydrotreating catalyst to remove most of the sulfur and nitrogen,as well as some heavy metal compounds. The second reactor,where the cracking reaction is completed, is loaded with hydrocracking catalyst. In these reactors, low-temperature hydrogen is directly added to absorb the heat released by the exothermic reaction to maintain a stable temperature.The reaction product passes through a high-pressure separator to recycle unreacted hydrogen and then passes through a low-pressure separator to separate some light gases. Finally, the separation of different components is achieved by a fractionation tower. Six kinds of products are collected:light end(LE),light naphtha(LN),heavy naphtha(HN),kerosene (KE), diesel (DI), and bottom oil (BO).

Due to the fluctuation in product prices and changes in the market’s supply and demand, the yield of different products must be relocated accordingly in order to maximize the total profit. Therefore,it is essential to accurately predict the yield of each product in time to guide the operation optimization.In this paper,we take the yield of DI as an example to establish a prediction model. In this problem, the sampling period is 4 h and the dataset covers a total of 15 months. Finally, 2052 samples with 55 related input variables, including the feed mass flow rate, volume flow of the fresh hydrogen gas, and so forth, are collected.

4.2. User-specified parameters

Seven typical regression algorithms are involved in this work:

(1)MML-based k-NNR first adopts the MML approach proposed in Ref. [27]. The model first defines the constraints based on triplets, and then formulates the optimization problem as a convex quadratic programming problem. In this algorithm, the number of nearest neighbors Kkis to be determined.

(2)SVR achieves a tradeoff between structural risks and empirical risks by means of the regularization coefficient C and achieves nonlinear mapping by introducing kernel methods. In this paper,different kernels such as the linear kernel, the Gaussian kernel,and the polynomial kernel are compared with each other, and the Gaussian kernel is demonstrated to be better for these regression problems.Thus,the regularization coefficient C and the kernel parameter γ are to be optimized.

Fig. 4. Process flowchart of the hydrocracking process.

(5) NNs are effective tools to solve regression problems. We implemented strategies including BN and dropout, which have been demonstrated to be the state of the art in various fields[35].To be specific,the batch size is chosen to be 30,the proportion of dropout is 0.3, and the number of hidden neurons Nnhis chosen by fivefold cross-validation.

The results in the table show that the number of nearest neighbors in the LQEL varies with different datasets.First,it depends on the scale of the dataset, which determines the density of the samples in the space.For example,in the critical assessment of protein structure prediction(CASP)dataset,the instances are sufficient for the neighbors to be better referenced for prediction. This implies that a large number of nearest neighbors can effectively improve the prediction ability of the model. However, for the industrial fineness dataset, limited samples are available for modeling. In addition, it is difficult to use the values of the instrumental variables for state representation. For example, the quantity of slag rejection in a vertical roller mill (VRM) is often evaluated by thecurrent of the bucket elevator,but current drift occurs when regular maintenance is carried out (approximately once every 2 days),especially when lubricating oil is added. Therefore, it is necessary to pay more attention to the changes in the current. Under these circumstances, the nearest neighbors in the space may not be as instructive as those of the CASP dataset. Therefore, the model chooses a small number of neighboring samples for prediction.The table also implies that the proposed LQEL model with simple forward NNs can perform well in regression problems. Compared with the forward NN model, there are fewer hidden neurons in the LQEL model(no more than four),and a smaller scale of parameters must be estimated. This reduces the model complexity,thereby improving the model generalization.

Table 2 Hyper-parameters employed in case study.

4.3. Performance comparison of different datasets

To compare the performance of the proposed method with the abovementioned classical methods,a total of nine regression problems on seven datasets were used. Each experiment was repeated 30 times, and the MSE and mean absolute error (MAE) on the test sets were recorded. Then, statistical analyses were carried out on these indexes to validate the robustness of the algorithm.

Table 3 shows the average indexes of each algorithm on different datasets.The best performance for each line is marked in bold.It can be seen that, for the nine verification tests listed below, the LQEL algorithm proposed in this paper achieves the best performance on most of the datasets. Moreover, the LQEL algorithm achieves a performance comparable to those of the best-performing LightGBM and RF algorithms, and it has clear advantages when compared with other algorithms.

Moreover,to evaluate the robustness of the algorithm,it is necessary to compare the distribution of the obtained indexes. The MSE and MAE distributions of multiple repeated tests are shown with box plots in Figs.5 and 6,respectively.The figure implies that the LQEL has the most remarkable stability on most datasets,except for the wine quality, CASP, and fineness datasets. Although the performance fluctuates slightly more than some of the other algorithms, the overall MSE and MAE are significantly lower—that is, the algorithms with more stable performances often sacrifice precision as the cost. In particular, strategies such as dropout,batch learning, and BN are implemented in both the NN and LQEL algorithms, but the latter outperforms the former.

Figs.7 and 8 show scatter plots of the prediction results for different algorithms on the two industrial datasets, in which the abscissa is the ground truth value and the ordinate is the prediction results. The coefficient of determination (R2) is marked on the top left corner, and indicates that the LQEL algorithm shows advantages over the other algorithms on these two soft sensing applications. This can be attributed to two aspects:

(1) The absolute value of the variables in these industrial datasets cannot well describe the process state. The method proposed in this paper makes corrections to the nearest neighbors according to the change of auxiliary variables, which puts greater emphasis on the differences and thus reduces the risk of the above problem.

(2)This method employs two extremely simple NNs to achieve LQEL.One NN aims to find the coefficients of local quadratic functions,and the other realizes the weight assignment for predictionsgiven by nearest neighbors.Based on these advantages,the generalization ability of the proposed algorithm can be effectively improved.

Table 3 Performance comparison of different algorithms.

Fig. 5. MSE box plots of algorithms tested on different datasets. MML: MML-based k-NNR; LGB: LightGBM; DML: DML-based k-NNR.

Fig. 6. MAE box plots of algorithms tested on different datasets.

Fig. 8. Scatter plots of the prediction results for different algorithms on the hydrocracking dataset.

5. Conclusions

The paper proposed an LQEL algorithm for regression problems.MML is first improved by optimizing the consistency of the distances between samples in the input and output space.By relaxing the constraints, the modified problem is proved to be a convex optimization problem,while it keeps the same solution as the original problem.Based on this,a locally quadratic embedding model is developed, and different weights are assigned to the prediction results to minimize the expectation of prediction error. In this framework, two extremely simple NNs are implemented to learn the quadratic embedding matrix and the weight assignments of the neighboring predictions. We hope to build a unified end-toend model that prevents the independent two-layer optimization from getting stuck in a local optimal. The proposed LQEL model has the following advantages:

●A global consistency for distances in the input and output space is achieved via improved metric learning.

●The information contained in output labels is better exploited,which leads to a better determination of the neighborhood for a certain instance.

●An LQEL framework was proposed based on the local quadratic embedding hypothesis. Two specially designed networks improve generalization by simplifying the model structure from either a global or a local perspective.

●The experimental results show that the LQEL can achieve a more precise and comparable robust prediction when lightweight NNs are employed.

Acknowledgments

This work was supported by the National Key Research and Development Program of China (2016YFB0303401), the International (Regional) Cooperation and Exchange Project(61720106008),the National Science Fund for Distinguished Young Scholars (61725301), and the Shanghai AI Lab.

Compliance with ethics guidelines

Yaoyao Bao, Yuanming Zhu, and Feng Qian declare that they have no conflict of interest or financial conflicts to disclose.

Engineering2022年11期

Engineering的其它文章: Erratum to ‘‘Targeted Genotyping of a Whole-Gene Repertoire by an Ultrahigh-Multiplex and Flexible HD-Marker Approach”[Engineering 13(2022) 186-196]; China’s Rural Transformation and Policies: Past Experience and Future Directions; Treating Chronic Diseases by Regulating the Gut Microbiota; State of Science:Why Does Rework Occur in Construction?What Are Its Consequences? And What Can be Done to Mitigate Its Occurrence?; Tailoring Anti-Impact Properties of Ultra-High Performance Concrete by Incorporating Functionalized Carbon Nanotubes; An Intelligent IEQ Monitoring and Feedback System: Development and Applications

99热精品在线国产_美女午夜性视频免费_国产精品国产高清国产av_av欧美777_自拍偷自拍亚洲精品老妇_亚洲熟女精品中文字幕_www日本黄色视频网_国产精品野战在线观看