Regression

Data fusion of spectroscopic data for enhancing machine learning model performance

Developing accurate industrial prediction models for complex industrial and geological applications remains a significant challenge, particularly when relying on limited and disparate spectroscopic data. Traditional data fusion methods often fall short in effectively integrating complementary information across different spectral sources, limiting predictive performance. Complex-level ensemble fusion (CLF) is presented as a two-layer chemometric algorithm that jointly selects variables from concatenated mid-infrared (MIR) and Raman spectra with a genetic algorithm, projects them with partial least squares and stacks the latent variables into an XGBoost regressor, thereby capturing feature- and model-level complementarities in a single workflow. When benchmarked against single-source models and classical low-, mid-, and high-level data-fusion schemes, the CLF technique consistently demonstrated significantly improved predictive accuracy. Evaluated on paired Mid-Infrared (MIR) and Raman datasets from industrial lubricant additives and RRUFF minerals, CLF robustly outperformed established methodologies by effectively leveraging complementary spectral information. Mid-level fusion yielded no improvement, underscoring the need for supervised integration. These results constitute the first evidence that a stacked, complex-level scheme can surpass all established fusion levels on real-world spectroscopic regressions comprising fewer than one hundred samples and provide a transferable recipe for building more accurate and resilient soft sensors in quality-control and geochemical applications.

Post Date: 28 October 2025

Hanzelik, P. P., Gergely, S., Abonyi, J., & Kummer, A. (2025). Data fusion of spectroscopic data for enhancing machine learning model performance. Digital Chemical Engineering, 100271.

Network science and explainable AI-based life cycle management of sustainability models

Model-based assessment of the potential impacts of variables on the Sustainable Development Goals (SDGs) can bring great additional information about possible policy intervention points. In the context of sustainability planning, machine learning techniques can provide data-driven solutions throughout the modeling life cycle. In a changing environment, existing models must be continuously reviewed and developed for effective decision support. Thus, we propose to use the Machine Learning Operations (MLOps) life cycle framework. A novel approach for model identification and development is introduced, which involves utilizing the Shapley value to determine the individual direct and indirect contributions of each variable towards the output, as well as network analysis to identify key drivers and support the identification and validation of possible policy intervention points. The applicability of the methods is demonstrated through a case study of the Hungarian water model developed by the Global Green Growth Institute. Based on the model exploration of the case of water efficiency and water stress (in the examined period for the SDG 6.4.1 & 6.4.2) SDG indicators, water reuse and water circularity offer a more effective intervention option than pricing and the use of internal or external renewable water resources.

Post Date: 17 June 2024

Ipkovich Á, Czvetkó T, A. Acosta L, Lee S, Nzimenyera I, et al. (2024) Network science and explainable AI-based life cycle management of sustainability models. PLOS ONE 19(6): e0300531.

Mixtures of QSAR Models – Learning Application Domains of pKa Predictors

Quantitative structure-activity relationship models (QSAR models) predict the physical properties or biological effects based on physicochemical properties or molecular descriptors of chemical structures. Our work focuses on the construction of optimal linear and nonlinear weighted mixes of individual QSAR models to more accurately predict their performance. How the splitting of the application domain by a nonlinear gating network in a "mixture of experts" model structure is suitable for the determination of the optimal domain-specific QSAR model and how the optimal QSAR model for certain chemical groups can be determined is highlighted. The input of the gating network is arbitrarily formed by the various molecular structure descriptors and/or even the prediction of the individual QSAR models. The applicability of the method is demonstrated on the pKa values of the OASIS database (1912 chemicals) by the combination of four acidic pKa predictions of the OECD QSAR Toolbox. According to the results, the prediction performance was enhanced by more than 15 % (RMSE value) compared to the predictions of the best individual QSAR model.

J.Abonyi, T. Varga, O. P. Hamadi, Gy. Dorgo, Mixtures of QSAR Models – Learning Application Domains of pKa Predictors, Journal of Chemometrics, 2020

Fuzzy Model Identification for Control

Fuzzy Model Identification

Fuzzy model identification is an effective tool for the approximation of uncertain nonlinear systems on the basis of measured data. The identification of a fuzzy model using input-output data can be divided into two tasks: structure identification, which determines the type and number of the rules and membership functions, and parameter identification. For both structural and parametric adjustment, prior knowledge plays an important role. Hence, in this book the rules of the fuzzy system are designed based on the available a priori knowledge and the parameters of the membership, and the consequent functions are adapted in a learning process based on the available input-output data. Hence, this chapter is devoted mainly to the parameteridentification of the proposed fuzzy models, but certain structure identification tools are also discussed.

Fuzzy Model Based Control

This chapter discusses how the proposed fuzzy models can be used in model-based control. The developed Takagi --- Sugeno, Hybrid Fuzzy Convolution and Fuzzy Hammerstein dynamic fuzzy models will be applied in several inversion and linearization-based control schemes. Taking the identification of the Takagi --- Sugeno fuzzy models into account, guidelines will be given as which control configuration is most advantageous.

Table of Contents (PDF)

Preface (PDF)

Introduction

J. Abonyi, Fuzzy model identification for control, Birkhauser Boston, 2003, 310 pages

Incorporating Prior Knowledge in Fuzzy Model Identification

This paper presents an algorithm for incorporating a priori knowledge into data-driven identification of dynamic fuzzy models of the Takagi-Sugeno type. Knowledge about the modelled process such as its stability, minimal or maximal static gain, or the settling time of its step response can be translated into inequality constraints on the consequent parameters. By using input-output data, optimal parameter values are then found by means of quadratic programming. The proposed approach has been applied to the identification of a laboratory liquid level process. The obtained fuzzy model has been used in model-based predictive control. Real-time control results show that, when the proposed identification algorithm is applied, not only are physically justified models obtained but also the performance of the model-based controller improves with regard to the case where no prior knowledge is involved.

J. Abonyi, R. Babuska, H. B. Verbruggen, F. Szeifert, Incorporating prior knowledge in fuzzy model identification, Int. Journal of Systems Science, 31(5), 657-667, 2000, IF 0.268

Fuzzy modeling with multivariate membership functions: gray-box identification and control design

A novel framework for fuzzy modeling and model-based control design is described. The fuzzy model is of the Takagi-Sugeno (TS) type with constant consequents. It uses multivariate antecedent membership functions obtained by Delaunay triangulation of their characteristic points. The number and position of these points are determined by an iterative insertion algorithm. Constrained optimization is used to estimate the consequent parameters, where the constraints are based on control-relevant a priori knowledge about the modeled process. Finally, methods for control design through linearization and inversion of this model are developed. The proposed techniques are demonstrated by means of two benchmark examples: identification of the well-known Box-Jenkins gas furnace and inverse model-based control of a pH process. The obtained results are compared with results from the literature.

J. Abonyi, R. Babuska, F. Szeifert, Fuzzy modeling with multivariate membership functions: Gray-box identification and control design, IEEE Systems, Man and Cybernetics, Part B: Cybernetics, 31 (5), pp. 755-767 IF 0.789, 2001

Identification and Control of Nonlinear Systems Using Fuzzy Hammerstein Models

This paper addresses the identification and control of nonlinear systems by means of Fuzzy Hammerstein (FH) models, which consist of a static fuzzy model connected in series with a linear dynamic model. For the identification of nonlinear dynamic systems with the proposed FH models, two methods are proposed. The first one is an alternating optimization algorithm that iteratively refines the estimate of the linear dynamics and the parameters of the static fuzzy model. The second method estimates the parameters of the nonlinear static model and of the linear dynamic model simultaneously by using a constrained recursive least-squares algorithm. The obtained FH model is incorporated in a model-based predictive control scheme and a new constraint-handling method is presented. A simulated water-heater process is used as an illustrative example. A comparison with an affine neural network and a linear model is given. Simulation results show that the proposed FH modeling approach is useful for modular parsimonious modeling and model-based control of nonlinear systems.

J. Abonyi, R. Babuska, M. Ayala Botto, F. Szeifert, L. Nagy, Identification and control of nonlinear systems using fuzzy Hammerstein models, Industrial and Engineering Chemistry Research, 39, 4302-4314, 2000., IF 1.294

Modified Gath-Geva Fuzzy Clustering for Identification of Takagi-Sugeno Fuzzy Models

The construction of interpretable Takagi--Sugeno (TS) fuzzy models by means of clustering is addressed. First, it is shown how the antecedent fuzzy sets and the corresponding consequent parameters of the TS model can be derived from clusters obtained by the Gath--Geva algorithm. To preserve the partitioning of the antecedent space, linearly transformed input variables can be used in the model. This may, however, complicate the interpretation of the rules. To form an easily interpretable model that does not use the transformed input variables, a new clustering algorithm is proposed, based on the Expectation Maximization (EM) identification of Gaussian mixture models. The most relevant consequent variables of the TS model are selected by an orthogonal least squares method based on the obtained clusters. For the selection of the relevant antecedent (scheduling) variables a new method has been developed based on Fisher's interclass separability criteria. This new technique is applied to two well-known benchmark problems: the MPG (miles per gallon) prediction and a simulated second-order nonlinear process. The obtained results are compared with results from the literature.

J. Abonyi, R. Babuska, F. Szeifert, Modified gath-geva fuzzy clustering for identification of Takagi-Sugeno fuzzy models, IEEE Trans. on Systems, Man and Cybernetics, Part B,612-621, Oct, 2002

Compact TS-Fuzzy Models through Clustering and OLS plus FIS Model Reduction

Identification of uncertain and nonlinear systems is an important and challenging problem. Fuzzy models of the Takagi-Sugeno (TS) type may be a good choice to describe such systems; however, in many cases these become soon complex. We propose a three-step method to obtain compact TS-models that can be effectively used to represent complex systems: 1) a new fuzzy clustering method is proposed for identification of compact TS-models; 2) the most relevant consequent variables of the TS-model are selected by an orthogonal least squares (OLS) method based on the obtained clusters; and 3) for selection of relevant antecedent variables, a new method is proposed based on Fisher's interclass separability (FIS) criterion. The overall approach is demonstrated by means of the MPG (miles per gallon) nonlinear regression benchmark. Results are compared with those obtained by standard linear, neuro-fuzzy and advanced fuzzy clustering-based identification tools.

J. Abonyi, J.A. Roubos, M. Oosterom, F. Szeifert, Compact TS-Fuzzy models through clustering and OLS plus FIS model reduction, FUZZ-IEEE'01 Conference, Sydney, Australia, 1420-1423, 2001, (MATLAB implementation)

Visualization and Complexity Reduction of Neural Networks

The identification of the proper structure of nonlinear neural networks (NNs) is a difficult problem, since these black-box models are not interpretable. The aim of the paper is to propose a new approach that can be used for the analysis and the reduction of these models. It is shown that NNs with sigmoid transfer function can be transformed into fuzzy systems. Hence, with the use of this transformation NNs can be analyzed by human experts based on the extracted linguistic rules. Moreover, based on the similarity of the resulted membership functions the hidden neurons of the NNs can be mapped into a two dimensional space. The resulted map provides an easily interpretable figure about the redundancy of the neurons. Furthermore, the contribution of these neurons can be measured by orthogonal least squares technique that can be used for the ordering of the extracted fuzzy rules based on their importance. A practical example related to the dynamic modeling of a chemical process system is used to prove that synergistic combination of model transformation, visualization and reduction of NNs is an effective technique, that can be used for the structural and parametrical analysis of NNs.

Kenesei, Tamas, Balazs Feil, Janos Abonyi, Visualization and Complexity Reduction of Neural Networks, Applications of Soft Computing. Springer Berlin Heidelberg, 2009. 43-52.

Hinging hyperplane based regression tree identified by fuzzy clustering and its application

Hierarchical fuzzy modeling techniques have great advantage since model accuracy and complexity can be easily controlled thanks to the transparent model structures. A novel tool for regression tree identification is proposed based on the synergistic combination of fuzzy c-regression clustering and the concept of hierarchical modeling. In a special case (c = 2), fuzzy c-regression clustering can be used for identification of hinging hyperplane models. The proposed method recursively identifies a hinging hyperplane model that contains two linear submodels by partitioning operating region of one local linear model resulting a binary regression tree. Novel measures of model performance and complexity are developed to support the analysis and building of the proposed special model structure. Effectiveness of proposed model is demonstrated by benchmark regression datasets. Examples also demonstrate that the proposed model can effectively represent nonlinear dynamical systems. Thanks to the piecewise linear model structure the resulted regression tree can be easily utilized in model predictive control. A detailed application example related to the model predictive control of a water heater demonstrate that the proposed framework can be effectively used in modeling and control of dynamical systems.

Tamás Kenesei, János Abonyi, Hinging hyperplane based regression tree identified by fuzzyclustering and itsapplication, Applied Soft Computing, 13, 782-792, 2013, (MATLAB implementation)

A priori knowledge based spline smoohing is useful for the data-driven identification of kinetic parameters

In many practical situations, the involvement of laboratory and industrial experiments are expensive and time consuming and accurate measurements cannot be made. This problem results in a small number of data points that are often noisy and obtained at irregular time intervals. Hence, data smoothing and re-sampling are often required to reduce the effect of measurement noise and irregular time intervals. Typically, an interpolation method is used for this purpose, e.g. cubic spline interpolation, but the disadvantage of the common interpolation methods is that they can not utilize any a priori information. Hence, we developed a new cubic spline interpolation approach which utilizes a priori knowledge, e.g. material balance, or prior information about the measured properties. The methodology has been demonstrated through the investigation of a simulated and an industrial chemical reactor that the new method improves the accuracy of the data-driven estimation of kinetic parameters.

J. Madár, J. Abonyi, H. Roubos, F. Szeifert, Incorporating prior knowledge in cubic spline approximation - Application to the Identification of Reaction Kinetic Models, Industrial and Engineering Chemistry Research, 42, 4043-4049, 2003, IF: 1.252

Interpretable Support Vector Regression

This paper deals with transforming Support vector regression (SVR) models into fuzzy systems (FIS). It is highlighted that trained support vector based models can be used for the construction of fuzzy rule-based regression models. However, the transformed support vector model does not automatically result in an interpretable fuzzy model. Training of a support vector model results a complex rule base, where the number of rules are approximately 40-60% of the number of the training data, therefore reduction of the support vector model initialized fuzzy model is an essential task. For this purpose, a three-step reduction algorithm is used based on the combination of previously published model reduction techniques, namely the reduced set method to decrease number of kernel functions, then after the reduced support vector model is transformed into fuzzy rule base similarity measure based merging and orthogonal least-squares methods are utilized. The proposed approach is applied for nonlinear system identification, the identification of a Hammerstein system is used to demonstrate accuracy of the technique with fulfilling the criteria of interpretability.

Tamas Kenesei, Janos Abonyi, Interpretable Support Vector Regression, Artificial Intelligence Research, 1, 11-21, 2012.

Interpetable Support Vector Machines in Regression and Classification – Application in Process Engineering

Tools from the armoury of soft computing have been in focus of researches recently, since soft computing techniques are used for fault detection (classification techniques), forecasting of time-series data, inference, hypothesis testing, and modelling of causal relationships (regression techniques) in process engineering. These techniques solve two cardinal problems: learning from experimental data by neural networks and support vector based techniques and embedding existing structured human knowledge into fuzzy models. Support vector based models are one of the most commonly used soft computing techniques. Support vector based models are strong in feature selection and to achieve robust models and fuzzy logic helps to improve the interpretability of models. This paper deals with combining these existing soft computing techniques to get interpretable but accurate models for industrial purposes. The paper describes that trained support vector based models can be used for the construction of fuzzy rule-based classifier or regression models. However, the transformed support vector model does not automatically result in an interpretable fuzzy model because the support vector model results in a complex rulebase, where the number of rules is approximately 40-60% of the number of the training data. Hence, reduction of the support model-initialized fuzzy model is an essential task. For this purpose, a three-step reduction algorithm is used on the combination of previously published model reduction techniques. In the first step, the identification of the SV model is followed by the application of the Reduced Set method to decrease the number of kernel functions. The reduced SV model is then transformed into a fuzzy rule-based model. The interpretability of a fuzzy model highly depends on the distribution of the membership functions. Hence, the second reduction step is achieved by merging similar fuzzy sets based on a similarity measure. Finally, in the third step, an orthogonal least-squares method is used to reduce the number of rules and re-estimate the consequent parameters of the fuzzy rule-based model. The proposed approach is applied for classification problems and applied for Hammerstein system identification to illustrate the effectiveness of the technique.

Tamas Kenesei, Janos Abonyi, Interpretable Support Vector Machines In Regression, And Classification – Application In Process Engineering, Hungarian Journal of Industrial Chemistry, 35, 101-108, 2007.

Report abuse