# Programs and data

### Repositories

On the following links you can find files that can be useful for comprehending some of Dr. Janos Abonyis' articles.

### Data describing the relationship between world news and sustainable development goals

The data article presents a dataset and a tool for news-based monitoring of sustainable development goals defined by the United Nations. The presented dataset was created by struc- tured queries of the GDELT database based on the categories of the World Bank taxonomy matched to sustainable devel- opment goals. The Google BigQuery SQL scripts and the re- sults of the related network analysis are attached to the data to provide a toolset for the strategic management of sustain- ability issues. The article demonstrates the dataset on the 6th sustainable development goal (Clean Water and Sanita- tion). The network formed based on how countries appear in the same news can be used to explore the potential interna- tional cooperation. The network formed based on how topics of World Bank taxonomy appear in the same news can be used to explore how the problems and solutions of sustain- ability issues are interlinked.

### Data describing the regional Industry 4.0 readiness index

The data article presents a dataset suitable to measure regional Industry 4.0 (I4.0+) readiness. The I4.0+ dataset includes 101 indicators with 248 958 observations, aggregated to NUTS 2 statistical level) based on open data in the field of education (ETER, Erasmus), science (USPTO, MA-Graph, GRID), government (Eurostat) and media coverage (GDELT). Indicators consider the I4.0-specific domain of higher education and lifelong learning, innovation, technological investment, labour market and technological readiness as indicators. A composite indicator, the I4.0+ index was constructed by the Promethee method, to identify regional rank regarding their I4.0 performance. The index is validated with economic (GDP) and innovation indexes (Regional Innovation Index).

## A multilayer and spatial description of the Erasmus mobility network

The Erasmus Programme is the biggest collaboration network consisting of European Higher Education Institutions (HEIs). The flows of students, teachers and staff form directed and weighted networks that connect institutions, regions and countries. Here, we present a linked and manually verified dataset of this multiplex, multipartite, multi-labelled, spatial network. We enriched the network with institutional socio-economic data from the European Tertiary Education Register (ETER) and the Global Research Identifier Database (GRID). We geocoded the headquarters of institutions and characterised the attractiveness and quality of their environments based on Points of Interest (POI) data. The linked datasets provide relevant information to grasp a more comprehensive understanding of the mobility patterns and attractiveness of the institutions.

Machine-accessible metadata file describing the reported data

### Data-driven multilayer complex networks of Sustainable Development Goals

This data article presents the formulation of multilayer network for modelling the interconnections among the sustainable development goals (SDGs), targets and includes the correlation based linking of the sustainable development indicators with the available long-term datasets of The World Bank, 2018 . The spatial distribution of the time series data allows creating country-specific sustainability assessments. In the related research article “Network Model-Based Analysis of the Goals, Targets and Indicators of Sustainable Development for Strategic Environmental Assessment” the similarities of SDGs for ten regions have been modelled in order to improve the quality of strategic environmental assessments. The datasets of the multilayer networks are available on Mendeley .

### Network analysis dataset of System Dynamics models

This paper presents a tool developed for the analysis of networks extracted from system dynamics models. The developed tool and the collected models were used and analyzed in the research paper, Review and structural analysis of system dynamics models in sustainability science [1]. The models developed in Vensim, Stella, and InsightMaker are converted into networks of state-variables, flows, and parameters by the developed Python program that also performs model reduction, modularity analysis and calculates the structural properties of the models and its main variables. The dataset covers the results of the analysis of nine models in sustainability science used for policy testing, prediction and simulation.

### Graph Configuration Model based Evaluation of the Education-Occupation Match

To study education—occupation matchings we developed a bipartite network model of education to work transition and a graph configuration model based metric. We studied the career paths of 15 thousand Hungarian students based on the integrated database of the National Tax Administration, the National Health Insurance Fund, and the higher education information system of the Hungarian Government. A brief analysis of gender pay gap and the spatial distribution of over-education is presented to demonstrate the background of the research and the resulted open dataset. We highlighted the hierarchical and clustered structure of the career paths based on the multi-resolution analysis of the graph modularity. The results of the cluster analysis can support policymakers to fine-tune the fragmented program structure of higher education.

### NOCAD Network based Observability and Controlability Analysis of Dynamical Systems toolbox

This Matlab package is written specifically for the paper Dániel Leitold, Ágnes Vathy-Fogarassy & János Abonyi, Controllability and observability in complex networks – the effect of connection types Scientific Reports 7, Article number: 151 (2017) doi:10.1038/s41598-017-00160-5

The purpose of the package is to analyse the state space models of dynamical systems and design the place of controllers and sensors.

### Correlation based dynamic time warping of multivariate time series

In recent years, dynamic time warping (DTW) has begun to become the most widely used technique for comparison of time series data where extensive a priori knowledge is not available. However, it is often expected a multivariate comparison method to consider the correlation between the variables as this correlation carries the real information in many cases. Thus, principal component analysis (PCA) based similarity measures, such as PCA similarity factor (SPCA), are used in many industrial applications.

In this paper, we represent a novel algorithm called correlation based dynamic time warping (CBDTW) wich combines DTW and PCA based similarity measures. To preserve correlation, multivariate time series are segmented and the local dissimilarity function of DTW originated from SPCA. The segments are obtained by bottom-up segmentation using special, PCA related costs. Our novel technique qualitified on two databases, the database of signature verification competition 2004 and the commonly used AUSLAN dataset. We show that CBDTW outperforms the standard SPCA and the most commonly used, Euclidean distance based multivariate DTW in case of datasets wich complex correlation structure.

Matlab® files

### Graph-based clustering and data visualization algorithms

This Matlab package is written specifically for the book Ágnes Vathy–Fogarassy and János Abonyi: Graph-based clustering and data visualization algorithms. The purpose of the package is to demonstrate a wide range of graph-based clustering and visualization algorithms presented in the book. The package contains graph-based algorithms for vector quantization (e.g. kmeans, Neural Gas method, Topology Representing Networks, etc.), for clustering (e.g. Hybrid Minimal Spanning Tree – Gath-Geva algorithm, improved Jarvis-Patrick algorithm, etc.) and for low-dimensional visualization of high-dimensonal data set (e.g. Isomap, Curvilinear Component Analysis, Topology Representing Network Map – TRNMap, etc.).

### Fuzzy Clustering and Data Analysis Toolbox

The toolbox is a collection of Matlab functions can be used to culstering of data by fuzzy c-means, Gustafson - Kessel, Gath-Geva clustering algorithms. The validity function provides cluster validity measures for each partition. The Visualization part of this toolbox provides the modified Sammon mapping of the data. The toolbox is supported by a 77 pages manual.

### Node Similarity-based Graph Clustering and Visualization

The basis of the presented methods for the visualization and clustering of graphs is a novel similarity and distance metric, and the matrix describing the similarity of the nodes in the graph. This matrix represents the type of connections between the nodes in the graph in a compact form, thus it provides a very good starting point for both the clustering and visualization algorithms. Hence visualization is done with the MDS (Multidimensional Scaling) dimensionality reduction technique obtaining the spectral decomposition of this matrix, while the partitioning is based on the results of this step generating a hierarchical representation. A detailed example is shown to justify the capability of the described algorithms for clustering and visualization of the link structure of Web sites.

### Feedback Linearizing Control Using Hybrid Neural Networks Identified by Sensitivity Approach

Globally Linearizing Control (GLC) is a control algorithm capable of using non-linear process model directly. In GLC, mostly, first-principles models derived from dynamic mass, energy and momentum balances are used. When the process is not perfectly known, the unknown parts of the first principles model should be represented by black-box models, e.g. by neural networks. This paper is devoted to the identification and application of such hybrid models for GLC. It is shown that the first principles part of the model determines the dominant structure of the controller, while the black-box elements of the hybrid model are used as state and/or disturbance estimators. For the identification of the neural network elements of the hybrid model a sensitivity approach based algorithm has been developed. The underlying framework is illustrated by the temperature control of a continuous stirred tank reactor (CSTR) where a neural network is used to model the heat released by an exothermic chemical reaction.

J. Madár, J. Abonyi, F. Szeifert, Feedback linearizing control using hybrid neural networks identified by sensitivity approach, Engineering Applications of Artificial Intelligence, 343-351, 2005 (MATLAB implementation)

### Fuzzy clustering based time-series segmentation

The changes of the variables of a multivariate time-series are usually vague and do not focus on any particular time point. Therefore, it is not practical to define crisp bounds of the segments. Although fuzzy clustering algorithms are widely used to group overlapping and vague objects, they cannot be directly applied to time-series segmentation, because the clusters need to be contiguous in time. This paper proposes a clustering algorithm for the simultaneous identification of local Probabilistic Principal Component Analysis (PPCA) models used to measure the homogeneity of the segments and fuzzy sets used to represent the segments in time. The algorithm favors contiguous clusters in time and able to detect changes in the hidden structure of multivariate time-series. A fuzzy decision making algorithm based on a compatibility criteria of the clusters have been worked out to determine the required number of segments, while the required number of principal components are determined by the screeplots of the eigenvalues of the fuzzy covariance matrices. The application example shows that this new technique is a useful tool for the analysis of historical process data.

### Fuzzy Model Identification for Control

This book presents new approaches to the construction of fuzzy models for model-based control. New model structures and identification algorithms are described for the effective use of heterogeneous information in the form of numerical data, qualitative knowledge, and first principle models. The main methods and techniques are illustrated through several simulated examples and real-world applications from chemical and process engineering practice.

### Supervised Fuzzy Clustering for the Identification of Fuzzy Classifiers

The classical fuzzy classifier consists of rules each one describing one of the classes. In this paper a new fuzzy model structure is proposed where each rule can represent more than one classes with different probabilities. The obtained classifier can be considered as an extension of the quadratic Bayes classifier that utilizes mixture of models for estimating the class conditional densities. A supervised clustering algorithm has been worked out for the identification of this fuzzy model. The relevant input variables of the fuzzy classifier have been selected based on the analysis of the clusters by Fisher's interclass separability criteria. This new approach is applied to the well-known wine and Wisconsin Breast Cancer classification problems.

J. Abonyi, F. Szeifert, Supervised fuzzy clustering for the identification of fuzzy classifiers, Pattern Recognition Letters, 24(14) 2195-2207, October 2003 (MATLAB implementation)

### Incorporating Prior Knowledge in Cubic Spline Approximation - Application to the Identification of Reaction Kinetic Models

Data smoothening and re-sampling are often necessary to handle data obtained from laboratory and industrial experiments. This paper presents a new algorithm for incorporating prior knowledge into spline-smoothing of interrelated multivariate data. Prior knowledge based on the visual inspection of the variables and/or knowledge about the assumed balance equations can be transformed into linear equality and inequality constraints on the parameters of the splines. The splines than can be simultaneously identified from the available data by solving one quadratic programming problem. To demonstrate the applicability of the method two examples are given. In the first example, the proposed approach has been applied to the identification of kinetic parameters of a simulated reaction network, while in the second example data taken from an industrial batch reactor is analyzed. The results show that, when the proposed constrained spline-smoothing algorithm is applied, not only better fitting to the data points is achieved, but also the performance of the estimation of the kinetic parameters improves with regard to the case where no prior knowledge is involved.

J. Madár, J. Abonyi, H. Roubos, F. Szeifert, Incorporating prior knowledge in cubic spline approximation - Application to the identification of reaction kinetic models, Industrial and Engineering Chemistry Research, 1-6, 2003 (MATLAB implementation)

### Star plots - MATLAB files for Graphical Representation of trace elements of clinkers

The trace element content of clinkers (and possibly of cements) can be used for the qualitative identification (i.e. manufacturing factory). In this program a graphical method is presented to facilitate the visualisation of the trace element content.

Ferenc D. Tamás, János Abonyi, Trace elements in clincker I. - A graphical representation, Cement and Concrete Research, 32/8, 1319-1323, 2002 (MATLAB implementation)

### Qualitative identification of clinkers by fuzzy clustering - Fuzzy clustering of trace elements of clinkers

With the application of this code fuzzy classifier is identified by unsupervised fuzzy clustering. The most relevant trace elements were selected based on the obtained clusters by the modified version of Fisher interclass separability method. The classification of Portuguese and South African clinkers is used as an illustrative example. The results show that the proposed method is useful to identify compact classifiers that are able to determine the origin of the clinker; and the obtained classifier is easy to use and interpret for engineers and researchers, even when they are not familiar with the concept of fuzzy logic.

Ferenc D. Tamás, János Abonyi, Trace elements in clincker II. – Qualitative identification by fuzzy clustering, Cement and Concrete Research, 32/8, 1325-1330, 2002 (MATLAB implementation)

### Compact TS-Fuzzy Models through Clustering and OLS plus FIS Model Reduction

The construction of interpretable Takagi--Sugeno (TS) fuzzy models by means of clustering is addressed. First, it is shown how the antecedent fuzzy sets and the corresponding consequent parameters of the TS model can be derived from clusters obtained by the Gath--Geva algorithm. To preserve the partitioning of the antecedent space, linearly transformed input variables can be used in the model. This may, however, complicate the interpretation of the rules. To form an easily interpretable model that does not use the transformed input variables, a new clustering algorithm is proposed, based on the Expectation Maximization (EM) identification of Gaussian mixture models. The most relevant consequent variables of the TS model are selected by an orthogonal least squares method based on the obtained clusters. For the selection of the relevant antecedent (scheduling) variables a new method has been developed based on Fisher's interclass separability criteria. This new technique is applied to two well-known benchmark problems: the MPG (miles per gallon) prediction and a simulated second-order nonlinear process. The obtained results are compared with results from the literature.

J. Abonyi, J.A. Roubos, M. Oosterom, F. Szeifert, Compact TS-Fuzzy models through clustering and OLS plus FIS model reduction, FUZZ-IEEE'01 Conference, Sydney, Australia, 1420-1423, 2001, (MATLAB implementation)

### Fuzzy Modeling with Multidimensional Membership Functions: Grey-Box Identification and Control Design

A novel framework for fuzzy modeling and model-based control design is worked out. The fuzzy model is of the Takagi--Sugeno type with constant consequents. It uses multidimensional antecedent membership functions obtained by Delaunay triangulation of their characteristic points. The number and position of these points are determined by an iterative insertion algorithm. Constrained optimization is used to estimate the consequent parameters, where the constraints are based on control-relevant {\it a priori} knowledge about the modeled process. Finally, methods for control design through linearization and inversion of this model are developed. The proposed techniques are demonstrated by means of two benchmark examples: identification of the well-known Box-Jenkins gas furnace and inverse model-based control of a pH process. The obtained results are compared with results from the literature.

J. Abonyi, R. Babuska, F. Szeifert, Fuzzy modeling with multidimensional membership functions: Gray box identification and control design, IEEE Systems, Man and Cybernetics, Part B, 755-767, Oct, 2001 (MATLAB implementation)

### Constrained Fuzzy Model Identification - Files for Fuzzy Modeling and Identification Toolbox

The Fuzzy Modeling and Identification (FMID) toolbox is a collection of Matlab functions for the construction of Takagi--Sugeno (TS) fuzzy models from data. The proposed extension is a collection of functions that can be used for estimation of the consequent model parameters. As only the weights of the fuzzy rules, the consequent input variables, and the desired model output are needed for the determination of the consequent parameters, the presented functions can be used independently from the FMID toolbox. Moreover, by setting the number of rules to one, the toolbox is suitable for constrained identification of linear models.

J. Abonyi, R. Babuska, H. B. Verbruggen, F. Szeifert, Incorporating prior knowledge in fuzzy model identification, Int. Journal of Systems Science, 31(5), 657-667, 2000 (MATLAB implementation) Constrained Fuzzy Model Identification Toolbox manual)

### Identification and Control of Nonlinear Systems Using Fuzzy Hammerstein Models

This software addresses the identification and control of nonlinear systems by means of Fuzzy Hammerstein (FH) models, which consist of a static fuzzy model connected in series with a linear dynamic model. For the identification of nonlinear dynamic systems with the proposed FH models, two methods are proposed. The first one is an alternating optimization algorithm that iteratively refines the estimate of the linear dynamics and the parameters of the static fuzzy model. The second method estimates the parameters of the nonlinear static model and of the linear dynamic model simultaneously by using a constrained recursive least-squares algorithm. The obtained FH model is incorporated in a model-based predictive control scheme and a new constraint-handling method is presented. A simulated water-heater process is used as an illustrative example. A comparison with an affine neural network and a linear model is given. Simulation results show that the proposed FH modeling approach is useful for modular parsimonious modeling and model-based control of nonlinear systems.

J. Abonyi, R. Babuska, M. Ayala Botto, F. Szeifert, L. Nagy, Identification and control of nonlinear systems using fuzzy Hammerstein models, Industrial and Engineering Chemistry Research, 39, 4302-4314, 2000 (MATLAB implementation)

### Fisher information matrix based time-series segmentation of process data

Advanced chemical process engineering tools, like model predictive control or soft sensor solutions require proper process models. Parameter identification of these models needs input–output data with high information content. When model based optimal experimental design techniqes cannot be applied, the extraction of informative segements from historical data can also support system identification. We developed a goal-oriented Fisher information based time-series segmentation algorithm, aimed at selecting informative segments from historical process data. The utilized standard bottom-up algorithm is widely used in off-line analysis of process data. Different segments can support the identification of parameter sets. Hence, instead of using either D- or E-optimality as the criterion for comparing the information content of two input sequences (neigbouring segments), we propose the use of Krzanowski's similarity coefficient between the eigenvectors of the Fisher information matrices obtained from the sequences. The efficiency of the proposed methodology is demonstrated by two application examples. The algorithm is capable to extract segments with parameter-set specific information content from historical processdata.

### Genetic Programming MATLAB Toolbox

Linear-in-parameters models are quite widespread in process engineering, e.g. NAARX, polynomial ARMA models, etc. Genetic Programming (GP) is able to generate nonlinear input-output models of dynamical systems that are represented in a tree structure. This GP-OLS toolbox applies Orthogonal Least Squares algorithm (OLS) to estimate the contribution of the branches of the tree to the accuracy of the model. This method results in more robust and interpretable models than the classical GP method

### Interactive Evolutionary Computing (EASY-IEC) MATLAB Toolbox

In some real-life optimization problems the objectives are often non-commensurable and are explicitly/mathematically not available. Interactive Evolutionary Computation (IEC) can effectively handle these problems.

### A Simple Fuzzy Classifier based on Inconsistency Analysis of Labeled Data

An extremely simple fuzzy classifier is identified based on the inconsistency analysis of labelled training data. The method was applied to the COIL challenge 2000 Direct Mail problem and resulted in 121 selected caravan policies within the first 800 selected customers. As this result is identical to the result of the winner of the competition, the presented method is an example for how the try the simplest first approach can be effective in real-life problems.

J. Abonyi, H. Roubos, Simple fuzzy classifier based on inconsistency analysis of labeled data", Chapter 12 in: CoIL Challenge 2000: The Insurance Company Case, Peter van der Putten and Maarten van Someren (eds), Sentient Machine Research, Amsterdam and Leiden Institute of Advanced Computer Science, Leiden, LIACS Technical Report, 1-10, 2000 (MATLAB implementation)