# Clustering

## Scalable co-clustering

Production flow analysis includes various families of components and groups of machines. Machine-part cell formation means the optimal design of manufacturing cells consisting of similar machines producing similar products from a similar set of components. Most of the algorithms reorders of the machine-part incidence matrix. We generalize this classical concept to handle more than two elements of the production process (e.g. machine - part - product - resource - operator). The application of this extended concept requires an efficient optimization algorithm for the simultaneous grouping these elements. For this purpose, we propose a novel co-clustering technique based on crossing minimization of layered bipartite graphs. The present method has been implemented as a MATLAB toolbox. The efficiency of the proposed approach and developed tools is demonstrated by realistic case studies. The log-linear scalability of the algorithm is proven theoretically and experimentally.

Scalable co-Clustering using a Crossing Minimization‒Application to Production Flow Analysis

C Pigler, Á Fogarassy-Vathy, J Abonyi, Acta Polytechnica Hungarica 13 (2)

## Fuzzy clustering for nonlinear regression

Takagi-Sugeno (TS) models formed by logical rules consist of a fuzzy antecedent and a mathematical function as consequent part. The construction of a TS model is usually done in two steps. In the first step, the fuzzy sets in the rule antecedents are determined that partition the input space into a number of fuzzy regions. In the second step, the rule consequents are determined which means identification of (usually linear) models. TS fuzzy model identification is a complex task; there are non-trivial problems as follows: (1) how to automatically partition the input space, (2) how many fuzzy rules are really needed for properly approximating an unknown nonlinear system, and (3) how to construct a fuzzy system from data examples automatically. These problems can be partially solved by the recent developments of fuzzy systems.

We recognized that fuzzy clustering algorithms are able to automatically divide the input space, and developed clustering algorithm that fits a local models beside the clustering simultaneously. Transparency of the model is enchanced by the tree representation. The number of rules can be given by the user or it can be identified by validation. Comparing with well-known methods it can be determined that the developed method gives the most transparent results at similar accuracy to these methods.

## Improvement of Jarvis-Patrick clustering

Jarvis-Patrick clustering utilizes the nearest neighbor approach to cluster the objects. The main disadvantages of the Jarvis-Patrick clustering are that it utilizes a very rigid decision criterion to classify the objects, and this decision criterion is only confined to the k nearest neighbors. To solve these problems a new similarity measure based on the nearest neighbors of the objects has been defined. This similarity measure is not restricted to the direct neighbors, but it can also take into account objects that are further away. Furthermore, the suggested similarity measure fuzzifies the crisp decision criterion of the Jarvis-Patrick algorithm. The combination of the proposed similarity measure with hierarchical clustering methods provides an effective tool for exploring groups of data.

A. Vathy-Fogarassy, A. Kiss, J. Abonyi. Improvement of Jarvis-Patrick clustering based on fuzzy similarity. In Lecture Notes in Computer Science: Applications of Fuzzy Sets Theory, volume 4578/2007, pages 195–202. Springer Berlin / Heidelberg, 2007.

## Minimal spanning tree and Gath-Geva clustering

A novel clustering algorithm based on the minimal spanning tree of the objects and the Gath-Geva clustering method has been proposed. Graph based clustering algorithms find groups of objects by eliminating inconsistent edges of the graph representing the data set to be analyzed. The resulting subgraphs yield clusters. However, due to the huge variety of problems and data, it is a difficult challenge to identify the inconsistent edges of graphs. To solve this problem I have suggested a new cutting process of graphs, that iteratively finds the best partitions based on the measure of the fuzzy hyper volume of clusters. Based on this cutting criterion I have also suggested a new graph based clustering algorithm, which is an effective combination of the minimal spanning tree based clustering and the partitional Gath-Geva algorithm. The suggested algorithm is able to solve the typical problem of the graph based clustering algorithms (chaining effect) and it also solves the initialization problem of the Gath-Geva algorithm. The resulting clusters of the proposed algorithm are easily interpretable with a compact parametric description.

A. Vathy-Fogarassy, A. Kiss, J. Abonyi. Improvement of Jarvis-Patrick clustering based on fuzzy similarity. In Lecture Notes in Computer Science: Applications of Fuzzy Sets Theory, volume 4578/2007, pages 195–202. Springer Berlin / Heidelberg, 2007.

## Fuzzy clustering based time-series segmentation

Changes of the variables of a multivariate time-series are usually vague and do not focus on any particular time point. Therefore, it is not practical to define crisp bounds of the segments. Although fuzzy clustering algorithms are widely used to group overlapping and vague objects, they cannot be directly applied to time-series segmentation, because the clusters need to be contiguous in time. This paper proposes a clustering algorithm for the simultaneous identification of local Probabilistic Principal Component Analysis (PPCA) models used to measure the homogeneity of the segments and fuzzy sets used to represent the segments in time. The algorithm favors contiguous clusters in time and able to detect changes in the hidden structure of multivariate time-series. A fuzzy decision making algorithm based on a compatibility criteria of the clusters have been worked out to determine the required number of segments, while the required number of principal components are determined by the screeplots of the eigenvalues of the fuzzy covariance matrices. The application example shows that this new technique is a useful tool for the analysis of historical process data.

J. Abonyi, B. Feil, Cluster analysis for data mining and system identification, Birkhauser, 2007, 300 pages

J. Abonyi, R. Babuska, F. Szeifert, Modified gath-geva fuzzy clustering for identification of Takagi-Sugeno fuzzy models, IEEE Trans. on Systems, Man and Cybernetics, Part B, 612-621, Oct, 2002, IF 0.63

J. Abonyi, B. Feil, S. Nemeth, P. Arva, Modiﬁed gath–geva clustering for fuzzy segmentation of multivariate time-series, Fuzzy Sets and Systems 149 (2005) 39–56

## Clustering based model order selection of input-output models

In the first step of data-driven modeling the structure of the model should be identified or selected. One of the most often used techniques is input-output models. If the input and output order of the model is not known beforehand, it should be identified. We recognized that the false nearest neighbor method (FNN), which is usually used for this purpose, can be improved using clustering because time demanding model identification steps can be omitted. We recognized that model orders can be estimated accurately without FNN on the only basis of clustering results with the analysis of cluster covariance matrices, so other time consuming computational steps, searching for nearest neighbors, can be avoided. The developed algorithms gave accurate results linear as well as nonlinear cases (polymerization and Van der Vusse reactors).

The performance of the proposed method is comparable to the method based on advanced optimization algorithms (combination of Genetic Programming and Orthogonal Least Squares techniques).

There is an analogous problem related to autonomous systems as well. The proper selection of the state space dimension is a challenge by chaotic systems. As a solution, a novel method has been worked out with its help three tasks can be solved simultaneously utilizing clustering results: (1) selection of the dimension of the state space, (2) estimation of the dimension of the manifold that data extend, and (3) identification of a model that can be used for prediction. It can be determined from the examples that the developed method gave accurate results by the autonomous systems with known dimensions.

J. Madar, J. Abonyi, F. Szeifert, Genetic programming for the identification of nonlinear input-output models, Industrial and Engineering Chemistry Research, 44, 3178-3186, 2005, IF: 1.29

B. Feil, J. Abonyi, F. Szeifert, Model order selection of nonlinear input-output models – A clustering based approach, Journal of Process Control, Volume 14, Issue 6, 593-602, IF: 1.241 2004.

## Visualization of fuzzy clustering results

Since in practical data mining problems high-dimensional data are clustered, the resulting clusters are high-dimensional geometrical objects, which are difﬁcult to analyze and interpret. Cluster validity measures try to solve this problem by providing a single numerical value. As a low dimensional graphical representation of the clusters could be much more informative than such a single value, this paper proposes a new tool for the visualization of fuzzy clustering results. By using the basic properties of fuzzy clustering algorithms, this new tool maps the cluster centers and the data such that the distances between the clusters and the data-points are preserved. During the iterative mapping process, the algorithm uses the membership values of the data and minimizes an objective function similar to the original clustering algorithm. Comparing to the original Sammon mapping not only reliable cluster shapes are obtained but the numerical complexity of the algorithm is also drastically reduced. The developed tool has been applied for visualization of reconstructed phase space trajectories of chaotic systems. The case study demonstrates that proposed FUZZSAMM algorithm is a useful tool in user-guided clustering.

B. Feil, B. Balasko, J. Abonyi, Visualization of fuzzy clusters by fuzzy sammon mapping projection – Application to the analysis of phase space trajectories, special issue of "Soft Computing" on "Soft Computing for Information Mining", 11, (5), 479-488,. 2006 (IF: 0.33)

J. Abonyi, R. Babuska, FUZZSAM - Visualization of fuzzy clustering results by modified Sammon mapping, 13th International Conference on Fuzzy Systems, Budapest, Hungary, 2004.07.25-2004.07.29. New York: IEEE Press, 2004. pp. 365-370.

## Fuzzy clustering for selecting structure of nonlinear models with mixed discrete and continuous inputs

A method for selecting regressors in nonlinear models with mixed discrete (categorical) and continuous inputs is proposed. Given a set of input-output data and an initial superset of potential inputs, the relevant inputs are selected by a model-free search algorithm. Fuzzy clustering is used to quantize continuous data into subsets that can be handled in a similar way as discrete data. Two simulation examples and one real-world data set are included to illustrate the performance of the proposed method and compare it with the performance of regression trees. For small to medium size problems (up to 15 candidate inputs), the proposed method works effectively. For larger problems, the computational load becomes too high.

Girimonte D, Babuska R, Abonyi J, Fuzzy clustering for selecting structure of nonlinear models with mixed discrete and continuous inputs, 13th International Conference on Fuzzy Systems, Budapest, Hungary, 2004.07.25-2004.07.29. New York: IEEE Press, 2004. pp. 383-387. (ISBN:0-7803-8353-2).

## Supervised Fuzzy Clustering for the Identification of Fuzzy Classifiers

The classical fuzzy classifier consists of rules each one describing one of the classes. In this paper a new fuzzy model structure is proposed where each rule can represent more than one classes with different probabilities. The obtained classifier can be considered as an extension of the quadratic Bayes classifier that utilizes mixture of models for estimating the class conditional densities. A supervised clustering algorithm has been worked out for the identification of this fuzzy model. The relevant input variables of the fuzzy classifier have been selected based on the analysis of the clusters by Fisher's interclass separability criteria. This new approach was applied to the well-known wine and Wisconsin Breast Cancer classification problems.

J. Abonyi, F. Szeifert, Supervised fuzzy clustering for the identification of fuzzy classifiers, Pattern recognition letters, 24(14) 2195-2207, October 2003 (MATLAB implementation)

## Application of fuzzy clustering and piezoelectric chemical sensor array for investigation on organic compounds

The fuzzy c-means (FCM) clustering models were used for the discrimination of organic compounds using piezoelectric chemical sensor array data of 14 analytes. Appropriate clusters are found by the sum of the weighted quadratic distances between data points and cluster prototypes. A priori known information can be integrated into the clustering algorithm by using constrained prototypes. A sensor array was built using piezoelectric quartz crystal sensors. Four AT-cut quartz crystals with 9MHz fundamental frequencies were applied. Sensing materials were OV1, OV275, ASI50, and polyphenil-ether. The appropriate coating materials were found by principal component analysis. The application of the fuzzy clustering method has been proved to be a reliable way of identifying similar, pure organic compounds.

Gy. Barkó, J. Abonyi, J. Hlavay, Application of fuzzy clustering and piezoelectric chemical sensor array for investigation on organic compounds, Analytica Chimica Acta 398 (1999) 219–226

## Fuzzy clustering and data analysis toolbox

The toolbox is a collection of Matlab functions can be used to culstering of data by fuzzy c-means, Gustafson - Kessel, Gath-Geva clustering algorithms.

The validity function provides cluster validity measures for each partition. The Visualization part of this toolbox provides the modified Sammon mapping of the data.The toolbox is supported by a 77 pages manual.