Clustering

Geodesic distance based fuzzy c-medoid clustering – searching for central points in graphs and high dimensional data

Clustering high dimensional data and identifying central nodes in a graph are complex and computationally expensive tasks. We utilize k-nn graph of high dimensional data as efficient representation of the hidden structure of the clustering problem. Initial cluster centers are determined by graph centrality measures. Cluster centers are fine-tuned by minimizing fuzzy-weighted geodesic distances. The shortest-path based representation is parallel to the concept of transitive closure. Therefore, our algorithm is capable to cluster networks or even more complex and abstract objects based on their partially known pairwise similarities. The algorithm is proven to be effective to identify senior researchers in a co-author network, central cities in topographical data, and clusters of documents represented by high dimensional feature vectors.

Király, András, Ágnes Vathy-Fogarassy, and János Abonyi. "Geodesic distance based fuzzy c-medoid clustering–searching for central points in graphs and high dimensional data." Fuzzy Sets and Systems 286 (2016): 157-172.

Scalable co-clustering

Production flow analysis includes various families of components and groups of machines. Machine-part cell formation means the optimal design of manufacturing cells consisting of similar machines producing similar products from a similar set of components. Most of the algorithms reorders of the machine-part incidence matrix. We generalize this classical concept to handle more than two elements of the production process (e.g. machine - part - product - resource - operator). The application of this extended concept requires an efficient optimization algorithm for the simultaneous grouping these elements. For this purpose, we propose a novel co-clustering technique based on crossing minimization of layered bipartite graphs. The present method has been implemented as a MATLAB toolbox. The efficiency of the proposed approach and developed tools is demonstrated by realistic case studies. The log-linear scalability of the algorithm is proven theoretically and experimentally.

Fuzzy clustering for nonlinear regression

Takagi-Sugeno (TS) models formed by logical rules consist of a fuzzy antecedent and a mathematical function as consequent part. The construction of a TS model is usually done in two steps. In the first step, the fuzzy sets in the rule antecedents are determined that partition the input space into a number of fuzzy regions. In the second step, the rule consequents are determined which means identification of (usually linear) models. TS fuzzy model identification is a complex task; there are non-trivial problems as follows: (1) how to automatically partition the input space, (2) how many fuzzy rules are really needed for properly approximating an unknown nonlinear system, and (3) how to construct a fuzzy system from data examples automatically. These problems can be partially solved by the recent developments of fuzzy systems.

We recognized that fuzzy clustering algorithms are able to automatically divide the input space, and developed clustering algorithm that fits a local models beside the clustering simultaneously. Transparency of the model is enchanced by the tree representation. The number of rules can be given by the user or it can be identified by validation. Comparing with well-known methods it can be determined that the developed method gives the most transparent results at similar accuracy to these methods.

Improvement of Jarvis-Patrick clustering

Jarvis-Patrick clustering utilizes the nearest neighbor approach to cluster the objects. The main disadvantages of the Jarvis-Patrick clustering are that it utilizes a very rigid decision criterion to classify the objects, and this decision criterion is only confined to the k nearest neighbors. To solve these problems a new similarity measure based on the nearest neighbors of the objects has been defined. This similarity measure is not restricted to the direct neighbors, but it can also take into account objects that are further away. Furthermore, the suggested similarity measure fuzzifies the crisp decision criterion of the Jarvis-Patrick algorithm. The combination of the proposed similarity measure with hierarchical clustering methods provides an effective tool for exploring groups of data.

A. Vathy-Fogarassy, A. Kiss, J. Abonyi. Improvement of Jarvis-Patrick clustering based on fuzzy similarity. In Lecture Notes in Computer Science: Applications of Fuzzy Sets Theory, volume 4578/2007, pages 195–202. Springer Berlin / Heidelberg, 2007.

Minimal spanning tree and Gath-Geva clustering

A novel clustering algorithm based on the minimal spanning tree of the objects and the Gath-Geva clustering method has been proposed. Graph based clustering algorithms find groups of objects by eliminating inconsistent edges of the graph representing the data set to be analyzed. The resulting subgraphs yield clusters. However, due to the huge variety of problems and data, it is a difficult challenge to identify the inconsistent edges of graphs. To solve this problem I have suggested a new cutting process of graphs, that iteratively finds the best partitions based on the measure of the fuzzy hyper volume of clusters. Based on this cutting criterion I have also suggested a new graph based clustering algorithm, which is an effective combination of the minimal spanning tree based clustering and the partitional Gath-Geva algorithm. The suggested algorithm is able to solve the typical problem of the graph based clustering algorithms (chaining effect) and it also solves the initialization problem of the Gath-Geva algorithm. The resulting clusters of the proposed algorithm are easily interpretable with a compact parametric description.

Fuzzy clustering based time-series segmentation

Changes of the variables of a multivariate time-series are usually vague and do not focus on any particular time point. Therefore, it is not practical to define crisp bounds of the segments. Although fuzzy clustering algorithms are widely used to group overlapping and vague objects, they cannot be directly applied to time-series segmentation, because the clusters need to be contiguous in time. This paper proposes a clustering algorithm for the simultaneous identification of local Probabilistic Principal Component Analysis (PPCA) models used to measure the homogeneity of the segments and fuzzy sets used to represent the segments in time. The algorithm favors contiguous clusters in time and able to detect changes in the hidden structure of multivariate time-series. A fuzzy decision making algorithm based on a compatibility criteria of the clusters have been worked out to determine the required number of segments, while the required number of principal components are determined by the screeplots of the eigenvalues of the fuzzy covariance matrices. The application example shows that this new technique is a useful tool for the analysis of historical process data.

Cluster Analysis for Data Mining and System Identification

2007th Edition

The aim of this book is to illustrate that advanced fuzzy clustering algorithms can be used not only for partitioning of the data. It can also be used for visualization, regression, classification and time-series analysis, hence fuzzy cluster analysis is a good approach to solve complex data mining and system identification problems. This book is oriented to undergraduate and postgraduate and is well suited for teaching purposes.

J. Abonyi, B. Feil, Cluster analysis for data mining and system identification, Birkhauser, 2007, 300 pages

Modified gath-geva fuzzy clustering for identification of Takagi-Sugeno fuzzy models

Abstract—The construction of interpretable Takagi-Sugeno (TS) fuzzy models by means of clustering is addressed. First, it is shown how the antecedent fuzzy sets and the corresponding consequent parameters of the TS model can be derived from clusters obtained by the Gath-Geva (GG) algorithm. To preserve the partitioning of the antecedent space, linearly transformed input variables can be used in the model. This may, however, complicate the interpretation of the rules. To form an easily interpretable model that does not use the transformed input variables, a new clustering algorithm is proposed, based on the expectation-maximization (EM) identification of Gaussian mixture models. This new technique is applied to two well-known benchmark problems: the MPG (miles per gallon) prediction and a simulated second-order nonlinear process. The obtained results are compared with results from the literature.

Modified Gath–Geva clustering for fuzzy segmentation of multivariate time-series

Partitioning a time-series into internally homogeneous segments is an important data-mining problem. The changes of the variables of a multivariate time-series are usually vague and do not focus on any particular time point. Therefore, it is not practical to define crisp bounds of the segments. Although fuzzy clustering algorithms are widely used to group overlapping and vague objects, they cannot be directly applied to time-series segmentation, because the clusters need to be contiguous in time. This paper proposes a clustering algorithm for the simultaneous identification of local probabilistic principal component analysis (PPCA) models used to measure the homogeneity of the segments and fuzzy sets used to represent the segments in time. The algorithm favors contiguous clusters in time and is able to detect changes in the hidden structure of multivariate time-series. A fuzzy decision making algorithm based on a compatibility criteria of the clusters has been worked out to determine the required number of segments, while the required number of principal components are determined by the screeplots of the eigenvalues of the fuzzy covariance matrices. The application example shows that this new technique is a useful tool for the analysis of historical process data.

Genetic programming for the identification of nonlinear input-output models

Linear-in-parameters models are quite widespread in process engineering, e.g., nonlinear additive autoregressive models, polynomial ARMA models, etc. This paper proposes a new method for the structure selection of these models. The method uses genetic programming to generate nonlinear input-output models of dynamical systems that are represented in a tree structure. The main idea of the paper is to apply the orthogonal least squares (OLS) algorithm to estimate the contribution of the branches of the tree to the accuracy of the model. This method results in more robust and interpretable models. The proposed approach has been implemented as a freely available MATLAB Toolbox, www.fmt.veim.hu/softcomp. The simulation results show that the developed tool provides an efficient and fast method for determining the order and structure for nonlinear input-output models.

Model order selection of nonlinear input–output models––a clustering based approach

Selecting the order of an input–output model of a dynamical system is a key step toward the goal of system identification. The false nearest neighbors algorithm (FNN) is a useful tool for the estimation of the order of linear and nonlinear systems. While advanced FNN uses nonlinear input–output data-based models for the model-based selection of the threshold constant that is used to compute the percentage of false neighbors, the computational effort of the method increases along with the number of data and the dimension of the model. To increase the efficiency of this method, in this paper we propose a clustering-based algorithm. Clustering is applied to the product space of the input and output variables. The model structure is then estimated on the basis of the cluster covariance matrix eigenvalues. The main advantage of the proposed solution is that it is model-free. This means that no particular model needs to be constructed in order to select the order of the model, while most other techniques are wrapped’ around a particular model construction method. This saves the computational effort and avoids a possible bias due to the particular construction method used. Three simulation examples are given to illustrate the proposed technique: estimation of the model structure for a linear system, a polymerization reactor and the van der Vusse reactor.

Visualization of fuzzy clusters by fuzzy sammon mapping projection – Application to the analysis of phase space trajectories

Since in practical data mining problems high-dimensional data are clustered, the resulting clusters are high-dimensional geometrical objects, which are difficult to analyze and interpret. Cluster validity measures try to solve this problem by providing a single numerical value. As a low dimensional graphical representation of the clusters could be much more informative than such a single value, this paper proposes a new tool for the visualization of fuzzy clustering results. By using the basic properties of fuzzy clustering algorithms, this new tool maps the cluster centers and the data such that the distances between the clusters and the data-points are preserved. During the iterative mapping process, the algorithm uses the membership values of the data and minimizes an objective function similar to the original clustering algorithm. Comparing to the original Sammon mapping not only reliable cluster shapes are obtained but the numerical complexity of the algorithm is also drastically reduced. The developed tool has been applied for visualization of reconstructed phase space trajectories of chaotic systems. The case study demonstrates that proposed FUZZSAMM algorithm is a useful tool in user-guided clustering.

FUZZSAM – Visualization of fuzzy clusters by fuzzy sammon mapping projection – Application to the analysis of phase space trajectories

Since in practical data mining problems high-dimensional data are clustered, the resulting clusters are high-dimensional geometrical objects, which are difficult to analyze and interpret. Cluster validity measures try to solve this problem by providing a single numerical value. As a low dimensional graphical representation of the clusters could be much more informative than such a single value, this paper proposes a new tool for the visualization of fuzzy clustering results. By using the basic properties of fuzzy clustering algorithms, this new tool maps the cluster centers and the data such that the distances between the clusters and the data-points are preserved. During the iterative mapping process, the algorithm uses the membership values of the data and minimizes an objective function similar to the original clustering algorithm. Comparing to the original Sammon mapping not only reliable cluster shapes are obtained but the numerical complexity of the algorithm is also drastically reduced. The algorithm has been applied to several data sets and the numerical results show performance superior to Principal Component Analysis and the classical Sammon mapping based projection. The examples demonstrate that proposed FUZZSAMM algorithm is a useful tool in user-guided clustering.

Fuzzy clustering for selecting structure of nonlinear models with mixed discrete and continuous inputs

A method for selecting regressors in nonlinear models with mixed discrete (categorical) and continuous inputs is proposed. Given a set of input-output data and an initial superset of potential inputs, the relevant inputs are selected by a model-free search algorithm. Fuzzy clustering is used to quantize continuous data into subsets that can be handled in a similar way as discrete data. Two simulation examples and one real-world data set are included to illustrate the performance of the proposed method and compare it with the performance of regression trees. For small to medium size problems (up to 15 candidate inputs), the proposed method works effectively. For larger problems, the computational load becomes too high.

Girimonte D, Babuska R, Abonyi J, Fuzzy clustering for selecting structure of nonlinear models with mixed discrete and continuous inputs, 13th International Conference on Fuzzy Systems, Budapest, Hungary, 2004.07.25-2004.07.29. New York: IEEE Press, 2004. pp. 383-387. (ISBN:0-7803-8353-2).

Supervised Fuzzy Clustering for the Identification of Fuzzy Classifiers

The classical fuzzy classifier consists of rules each one describing one of the classes. In this paper a new fuzzy model structure is proposed where each rule can represent more than one classes with different probabilities. The obtained classifier can be considered as an extension of the quadratic Bayes classifier that utilizes mixture of models for estimating the class conditional densities. A supervised clustering algorithm has been worked out for the identification of this fuzzy model. The relevant input variables of the fuzzy classifier have been selected based on the analysis of the clusters by Fisher's interclass separability criteria. This new approach was applied to the well-known wine and Wisconsin Breast Cancer classification problems.

Application of fuzzy clustering and piezoelectric chemical sensor array for investigation on organic compounds

The fuzzy c-means (FCM) clustering models were used for the discrimination of organic compounds using piezoelectric chemical sensor array data of 14 analytes. Appropriate clusters are found by the sum of the weighted quadratic distances between data points and cluster prototypes. A priori known information can be integrated into the clustering algorithm by using constrained prototypes. A sensor array was built using piezoelectric quartz crystal sensors. Four AT-cut quartz crystals with 9MHz fundamental frequencies were applied. Sensing materials were OV1, OV275, ASI50, and polyphenil-ether. The appropriate coating materials were found by principal component analysis. The application of the fuzzy clustering method has been proved to be a reliable way of identifying similar, pure organic compounds.

Gy. Barkó, J. Abonyi, J. Hlavay, Application of fuzzy clustering and piezoelectric chemical sensor array for investigation on organic compounds, Analytica Chimica Acta 398 (1999) 219–226

Fuzzy clustering and data analysis toolbox

The toolbox is a collection of Matlab functions can be used to culstering of data by fuzzy c-means, Gustafson - Kessel, Gath-Geva clustering algorithms.

The validity function provides cluster validity measures for each partition. The Visualization part of this toolbox provides the modified Sammon mapping of the data.The toolbox is supported by a 77 pages manual.