Clustering

3D Scanner-Based Identification of Welding Defects—Clustering the Results of Point Cloud Alignment

This paper describes a framework for detecting welding errors using 3D scanner data. The proposed approach employs density-based clustering to compare point clouds and identify deviations. The discovered clusters are then classified according to standard welding fault classes. Six welding deviations defined in the ISO 5817:2014 standard were evaluated. All defects were represented through CAD models, and the method was able to detect five of these deviations. The results demonstrate that the errors can be effectively identified and grouped according to the location of the different points in the error clusters. However, the method cannot separate crack-related defects as a distinct cluster. 

Post date: 01 March 2023

Multi-objective hierarchical clustering for tool assignment

Due to the limited tool magazine capacities of CNC machines, time-consuming tool changeovers result in inefficient equipment utilization. This study provides a method to minimize the changeovers by optimizing the allocation of the tools to the machines. The proposed algorithm is efficient as it approaches the tool assignment task as a multi-objective hierarchical clustering problem where the products are grouped based on the similarity of the tool demands. The novelty of the goal-oriented agglomerative clustering algorithm is that it is based on the Pareto optimal selection of the merged clusters. The applicability of the method is demonstrated through an industrial case study. The tool assignment problem has also been formulated as a bin-packing optimization task, and the results of the related linear programming were used as a benchmark reference. The comparison highlighted that the proposed method provides a feasible solution for large real-life problems with low computation time. 


Post date: 18 February 2023

Goal-oriented possibilistic fuzzy C-Medoid clustering of human mobility patterns—Illustrative application for the Taxicab trips-based enrichment of public transport services

The discovery of human mobility patterns of cities provides invaluable information for decision-makers who are responsible for redesign of community spaces, traffic, and public transportation systems and building more sustainable cities. The present article proposes a possibilistic fuzzy c-medoid clustering algorithm to study human mobility. The proposed medoid-based clustering approach groups the typical mobility patterns within walking distance to the stations of the public transportation system. The departure times of the clustered trips are also taken into account to obtain recommendations for the scheduling of the designed public transportation lines. The effectiveness of the proposed methodology is revealed in an illustrative case study based on the analysis of the GPS data of Taxicabs recorded during nights over a one-year-long period in Budapest.


Post date: 10 October 2022 

A multi-block clustering algorithm for high dimensional binarized sparse data

We introduce a multidimensional multiblock clustering (MDMBC) algorithm in this paper. MDMBC can generate overlapping clusters with similar values along clusters of dimensions. The parsimonious binary vector representation of multidimensional clusters lends itself to the application of efficient meta-heuristic optimization algorithms. In this paper, a hill-climbing (HC) greedy search algorithm has been presented that can be extended by several stochastic and population-based meta-heuristic frameworks. The benefits of the algorithm are demonstrated in a bi-clustering benchmark problem and in the analysis of the Leiden higher education ranking system, which measures the scientific performance of 903 institutions along four dimensions of 20 indicators representing publication output and collaboration in different scientific fields and time periods.


Post date:  01 April 2022

Processing indoor positioning data by goal-oriented supervised fuzzy clustering for tool management

Indoor positioning systems allow real-time tracking of tool locations. Tool utilization can be calculated based on positional data of the storage and manufacturing areas. Due to the uncertainty of the position measurements, estimation of the state of the tools is problematic when the distance urvival Indoor positioning systems allow real-time tracking of tool locations. Tool utilization can be calculated based on positional data of the storage and manufacturing areas. Due to the uncertainty of the position measurements, estimation of the state of the tools is problematic when the distance between the examined zones is less than the estimation error. We propose a goal-oriented supervised fuzzy clustering algorithm that utilizes the activity state of the tool, as the algorithm simultaneously maximizes the spatial distribution probability and the probability of a specific activity state occurring in a cluster. By weighting data points according to the time spent in the related states and positions, the resulting cluster weights can be interpreted as tool utilizations. The applicability of the developed method is presented through the processing of position data from crimping tools used by a wire harness manufacturer.


Mixture of Survival Analysis Models-Cluster-Weighted Weibull Distributions

Survival analysis is a widely used method to establish a connection between a time to event outcome and a set of variables. The goal of this work is to improve the accuracy of the widely applied parametric survival models. This work highlights that accurate and interpretable survival analysis models can be identified by clustering-based exploration of the operating regions of local survival models. The key idea is that when operating regions of local Weibull distributions are represented by Gaussian mixture models, the parameters of the mixture-of-Weibull model can be identified by a clustering algorithm. The proposed method is utilised in three case studies. The examples cover studying the dropout rate of university students, calculating the remaining useful life of lithium-ion batteries, and determining the chances of survival of prostate cancer patients. The results demonstrate the wide applicability of the method and the benefits of clustering-based identification of local Weibull models. 

Geodesic distance based fuzzy c-medoid clustering – searching for central points in graphs and high dimensional data 

Clustering high dimensional data and identifying central nodes in a graph are complex and computationally expensive tasks. We utilize k-nn graph of high dimensional data as efficient representation of the hidden structure of the clustering problem. Initial cluster centers are determined by graph centrality measures. Cluster centers are fine-tuned by minimizing fuzzy-weighted geodesic distances. The shortest-path based representation is parallel to the concept of transitive closure. Therefore, our algorithm is capable to cluster networks or even more complex and abstract objects based on their partially known pairwise similarities. The algorithm is proven to be effective to identify senior researchers in a co-author network, central cities in topographical data, and clusters of documents represented by high dimensional feature vectors. 

Király, András, Ágnes Vathy-Fogarassy, and János Abonyi. "Geodesic distance based fuzzy c-medoid clustering–searching for central points in graphs and high dimensional data." Fuzzy Sets and Systems 286 (2016): 157-172. 

Scalable co-clustering 

Production flow analysis includes various families of components and groups of machines. Machine-part cell formation means the optimal design of manufacturing cells consisting of similar machines producing similar products from a similar set of components. Most of the algorithms reorders of the machine-part incidence matrix. We generalize this classical concept to handle more than two elements of the production process (e.g. machine - part - product - resource - operator). The application of this extended concept requires an efficient optimization algorithm for the simultaneous grouping these elements. For this purpose, we propose a novel co-clustering technique based on crossing minimization of layered bipartite graphs. The present method has been implemented as a MATLAB toolbox. The efficiency of the proposed approach and developed tools is demonstrated by realistic case studies. The log-linear scalability of the algorithm is proven theoretically and experimentally. 

Fuzzy clustering for nonlinear regression

Takagi-Sugeno (TS) models formed by logical rules consist of a fuzzy antecedent and a mathematical function as consequent part. The construction of a TS model is usually done in two steps. In the first step, the fuzzy sets in the rule antecedents are determined that partition the input space into a number of fuzzy regions. In the second step, the rule consequents are determined which means identification of (usually linear) models. TS fuzzy model identification is a complex task; there are non-trivial problems as follows: (1) how to automatically partition the input space, (2) how many fuzzy rules are really needed for properly approximating an unknown nonlinear system, and (3) how to construct a fuzzy system from data examples automatically. These problems can be partially solved by the recent developments of fuzzy systems.

We recognized that fuzzy clustering algorithms are able to automatically divide the input space, and developed clustering algorithm that fits a local models beside the clustering simultaneously. Transparency of the model is enchanced by the tree representation. The number of rules can be given by the user or it can be identified by validation. Comparing with well-known methods it can be determined that the developed method gives the most transparent results at similar accuracy to these methods.

Improvement of Jarvis-Patrick clustering

Jarvis-Patrick clustering utilizes the nearest neighbor approach to cluster the objects. The main disadvantages of the Jarvis-Patrick clustering are that it utilizes a very rigid decision criterion to classify the objects, and this decision criterion is only confined to the k nearest neighbors. To solve these problems  a new similarity measure based on the nearest neighbors of the objects has been defined. This similarity measure is not restricted to the direct neighbors, but it can also take into account objects that are further away. Furthermore, the suggested similarity measure fuzzifies the crisp decision criterion of the Jarvis-Patrick algorithm. The combination of the proposed similarity measure with hierarchical clustering methods provides an effective tool for exploring groups of data.

A. Vathy-Fogarassy, A. Kiss, J. Abonyi. Improvement of Jarvis-Patrick clustering based on fuzzy similarity. In Lecture Notes in Computer Science: Applications of Fuzzy Sets Theory, volume 4578/2007, pages 195–202. Springer Berlin / Heidelberg, 2007. 

Minimal spanning tree and Gath-Geva clustering 

A novel clustering algorithm based on the minimal spanning tree of the objects and the Gath-Geva clustering method has been proposed. Graph based clustering algorithms find groups of objects by eliminating inconsistent edges of the graph representing the data set to be analyzed. The resulting subgraphs yield clusters. However, due to the huge variety of problems and data, it is a difficult challenge to identify the inconsistent edges of graphs. To solve this problem I have suggested a new cutting process of graphs, that iteratively finds the best partitions based on the measure of the fuzzy hyper volume of clusters. Based on this cutting criterion I have also suggested a new graph based clustering algorithm, which is an effective combination of the minimal spanning tree based clustering and the partitional Gath-Geva algorithm. The suggested algorithm is able to solve the typical problem of the graph based clustering algorithms (chaining effect) and it also solves the initialization problem of the Gath-Geva algorithm. The resulting clusters of the proposed algorithm are easily interpretable with a compact parametric description.

Fuzzy clustering based time-series segmentation

Changes of the variables of a multivariate time-series are usually vague and do not focus on any particular time point. Therefore, it is not practical to define crisp bounds of the segments. Although fuzzy clustering algorithms are widely used to group overlapping and vague objects, they cannot be directly applied to time-series segmentation, because the clusters need to be contiguous in time. This paper proposes a clustering algorithm for the simultaneous identification of local Probabilistic Principal Component Analysis (PPCA) models used to measure the homogeneity of the segments and fuzzy sets used to represent the segments in time. The algorithm favors contiguous clusters in time and able to detect changes in the hidden structure of multivariate time-series. A fuzzy decision making algorithm based on a compatibility criteria of the clusters have been worked out to determine the required number of segments, while the required number of principal components are determined by the screeplots of the eigenvalues of the fuzzy covariance matrices. The application example shows that this new technique is a useful tool for the analysis of historical process data.

Cluster Analysis for Data Mining and System Identification 

2007th Edition

The aim of this book is to illustrate that advanced fuzzy clustering algorithms can be used not only for partitioning of the data. It can also be used for visualization, regression, classification and time-series analysis, hence fuzzy cluster analysis is a good approach to solve complex data mining and system identification problems. This book is oriented to undergraduate and postgraduate and is well suited for teaching purposes. 

J. Abonyi, B. Feil, Cluster analysis for data mining and system identification, Birkhauser, 2007, 300 pages 

Modified gath-geva fuzzy clustering for identification of Takagi-Sugeno fuzzy models

Abstract—The construction of interpretable Takagi-Sugeno (TS) fuzzy models by means of clustering is addressed. First, it is shown how the antecedent fuzzy sets and the corresponding consequent parameters of the TS model can be derived from clusters obtained by the Gath-Geva (GG) algorithm. To preserve the partitioning of the antecedent space, linearly transformed input variables can be used in the model. This may, however, complicate the interpretation of the rules. To form an easily interpretable model that does not use the transformed input variables, a new clustering algorithm is proposed, based on the expectation-maximization (EM) identification of Gaussian mixture models. This new technique is applied to two well-known benchmark problems: the MPG (miles per gallon) prediction and a simulated second-order nonlinear process. The obtained results are compared with results from the literature.

Modified Gath–Geva clustering for fuzzy segmentation of multivariate time-series

Partitioning a time-series into internally homogeneous segments is an important data-mining problem. The changes of the variables of a multivariate time-series are usually vague and do not focus on any particular time point. Therefore, it is not practical to define crisp bounds of the segments. Although fuzzy clustering algorithms are widely used to group overlapping and vague objects, they cannot be directly applied to time-series segmentation, because the clusters need to be contiguous in time. This paper proposes a clustering algorithm for the simultaneous identification of local probabilistic principal component analysis (PPCA) models used to measure the homogeneity of the segments and fuzzy sets used to represent the segments in time. The algorithm favors contiguous clusters in time and is able to detect changes in the hidden structure of multivariate time-series. A fuzzy decision making algorithm based on a compatibility criteria of the clusters has been worked out to determine the required number of segments, while the required number of principal components are determined by the screeplots of the eigenvalues of the fuzzy covariance matrices. The application example shows that this new technique is a useful tool for the analysis of historical process data.

Genetic programming for the identification of nonlinear input-output models

Linear-in-parameters models are quite widespread in process engineering, e.g., nonlinear additive autoregressive models, polynomial ARMA models, etc. This paper proposes a new method for the structure selection of these models. The method uses genetic programming to generate nonlinear input-output models of dynamical systems that are represented in a tree structure. The main idea of the paper is to apply the orthogonal least squares (OLS) algorithm to estimate the contribution of the branches of the tree to the accuracy of the model. This method results in more robust and interpretable models. The proposed approach has been implemented as a freely available MATLAB Toolbox, www.fmt.veim.hu/softcomp. The simulation results show that the developed tool provides an efficient and fast method for determining the order and structure for nonlinear input-output models.

Model order selection of nonlinear input–output models––a clustering based approach

Selecting the order of an input–output model of a dynamical system is a key step toward the goal of system identification. The false nearest neighbors algorithm (FNN) is a useful tool for the estimation of the order of linear and nonlinear systems. While advanced FNN uses nonlinear input–output data-based models for the model-based selection of the threshold constant that is used to compute the percentage of false neighbors, the computational effort of the method increases along with the number of data and the dimension of the model. To increase the efficiency of this method, in this paper we propose a clustering-based algorithm. Clustering is applied to the product space of the input and output variables. The model structure is then estimated on the basis of the cluster covariance matrix eigenvalues. The main advantage of the proposed solution is that it is model-free. This means that no particular model needs to be constructed in order to select the order of the model, while most other techniques are wrapped’ around a particular model construction method. This saves the computational effort and avoids a possible bias due to the particular construction method used. Three simulation examples are given to illustrate the proposed technique: estimation of the model structure for a linear system, a polymerization reactor and the van der Vusse reactor.

Visualization of fuzzy clusters by fuzzy sammon mapping projection – Application to the analysis of phase space trajectories

Since in practical data mining problems high-dimensional data are clustered, the resulting clusters are high-dimensional geometrical objects, which are difficult to analyze and interpret. Cluster validity measures try to solve this problem by providing a single numerical value. As a low dimensional graphical representation of the clusters could be much more informative than such a single value, this paper proposes a new tool for the visualization of fuzzy clustering results. By using the basic properties of fuzzy clustering algorithms, this new tool maps the cluster centers and the data such that the distances between the clusters and the data-points are preserved. During the iterative mapping process, the algorithm uses the membership values of the data and minimizes an objective function similar to the original clustering algorithm. Comparing to the original Sammon mapping not only reliable cluster shapes are obtained but the numerical complexity of the algorithm is also drastically reduced. The developed tool has been applied for visualization of reconstructed phase space trajectories of chaotic systems. The case study demonstrates that proposed FUZZSAMM algorithm is a useful tool in user-guided clustering. 

FUZZSAM – Visualization of fuzzy clusters by fuzzy sammon mapping projection – Application to the analysis of phase space trajectories

Since in practical data mining problems high-dimensional data are clustered, the resulting clusters are high-dimensional geometrical objects, which are difficult to analyze and interpret. Cluster validity measures try to solve this problem by providing a single numerical value. As a low dimensional graphical representation of the clusters could be much more informative than such a single value, this paper proposes a new tool for the visualization of fuzzy clustering results. By using the basic properties of fuzzy clustering algorithms, this new tool maps the cluster centers and the data such that the distances between the clusters and the data-points are preserved. During the iterative mapping process, the algorithm uses the membership values of the data and minimizes an objective function similar to the original clustering algorithm. Comparing to the original Sammon mapping not only reliable cluster shapes are obtained but the numerical complexity of the algorithm is also drastically reduced. The algorithm has been applied to several data sets and the numerical results show performance superior to Principal Component Analysis and the classical Sammon mapping based projection. The examples demonstrate that proposed FUZZSAMM algorithm is a useful tool in user-guided clustering.

Fuzzy clustering for selecting structure of nonlinear models with mixed discrete and continuous inputs

A method for selecting regressors in nonlinear models with mixed discrete (categorical) and continuous inputs is proposed. Given a set of input-output data and an initial superset of potential inputs, the relevant inputs are selected by a model-free search algorithm. Fuzzy clustering is used to quantize continuous data into subsets that can be handled in a similar way as discrete data. Two simulation examples and one real-world data set are included to illustrate the performance of the proposed method and compare it with the performance of regression trees. For small to medium size problems (up to 15 candidate inputs), the proposed method works effectively. For larger problems, the computational load becomes too high.

Girimonte D, Babuska R, Abonyi J, Fuzzy clustering for selecting structure of nonlinear models with mixed discrete and continuous inputs, 13th International Conference on Fuzzy Systems, Budapest, Hungary, 2004.07.25-2004.07.29. New York: IEEE Press, 2004. pp. 383-387. (ISBN:0-7803-8353-2).

Supervised Fuzzy Clustering for the Identification of Fuzzy Classifiers

The classical fuzzy classifier consists of rules each one describing one of the classes. In this paper a new fuzzy model structure is proposed where each rule can represent more than one classes with different probabilities. The obtained classifier can be considered as an extension of the quadratic Bayes classifier that utilizes mixture of models for estimating the class conditional densities. A supervised clustering algorithm has been worked out for the identification of this fuzzy model. The relevant input variables of the fuzzy classifier have been selected based on the analysis of the clusters by Fisher's interclass separability criteria. This new approach was applied to the well-known wine and Wisconsin Breast Cancer classification problems.

Application of fuzzy clustering and piezoelectric chemical sensor array for investigation on organic compounds

The fuzzy c-means (FCM) clustering models were used for the discrimination of organic compounds using piezoelectric chemical sensor array data of 14 analytes. Appropriate clusters are found by the sum of the weighted quadratic distances between data points and cluster prototypes. A priori known information can be integrated into the clustering algorithm by using constrained prototypes. A sensor array was built using piezoelectric quartz crystal sensors. Four AT-cut quartz crystals with 9MHz fundamental frequencies were applied. Sensing materials were OV1, OV275, ASI50, and polyphenil-ether. The appropriate coating materials were found by principal component analysis. The application of the fuzzy clustering method has been proved to be a reliable way of identifying similar, pure organic compounds.

Gy. Barkó, J. Abonyi, J. Hlavay, Application of fuzzy clustering and piezoelectric chemical sensor array for investigation on organic compounds, Analytica Chimica Acta 398 (1999) 219–226

Fuzzy clustering and data analysis toolbox

The toolbox is a collection of Matlab functions can be used to culstering of data by fuzzy c-means, Gustafson - Kessel, Gath-Geva clustering algorithms.

The validity function provides cluster validity measures for each partition. The Visualization part of this toolbox provides the modified Sammon mapping of the data.The toolbox is supported by a 77 pages manual.