Genetic programming-based symbolic regression for goal-oriented dimension reduction
The majority of dimension reduction techniques are built upon the optimization of an objective functionaiming to retain certain characteristics of the projected datapoints: the variance of the original dataset,the distance between the datapoints or their neighbourhood characteristics, etc. Building upon theoptimization-based formalization of dimension reduction techniques, the goal-oriented formulation ofprojection cost functions is proposed. For the optimization of the application-oriented data visualizationcost function, a Multi-gene genetic programming (GP)-based algorithm is introduced to optimize thestructures of the equations used for mapping high-dimensional data into a two-dimensional space andto select variables that are needed to explore the internal structure of the data for data-driven softwaresensor development or classifier design. The main benefit of the approach is that the evolved equationsare interpretable and can be utilized in surrogate models. The applicability of the approach is demon-strated in the benchmark wine dataset and in the estimation of the product quality in a diesel oil blendingtechnology based on an online near-infrared (NIR) analyzer. The results illustrate that the algorithm iscapable to generate goal-oriented and interpretable features, and the resultant simple algebraic equa-tions can be directly implemented into applications when there is a need for computationally cost-effective projections of high-dimensional data as the resultant algebraic equations are computationallysimpler than other solutions as neural networks.
Local and global mappings of topology representing networks
As data analysis tasks often have to deal with complex data structures, the nonlinear dimensionality reduction methods play an important role in exploratory data analysis. In the literature a number of nonlinear dimensionality reduction techniques have been proposed (e.g. Sammon mapping, Locally Linear Embedding). These techniques attempt to preserve either the local or the global geometry of the original data, and they perform metric or non-metric dimensionality reduction. Nevertheless, it is difficult to apply most of them to large data sets. There is a need for new algorithms that are able to combine vector quantisation and mapping methods in order to visualise the data structure in a low-dimensional vector space. In this paper we define a new class of algorithms to quantify and disclose the data structure, that are based on the topology representing networks and apply different mapping methods to the low-dimensional visualisation. Not only existing methods are combined for that purpose but also a novel group of mapping methods (Topology Representing Network Map) are introduced as a part of this class. Topology Representing Network Maps utilise the main benefits of the topology representing networks and of the multidimensional scaling methods to disclose the real structure of the data set under study. To determine the main properties of the topology representing network based mapping methods, a detailed analysis of classical benchmark examples (Wine and Optical Recognition of Handwritten Digits data set) is presented.
Visualization and Complexity Reduction of Neural Networks
The identification of the proper structure of nonlinear neural networks (NNs) is a difficult problem, since these black-box models are not interpretable. The aim of the paper is to propose a new approach that can be used for the analysis and the reduction of these models. It is shown that NNs with sigmoid transfer function can be transformed into fuzzy systems. Hence, with the use of this transformation NNs can be analyzed by human experts based on the extracted linguistic rules. Moreover, based on the similarity of the resulted membership functions the hidden neurons of the NNs can be mapped into a two dimensional space. The resulted map provides an easily interpretable figure about the redundancy of the neurons. Furthermore, the contribution of these neurons can be measured by orthogonal least squares technique that can be used for the ordering of the extracted fuzzy rules based on their importance. A practical example related to the dynamic modeling of a chemical process system is used to prove that synergistic combination of model transformation, visualization and reduction of NNs is an effective technique, that can be used for the structural and parametrical analysis of NNs.
Topology Representing Network Map – A New Tool for Visualization of High-Dimensional Data
In practical data mining problems high-dimensional data has to be analyzed. In most of these cases it is very informative to map and visualize the hidden structure of complex data set in a low-dimensional space. The aim of this paper is to propose a new mapping algorithm based both on the topology and the metric of the data.
The utilized Topology Representing Network (TRN) combines neural gas vector quantization and competitive Hebbian learning rule in such a way that the hidden data structure is approximated by a compact graph representation. TRN is able to define a low-dimensional manifold in the high-dimensional feature space. In case the existence of a manifold, multidimensional scaling and/or Sammon mapping of the graph distances can be used to form the map of the TRN (TRNMap).
The systematic analysis of the algorithms that can be used for data visualization and the numerical examples presented in this paper demonstrate that the resulting map gives a good representation of the topology and the metric of complex data sets, and the component plane representation of TRNMap is useful to explore the hidden relations among the features.
Topology Representing Networks Based Visualizations of Manifolds
In practical data mining tasks high-dimensional data has to be analyzed. In most of the cases it is very informative to map and visualize the hidden structure of a complex data set in a low-dimensional space. In this paper a new class of mapping algorithms is defined. These algorithms combine topology representing networks and different nonlinear mapping algorithms. While the former methods aim to quantify the data and disclose the real structure of the objects, the nonlinear mapping algorithms are able to visualize the quantized data in the low-dimensional vector space. In this paper we round up the techniques based on these methods and we show the results of a detailed analysis performed on them. The primary aim of this analysis was to examine the preservation of distances and neighborhood relations of the objects. Preservation of neighborhood relations was analyzed both in local and global environments. To evaluate the main properties of the examined methods we show the outcome of the analysis based on a synthetic and a real benchmark examples.
Visualization of fuzzy clusters by fuzzy Sammon mapping projection: application to the analysis of phase space trajectories
Since in practical data mining problems high-dimensional data are clustered, the resulting clusters are high-dimensional geometrical objects, which are difficult to analyze and interpret. Cluster validity measures try to solve this problem by providing a single numerical value. As a low dimensional graphical representation of the clusters could be much more informative than such a single value, this paper proposes a new tool for the visualization of fuzzy clustering results. By using the basic properties of fuzzy clustering algorithms, this new tool maps the cluster centers and the data such that the distances between the clusters and the data-points are preserved. During the iterative mapping process, the algorithm uses the membership values of the data and minimizes an objective function similar to the original clustering algorithm. Comparing to the original Sammon mapping not only reliable cluster shapes are obtained but the numerical complexity of the algorithm is also drastically reduced. The developed tool has been applied for visualization of reconstructed phase space trajectories of chaotic systems. The case study demonstrates that proposed FUZZSAMM algorithm is a useful tool in user-guided clustering.
Node Similarity Based Graph Clustering and Visualization
The basis of the presented methods for the visualization and clustering of graphs is a novel similarity and distance metric, and the matrix describing the similarity of the nodes in the graph. This matrix represents the type of connections between the nodes in the graph in a compact form, thus it provides a very good starting point for both the clustering and visualization algorithms. Hence visualization is done with the MDS (Multidimensional Scaling) dimensionality reduction technique obtaining the spectral decomposition of this matrix, while the partitioning is based on the results of this step generating a hierarchical representation. A detailed example is shown to justify the capability of the described algorithms for clustering and visualization of the link structure of Web sites.
FUZZSAM – Visualization of Fuzzy Clustering Results by Modified Sammon Mapping
Since in practical data mining problems high-dimensional data are clustered, the resulting clusters are high-dimensional geometrical objects, which are difficult to analyze and interpret. Cluster validity measures try to solve this problem by providing a single numerical value. As a low dimensional graphical representation of the clusters could be much more informative than such a single value, this paper proposes a new tool for the visualization of fuzzy clustering results. By using the basic properties of fuzzy clustering algorithms, this new tool maps the cluster centers and the data such that the distances between the clusters and the data-points are preserved. During the iterative mapping process, the algorithm uses the membership values of the data and minimizes an objective function similar to the original clustering algorithm. Comparing to the original Sammon mapping not only reliable cluster shapes are obtained but the numerical complexity of the algorithm is also drastically reduced. The algorithm has been applied to several data sets and the numerical results show performance superior to principal component analysis and the classical Sammon mapping based projection. The examples demonstrate that proposed FUZZSAMM algorithm is a useful tool in user-guided clustering.