Visualization

Matrix factorization-based multi-objective ranking–What makes a good university?

Non-negative matrix factorization (NMF) efficiently reduces high dimensionality for many-objective ranking problems. In multi-objective optimization, as long as only three or four conflicting viewpoints are present, an optimal solution can be determined by finding the Pareto front. When the number of the objectives increases, the multi-objective problem evolves into a many-objective optimization task, where the Pareto front becomes oversaturated. The key idea is that NMF aggregates the objectives so that the Pareto front can be applied, while the Sum of Ranking Differences (SRD) method selects the objectives that have a detrimental effect on the aggregation, and validates the findings. The applicability of the method is illustrated by the ranking of 1176 universities based on 46 variables of the CWTS Leiden Ranking 2020 database. The performance of NMF is compared to principal component analysis (PCA) and sparse non-negative matrix factorization-based solutions. The results illustrate that PCA incorporates negatively correlated objectives into the same principal component. On the contrary, NMF only allows non-negative correlations, which enable the proper use of the Pareto front. With the combination of NMF and SRD, a non-biased ranking of the universities based on 46 criteria is established, where Harvard, Rockefeller and Stanford Universities are determined as the first three. To evaluate the ranking capabilities of the methods, measures based on Relative Entropy (RE) and Hypervolume (HV) are proposed. The results confirm that the sparse NMF method provides the most informative ranking. The results highlight that academic excellence can be improved by decreasing the proportion of unknown open-access publications and short distance collaborations. The proportion of gender indicators barely correlate with scientific impact. More authors, long-distance collaborations, publications that have more scientific impact and citations on average highly influence the university ranking in a positive direction.


Post date: 13 April 2023 

Factor analysis, sparse PCA, and Sum of Ranking Differences-based improvements of the Promethee-GAIA multicriteria decision support technique

The Promethee-GAIA method is a multicriteria decision support technique that defines the aggregated ranks of multiple criteria and visualizes them based on Principal Component Analysis (PCA). In the case of numerous criteria, the PCA biplot-based visualization do not perceive how a criterion influences the decision problem. The central question is how the Promethee-GAIA-based decision-making process can be improved to gain more interpretable results that reveal more characteristic inner relationships between the criteria. To improve the Promethee-GAIA method, we suggest three techniques that eliminate redundant criteria as well as clearly outline, which criterion belongs to which factor and explore the similarities between criteria. These methods are the following: A) Principal factoring with rotation and communatily analysis (P-PFA), B) the integration of Sparse PCA into the Promethee II methods (P-sPCA), and C) the Sum of Ranking Differences method (P-SRD). The suggested methods are presented through an I4.0+ dataset that measures the Industry 4.0 readiness of NUTS2-classified regions. The proposed methods are useful tools for handling multicriteria ranking problems, if the number of criteria is numerous. 

Comprehensible Visualization of Multidimensional Data: Ranking Differences-Based Parallel Coordinates

A novel visualization technique is proposed for the sum of ranking differences method (SRD) based on parallel coordinates. An axis is defined for each variable, on which the data are depicted row-wise. By connecting data, the lines may intersect. The fewer intersections between the variables, the more similar they are and the clearer the figure becomes. Therefore, the visualization depends on what techniques are used to order the variables. The key idea is to employ the SRD method to measure the degree of similarity of the variables, establishing a distance-based order. The distances between the axes are not uniformly distributed in the proposed visualization; their closeness reflects similarity, according to their SRD value. The proposed algorithm identifies false similarities through an iterative approach, where the angles between the SRD values determine which side a variable is plotted. Visualization of the algorithm is provided by MATLAB/Octave source codes. The proposed tool is applied to study how the sources of greenhouse gas emissions can be grouped based on the statistical data of the countries. A comparison to multidimensional scaling (MDS)-based ordering is also given. The use case demonstrates the applicability of the method and the synergies of the incorporation of the SRD method into parallel coordinates. 

Genetic programming-based symbolic regression for goal-oriented dimension reduction

The majority of dimension reduction techniques are built upon the optimization of an objective functionaiming to retain certain characteristics of the projected datapoints: the variance of the original dataset,the distance between the datapoints or their neighbourhood characteristics, etc. Building upon theoptimization-based formalization of dimension reduction techniques, the goal-oriented formulation ofprojection cost functions is proposed. For the optimization of the application-oriented data visualizationcost function, a Multi-gene genetic programming (GP)-based algorithm is introduced to optimize thestructures of the equations used for mapping high-dimensional data into a two-dimensional space andto select variables that are needed to explore the internal structure of the data for data-driven softwaresensor development or classifier design. The main benefit of the approach is that the evolved equationsare interpretable and can be utilized in surrogate models. The applicability of the approach is demon-strated in the benchmark wine dataset and in the estimation of the product quality in a diesel oil blendingtechnology based on an online near-infrared (NIR) analyzer. The results illustrate that the algorithm iscapable to generate goal-oriented and interpretable features, and the resultant simple algebraic equa-tions can be directly implemented into applications when there is a need for computationally cost-effective projections of high-dimensional data as the resultant algebraic equations are computationallysimpler than other solutions as neural networks.

Local and global mappings of topology representing networks

As data analysis tasks often have to deal with complex data structures, the nonlinear dimensionality reduction methods play an important role in exploratory data analysis. In the literature a number of nonlinear dimensionality reduction techniques have been proposed (e.g. Sammon mapping, Locally Linear Embedding). These techniques attempt to preserve either the local or the global geometry of the original data, and they perform metric or non-metric dimensionality reduction. Nevertheless, it is difficult to apply most of them to large data sets. There is a need for new algorithms that are able to combine vector quantisation and mapping methods in order to visualise the data structure in a low-dimensional vector space. In this paper we define a new class of algorithms to quantify and disclose the data structure, that are based on the topology representing networks and apply different mapping methods to the low-dimensional visualisation. Not only existing methods are combined for that purpose but also a novel group of mapping methods (Topology Representing Network Map) are introduced as a part of this class. Topology Representing Network Maps utilise the main benefits of the topology representing networks and of the multidimensional scaling methods to disclose the real structure of the data set under study. To determine the main properties of the topology representing network based mapping methods, a detailed analysis of classical benchmark examples (Wine and Optical Recognition of Handwritten Digits data set) is presented.

Visualization and Complexity Reduction of Neural Networks

The identification of the proper structure of nonlinear neural networks (NNs) is a difficult problem, since these black-box models are not interpretable. The aim of the paper is to propose a new approach that can be used for the analysis and the reduction of these models. It is shown that NNs with sigmoid transfer function can be transformed into fuzzy systems. Hence, with the use of this transformation NNs can be analyzed by human experts based on the extracted linguistic rules. Moreover, based on the similarity of the resulted membership functions the hidden neurons of the NNs can be mapped into a two dimensional space. The resulted map provides an easily interpretable figure about the redundancy of the neurons. Furthermore, the contribution of these neurons can be measured by orthogonal least squares technique that can be used for the ordering of the extracted fuzzy rules based on their importance. A practical example related to the dynamic modeling of a chemical process system is used to prove that synergistic combination of model transformation, visualization and reduction of NNs is an effective technique, that can be used for the structural and parametrical analysis of NNs.

Topology Representing Network Map – A New Tool for Visualization of High-Dimensional Data

In practical data mining problems high-dimensional data has to be analyzed. In most of these cases it is very informative to map and visualize the hidden structure of complex data set in a low-dimensional space. The aim of this paper is to propose a new mapping algorithm based both on the topology and the metric of the data.

The utilized Topology Representing Network (TRN) combines neural gas vector quantization and competitive Hebbian learning rule in such a way that the hidden data structure is approximated by a compact graph representation. TRN is able to define a low-dimensional manifold in the high-dimensional feature space. In case the existence of a manifold, multidimensional scaling and/or Sammon mapping of the graph distances can be used to form the map of the TRN (TRNMap).

The systematic analysis of the algorithms that can be used for data visualization and the numerical examples presented in this paper demonstrate that the resulting map gives a good representation of the topology and the metric of complex data sets, and the component plane representation of TRNMap is useful to explore the hidden relations among the features.

Topology Representing Networks Based Visualizations of Manifolds

In practical data mining tasks high-dimensional data has to be analyzed. In most of the cases it is very informative to map and visualize the hidden structure of a complex data set in a low-dimensional space. In this paper a new class of mapping algorithms is defined. These algorithms combine topology representing networks and different nonlinear mapping algorithms. While the former methods aim to quantify the data and disclose the real structure of the objects, the nonlinear mapping algorithms are able to visualize the quantized data in the low-dimensional vector space. In this paper we round up the techniques based on these methods and we show the results of a detailed analysis performed on them. The primary aim of this analysis was to examine the preservation of distances and neighborhood relations of the objects. Preservation of neighborhood relations was analyzed both in local and global environments. To evaluate the main properties of the examined methods we show the outcome of the analysis based on a synthetic and a real benchmark examples.

Visualization of fuzzy clusters by fuzzy Sammon mapping projection: application to the analysis of phase space trajectories

Since in practical data mining problems high-dimensional data are clustered, the resulting clusters are high-dimensional geometrical objects, which are difficult to analyze and interpret. Cluster validity measures try to solve this problem by providing a single numerical value. As a low dimensional graphical representation of the clusters could be much more informative than such a single value, this paper proposes a new tool for the visualization of fuzzy clustering results. By using the basic properties of fuzzy clustering algorithms, this new tool maps the cluster centers and the data such that the distances between the clusters and the data-points are preserved. During the iterative mapping process, the algorithm uses the membership values of the data and minimizes an objective function similar to the original clustering algorithm. Comparing to the original Sammon mapping not only reliable cluster shapes are obtained but the numerical complexity of the algorithm is also drastically reduced. The developed tool has been applied for visualization of reconstructed phase space trajectories of chaotic systems. The case study demonstrates that proposed FUZZSAMM algorithm is a useful tool in user-guided clustering. 

Node Similarity Based Graph Clustering and Visualization

The basis of the presented methods for the visualization and clustering of graphs is a novel similarity and distance metric, and the matrix describing the similarity of the nodes in the graph. This matrix represents the type of connections between the nodes in the graph in a compact form, thus it provides a very good starting point for both the clustering and visualization algorithms. Hence visualization is done with the MDS (Multidimensional Scaling) dimensionality reduction technique obtaining the spectral decomposition of this matrix, while the partitioning is based on the results of this step generating a hierarchical representation. A detailed example is shown to justify the capability of the described algorithms for clustering and visualization of the link structure of Web sites.

FUZZSAM – Visualization of Fuzzy Clustering Results by Modified Sammon Mapping

Since in practical data mining problems high-dimensional data are clustered, the resulting clusters are high-dimensional geometrical objects, which are difficult to analyze and interpret. Cluster validity measures try to solve this problem by providing a single numerical value. As a low dimensional graphical representation of the clusters could be much more informative than such a single value, this paper proposes a new tool for the visualization of fuzzy clustering results. By using the basic properties of fuzzy clustering algorithms, this new tool maps the cluster centers and the data such that the distances between the clusters and the data-points are preserved. During the iterative mapping process, the algorithm uses the membership values of the data and minimizes an objective function similar to the original clustering algorithm. Comparing to the original Sammon mapping not only reliable cluster shapes are obtained but the numerical complexity of the algorithm is also drastically reduced. The algorithm has been applied to several data sets and the numerical results show performance superior to principal component analysis and the classical Sammon mapping based projection. The examples demonstrate that proposed FUZZSAMM algorithm is a useful tool in user-guided clustering.