Machine Learning Based Analysis of Human Serum N-glycome Alterations to Follow up Lung Tumor Surgery

The human serumN-glycome is a valuable source of biomarkers for malignant diseases,already utilized in multiple studies. In this paper, theN-glycosylation changes in human serumproteins were analyzed after surgical lung tumor resection. Seventeen lung cancer patients wereinvolved in this study and theN-glycosylation pattern of their serum samples was analyzed before andafter the surgery using capillary electrophoresis separation with laser-induced fluorescent detection.The relative peak areas of 21N-glycans were evaluated from the acquired electropherograms usingmachine learning-based data analysis. Individual glycans as well as their subclasses were taken intoaccount during the course of evaluation. For the data analysis, both discrete (e.g., smoker or not)and continuous (e.g., age of the patient) clinical parameters were compared against the alterationsin these 21N-linked carbohydrate structures. The classification tree analysis resulted in a panel ofN-glycans, which could be used to follow up on the effects of lung tumor surgical resection.

Decision trees for informative process alarm definition and alarm-based fault classification

Alarm messages in industrial processes are designed to draw attention to abnormalities that require timely assessment or intervention. However, in practice, alarms are arbitrarily and excessively defined by process operators resulting numerous nuisance and chattering alarms that are simply a source of distraction. Countless techniques are available for the retrospective filtering of alarm data, e.g., adding time delays and deadbands to existing alarm settings. As an alternative, in the present paper, instead of filtering or modifying existing alarms, a method for the design of alarm messages being informative for fault detection is proposed which takes into consideration that the occurring alarm messages originally should be optimal for fault detection and identification. This methodology utilizes a machine learning technique, the decision tree classifier, which provides linguistically well-interpretable models without the modification of the measured process variables. Furthermore, an online application of the defined alarm messages for fault identification is presented using a sliding window-based data preprocessing approach. The effectiveness of the proposed methodology is demonstrated in terms of the analysis of a well-known benchmark simulator of a vinyl-acetate production technology, where the complexity of the simulator is considered to be sufficient for the testing of alarm systems.

Note to practitioners: Process-specific knowledge can be used to label historical process data to normal operating and fault-specific periods. Alarm generation should be designed to be able to detect and isolate faulty states. Using decision trees, optimal”cuts” or alarm limits for the purpose of fault classification can be defined utilizing a labelled dataset. The results apply to a variety of industries operating with online control systems, and especially timely in the chemical industry.

Supervised fuzzy clustering for the identification of fuzzy classifiers

The classical fuzzy classifier consists of rules each one describing one of the classes. In this paper a new fuzzy model structure is proposed where each rule can represent more than one classes with different probabilities. The obtained classifier can be considered as an extension of the quadratic Bayes classifier that utilizes mixture of models for estimating the class conditional densities. A supervised clustering algorithm has been worked out for the identification of this fuzzy model. The relevant input variables of the fuzzy classifier have been selected based on the analysis of the clusters by Fisher’s interclass separability criteria. This new approach is applied to the well-known wine and Wisconsin breast cancer classification problems.

Abonyi, Janos, and Ferenc Szeifert. "Supervised fuzzy clustering for the identification of fuzzy classifiers." Pattern Recognition Letters 24.14 (2003): 2195-2207. (MATLAB implementation)

Data-driven generation of compact, accurate, and linguistically sound fuzzy classifiers based on a decision tree initialization

The data-driven identification of fuzzy rule-based classifiers for high-dimensional problems is addressed. A binary decision-tree-based initialization of fuzzy classifiers is proposed for the selection of the relevant features and effective initial partitioning of the input domains of the fuzzy system. Fuzzy classifiers have more flexible decision boundaries than decision trees (DTs) and can therefore be more parsimonious. Hence, the decision tree initialized fuzzy classifier is reduced in an iterative scheme by means of similarity-driven rule-reduction. To improve classification performance of the reduced fuzzy system, a genetic algorithm with a multi-objective criterion searching for both redundancy and accuracy is applied. The proposed approach is studied for (i) an artificial problem, (ii) the Wisconsin Breast Cancer classification problem, and (iii) a summary of results is given for a set of well-known classification problems available from the Internet: Iris, Ionospehere, Glass, Pima, and Wine data.

Learning Fuzzy Classification Rules from Data

Automatic design of fuzzy rule-based classification systems based on labeled data is considered. It is recognized that both classification performance and interpretability are of major importance and effort is made to keep the resulting rule bases small and comprehensible. An iterative approach for developing fuzzy classifiers is proposed. The initial model is derived from the data and subsequently, feature selection and rule base simplification are applied to reduce the model, and a GA is used for model tuning. An application to the Wine data classification problem is shown.

Compact TS-fuzzy models through clustering and OLS plus FIS model reduction

Identification of uncertain and nonlinear systems is an important and challenging problem. Fuzzy models of the Takagi-Sugeno (TS) type may be a good choice to describe such systems; however, in many cases these become soon complex. We propose a three-step method to obtain compact TS-models that can be effectively used to represent complex systems: 1) a new fuzzy clustering method is proposed for identification of compact TS-models; 2) the most relevant consequent variables of the TS-model are selected by an orthogonal least squares (OLS) method based on the obtained clusters; and 3) for selection of relevant antecedent variables, a new method is proposed based on Fisher's interclass separability (FIS) criterion. The overall approach is demonstrated by means of the MPG (miles per gallon) nonlinear regression benchmark. Results are compared with those obtained by standard linear, neuro-fuzzy and advanced fuzzy clustering-based identification tools.

J. Abonyi, J.A. Roubos, M. Oosterom, F. Szeifert, Compact TS-fuzzy models through clustering and OLS plus FIS model reduction, FUZZ-IEEE'01 Conference, Sydney, Australia, 2001, (MATLAB implementation) (presentation)

Association rule and decision tree based methods for fuzzy rule base generation

This paper focuses on the data-driven generation of fuzzy IF...THEN rules. The resulted fuzzy rule base can be applied to build a classifier, a model used for prediction, or it can be applied to form a decision support system. Among the wide range of possible approaches, the decision tree and the association rule based algorithms are overviewed, and two new approaches are presented based on the a priori fuzzy clustering based partitioning of the continuous input variables. An application study is also presented, where the developed methods are tested on the well known Wisconsin Breast Cancer classification problem.

F. P. Pach, J. Abonyi, Association rule and decision tree based methods for fuzzy rule base generation, Enformatika (Transactions on Engineering, Computing and Technology), Volume 13, 2006 45-50

Modified gath-geva fuzzy clustering for identification of takagi-sugeno fuzzy models

The construction of interpretable Takagi-Sugeno (TS) fuzzy models by means of clustering is addressed. First, it is shown how the antecedent fuzzy sets and the corresponding consequent parameters of the TS model can be derived from clusters obtained by the Gath-Geva (GG) algorithm. To preserve the partitioning of the antecedent space, linearly transformed input variables can be used in the model. This may, however, complicate the interpretation of the rules. To form an easily interpretable model that does not use the transformed input variables, a new clustering algorithm is proposed, based on the expectation-maximization (EM) identification of Gaussian mixture models. This new technique is applied to two well-known benchmark problems: the MPG (miles per gallon) prediction and a simulated second-order nonlinear process. The obtained results are compared with results from the literature.

J. Abonyi, R. Babuska, F. Szeifert, Modified gath-geva fuzzy clustering for identification of takagi-sugeno fuzzy models, IEEE Trans. on Systems, Man and Cybernetics, Part B, Oct, 2002

Supervised clustering based decision tree induction

A new method based on supervised clustering was developed for the discretization of continuous features to form efficient fuzzy decision tree based classifiers. A proper classification rule structure is obtained by the feature discretization, rule-induction and rule-optimization procedures. The resulted fuzzy classifiers are very compact and well interpretable while the accuracy is still comparable to the best results reported in the literature.

J. Abonyi, Supervised Fuzzy Clustering Based Initialization of Fuzzy Partitions for Decision Tree Induction, Advances in Intelligent and Soft Computing, Soft Computing in Industrial Applications (75), 31-39

Application of Fuzzy Clustering and Piezoelectric Chemical Sensor Array for Investigation on Organic Compounds

The Fuzzy c-Means (FCM) clustering models were used for the discrimination of organic compounds using piezoelectric chemical sensor array data of 14 analytes. Appropriate clusters are found by the sum of the weighted quadratic distances between data points and cluster prototypes. A priori known information can be integrated into the clustering algorithm by using constrained prototypes. A sensor array was built using piezoelectric quartz crystal sensors. Four AT-cut quartz crystals with 9 MHz fundamental frequencies were applied. Sensing materials were OV1, OV275, ASI50, and polyphenil-ether. The appropriate coating materials were found by a principal component analysis. The application of the fuzzy clustering method has been proved to be reliable way of identifying similar, pure organic compounds.

G. Barkó, J. Abonyi, J. Hlavay, Application of fuzzy clustering and piezoelectric chemical sensor array for investigation on organic compounds, Analytica Chimica Acta, 398 (2-3), 219-22, 1999

A Simple Fuzzy Classifier based on Inconsistency Analysis of Labeled Data

An extremely simple fuzzy classifier is identified based on the inconsistency analysis of labelled training data. The method was applied to the COIL challenge 2000 Direct Mail problem and resulted in 121 selected caravan policies within the first 800 selected customers. As this result is identical to the result of the winner of the competition, the presented method is an example for how the try the simplest first approach can be effective in real-life problems.

J. Abonyi, H. Roubos, Simple fuzzy classifier based on inconsistency analysis of labeled data", Chapter 12 in: CoIL Challenge 2000: The Insurance Company Case, Peter van der Putten and Maarten van Someren (eds), Sentient Machine Research, Amsterdam and Leiden Institute of Advanced Computer Science, Leiden, LIACS Technical Report, 1-10, 2000 (MATLAB implementation)

World map of clinkers - Visualization of trace element content of clinkers by self-organizing map

The analysis of data taken from the measurements of trace elements in clinkers may lead to extremely valuable insights into the properties of raw materials and can be used to solve practical problems too: to determine the origin of the clinker (i.e. the manufacturing works. For this purpose, several hundred clinker sorts have been analysed by replicated quarterly samples, collected from factories of nine countries to determine their Mg, Sr, Ba, Mn, Ti, Zr, Zn, and V content. This paper describes a soft-computing based approach where the Self-Organizing Map (SOM) is used for the extraction of knowledge from this database related to the trance element content of clinkers. The SOM is a vector quantization method which places prototype vectors (cluster centers) on a regular low-dimensional grid in an ordered fashion. Since SOM provides a compact representation of the data distribution, the typical clinkers are detected by the SOM via clustering of the data. As the typical trance elements contents are arranged on a two-dimensional projection of the concentration variables, hence the model can be effectively used to analyze the relationships between different factories and different trance elements.

F. D. Tamás, J. Abonyi, World map of clinkers - Visualization of trace element content of clinkers by self-organizing map, 11th International Congress on the Chemistry of Cement (ICCC), South Africa, 2003,

Trace elements in clincker I. - A graphical representation

The trace element content of clinkers (and possibly of cements) can be used for the qualitative identification (i.e. manufacturing works). Several hundred clinker sorts have been analysed (by replicated quarterly samples, collected from all Hungarian cement factories, as well as from factories in eight foreign countries) to determine their Mg, Sr, Ba, Mn, Ti, Zr, Zn and V content. The first six elements come from the main raw materials and are of dactylogrammatic value, while the last two elements mainly come from fuel (used tires and heavy fuel oil, respectively) and cannot be used for identification. In this paper, a graphical method is presented to facilitate the visualisation of the trace element content.

F. D. Tamás, J. Abonyi, Trace elements in clincker I. - A graphical representation, Cement and Concrete Research, 32/8, 1319-1323, 2002

Trace elements in clincker II. – Qualitative identification by fuzzy clustering

The trace element content of clinkers (and possibly of cements) can be used for the qualitative identification (i.e., manufacturing factory). This paper proposes a fuzzy classifier for the discrimination of clinkers produced in different factories based on their Mg, Sr, Ba, Mn, Ti, Zr, Zn and V content. The fuzzy classifier is identified by unsupervised fuzzy clustering. The most relevant trace elements were selected based on the obtained clusters by the modified version of the Fisher interclass separability method. The classification of a country from the European Community and South African clinkers is used as an illustrative example. The results show that the proposed method is useful to identify compact classifiers that are able to determine the origin of the clinker; the obtained classifier is easy to use and interpret for engineers and researchers, even when they are not familiar with the concept of fuzzy logic.

F. D. Tamás, J. Abonyi, Trace elements in clincker II. – Qualitative identification by fuzzy clustering, Cement and Concrete Research, 32/8, 1325-1330, 2002

Computational intelligence in data mining

This paper is aimed to give a comprehensive view about the links between computational intelligence and data mining. Further, a case study is also given in which the extracted knowledge is represented by fuzzy rule-based expert systems obtained by soft computing based data mining algorithms. It is recognized that both model performance and interpretability are of major importance, and effort is required to keep the resulting rule bases small and comprehensible. Therefore, CI technique based data mining algorithms have been developed for feature selection, feature extraction, model optimization and model reduction (rule base simplification). Application of these techniques is illustrated using the Wine data classification problem. The results illustrate that that CI based tools can be applied in a synergistic manner though the nine steps of knowledge discovery.