Frequent Sequence, Itemset and Association Rule Mining
Integrated Survival Analysis and Frequent Pattern Mining for Course Failure-Based Prediction of Student Dropout
A data-driven method to identify frequent sets of course failures that students should avoid in order to minimize the likelihood of their dropping out from their university training is proposed. The overall probability distribution of the dropout is determined by survival analysis. This result can only describe the mean dropout rate of the undergraduates. However, due to the failure of different courses, the chances of dropout can be highly varied, so the traditional survival model should be extended with event analysis. The study paths of students are represented as events in relation to the lack of completing the required subjects for every semester. Frequent patterns of backlogs are discovered by the mining of frequent sets of these events. The prediction of dropout is personalised by classifying the success of the transitions between the semesters. Based on the explored frequent item sets and classifiers, association rules are formed providing the estimates of the success of the continuation of the studies in the form of confidence metrics. The results can be used to identify critical study paths and courses. Furthermore, based on the patterns of individual uncompleted subjects, it is suitable to predict the chance of continuation in every semester. The analysis of the critical study paths can be used to design personalised actions minimizing the risk of dropout, or to redesign the curriculum aiming the reduction in the dropout rate. The applicability of the method is demonstrated based on the analysis of the progress of chemical engineering students at the University of Pannonia in Hungary. The method is suitable for the examination of more general problems assuming the occurrence of a set of events whose combinations may trigger a set of critical events.
Directions of membrane separator development for microbial fuel cells: A retrospective analysis using frequent itemset mining and descriptive statistical approach
To increase the efficiency of microbial fuel cells (MFCs), the separator (which is mostly a membrane) placed between the electrodes or their compartments is considered of high importance besides several other biotic and abiotic factors (e.g. configuration, mode of operation, types of inoculum and substrate). Nafion-based proton exchange membranes (PEMs) are the most widespread, although these materials are often criticized on various technological and economical grounds. Therefore, to find alternatives of Nafion, the synthesis, development and testing of novel/commercialized membrane separators with enhanced characteristics have been hot topics. In this study, the goals were to assess the membrane-installed MFCs in a retrospective manner and reveal the trends, the applied practices, frequent setups, etc. via Bayesian classification and frequent itemset mining algorithms. Thereafter, a separate discussion was devoted to examine the current standing of research related to major membrane groups used in MFCs and evaluate in accordance with the big picture how the various systems behave in comparison with each other, especially compared to those applying Nafion PEM. It was concluded that some membrane types seem to be competitive to Nafion, however, the standardization of the experiments would drive the more unambiguous comparison of studies.
Koók, L., Dorgo, G., Bakonyi, P., Rózsabenerszki T. Nemestóthy N., Varga, Bélafiné-Bakó Katalin & Abonyi, J. (2020). Directions of membrane separator development for microbial fuel cells: A retrospective analysis using frequent itemset mining and descriptive statistical approach, Journal of Power Sources, 481, 229014
Frequent pattern mining in multidimensional organizational networks
Network analysis can be applied to understand organizations based on patterns of communication, knowledge flows, trust, and the proximity of employees. A multidimensional organizational network was designed, and association rule mining of the edge labels applied to reveal how relationships, motivations, and perceptions determine each other in different scopes of activities and types of organizations. Frequent itemset-based similarity analysis of the nodes provides the opportunity to characterize typical roles in organizations and clusters of co-workers. A survey was designed to define 15 layers of the organizational network and demonstrate the applicability of the method in three companies. The novelty of our approach resides in the evaluation of people in organizations as frequent multidimensional patterns of multilayer networks. The results illustrate that the overlapping edges of the proposed multilayer network can be used to highlight the motivation and managerial capabilities of the leaders and to find similarly perceived key persons.
Towards Operator 4.0, Increasing Production Efficiency and Reducing Operator Workload by Process Mining of Alarm Data
A methodology to extract temporal patterns of alarm sequences and operator actions from the log files of alarm management systems is proposed. Firstly, time-segments that are informative from the viewpoint of operator interventions are identified by the algorithm. These segments include series of alarms that initialize operator actions, sets of operator actions, and a period that potentially covers the effects of the corrective actions of the operators. In the second step of the methodology, the sets of operator actions that are frequently applied in the same situations are determined. For this purpose, the FP-Growth Algorithm, which is one of the fastest tools of frequent item-set mining and generates well-structured action trees that are not only suitable for the visualization of interventions but lend themselves to build association rules that could be directly applied in decision support systems, is utilized. Finally, multi-temporal sequence mining is applied to reveal what alarms led to the sets of operator actions and what were the effects of these interventions. The applicability of the methodology is illustrated by presenting results connected to the analysis of the delayed coker plant at the Danube Refinery of the MOL Group.
Sequence Mining based Alarm Suppression
Despite the high-pace improvement of industrial process automation, the management of abnormal events still requires human actions. Alarm systems are becoming crucial in providing situation-specific information to the decreasing number of operators. The key role of an alarm management system is to ensure that only the currently significant alarms are annunciated. The design of alarm suppression rules requires the systematic analysis of the process and its control system. We give an overview of the recently developed data-driven techniques and show that the widely applied correlation-based methods utilize a static view of the system. To provide more insight into the process dynamics and represent the temporal relationships among faults, control actions, and process variables, we propose of a multi-temporal sequence mining-based algorithm. The methodology starts with the generation of frequent temporal patterns of the alarm signals. We transform the multi-temporal sequences into Bayes classifiers. The obtained association rules can be used to define the alarm suppression rules. We analyze the data set of a laboratory-scale water treatment testbed to illustrate that multi-temporal sequences are applicable for the description of operation patterns. We extended the benchmark simulator of a vinyl acetate production technology to generate easily reproducible results and stimulate the development of alarm management algorithms. The results of detailed sensitivity analyses confirm the benefits of the application of temporal alarm suppression rules, which are reflecting the dynamical behavior of the process.
For the extended simulator of the vinyl acetate production technology and the source codes of the Bayes’ theorem-based evaluation of sequences see: HTTPS://GITHUB.COM/ABONYILAB/VACSIMULATOR
The MATLAB implementation of the sequence mining algorithm is available at: HTTPS://GITHUB.COM/ABONYILAB/MULTI-TEMPORAL-SEQUENCE-MINING
Multi-temporal sequential pattern mining based improvement of alarm management systems
Even in a case of a simple failure, modern process control systems can cause a vast number of alarms. Due to the overload of the operators these alarm ﬂoods may result in tragedical accidents. Alarm management systems can suppress correlated and predictable alarms to reduce the workload of the operators. Since the process units of complex production systems are strongly interconnected, the signals deﬁned on different process variables generate complex multi-temporal patterns. We propose a multi-temporal sequence mining based approach to extract these patterns and form alarm suppression rules. We demonstrate the applicability of the concept in a vinyl-acetate production technology. The results illustrate the multi-temporal analysis of events deﬁned on process variables can detect causes of alarm, and prevent alarm ﬂoods by pro-actively suppressing alarms based on the extracted sequences of events.
Fuzzy association rule mining for feature and model structure selection
Effective methods for feature and model structure selection are very important for data-driven modeling and system identification tasks. A new method for selecting important variables in nonlinear (dynamic) models with mixed discrete (categorical, fuzzy) and continuous inputs and outputs was developed. The method applies fuzzy association rule mining and the selection process of important variables (model structure) is based on two rule interesting measures. The method is able to select the most relevant variables in nonlinear feature selection problems. Moreover it selects the right model order of strongly nonlinear dynamical system, therefore it can be a very efficient tool for process modeling.
Compact and accurate fuzzy classifiers can be constructed by fuzzy association rule mining
The interpretability and accuracy are critical issues in many classification applications. Associative classifier methods can have high accuracy but these predictions are based on too large sets of rules. In contrast to them, a new method was developed which produces very compact and accurate fuzzy classifier systems at the same time. Therefore, it efficiently helps to understand the relationships of data and the predict mechanism in several types of classification problem.
Bit-Table Based Biclustering and Frequent Closed Itemset Mining in High-Dimensional Binary Data
During the last decade various algorithms have been developed and proposed for discovering overlapping clusters in highdimensional data.The two most prominent application fields in this research, proposed independently, are frequent itemset mining (developed for market basket data) and biclustering (applied to gene expression data analysis). The common limitation of both methodologies is the limited applicability for very large binary data sets. In this paper we propose a novel and efficient method to find both frequent closed itemsets and biclusters in high-dimensional binary data. The method is based on simple but very powerful matrix and vector multiplication approaches that ensure that all patterns can be discovered in a fast manner.The proposed algorithm has been implemented in the commonly used MATLAB environment.
Bittable_TID - a Bit-Table based biclustering tool
Bittable_TID is a biclustering tool written in MATLAB. It provides a fast solution for finding all biclusters within a binary data matrix.
Quick Download and running guide
You can download the MATLAB source code, the other software tools used for comparison and data sets from here:
Bittable_TID program package: Bittable_TID
BiMAX software package: BiMAX
Supplementary data for the related publication (input data sets): Bittable_TID datasets
After downloading and unpacking the program package, the program can be run by opening the file bittable_TID.m in MATLAB. The resulted closed itemsets are presented in the variable: itemsCell. Each row represents a closed itemset, where first column contains the involved rows while the second the involved columns.