Frequent sequence, itemset and association rule mining

Time-dependent sequential association rule-based survival analysis: A healthcare application

The analysis of event sequences with temporal dependencies holds substantial importance across various domains, including healthcare. This study introduces a novel approach that combines sequential rule mining and survival analysis to uncover significant associations and temporal patterns within event sequences. By integrating these techniques, we address the limitations linked to the loss of temporal information. The methodology extends traditional sequential rule mining by introducing time-dependent confidence functions, providing a comprehensive understanding of relationships between antecedent and consequent events. The incorporation of the Kaplan-Meier estimator of survival analysis enables the calculation of temporal distributions between events, resulting in time-dependent confidence functions. These confidence functions illuminate the probability of specific event occurrences considering temporal contexts. To present the application of the method, we demonstrated the usage within the healthcare domain. Analyzing the ICD-10 codes and the laboratory events, we successfully identified relevant sequential rules and their time-dependent confidence functions. This empirical validation underscores the potential of methodology to uncover clinically significant associations within intricate medical data.


Post date: 05 January 2024 

Frequent pattern mining-based log file partition for process mining

Process mining is a technique for exploring models based on event sequences, growing in popularity in the process industry. Process mining algorithms assume that the processed log files contain events generated by only one unknown process, which can lead to extremely complex and inaccurate models when this assumption is not met. To address this issue, this article proposes a frequent pattern mining-based method for log file partitioning, allowing for the exploration of parallel processes. The key idea is that frequent pattern mining can identify grouped events and generate sub-logs of overlapping sub-processes. Thanks to the pre-processing of the log files, more compact and interpretable process models can be identified. We developed a set of goal-oriented metrics to evaluate the complexity of process mining problems and the resulting models. The applicability and effectiveness of the method are demonstrated in the analysis of process alarms of an industrial plant. The results confirm that the proposed method enables the discovery of targeted sub-process models by partitioning the log file using frequent pattern mining, and the effectiveness of the method increases with the number of parallel processes stored in the same log file. We recommend applying the method in every case where there is no clear start and end of the logged events so that the log file can describe different processes.


Post date: 03 April 2023 

Integrated Survival Analysis and Frequent Pattern Mining for Course Failure-Based Prediction of Student Dropout

A data-driven method to identify frequent sets of course failures that students should avoid in order to minimize the likelihood of their dropping out from their university training is proposed. The overall probability distribution of the dropout is determined by survival analysis. This result can only describe the mean dropout rate of the undergraduates. However, due to the failure of different courses, the chances of dropout can be highly varied, so the traditional survival model should be extended with event analysis. The study paths of students are represented as events in relation to the lack of completing the required subjects for every semester. Frequent patterns of backlogs are discovered by the mining of frequent sets of these events. The prediction of dropout is personalised by classifying the success of the transitions between the semesters. Based on the explored frequent item sets and classifiers, association rules are formed providing the estimates of the success of the continuation of the studies in the form of confidence metrics. The results can be used to identify critical study paths and courses. Furthermore, based on the patterns of individual uncompleted subjects, it is suitable to predict the chance of continuation in every semester. The analysis of the critical study paths can be used to design personalised actions minimizing the risk of dropout, or to redesign the curriculum aiming the reduction in the dropout rate. The applicability of the method is demonstrated based on the analysis of the progress of chemical engineering students at the University of Pannonia in Hungary. The method is suitable for the examination of more general problems assuming the occurrence of a set of events whose combinations may trigger a set of critical events. 

Directions of membrane separator development for microbial fuel cells: A retrospective analysis using frequent itemset mining and descriptive statistical approach

To increase the efficiency of microbial fuel cells (MFCs), the separator (which is mostly a membrane) placed between the electrodes or their compartments is considered of high importance besides several other biotic and abiotic factors (e.g. configuration, mode of operation, types of inoculum and substrate). Nafion-based proton exchange membranes (PEMs) are the most widespread, although these materials are often criticized on various technological and economical grounds. Therefore, to find alternatives of Nafion, the synthesis, development and testing of novel/commercialized membrane separators with enhanced characteristics have been hot topics. In this study, the goals were to assess the membrane-installed MFCs in a retrospective manner and reveal the trends, the applied practices, frequent setups, etc. via Bayesian classification and frequent itemset mining algorithms. Thereafter, a separate discussion was devoted to examine the current standing of research related to major membrane groups used in MFCs and evaluate in accordance with the big picture how the various systems behave in comparison with each other, especially compared to those applying Nafion PEM. It was concluded that some membrane types seem to be competitive to Nafion, however, the standardization of the experiments would drive the more unambiguous comparison of studies. 


Frequent pattern mining in multidimensional organizational networks

Network analysis can be applied to understand organizations based on patterns of communication, knowledge flows, trust, and the proximity of employees. A multidimensional organizational network was designed, and association rule mining of the edge labels applied to reveal how relationships, motivations, and perceptions determine each other in different scopes of activities and types of organizations. Frequent itemset-based similarity analysis of the nodes provides the opportunity to characterize typical roles in organizations and clusters of co-workers. A survey was designed to define 15 layers of the organizational network and demonstrate the applicability of the method in three companies. The novelty of our approach resides in the evaluation of people in organizations as frequent multidimensional patterns of multilayer networks. The results illustrate that the overlapping edges of the proposed multilayer network can be used to highlight the motivation and managerial capabilities of the leaders and to find similarly perceived key persons. 

https://www.nature.com/articles/s41598-019-39705-1

Towards Operator 4.0, Increasing Production Efficiency and Reducing Operator Workload by Process Mining of Alarm Data 

A methodology to extract temporal patterns of alarm sequences and operator actions from the log files of alarm management systems is proposed. Firstly, time-segments that are informative from the viewpoint of operator interventions are identified by the algorithm. These segments include series of alarms that initialize operator actions, sets of operator actions, and a period that potentially covers the effects of the corrective actions of the operators. In the second step of the methodology, the sets of operator actions that are frequently applied in the same situations are determined. For this purpose, the FP-Growth Algorithm, which is one of the fastest tools of frequent item-set mining and generates well-structured action trees that are not only suitable for the visualization of interventions but lend themselves to build association rules that could be directly applied in decision support systems, is utilized. Finally, multi-temporal sequence mining is applied to reveal what alarms led to the sets of operator actions and what were the effects of these interventions. The applicability of the methodology is illustrated by presenting results connected to the analysis of the delayed coker plant at the Danube Refinery of the MOL Group. 

Sequence Mining based Alarm Suppression

Despite the high-pace improvement of industrial process automation, the management of abnormal events still requires human actions. Alarm systems are becoming crucial in providing situation-specific information to the decreasing number of operators. The key role of an alarm management system is to ensure that only the currently significant alarms are annunciated. The design of alarm suppression rules requires the systematic analysis of the process and its control system. We give an overview of the recently developed data-driven techniques and show that the widely applied correlation-based methods utilize a static view of the system. To provide more insight into the process dynamics and represent the temporal relationships among faults, control actions, and process variables, we propose of a multi-temporal sequence mining-based algorithm. The methodology starts with the generation of frequent temporal patterns of the alarm signals. We transform the multi-temporal sequences into Bayes classifiers. The obtained association rules can be used to define the alarm suppression rules. We analyze the data set of a laboratory-scale water treatment testbed to illustrate that multi-temporal sequences are applicable for the description of operation patterns. We extended the benchmark simulator of a vinyl acetate production technology to generate easily reproducible results and stimulate the development of alarm management algorithms. The results of detailed sensitivity analyses confirm the benefits of the application of temporal alarm suppression rules, which are reflecting the dynamical behavior of the process. 

For the extended simulator of the vinyl acetate production technology and the source codes of the Bayes’ theorem-based evaluation of sequences see: HTTPS://GITHUB.COM/ABONYILAB/VACSIMULATOR

The MATLAB implementation of the sequence mining algorithm is available at: HTTPS://GITHUB.COM/ABONYILAB/MULTI-TEMPORAL-SEQUENCE-MINING

Multi-temporal sequential pattern mining based improvement of alarm management systems

Even in a case of a simple failure, modern process control systems can cause a vast number of alarms. Due to the overload of the operators these alarm floods may result in tragedical accidents. Alarm management systems can suppress correlated and predictable alarms to reduce the workload of the operators. Since the process units of complex production systems are strongly interconnected, the signals defined on different process variables generate complex multi-temporal patterns. We propose a multi-temporal sequence mining based approach to extract these patterns and form alarm suppression rules. We demonstrate the applicability of the concept in a vinyl-acetate production technology. The results illustrate the multi-temporal analysis of events defined on process variables can detect causes of alarm, and prevent alarm floods by pro-actively suppressing alarms based on the extracted sequences of events.

Fuzzy association rule mining for feature and model structure selection

Effective methods for feature and model structure selection are very important for data-driven modeling and system identification tasks. A new method for selecting important variables in nonlinear (dynamic) models with mixed discrete (categorical, fuzzy) and continuous inputs and outputs was developed. The method applies fuzzy association rule mining and the selection process of important variables (model structure) is based on two rule interesting measures. The method is able to select the most relevant variables in nonlinear feature selection problems. Moreover it selects the right model order of strongly nonlinear dynamical system, therefore it can be a very efficient tool for process modeling.

F. P. Pach, A. Gyenesei and J. Abonyi, MOSSFARM: Model structure selection by fuzzy association rule mining, Journal of Intelligent and Fuzzy Systems, pp. 399-407 (2008)

Compact and accurate fuzzy classifiers can be constructed by fuzzy association rule mining

The interpretability and accuracy are critical issues in many classification applications. Associative classifier methods can have high accuracy but these predictions are based on too large sets of rules. In contrast to them, a new method was developed which produces very compact and accurate fuzzy classifier systems at the same time. Therefore, it efficiently helps to understand the relationships of data and the predict mechanism in several types of classification problem.

Bit-Table Based Biclustering and Frequent Closed Itemset Mining in High-Dimensional Binary Data

During the last decade various algorithms have been developed and proposed for discovering overlapping clusters in highdimensional data.The two most prominent application fields in this research, proposed independently, are frequent itemset mining (developed for market basket data) and biclustering (applied to gene expression data analysis). The common limitation of both methodologies is the limited applicability for very large binary data sets. In this paper we propose a novel and efficient method to find both frequent closed itemsets and biclusters in high-dimensional binary data. The method is based on simple but very powerful matrix and vector multiplication approaches that ensure that all patterns can be discovered in a fast manner.The proposed algorithm has been implemented in the commonly used MATLAB environment.

A Király, A. Gyenesei, J. Abonyi, Bit-Table Based Biclustering and Frequent Closed Itemset Mining in High-Dimensional Binary Data, The Scientific World Journal, vol. 2014, Article ID 870406, 7 pages

Bittable_TID - a Bit-Table based biclustering tool

Bittable_TID is a biclustering tool written in MATLAB. It provides a fast solution for finding all biclusters within a binary data matrix.

Quick Download and running guide

You can download the MATLAB source code, the other software tools used for comparison and data sets from here:

After downloading and unpacking the program package, the program can be run by opening the file bittable_TID.m in MATLAB. The resulted closed itemsets are presented in the variable: itemsCell. Each row represents a closed itemset, where first column contains the involved rows while the second the involved columns.

Download Bittable_TID

Bit-table representation of market basket data.