Frequent sequence, itemset and association rule mining
Machine Learning-Supported Designing of Human–Machine Interfaces
The design and functionality of the human–machine interface (HMI) significantly affects operational efficiency and safety related to process control. Alarm management techniques consider the cognitive model of operators, but mainly only from a signal perception point of view. To develop a human-centric alarm management system, the construction of an easy-to-use and supportive HMI is essential. This work suggests a development method that uses machine learning (ML) tools. The key idea is that more supportive higher-level HMI displays can be developed by analysing operator-related events in the process log file. The obtained process model contains relevant data on the relationship of the process events, enabling a network-like visualisation. Attributes of the network allow us to solve the minimisation problem of the ideal workflow–display relation. The suggested approach allows a targeted process pattern exploration to design higher-level HMI displays with respect to content and hierarchy. The method was applied in a real-life hydrofluoric acid alkylation plant, where a proposal was made about the content of an overview display.
Post Date: 13 May 2024
Network-based visualisation of frequent sequences
Frequent sequence pattern mining is an excellent tool to discover patterns in event chains. In complex systems, events from parallel processes are present, often without proper labelling. To identify the groups of events related to the subprocess, frequent sequential pattern mining can be applied. Since most algorithms provide too many frequent sequences that make it difficult to interpret the results, it is necessary to post-process the resulting frequent patterns. The available visualisation techniques do not allow easy access to multiple properties that support a faster and better understanding of the event scenarios. To answer this issue, our work proposes an intuitive and interactive solution to support this task, introducing three novel network-based sequence visualisation methods that can reduce the time of information processing from a cognitive perspective. The proposed visualisation methods offer a more information rich and easily understandable interpretation of sequential pattern mining results compared to the usual text-like outcome of pattern mining algorithms. The first uses the confidence values of the transitions to create a weighted network, while the second enriches the adjacency matrix based on the confidence values with similarities of the transitive nodes. The enriched matrix enables a similarity-based Multidimensional Scaling (MDS) projection of the sequences. The third method uses similarity measurement based on the overlap of the occurrences of the supporting events of the sequences. The applicability of the method is presented in an industrial alarm management problem and in the analysis of clickstreams of a website. The method was fully implemented in Python environment. The results show that the proposed methods are highly applicable for the interactive processing of frequent sequences, supporting the exploration of the inner mechanisms of complex systems.
Post Date: 13 May 2024
Time-dependent sequential association rule-based survival analysis: A healthcare application
The analysis of event sequences with temporal dependencies holds substantial importance across various domains, including healthcare. This study introduces a novel approach that combines sequential rule mining and survival analysis to uncover significant associations and temporal patterns within event sequences. By integrating these techniques, we address the limitations linked to the loss of temporal information. The methodology extends traditional sequential rule mining by introducing time-dependent confidence functions, providing a comprehensive understanding of relationships between antecedent and consequent events. The incorporation of the Kaplan-Meier estimator of survival analysis enables the calculation of temporal distributions between events, resulting in time-dependent confidence functions. These confidence functions illuminate the probability of specific event occurrences considering temporal contexts. To present the application of the method, we demonstrated the usage within the healthcare domain. Analyzing the ICD-10 codes and the laboratory events, we successfully identified relevant sequential rules and their time-dependent confidence functions. This empirical validation underscores the potential of methodology to uncover clinically significant associations within intricate medical data.
The study presents a unique methodology that integrates sequential rule mining and survival analysis.
The methodology extends traditional sequential rule mining by introducing time-dependent confidence functions.
The application of the method is demonstrated within the healthcare domain.
Post date: 05 January 2024
Frequent pattern mining-based log file partition for process mining
Process mining is a technique for exploring models based on event sequences, growing in popularity in the process industry. Process mining algorithms assume that the processed log files contain events generated by only one unknown process, which can lead to extremely complex and inaccurate models when this assumption is not met. To address this issue, this article proposes a frequent pattern mining-based method for log file partitioning, allowing for the exploration of parallel processes. The key idea is that frequent pattern mining can identify grouped events and generate sub-logs of overlapping sub-processes. Thanks to the pre-processing of the log files, more compact and interpretable process models can be identified. We developed a set of goal-oriented metrics to evaluate the complexity of process mining problems and the resulting models. The applicability and effectiveness of the method are demonstrated in the analysis of process alarms of an industrial plant. The results confirm that the proposed method enables the discovery of targeted sub-process models by partitioning the log file using frequent pattern mining, and the effectiveness of the method increases with the number of parallel processes stored in the same log file. We recommend applying the method in every case where there is no clear start and end of the logged events so that the log file can describe different processes.
Post date: 03 April 2023
Integrated Survival Analysis and Frequent Pattern Mining for Course Failure-Based Prediction of Student Dropout
A data-driven method to identify frequent sets of course failures that students should avoid in order to minimize the likelihood of their dropping out from their university training is proposed. The overall probability distribution of the dropout is determined by survival analysis. This result can only describe the mean dropout rate of the undergraduates. However, due to the failure of different courses, the chances of dropout can be highly varied, so the traditional survival model should be extended with event analysis. The study paths of students are represented as events in relation to the lack of completing the required subjects for every semester. Frequent patterns of backlogs are discovered by the mining of frequent sets of these events. The prediction of dropout is personalised by classifying the success of the transitions between the semesters. Based on the explored frequent item sets and classifiers, association rules are formed providing the estimates of the success of the continuation of the studies in the form of confidence metrics. The results can be used to identify critical study paths and courses. Furthermore, based on the patterns of individual uncompleted subjects, it is suitable to predict the chance of continuation in every semester. The analysis of the critical study paths can be used to design personalised actions minimizing the risk of dropout, or to redesign the curriculum aiming the reduction in the dropout rate. The applicability of the method is demonstrated based on the analysis of the progress of chemical engineering students at the University of Pannonia in Hungary. The method is suitable for the examination of more general problems assuming the occurrence of a set of events whose combinations may trigger a set of critical events.
Directions of membrane separator development for microbial fuel cells: A retrospective analysis using frequent itemset mining and descriptive statistical approach
To increase the efficiency of microbial fuel cells (MFCs), the separator (which is mostly a membrane) placed between the electrodes or their compartments is considered of high importance besides several other biotic and abiotic factors (e.g. configuration, mode of operation, types of inoculum and substrate). Nafion-based proton exchange membranes (PEMs) are the most widespread, although these materials are often criticized on various technological and economical grounds. Therefore, to find alternatives of Nafion, the synthesis, development and testing of novel/commercialized membrane separators with enhanced characteristics have been hot topics. In this study, the goals were to assess the membrane-installed MFCs in a retrospective manner and reveal the trends, the applied practices, frequent setups, etc. via Bayesian classification and frequent itemset mining algorithms. Thereafter, a separate discussion was devoted to examine the current standing of research related to major membrane groups used in MFCs and evaluate in accordance with the big picture how the various systems behave in comparison with each other, especially compared to those applying Nafion PEM. It was concluded that some membrane types seem to be competitive to Nafion, however, the standardization of the experiments would drive the more unambiguous comparison of studies.
Frequent pattern mining in multidimensional organizational networks
Network analysis can be applied to understand organizations based on patterns of communication, knowledge flows, trust, and the proximity of employees. A multidimensional organizational network was designed, and association rule mining of the edge labels applied to reveal how relationships, motivations, and perceptions determine each other in different scopes of activities and types of organizations. Frequent itemset-based similarity analysis of the nodes provides the opportunity to characterize typical roles in organizations and clusters of co-workers. A survey was designed to define 15 layers of the organizational network and demonstrate the applicability of the method in three companies. The novelty of our approach resides in the evaluation of people in organizations as frequent multidimensional patterns of multilayer networks. The results illustrate that the overlapping edges of the proposed multilayer network can be used to highlight the motivation and managerial capabilities of the leaders and to find similarly perceived key persons.
Towards Operator 4.0, Increasing Production Efficiency and Reducing Operator Workload by Process Mining of Alarm Data
A methodology to extract temporal patterns of alarm sequences and operator actions from the log files of alarm management systems is proposed. Firstly, time-segments that are informative from the viewpoint of operator interventions are identified by the algorithm. These segments include series of alarms that initialize operator actions, sets of operator actions, and a period that potentially covers the effects of the corrective actions of the operators. In the second step of the methodology, the sets of operator actions that are frequently applied in the same situations are determined. For this purpose, the FP-Growth Algorithm, which is one of the fastest tools of frequent item-set mining and generates well-structured action trees that are not only suitable for the visualization of interventions but lend themselves to build association rules that could be directly applied in decision support systems, is utilized. Finally, multi-temporal sequence mining is applied to reveal what alarms led to the sets of operator actions and what were the effects of these interventions. The applicability of the methodology is illustrated by presenting results connected to the analysis of the delayed coker plant at the Danube Refinery of the MOL Group.
Sequence Mining based Alarm Suppression
Despite the high-pace improvement of industrial process automation, the management of abnormal events still requires human actions. Alarm systems are becoming crucial in providing situation-specific information to the decreasing number of operators. The key role of an alarm management system is to ensure that only the currently significant alarms are annunciated. The design of alarm suppression rules requires the systematic analysis of the process and its control system. We give an overview of the recently developed data-driven techniques and show that the widely applied correlation-based methods utilize a static view of the system. To provide more insight into the process dynamics and represent the temporal relationships among faults, control actions, and process variables, we propose of a multi-temporal sequence mining-based algorithm. The methodology starts with the generation of frequent temporal patterns of the alarm signals. We transform the multi-temporal sequences into Bayes classifiers. The obtained association rules can be used to define the alarm suppression rules. We analyze the data set of a laboratory-scale water treatment testbed to illustrate that multi-temporal sequences are applicable for the description of operation patterns. We extended the benchmark simulator of a vinyl acetate production technology to generate easily reproducible results and stimulate the development of alarm management algorithms. The results of detailed sensitivity analyses confirm the benefits of the application of temporal alarm suppression rules, which are reflecting the dynamical behavior of the process.
For the extended simulator of the vinyl acetate production technology and the source codes of the Bayes’ theorem-based evaluation of sequences see: HTTPS://GITHUB.COM/ABONYILAB/VACSIMULATOR
The MATLAB implementation of the sequence mining algorithm is available at: HTTPS://GITHUB.COM/ABONYILAB/MULTI-TEMPORAL-SEQUENCE-MINING
Multi-temporal sequential pattern mining based improvement of alarm management systems
Even in a case of a simple failure, modern process control systems can cause a vast number of alarms. Due to the overload of the operators these alarm floods may result in tragedical accidents. Alarm management systems can suppress correlated and predictable alarms to reduce the workload of the operators. Since the process units of complex production systems are strongly interconnected, the signals defined on different process variables generate complex multi-temporal patterns. We propose a multi-temporal sequence mining based approach to extract these patterns and form alarm suppression rules. We demonstrate the applicability of the concept in a vinyl-acetate production technology. The results illustrate the multi-temporal analysis of events defined on process variables can detect causes of alarm, and prevent alarm floods by pro-actively suppressing alarms based on the extracted sequences of events.
Fuzzy association rule mining for feature and model structure selection
Effective methods for feature and model structure selection are very important for data-driven modeling and system identification tasks. A new method for selecting important variables in nonlinear (dynamic) models with mixed discrete (categorical, fuzzy) and continuous inputs and outputs was developed. The method applies fuzzy association rule mining and the selection process of important variables (model structure) is based on two rule interesting measures. The method is able to select the most relevant variables in nonlinear feature selection problems. Moreover it selects the right model order of strongly nonlinear dynamical system, therefore it can be a very efficient tool for process modeling.
Compact and accurate fuzzy classifiers can be constructed by fuzzy association rule mining
The interpretability and accuracy are critical issues in many classification applications. Associative classifier methods can have high accuracy but these predictions are based on too large sets of rules. In contrast to them, a new method was developed which produces very compact and accurate fuzzy classifier systems at the same time. Therefore, it efficiently helps to understand the relationships of data and the predict mechanism in several types of classification problem.
Bit-Table Based Biclustering and Frequent Closed Itemset Mining in High-Dimensional Binary Data
During the last decade various algorithms have been developed and proposed for discovering overlapping clusters in highdimensional data.The two most prominent application fields in this research, proposed independently, are frequent itemset mining (developed for market basket data) and biclustering (applied to gene expression data analysis). The common limitation of both methodologies is the limited applicability for very large binary data sets. In this paper we propose a novel and efficient method to find both frequent closed itemsets and biclusters in high-dimensional binary data. The method is based on simple but very powerful matrix and vector multiplication approaches that ensure that all patterns can be discovered in a fast manner.The proposed algorithm has been implemented in the commonly used MATLAB environment.
Bittable_TID - a Bit-Table based biclustering tool
Bittable_TID is a biclustering tool written in MATLAB. It provides a fast solution for finding all biclusters within a binary data matrix.
Quick Download and running guide
You can download the MATLAB source code, the other software tools used for comparison and data sets from here:
Bittable_TID program package: Bittable_TID
BiMAX software package: BiMAX
Supplementary data for the related publication (input data sets): Bittable_TID datasets
After downloading and unpacking the program package, the program can be run by opening the file bittable_TID.m in MATLAB. The resulted closed itemsets are presented in the variable: itemsCell. Each row represents a closed itemset, where first column contains the involved rows while the second the involved columns.
Download Bittable_TID