# Time series mining

## Mixed dissimilarity measure for piecewise linear approximation based time series applications

In recent years, expert systems built around time series-based methods have been enthusiastically adopted in engineering applications, thanks to their ease of use and effectiveness. This effectiveness depends on how precisely the raw data can be approximated and how precisely these approximations can be compared. When performance of a time series-based system needs to be improved, it is desirable to consider other time series representations and comparison methods. The approximation, however, is often generated by a non-replaceable element and eventually the only way to find a more advanced comparison method is either by creating a new dissimilarity measure or by improving the existing one further.

In this paper, it is shown how a mixture of different comparison methods can be utilized to improve the effectiveness of a system without modifying the time series representation itself. For this purpose, a novel, mixed comparison method is presented for the widely used piecewise linear approximation (PLA), called mixed dissimilarity measure for PLA (MDPLA). It combines one of the most popular dissimilarity measure that utilizes the means of PLA segments and the authors’ previously presented approach that replaces the mean of a segment with its slope.

On the basis of empirical studies three advantages of such combined dissimilarity measures are presented. First, it is shown that the mixture ensures that MDPLA outperforms the most popular dissimilarity measures created for PLA segments. Moreover, in many cases, MDPLA provides results that makes the application of dynamic time warping (DTW) unnecessary, yielding improvement not only in accuracy but also in speed. Finally, it is demonstrated that a mixed measure, such as MDPLA, shortens the warping path of DTW and thus helps to avoid pathological warpings, i.e. the unwanted alignments of DTW. This way, DTW can be applied without penalizing or constraining the warping path itself while the chance of the unwanted alignments are significantly lowered.

## Energy monitoring of process systems: time-series segmentation-based targeting models

Energy monitoring systems calculate actual energy use, estimate energy needs at normal operation, track energy metrics, and highlight issues related to energy efficiency of process plants. Analysis of key energy indicators (KEIs) allows the comparison of process efficiency at different operating regimes. Based on the extracted knowledge realistic targets of KEIs can be determined. The performance of data-driven targeting models depends on how effective the operating regimes are characterized. Till now this modeling task is performed manually based on heuristic and subjective evaluation of the operation. A goal-oriented time-series segmentation technique has been developed to automate the selection of proper data used for the identification of targeting models. With the proposed novel segmentation algorithm targeting-models for different operating regions can be automatically determined. The concept of the resulted energy monitoring system is demonstrated at Heavy Naphtha Hydrotreater and CCR Reforming Units of MOL Hungarian Oil and Gas Company.

## Fisher information matrix based time-series segmentation of process data

Advanced chemical process engineering tools, like model predictive control or soft sensor solutions require proper process models. Parameter identification of these models needs input–output data with high information content. When model based optimal experimental design techniqes cannot be applied, the extraction of informative segements from historical data can also support system identification. We developed a goal-oriented Fisher information based time-series segmentation algorithm, aimed at selecting informative segments from historical process data. The utilized standard bottom-up algorithm is widely used in off-line analysis of process data. Different segments can support the identification of parameter sets. Hence, instead of using either D- or E-optimality as the criterion for comparing the information content of two input sequences (neigbouring segments), we propose the use of Krzanowski's similarity coefficient between the eigenvectors of the Fisher information matrices obtained from the sequences. The efficiency of the proposed methodology is demonstrated by two application examples. The algorithm is capable to extract segments with parameter-set specific information content from historical processdata.

The Matlab® implementation of the proposed method can be downloaded from here: Fisher_MATLAB_sources

## On-line detection os homogeneous operation ranges by dynamic principal component analysis based time-series segmentation

Development of chemical process technologies shall be based on the analisys of process data. In the field of process monitoring the recursive. Principal Component Analysis (PCA) is widely applied to detect any misbehavior of the technology. The investigation of transient states needs dynamic PCA to describe the dynamic behavior mopre accurately. By combining and integrating the recursive and dynamic PCA into time series segmentation techniques, efficient multivariate segmentation methods were resulted to detect homogenous operation ranges based on process data. The similarity os time series segments is evaluated based on the Krzanowski-similarity factor, wich compares the hyperplanes determined by the PCA models. With the help of developed time series segmentation framework separation of operation regimes becomes pissible for supporting process monitoring and control. The performance of the proposed methodology is presented throughout a linear process and the commonly applied Tennessee Eastman process.

## Correlation based dynamic time warping of multivariate time series

In recent years, dynamic time warping (DTW) has begun to become the most widely used technique for comparison of time series data where extensive a priori knowledge is not available. However, it is often expected a multivariate comparison method to consider the correlation between the variables as this correlation carries the real information in many cases. Thus, principal component analysis (PCA) based similarity measures, such as PCA similarity factor (SPCA), are used in many industrial applications.

In this paper, we represent a novel algorithm called correlation based dynamic time warping (CBDTW) wich combines DTW and PCA based similarity measures. To preserve correlation, multivariate time series are segmented and the local dissimilarity function of DTW originated from SPCA. The segments are obtained by bottom-up segmentation using special, PCA related costs. Our novel technique qualitified on two databases, the database of signature verification competition 2004 and the commonly used AUSLAN dataset. We show that CBDTW outperforms the standard SPCA and the most commonly used, Euclidean distance based multivariate DTW in case of datasets wich complex correlation structure.

The Matlab® implementation of the proposed method can be downloaded from here: CbDTW_MATLAB_sources

## Dynamic Principal Component: Analysis in Multivariate Time-Series Segmentation

Principal Component Analysis (PCA) based, time-series analysis methods have become basic tools of every process engineer in the past few years thanks to their efficiency and solid statistical basis. However, there are two drawbacks of these methods which have to be taken into account. First, linear relationships are assumed between the process variables, and second, process dynamics are not considered. The authors presented a PCA based multivariate time-series segmentation method which addressed the first problem. The nonlinear processes were split into locally linear segments by using T2 and Q statistics as cost functions. Based on this solution, we demonstrate how the homogeneous operation ranges and changes in process dynamics can also be detected in dynamic processes. Our approach is examined in detail on simple, theoretical processes and on the well-known pH process.

### QUANTITATIVE EMOTIONAL RESPONSE ANALYSIS

Nowadays, employers, television companies etc. are more and more interested in the quantitative evaluation of working conditions, organizational change, training or leisure time induced emotional reactions. A good instrument to follow these reactions is based on multiparametric monitoring of some essential physiological parameters. The results of medical multiparametric observations can be handled by multivariate time series. This well known and widely examined representation is suitable to reveal new information e.g. by data mining methods. The quality of the information received from these methods mainly depends on the applied similarity measure. Traditional similarity methods are based on the direct one-to-one comparison of individual variables, hence these approaches may ignore some hidden complex interactions within simultaneous biological processes. In this paper Principal Component Analysis (PCA) is used to analyze the emotional response of the participants while they were watching different annotated television programs. Simultaneously measured physiological signals have been monitored such as bioelectrical, biomechanical signals, respiration, body movement and body temperature channels. It has been shown by clustering that this approach, in which the multidimensionality of the signals is considered and the medical parameters are treated as a whole, is able to detect signs of emotional response and can contribute to the objective analysis of human reactions.

## Fuzzy clustering based time-series segmentation

Partitioning a time-series into internally homogeneous segments is an important data mining problem. The proposed method can effectively solve this problem. The changes of the variables of a multivariate time-series are usually vague and do not focus on any particular time point. Therefore, it is not practical to define crisp bounds of the segments. Although fuzzy clustering algorithms are widely used to group overlapping and vague objects, they cannot be directly applied to time-series segmentation, because the clusters need to be contiguous in time. This paper proposes a clustering algorithm for the simultaneous identification of local Probabilistic Principal Component Analysis (PPCA) models used to measure the homogeneity of the segments and fuzzy sets used to represent the segments in time. The algorithm favors contiguous clusters in time and able to detect changes in the hidden structure of multivariate time-series. A fuzzy decision making algorithm based on a compatibility criteria of the clusters have been worked out to determine the required number of segments, while the required number of principal components are determined by the screeplots of the eigenvalues of the fuzzy covariance matrices. The application example shows that this new technique is a useful tool for the analysis of historical process data.

The Matlab® implementation of the proposed method can be downloaded from here: ppcatss_MATLAB_sources.

The help page of PPCA-TSS software is available here: Html help for PPCA-TSS in Matlab

## Monitoring process transitions by Kalman filtering and time-series segmentation

The analysis of historical process data of technological systems plays important role in process monitoring, modelling and control. Timeseries segmentation algorithms are often used to detect homogenous periods of operation-based on input–output process data. However, historical process data alone may not be sufficient for the monitoring of complex processes. This paper incorporates the first-principle model of the process into the segmentation algorithm. The key idea is to use a model-based non-linear state-estimation algorithm to detect the changes in the correlation among the state-variables. The homogeneity of the time-series segments is measured using a PCA similarity factor calculated from the covariance matrices given by the state-estimation algorithm. The whole approach is applied to the monitoring of an industrial high-density polyethylene plant.