Data cleaning and knowledge discovery in process data

Xu, Ph. D., Shu
Journal Title
Journal ISSN
Volume Title

This dissertation presents several methods for overcoming the Big Data challenges, with an emphasis on data cleaning and knowledge discovery in process data. Data cleaning and knowledge discovery is chosen as a main research area here due to its importance from both theoretical and practical points of view.

Theoretical background and recent developments of data cleaning methods are reviewed from four aspects: missing data imputation, outlier detection, noise removal and time delay estimation. Moreover, the impact of contaminated data on model performance and corresponding improvement obtained by data cleaning methods are analyzed through both simulated and industrial case studies. The results provide a starting point for further advanced methodology development.

It is hard to find a universally applicable method for data cleaning since every data set may have its own distinctive features. Thus, we have to customize available methods so that the quality of the data set is guaranteed. An integrated data cleaning scheme is proposed, which incorporates model building and performance evaluation, to provide guidance in tuning the parameters of data cleaning methods and prevent over-cleaning. A case study based on industrial data has been used to verify the feasibility and effectiveness of the proposed new method, during which a partial least squares (PLS) model was built and three univariate data cleaning procedures is tested.

A time series Kalman filter (TSKF) is proposed that successfully handles outlier detection in dynamic systems, where normal process changes often mask the existence of outliers. The TSKF method combines a time series model fitting procedure with a modified Kalman filter to deal with additive outlier (AO) and innovational outlier (IO) detection problems in dynamic process data set. A comparative analysis of TSKF and available methods is performed on simulated and real chemical plant data.

Root cause diagnosis of plant-wide oscillations, as a concrete example of data cleaning and knowledge discovery in the process data, is provided. Plant-wide oscillations can negatively influence the overall control performance of the process and the detection results are often affected by noise at different frequency ranges. To address such a problem, an information transfer method combining spectral envelope algorithm with spectral transfer entropy is proposed to detect and diagnose such oscillations within a specific frequency range, mitigating the effects from measurement noise. The feasibility and effectiveness of the proposed method are verified and compared with available methods through both simulated and industrial case studies.