Supervised and unsupervised PRIDIT for active insurance fraud detection
MetadataShow full item record
This dissertation develops statistical and data mining based methods for insurance fraud detection. Insurance fraud is very costly and has become a world concern in recent years. Great efforts have been made to develop models to identify potentially fraudulent claims for special investigations. In a broader context, insurance fraud detection is a classification task. Both supervised learning methods (where a dependent variable is available for training the model) and unsupervised learning methods (where no prior information of dependent variable is available for use) can be potentially employed to solve this problem. First, an unsupervised method is developed to improve detection effectiveness. Unsupervised methods are especially pertinent to insurance fraud detection since the nature of insurance claims (i.e., fraud or not) is very costly to obtain, if it can be identified at all. In addition, available unsupervised methods are limited and some of them are computationally intensive and the comprehension of the results may be ambiguous. An empirical demonstration of the proposed method is conducted on a widely used large dataset where labels are known for the dependent variable. The proposed unsupervised method is also empirically evaluated against prevalent supervised methods as a form of external validation. This method can be used in other applications as well. Second, another set of learning methods is then developed based on the proposed unsupervised method to further improve performance. These methods are developed in the context of a special class of data mining methods, active learning. The performance of these methods is also empirically evaluated using insurance fraud datasets. Finally, a method is proposed to estimate the fraud rate (i.e., the percentage of fraudulent claims in the entire claims set). Since the true nature of insurance claims (and any level of fraud) is unknown in most cases, there has not been any consensus on the estimated fraud rate. The proposed estimation method is designed based on the proposed unsupervised method. Implemented using insurance fraud datasets with the known nature of claims (i.e., fraud or not), this estimation method yields accurate estimates which are superior to those generated by a benchmark naïve estimation method.