KKBox subscription prediction : an application of machine learning methods




Zheng, Hanyue

Journal Title

Journal ISSN

Volume Title



This report used datasets from a Kaggle competition which aims to develop machine learning models to predict if users of a music app called KKBox will renew their membership after it expires. This report created four machine learning classification models including logistic regression, random forest, Naïve Bayes and gradient boosting. Exploratory data analysis was performed to understand data distribution and the relationships between features. For models cannot handle missing data and multicollinearity, data imputation and principle component analysis were performed. The result shows that the variable importance derived from models are quite different, which suggests us to be more cautious selecting models. It is also shown that the random forest model achieved the highest AUC (0.9727), followed by Xgboost (AUC = 0.0921), logistic regression (AUC = 0.8500), and Naïve Bayes (AUC = 0.7962). However, it is unrealistic to judge model performance without considering the real business case. The result from this report is a guidance for further business decision making.



LCSH Subject Headings