Wrapper boxes for increasing model interpretability via example-based explanations
We propose wrapper boxes to provide interpretability in deep learning that is model, training, and dataset-agnostic. The prediction model is trained as usual on some dataset(s), typically optimizing some predetermined loss function. At inference time, the prediction model is augmented by a simpler model that makes forecasts by leveraging learned representations from the former. Hence, any black box model such as deep neural networks can be made more interpretable by "wrapping" them with white box auxiliaries that are explainable by design. We demonstrate the effectiveness of wrapper box approaches across two datasets and three large pre-trained language models, showing that performance is not noticeably different compared to the original model across various configurations, even for simple augmentations like k-nearest neighbors, support vector machines, decision trees, and k-means. In particular, we present quantitative evidence that representations retrieved from the penultimate layer alone are sufficient for white boxes to achieve not noticeably different performance. Finally, we illustrate the additive explainability of white box augmentations by showcasing intuitive and faithful example-based explanations. We hypothesize that any minor degradation in predictive performance is justified by enhanced interpretability for human users, enabling the combined human-AI partnership to be more performant than possible with a black box model alone.