To tackle interpretability in deep learning, we presented a novel framework that jointly learns a predictive model and its associated interpretation model. The interpreter provided both local and global interpretability about the predictive model in terms of human-understandable high level attribute functions, with minimal loss of accuracy. This is achieved by a dedicated architecture and well chosen regularization penalties. A detailed pipeline to visualize the learnt features is also developed. Moreover, besides generating interpretable models by design, our approach can be specialized to provide post-hoc interpretations for a pre-trained neural network. We validated our approach against several state-of-the-art methods on multiple datasets and show its efficacy on both kinds of tasks.
Link to Youtube
Link to GitHub