Abstract:
Performance evaluation is a critical part of deep learning (DL) that requires careful conduct to enhance confidence and reliability. Several metrics exist to evaluate DL models, however, choosing one for a given model is not trivial, since it is not a one-fit-all solution. Practically, accuracy is the most popularly used evaluation metric for capsule networks (CapsNets). This is problematic for sensitive applications (e.g. health), since accuracy is overly optimistic in the presence of class imbalance, and does not permit the exact reporting of a model’s risk of bias and potential usefulness. This paper, therefore, aims at demonstrating the usefulness of other metrics for performance evaluation as well as interpretability through the implementation of a custom capsule model. The metrics are effective in measuring the real performance of the models in terms of accuracy (93.03% for proposed model), number of parameters ( ≈ 4 million fewer for proposed model), ability to scale and fail-safe, and the effectiveness of the routing process when evaluated on the datasets. Evaluating a CapsNet model with all these metrics has the potential to enhance the practitioner’s confidence and also improve model understandability and reliability.