Image representations are commonly learned from class labels, which are a simplistic approximation of human image understanding. In this paper we demonstrate that transferable representations of images can be learned without manual annotations by modeling human visual attention. The basis of our analyses is a unique gaze tracking dataset of sonographers performing routine clinical fetal anomaly screenings. Models of sonographer visual attention are learned by training a convolutional neural network (CNN) to predict gaze on ultrasound video frames through visual saliency prediction or gaze-point regression. We evaluate the transferability of the learned representations to the task of ultrasound standard plane detection in two contexts. Firstly, we perform transfer learning by fine-tuning the CNN with a limited number of labeled standard plane images. We find that fine-tuning the saliency predictor is superior to training from random initialization, with an average F1-score improvement of 9.6% overall and 15.3% for the cardiac planes. Secondly, we train a simple softmax regression on the feature activations of each CNN layer in order to evaluate the representations independently of transfer learning hyper-parameters. We find that the attention models derive strong representations, approaching the precision of a fully-supervised baseline model for all but the last layer.
Introduction
Representation Learning by Modeling Visual Attention
Transfer Learning
Fixed Feature Extractor
Humans direct their visual attention towards semantically informative regions when interpreting images [Wu et al. 2017]. The task of predicting the distribution of gaze points on images or video frames is referred to as visual saliency prediction, and CNNs are currently the most effective method to do so [Borji 2018]. While there has been extensive research on designing ever more accurate saliency predictors (benchmarks), little work has been devoted to making them useful for other computer vision tasks (exceptions include Cornia et al. (2017) and Cai et al. (2018)). Here, we ask the question: To what extent can the representations learned purely based on gaze data be transferred to a challenging classification task?
Specifically, we train a CNN to predict the gaze of sonographers while they perform routine fetal ultrasound scans, and evaluate that model on the task of detecting certain standard planes in the corresponding videos. We implement two methods for predicting the sonographer gaze:
We refer to these models as visual attention models (VAMs), i.e., Saliency-VAM and Gaze-VAM. We pose the task of standard plane detection analogously to Baumgartner et al. (2016), and use their trained SonoNet model as a baseline.
Since the gaze data is acquired automatically, our work is related to self-supervised learning, which aims at learning representations from data without manual annotations by training on auxiliary prediction tasks. A good example of self-supervised learning is the work of Doersch et al. (2017), who combine multiple auxiliary tasks such as colorization. To the best of our knowledge, this work is the first attempt to study human visual attention modeling in the context of self-supervised representation learning.
Fig. 1 a) illustrates our framework for training and evaluating the visual attention models (VAMs). On random fetal ultrasound video frames, a dilated CNN is trained to either regress sonographer gaze points or to predict the 2D scalar saliency maps (see Fig. 2). Next, the dilations are removed (see Fig. 1 b)) and the network is evaluated on standard plane detection. We evaluate two methods of transferring the learned representations:
Table 3: Standard plane detection results after fine-tuning (mean ± standard deviation [%])
Metrics | Rand. Init. | Gaze-FT | Saliency-FT | SonoNet-FT |
---|---|---|---|---|
Precision | 70.4 ± 2.3 | 67.2 ± 3.4 | 79.5 ± 1.7 | 82.3 ± 1.3 (81) |
Recall | 64.9 ± 1.6 | 57.3 ± 4.5 | 75.1 ± 3.4 | 87.3 ± 1.1 (86) |
F1-Score | 67.0 ± 1.3 | 60.7 ± 3.9 | 76.6 ± 2.6 | 84.5 ± 0.9 (83) |
The model that learned to regress gaze-points (Gaze-FT) doesn’t improve over the model trained from random initialization (Rand. Init.). The saliency predictor, in contrast, performs significantly better than Rand. Init. In fact, its performance is closer to that of SonoNet, which had been pre-trained on over 22k labeled standard plane images (compared to 753 for fine-tuning). The literature SonoNet scores are given in parenthesis.
The results of the regression analysis in Fig. 3 a) show that, even without fine-tuning, the high-level features of the attention models are predictive for fetal anomaly standard plane detection. This supports our hypothesis, motivated by Wu et al. (2017), that gaze is a strong prior for semantic information. At the last layer, the attention models fall behind SonoNet, indicating the task-specificity of that layer. Rand. Feat. denotes a model with random weights.
The t-SNE plots in Fig. 3 b) confirm that some standard plane classes are separated in the respective feature spaces of the visual attention models (VAMs). Compared to the fully-supervised model (SonoNet), a significant overlap remains for the standard planes with similar appearance such as the brain views and the cardiac views (4CH, 3VT, LVOT, RVOT), respectively.
@InProceedings{
author={Droste, Richard and Cai, Yifan and Sharma, Harshita and Chatelain, Pierre and Drukker, Lior and Papageorghiou, Aris T. and Noble, J. Alison},
title={Ultrasound Image Representation Learning by Modeling Sonographer Visual Attention},
booktitle={Information Processing in Medical Imaging (IPMI)},
volume=11492,
series={LNCS},
year={2019},
publisher={Springer}
}
This work is supported by the ERC (ERC-ADG-2015 694581, project PULSE) and the EPSRC (EP/GO36861/1 and EP/MO13774/1). AP is funded by the NIHR Oxford Biomedical Research Centre.