Ultrasound Image Representation Learning by Modeling Sonographer Visual Attention


Image representations are commonly learned from class labels, which are a simplistic approximation of human image understanding. In this paper we demonstrate that transferable representations of images can be learned without manual annotations by modeling human visual attention. The basis of our analyses is a unique gaze tracking dataset of sonographers performing routine clinical fetal anomaly screenings. Models of sonographer visual attention are learned by training a convolutional neural network (CNN) to predict gaze on ultrasound video frames through visual saliency prediction or gaze-point regression. We evaluate the transferability of the learned representations to the task of ultrasound standard plane detection in two contexts. Firstly, we perform transfer learning by fine-tuning the CNN with a limited number of labeled standard plane images. We find that fine-tuning the saliency predictor is superior to training from random initialization, with an average F1-score improvement of 9.6% overall and 15.3% for the cardiac planes. Secondly, we train a simple softmax regression on the feature activations of each CNN layer in order to evaluate the representations independently of transfer learning hyper-parameters. We find that the attention models derive strong representations, approaching the precision of a fully-supervised baseline model for all but the last layer.

Information Processing in Medical Imaging (IPMI) 2019

Paper Summary

Table of Contents

Representation Learning by Modeling Visual Attention
Transfer Learning
Fixed Feature Extractor


Humans direct their visual attention towards semantically informative regions when interpreting images [ Wu et al. 2017]. The task of predicting the distribution of gaze points on images or video frames is referred to as visual saliency prediction, and CNNs are currently the most effective method to do so [ Borji 2018]. While there has been extensive research on designing ever more accurate saliency predictors ( benchmarks), little work has been devoted to making them useful for other computer vision tasks (exceptions include Cornia et al. (2017) and Cai et al. (2018)). Here, we ask the question: To what extent can the representations learned purely based on gaze data be transferred to a challenging classification task?

Specifically, we train a CNN to predict the gaze of sonographers while they perform routine fetal ultrasound scans, and evaluate that model on the task of detecting certain standard planes in the corresponding videos. We implement two methods for predicting the sonographer gaze:

  1. The model predicts the 2D scalar gaze point heat maps, termed saliency maps (usual approach)
  2. The model directly regresses the gaze point on each video frame.

We refer to these models as visual attention models (VAMs), i.e., Saliency-VAM and Gaze-VAM. We pose the task of standard plane detection analogously to Baumgartner et al. (2016), and use their trained SonoNet model as a baseline.

Since the gaze data is acquired automatically, our work is related to self-supervised learning, which aims at learning representations from data without manual annotations by training on auxiliary prediction tasks. A good example of self-supervised learning is the work of Doersch et al. (2017), who combine multiple auxiliary tasks such as colorization. To the best of our knowledge, this work is the first attempt to study human visual attention modeling in the context of self-supervised representation learning.

Representation Learning by Modeling Visual Attention

Fig. 1 a): Illustration of our framework for learning and evaluating visual attention models (VAMs)

Fig. 1 a) illustrates our framework for training and evaluating the visual attention models (VAMs). On random fetal ultrasound video frames, a dilated CNN is trained to either regress sonographer gaze points or to predict the 2D scalar saliency maps (see Fig. 2). Next, the dilations are removed (see Fig. 1 b)) and the network is evaluated on standard plane detection. We evaluate two methods of transferring the learned representations:

  1. The model is fine-tuned with a small set of a few hundred standard planes ( transfer learning)
  2. Simple logistic regressions are fitted to different layers of the models in order to evaluate the features independently of transfer learning hyper-parameters ( fixed feature extractor)

Transfer Learning

Table 3: Standard plane detection results after fine-tuning (mean ± standard deviation [%])

MetricsRand. Init.Gaze-FTSaliency-FTSonoNet-FT
Precision70.4 ± 2.367.2 ± 3.479.5 ± 1.782.3 ± 1.3 (81)
Recall64.9 ± 1.657.3 ± 4.575.1 ± 3.487.3 ± 1.1 (86)
F1-Score67.0 ± 1.360.7 ± 3.976.6 ± 2.684.5 ± 0.9 (83)

The model that learned to regress gaze-points (Gaze-FT) doesn’t improve over the model trained from random initialization (Rand. Init.). The saliency predictor, in contrast, performs significantly better than Rand. Init. In fact, its performance is closer to that of SonoNet, which had been pre-trained on over 22k labeled standard plane images (compared to 753 for fine-tuning). The literature SonoNet scores are given in parenthesis.

Fixed Feature Extractor

Fig. 3 a): Results of the regression analysis of the fixed-weight attention models, and baselines.

The results of the regression analysis in Fig. 3 a) show that, even without fine-tuning, the high-level features of the attention models are predictive for fetal anomaly standard plane detection. This supports our hypothesis, motivated by Wu et al. (2017), that gaze is a strong prior for semantic information. At the last layer, the attention models fall behind SonoNet, indicating the task-specificity of that layer. Rand. Feat. denotes a model with random weights.

Fig. 3 b): t-SNE visualization of the feature embeddings at the respective layers with the highest F1-score (Background class omitted for legibility).

The t-SNE plots in Fig. 3 b) confirm that some standard plane classes are separated in the respective feature spaces of the visual attention models (VAMs). Compared to the fully-supervised model (SonoNet), a significant overlap remains for the standard planes with similar appearance such as the brain views and the cardiac views (4CH, 3VT, LVOT, RVOT), respectively.

More Results

Fig. 1 b): An illustration that a dilated convolution results in the same receptive field than downsampling + non-dilated convolution. We apply this fact to train a dilated network for saliency prediction (higher-resolution output) which is then used as a classifier by introducing downsampling and removing the dilations (faster, lower memory requirements).

Fig. 2: Visual saliency and gaze point predictions with corresponding ground truths for representative validation set frames.

Appendix A: Confusion matrices of a) the fine-tuned saliency model (Saliency-FT) and b) the baseline SonoNet model. The Saliency-FT model is pre-trained on random video frames for salieny prediction (no manual annotations) and the SonoNet model is pre-trained with over 22k labeled standard plane images. Then, both models are fine-tuned with 753 standard images.

Appendix B: Nearest neighbors in the respective feature spaces. The first column shows various query images and the subsequent columns show the two nearest l2 neighbours in the (average-pooled) feature spaces of the last layer of each model.


  author={Droste, Richard and Cai, Yifan and Sharma, Harshita and Chatelain, Pierre and Drukker, Lior and Papageorghiou, Aris T. and Noble, J. Alison},
  title={Ultrasound Image Representation Learning by Modeling Sonographer Visual Attention},
  booktitle={Information Processing in Medical Imaging (IPMI)},


This work is supported by the ERC ( ERC-ADG-2015 694581, project PULSE) and the EPSRC (EP/GO36861/1 and EP/MO13774/1). AP is funded by the NIHR Oxford Biomedical Research Centre.