Updated: May 22
Performing high resolution tasks that demand contextual understanding
Networks are trained to perform a wide variety of vision-based tasks. The success of some requires not only a large amount of data, but also high-resolution images or volumes. Handling this heavy data can be done on powerful servers and requires time and money, but are there other solutions as well?
The UNet architecture, for example, tackles the conflict between context and resolution by using the pooling operator, usually max pooling. The idea is to retain only the prominent features (max valued pixels) from each region and throw away the information which is not important. This is how one can filter the information which best describes the context of the image.
Using a UNet, in many cases, might not be enough. Consider, for example, the task of segmenting a large object, where the object edges should be fully accurate. In applications like augmented reality, for example, cutting precisely around the implanted character is key for a realistic end-result of the superimposition. Similarly, precise segmentation of anatomical organs is often critical for a medical team or a robot in guiding a procedure. In both examples, high image resolution is crucial for fine object edge delineation, while looking at the bigger picture is needed to gain contextual understanding.
In a paper by Lessmann et al., the authors suggest an iterative method to tackle the resolution–context tradeoff, for the task of precise segmentation and anatomical identification of vertebrae. An iterative instance segmentation is achieved by composing a network with a memory component that retains information about already-segmented vertebrae, and labels the vertebrae one after the other.
A common approach to the context-resolution question is going from coarse to fine resolution, in which the first step would be to decrease the resolution and roughly locate the object, based on its context, and the second step would be the fine tuning of the edge delineation.
The edge fine tuning can be performed by working in patches: feeding (another) network with the patches along the estimated object edge, sequentially. While working with patches is simple to implement, two issues arise:
Each patch is a zoom-in part of the image and may not contain all the necessary information for accurate segmentation
Sewing back the patches, with or without overlaps, requires heuristics to ensure continuous edges
A method that may be preferable in many cases for the second step of fine-tuning the edges, is using a transformation to decrease the dimension of the data, for example to squeeze the image data around the estimated object borders into a small image. This way, we crop out the irrelevant information in the image and leave only the data needed to perform the task, while handling less data. This allows both requirements to be fulfilled – context and resolution.
In the process of accurate vessel segmentation, for instance, one could use a network that is fed by low resolution images and locates the vessel, followed by a transformation of sections along the estimated centerline to a narrow image, covering only the potential vessel edge area.
Bernier et al. tackle the task of segmenting the endocardial wall of the left ventricle (LV) given 3D echocardiographic images. To do so, they assume that the LV has a U-shape and represents the volume in a spherical-cylindrical coordinate system, as can be seen in the figure below. This example shows how a transformation of a 3D volume to a 2D image can be used to decrease data size and focus only on the significant parts.
As we have seen, tasks that demand contextual understanding do not have to compromise on low resolution. Various methods allow to achieve both accuracy goals.