Thoughts on Convolutional Kernels in Volumetric Images
Updated: Feb 6
When designing a neural network for volumetric data, such as CT or MRI, one of the first questions, is whether to use 3D convolutional kernels or apply the convolutions on each 2D slice separately. For obvious reasons, if there are no computational or memory limitations, 3D network is preferable since we would rather not assume a preference to any two axes out of the three.
However, often it’s not possible to have as an input to the network the whole volume due to memory limitations, and even when it can be done, the training and inference time can be very long. So, in most cases, we need to choose between working on two-dimensional slices that include the full width and height or working on volumetric patches cropped from the data and later stitched to generate the full volume. As is often the case in deep learning, there’s no one answer for all, and the choice depends on the specifics of the data, the task and the computational resources. But there are some considerations that can guide us towards a judicious choice.
The key question is whether the target can be well-identified using a 2D slice only, maybe with some additional post-processing filtering or whether we see a high correlation between the slices and the 3D information is crucial for detection and\or segmentation. An example of the latter is segmentation and separation of vertebrae in the spine – the vertebrae are quite complex objects, and their processes overlap, making it quite difficult, even for an expert observer, to separate two vertebrae in a 2D slice, especially if they include deformities such as fractures, scoliosis or degenerative discs. In this case, a 3D approach gives better results. On the other hand, for a task of detecting brain tumors in MRI scan it might be easier to detect the anomaly from a 2D slice that covers a wider field-of-view than from a small volumetric patch that might not include enough healthy and non-healthy tissue for separation.
If working in 3D, the size of the volumetric patch should be selected based on the characteristic size of the target, taking into account the computational limitations. In the case of the vertebrae, for example, it’s preferable if the thickness of the patch is at least the height of a vertebra.
We should also take into account the computational resources and time limitation on the one hand and the learning convergence on the other hand. Training a network with 3D kernels takes longer and requires more memory, often limiting the batch size during training, which might affect the stability of the learning process. The additional dimension also adds complexity to the network which might slow down the network convergence. Depending on the patch size, a single GPU might not even be sufficient to run the training. The inference time depends on the number of patches or slices required to build the volume. Sometimes a pre-processing algorithm can be used to focus the attention to the specific target area and reduce the number of patches during inference, thus reducing the time difference between the 2D and 3D implementations. For the spine segmentation example, the spine canal can be detected based on “traditional” computer vision algorithms.
Another consideration is the task itself and the nature of the data annotation. For a classification task, where the whole volume is given a single class, it is quite problematic to separate the volume into slices or areas, and we should attempt to input the whole volume to the network, even if it means reducing the resolution. A scan can be classified, for example, as abnormal but many of the slices will be normal. However, if the annotation includes classification per slice, or semantic segmentation of the volume, there’s more freedom in cropping or separating the volume.
In summary, though working with 3D kernels makes sense when dealing with volumetric data, we should decide between 2D and 3D based on our task, our data and our resources. If we have the time, it might be best to just try both options and compare.
A slice showing the transition between two adjacent vertebrae (on the right – with semantic segmentation). Separating the vertebrae based on a single slice can be very challenging even in a healthy spine.