This is going to be a fairly large blog compared to the others that I have made so far so kindly bear with me. Semantic Image segmentation is one of the toughest problems in vision. It is a task that requires a vision system, that can capture the pose and the location of an option to a high degree of accuracy. The typical deep learning based solutions for automatic image segmentation use Max Pooling layers as part of the vision system which causes the system to lose the property of equivariance.
So far, I have been using the popular U-net for most of my medical image segmentation tasks. While it has a few limitations, the U-Net worked well for the most part. Then, last year(2017), Geffory Hinton and his team at Google Brain published a paper on Capsule Networks. It took me some time to wrap my head around the paper because there are so many components(including a voting system within a network!) being talked about and understanding the need of each of those components and how they help train the network is crucial for being able to build upon this work. Since I was already into image segmentation, the first thing that came to my mind was as to why not use it for image segementation! I had started working on this project already when I came across Prof. Rodney L. and his team’s paper on “SegCaps”. Since this paper seemed to implement the network in the exact way that I had thought of implementing it, I decided to use this as my base instead. This helped me save a lot of time on model creation and rather help me focus on applying to a particular problem of Paedratic Bone Image Segmentation.
We use the state-of-the-art transforming auto-encoder and decoder network, which is known for being equivariant, to segment pediatric bone radiographs. The dataset used consists of about 12600 images. Contrast Limited Adaptive Histogram Equalization is applied to all images before feeding them as input to the trained transforming auto-encoder. Following this, morphological operations are performed to fill the holes in the output and also draw the contours of image and generate the final mask.
A comparison between the images produced by the UNet and that by the Capsule network for image segmentation can be seen below. I have tried to use an image which can help us appreciate the property of equivariance which is followed by the Capsule Network. This is the first paper that utilizes transforming auto-encoders for the purpose of Pediatric bone image segmentation.
U-Net | Capsule Netowork |----------|----------| No. of parameters: 34,632,753 | 1,416,912 (24 timeslesser) Property: Translation Invariant | Equivarient
While Capsule Networks do show a lot of promise, we have a long way to go before we can use it on a day to day basis. Hope we get there soon and happy reading!
Refereces:  https://arxiv.org/abs/1804.04241