Finally, I have managed to get some time off and thought of posting about my latest research updates. This post describes the first phase which has been completed. So getting to the discussion,
Since the images were obtained from multiple sources, they contain different backgrounds and in some cases…a lot of noise. Some images are too dark while others are too light. All these variations hamper the accuracy of the predictions.
Since the dataset is sufficiently large(> 10gb in size), I have enough data to train the model from scratch.
Before, I trained the CNN, the first task was to segment the images and extract the hands out of them. This helped reduce the amount of unnecessary background elements present in the image. The correct method would be to first draw masks for each of the images and then use image processing techniques to extract the hand using the mask provided. Unfortunately, in the given dataset, there are over 10000 images and hence, it would be a very painstaking and costly affair to draw masks for each of them individually.
To overcome this barrier, I made use of a Unet 1 which is a model that is specifically used for segmenting biomedical images(this model has reportedly been used for other segmentation tasks as well and is reported to have given equally satisfactory results). The Unet takes a few images along with their masks as input. The network is based on the VGG style architecture( In the future, I plan to modify the architecture of the Unet to a more recent one like Inception V3 and test its usability but due to time and resource constraints, I won’t be able to do it for now ) . Since this paper was published in early 2015, optimization techniques like batch normalization were not yet implemented. However, my implementation of the unet makes use of batch normalization and other recent optimization techniques that make this makes the network a lot faster. The whole model was trained on an AWS instance (g3.4 x instance) with a Tesla M60 16GB GPU. Initially, the output was not very satisfactory and hence I had to retrain the model using more data from the previous predictions. This was done about 4 times until satisfactory results could be obtained. After training, the input to the model was an image from the dataset. The output of the network is a mask for the corresponding input. The final output is generated automatically after applying a few image processing concepts to it.
Refereces:  https://arxiv.org/abs/1505.04597