Replacing a bespoke image segmentation workflow using classical computer vision tasks with a simple, fully convolutional neural network isn’t too hard with modern compute and software libraries, at least not for the first part of the learning curve. The conv-net alleviates your fine-tuning overhead, decreases the total curation requirement (time spent correcting human-obvious mistakes), and it even expands the flexibility of your segmentations so that you can simultaneously identify the pixel locations of multiple different classes. Even if the model occasionally makes mistakes, it seems to do so in a way that makes it obvious what the net was “thinking,” and the mistakes are still pretty close. If this is so easy, why do we still even have humans?
In some ways conv-nets work almost too well for many computer vision tasks. Getting a reasonably good result and declaring it “good enough” is very tempting. It’s easy to get lackadaisical about a task that you wouldn’t even approach for automation a decade ago, leaving it to undergraduates to manually assess images for “research experience” like focused zipheads. But we can do better, and it’s important that we do so if we are to live in a desirable future. Biased algorithms are nothing new, and the ramifications of a misbehaving model remain the responsibility of its creators
Take a 4 layer CNN trained to segment mitochondria from electron micrographs of brain tissue (training on an electron microscopy dataset from EPFL here. On a scale from Loch Stenness to Loch Ness, the depth of this network is the Bonneville Salt Flats. Nonetheless this puddle of neurons manages to get a reasonably good result after only a few hundred epochs.
I don’t think it would take too much in the way of post-processing to clean those segmentation results: a closing operator to get rid of the erroneous spots and smooth a few artifacts. But isn’t that defeating the point? The ease of getting good results gained early can be a bit misleading. Getting to 90% or even 95% effectiveness on a task can seem pretty easy thanks to the impressive learning capacity of conv-nets, but closing the gap of the last few percent, building a model that generalizes to new datasets, or better yet, transfers what it has learned to largely different tasks is much more difficult. With all the accelerated hardware and improved software libraries we have available today you may be only 30 minutes away from a perfect cat classifier, but you’re at least a few months of diligent work away from a conv-net that can match the image analysis efficacy of an undergrad for a new project.
Pooling operations are often touted as a principal contributor to conv-net classifier invariance, but this is controversial, and in any case most people who can afford the hardware for memory-intensive models are leaving them behind. It seems that pooling is probably more important for regularization than for feature invariance, but we’ll leave that discussion for another time. One side effect of pooling operations is that images are blurred as the x/y dimensions are reduced in deeper layers.
U-Net architectures and atrous convolutions are two strategies that have lately been shown to be effective elements of image segmentation models. The assumed effect for both strategies is better retention of high frequency details (as compared to fully convolutional networks). These counteract some of the blurring effect that comes from using pooling layers.
In this post, we’ll compare the frequency content retained in the output from different models. The training data is EM data from brain slices like the example above. I’m using the dataset from the 2012 ISBI 2D EM segmentation challenge for training and validation (published by Cardona et al., and we’ll compare the results using the EPFL dataset mentioned above as a test set.
To examine how these elements contribute to a vision model, we’ll train them on EM data as autoencoders. I’ve built one model for each strategy, constrained to have the same number of weights. The training process looks something like this (in the case of the fully convolutional model):
Dilated convolutions are an old concept revitalized to address problems associated with details lost to pooling operations by making them optional. This is accomplished by using dilated convolutional kernels (spacing the weights with zeros, or holes) to achieve long-distance context without pooling. In the image below, the dark squares are the active weights while the light gray ones are the “holes” (i.e. in French atrous). Where these kernels are convolved with a layer, they act like a larger kernel without having to learn/store additional weights.
U-Net architectures, on the other hand, utilize skip connections to bring information from the early, less-pooled layers to later layers. The main risk I see in using U-Net architectures is that for a particularly deep model the network may develop an over-reliance on the skip connections. This would mean the very early layers will train faster and have a bigger influence on the model, losing out on the capacity for more abstract feature representations in the layers at the bottom of the “U”.
Using atrous convolutions makes for noticeably better autoencoding fidelity compared to a simple fully convolutional network:
While training with the UNet architecture produces images that are hardly discernible from the originals. Note that the images here are from the validation set, they aren’t seen by the model during training steps.
If you compare the results qualitatively, the U-Net architecture is a clear winner in terms of the sharpness of the decoded output. By the looks of it the U-Net is probably more susceptible to fitting noise as well, at least in this configuration. Using dilated convolutions also offers improved detail reconstruction compared to the fully convolutional network, but it does eat up more memory and trains more slowly due to the wide interior layers.
This seemed like a good opportunity to bring out wavelet analysis to quantify the differences in autoencoder output. We’ll use wavelet image decomposition to investigate which frequency levels are most prevalent in the decoded output from each model. Image decomposition with wavelets looks something like this:
The top-left image has been downsized 2x from the original by removing the details with a wavelet transform (using Daubechies 1). The details left over in the other quadrants correspond to the high frequency content oriented to the vertical, horizontal, and diagonal directions. By computing wavelet decompositions of the conv-net outputs and comparing the normalized sums at each level, we should be able to get a good idea of where the information of the image resides. You can get an impression of the first level of wavelet decomposition for output images from the various models in the examples below:
And finally, if we calculate the normalized power for each level of wavelet decomposition we can see where the majority of the information of the corresponding image resides. The metrics below are the average of 100 autoencoded images from the test dataset.
In the plot, spatial frequencies increase with decreasing levels from left to right. Level 8 refers to the 8th level of the wavelet decomposition, aka the average gray level in this case. The model using a U-Net architecture is the closest to recapitulating all the spatial frequencies of the original image, with the noticeable exception of an about 60% decrease in image intensity at the very highest spatial frequencies.
I’d say the difference between the U-Net output and the original image is mostly down to a denoising effect. The atrous conv-net is not too far behind the U-Net in terms of spatial frequency fidelity, and the choice of model variant probably would depend on the end use. For example, there are some very small sub-organellar dot features that are resolved in the U-Net reconstruction but not the atrous model. If we wanted to segment those features, we’d definitely choose the U-Net. On the other hand, the atrous net would probably suffer less from over-fitting if we wanted to train for segmenting the larger mitochondria and only have a small dataset to train on. Finally, if all we want is to coarsely identify the cellular boundaries, that’s basically what we see in the autoencoder output from the fully convolutional network.
Hopefully this has been a helpful exercise in examining conv-net capabilities in a simple example. Open questions for this set of models remain. Which model performs the best on an actual semantic segmentation task? Does the U-Net rely too much on the skip connections?
I’m working with these models in a repository where I plan to keep notes and code for experimenting with ideas from the machine learning literature and you’re welcome to use the models therein for your own experiments.
A. Lucchi, K.Smith, R. Achanta, G. Knott, P. Fua, Supervoxel-Based Segmentation of Mitochondria in EM Image Stacks with Learned Shape Features, IEEE Transactions on Medical Imaging, Vol. 30, Nr. 11, October 2011.
Albert Cardona, Stephan Saalfeld, Stephan Preibisch, Benjamin Schmid, Anchi Cheng, Jim Pulokas, Pavel Tomancak, Volker Hartenstein. An Integrated Micro- and Macroarchitectural Analysis of the Drosophila Brain by Computer-Assisted Serial Section Electron Microscopy. PLOS 2010
Olaf Ronneberger, Philipp Fischer, Thomas Brox. U-Net: Convolutional Networks for Biomedical Image Segmentation. Arxiv. https://arxiv.org/abs/1505.04597
Liang-Chieh Chen, George Papandreou, Florian Schroff, Hartwig Adam. Rethinking Atrous Convolution for Semantic Image Segmentation. Arxiv. https://arxiv.org/abs/1706.05587
 My first job in a research laboratory was to dig through soil samples with fine tweezers to remove roots. We don’t have robots to do this (yet) but I can’t imagine a bored undergraduate producing replicable results in this scenario, and the same goes for manual image segmenation or assessment. On the other hand the undergrad will probably give the best results, albeit with a high standard deviation, as they are likely to have the most ambiguous understanding of the professor’s hypothesis and desired results of anyone in the lab.
 I am indeed reading A Deepness in the Sky.
 (o_o) / (^_^) / (*~*)