For the past few months I’ve been following the Fast.AI Deep-Learning for Coders course. An online series of lectures accompanied with Jupyter notebooks and python library built around PyTorch. The course itself is split into two halves: the first uses a top-down approach to teach state of the art techniques and best practices for deep learning in order to achieve top results on well established problems and datasets, with later lessons delving deeper into the code and mathematics; the second half deals with more with the cutting edge of deep learning, and focuses on less-well-founded problems, such as generative modelling, and recent experimental technologies which are still be developed.
A recurring theme in many lessons is the concept of transfer learning: taking a pretrained model, and adapting it to the problem at hand, rather than starting from scratch. This is initially presented in the domain of image recognition, in which a relatively deep model, such as ResNet34, has been pretrained for classifying ImageNet but is then used as a backbone to a new model, which could be determining dog breeds, satellite images, or regressing to some property of the image.
This idea is later taken to natural language processing by initially training a model to predict upcoming text from Wikipedia articles, and then using it to perform sentiment analysis of film reviews.
This is certainly a really interesting idea which I will be testing in my own work in physics, where samples sizes are often a limiting factor, but cheaper, less-accurate simulation could be used to pretrain models.
Aside from picking up a vast array of new techniques, which I’m already using extensively for my research, the course was an opportunity for me to become familiar with other forms of data, namely images and text. I guess high-energy physics data is classed as columnar data, but there were a few lessons dedicated to more traditional columnar data as well. And of course to get to grips with the different network architectures more suited to dealing with these new data formats; my previous experience had only been with basic, fully-connected networks.
My general approach to following the course was to run through a lesson and then try and find similar data to apply the ideas to, normally searching on Kaggle for current or old competitions. Over the next few posts, I’ll try and give you a better idea of the techniques by focussing on an example from each of the categories of data. Beginning with images.
The first lesson is a bombshell of information, but the core idea is to take a pretrained image classifier, add some extra layers on the end, train only those, using the problem-specific data, and then unfreeze the backbone model and perform a final training, often using different learning rates at different layers to avoid destroying the fine-tuned weights of the backbone (discriminative learning rates). An extension to this is to follow a training curriculum, by training very quickly on small images and gradually stepping up the resolution (whilst dropping down the batch size if needed).
These techniques need not be limited to simply identifying objects in images, but can also be extended to inferring other properties of the images as well. Looking through Kaggle I found a very interesting competition running, in which the task was to mask, at pixel level, areas of salt deposits from seismic imaging scans (TGS Salt). Lesson 14 of the course presents a solution to a similar problem, in which masks for cars must be produced.
The current best approach (the competition leaderboard eventually reconfirmed this), is to use the U-Net architecture. Originally this had been presented in the domain of biomedical image segmentation, allowing pictures of cells to be accurately highlighted. Essentially, the architecture allows many computations to be performed on an image at varying levels of resolution. In the downward path, convolutional kernels and pooling layers perform calculations on the image whilst reducing the number of pixels; at lower resolution, more channels of the image may be created without incurring a heavy increase in computation time. The catch is that the output mask must have the same resolution as the input image, such that the two can be overlayed.
Rather than simply rescaling the eventual low resolution image, the second half of the “U” consists of an upwards path of transposed convolutions which can learn to reconstruct a higher-resolution image better that rescaling. The key point is that since the final image is going to slightly resemble the input image, the features from the downward path before each pooling step (when the resolution is halved) can be brought across to the upwards path and concatenated to produce the rescaled image.
The lesson sticks to its core ideal of transfer learning, and instead replaces the downwards path with a pretrained ResNet architecture, cut just before the final pooling layer. This means that in initial training, only the upwards path needs tuning, before a final training. The fact that the ResNet works for a range of image sizes also means that curriculum learning can be used by gradually increasing the resolution of the input and target images.
Adapting the method for salt masking worked pretty well, and after a bit of playing around I landed pretty high on the leaderboard at the time. Eventually, better solutions came out, based mostly on combining different loss functions together according to the specific problem and I dropped down, ending about halfway down the leaderboard.
Below is an example response for my model:
From left to right: normalised input image, true salt mask, and predicted mask. It certainly seems to be getting the right idea, but is a bit jagged. I eventually moved to using an ensemble which helped to smooth things out a bit.
The winning solution looks really cool and I’m hoping to have the time to read through it and trying to reimplement it. From a quick glance, it still seems to be based on U-Net, with a slight more complex downwards path, ResNeXt50 – I’d tried this but hadn’t found much improvement over ResNet34. Additionally, they used two extra loss functions and changed them during the training.
I’d used Smith’s 1-cycle learning rate schedule for all training whereas they had started with constant learning rates and then switched to cosine-annealing with restarts to create a snapshot ensemble. Additionally, since ResNet requires input sizes according to powers of 2, they had padded the supplied 101×101 images to 128×128 (or padded them to 202×202 and rescaled to 256×256), whereas I had simply rescaled to 128×128, and then downscaled to progressively train on increasing resolutions (32×32 → 64×64 → 128×128). They also used a bit more data-augmentation at train-time than I had, and eventually used test-time augmentation as well; I’d had some trouble getting it to work since the output mask need to be de-augmented to build the average response correctly.
The key difference seemed to be that they trained three different sets of models, and used the output of each as training data for the next stage; a process called ‘pseduolabelling’. I have seen this technique talked about on forums and Kaggle discussion threads, but haven’t quite understood it yet. I think it seems similar to ‘stacking’, which I’ve also never tried. Given that they do this process twice, it looks to be pretty powerful, so I’ll definitely check it out. Their final models also use some more complex centre and upwards paths, and they linked to the relevant papers; more reading!