Cathie So, PhD

The Power of Ensemble Learning and Data Augmentation

Full code examples on the MNIST dataset with VGG16 | ResNet50 | FG-UNET | Majority Voting | Single Transferable Vote | Instant-Runoff Voting

Table of Contents

  1. Introduction (the MNIST dataset and the objective)
  2. The Models included in the Ensemble
  3. i. VGG16
  4. ii. ResNet50
  5. iii. FG-UNET
  6. Data Augmentation
  7. Ensemble Learning Methods
  8. i. Majority Voting
  9. ii. Ridge Regression Ensemble
  10. iii. Single transferable vote (STV)
  11. iv. Instant-runoff voting (IRV)
  12. Key Takeaway


As the first article on this blog, I decided to start with something simple — writing about an old project I did on Kaggle back in the early days on digit recognition. The task was simple (number [0–9] recognition on handwritten digits) and already well solved by the ML community, but it was a good toy dataset to kickstart my Kaggle portfolio.

To know more about the MNIST dataset, you can visit its page on TensorFlow Datasets here. In short, the MNIST dataset contains images of handwritten figures that need to been classified into 10 digits respectively.

Kaggle has a permanently running toy competition that provides a convenient platform for anyone to test their data science skills with the programming environment and model evaluation all set up. The metric used is the categorization accuracy, i.e. just the percentage of correct image predictions.

Below is the data description:

The data files train.csv and test.csv contain gray-scale images of hand-drawn digits, from zero through nine.
Each image is 28 pixels in height and 28 pixels in width, for a total of 784 pixels in total. Each pixel has a single pixel-value associated with it, indicating the lightness or darkness of that pixel, with higher numbers meaning darker. This pixel-value is an integer between 0 and 255, inclusive.
Handwritten numbers for illustrative purposes only (Photo by Pop & Zebra on Unsplash)

Note that the image dimensions are just 28 by 28 pixels, which is too small for a lot of models available in tf.keras.applications. We will assume that the data have already been preprocessed into arrays of shape (sample size, 32, 32, 1) by padding 2 rows and 2 columns of zeros on each side of the image for the rest of this article. The MNIST dataset on Kaggle has 42,000 training samples and 28,000 testing samples.

The Models included in the Ensemble

i. VGG16 (98.80% accuracy)

Here is the complete Kaggle notebook implementing VGG16 (with data augmentation) on the MNIST dataset.

VGG16 was proposed by Simonyan and Zisserman (2014) as a submission to ILSVRC2014, achieving 92.7% top-5 test accuracy in ImageNet. The number 16 stands for the number of layers in the network. There is also a variant of this model, VGG19, which has 19 layers in the network instead.

The Keras API of Tensorflow has a pre-trained model of VGG16 which only accepts an input size of 224x224. For the MNIST dataset, we are going to use the Keras API to create a VGG16 network with input size 32x32 and train from scratch, demonstrated with the code below.

from tf.keras.applications import VGG16from tf.keras import Modelfrom tf.keras.layers import Dense
vgg  = VGG16(include_top=False, weights=None, input_shape=(32,32,3), pooling="max")
x = vgg.layers[-1].outputx = Dense(10, activation='softmax', name='predictions')(x)model = Model(inputs=vgg.layers[0].output,outputs=x)

10% of the samples were reserved for validation. The model was trained with an Adam optimizer on the Categorical Crossentropy for 100 epochs, with an Early Stopping criterion of a patience of 10 of the validation loss.

Along with data augmentation that will be described below, VGG16 achieved a categorical accuracy of 98.80% on the test set.

ii. ResNet50 (99.17% accuracy)

Here is the complete Kaggle notebook implementing ResNet50 (with data augmentation) on the MNIST dataset.

ResNet was proposed by He et al. (2016) and one of its variants won 1st place in ILSVRC2015. ResNet is characterized by skip connections, short cuts in the network that skip over several layers, which were intended to tackle the issue of vanishing gradients. The number 50, again, represents the number of layers in the network.

The Keras API of Tensorflow also has a pre-trained model of ResNet50 which again only accepts an input size of 224x224. Similar to VGG16, we are going to train the model from scratch on the MNIST dataset with the code below.

from tf.keras.applications import ResNet50
res  = ResNet50(include_top=False, weights=None, input_shape=(32,32,3), pooling="max")
x = res.layers[-1].outputx = Dense(10, activation='softmax', name='predictions')(x)model = Model(inputs=res.layers[0].output,outputs=x)]

With the same training parameters above, ResNet50 achieved a categorical accuracy of 99.17% on the test set.

iii. FG-UNET (97.93% accuracy)

Here is the complete Kaggle notebook implementing FG-UNET on the MNIST dataset.

I came across FG-UNET when working on a client’s project on satellite images and found it particularly interesting. The model was based on the U-Net architecture, which consists of several downsampling and upsampling paths to produce pixel-wise classification on biomedical images. FG-UNET attempts to borrow this architecture while retaining fine-grain information by adding a fully connected path without any downsampling or upsampling.

It was really just a fun experiment to modify FG-UNET to classify the MNIST dataset. In particular, we no longer need a pixel-wise classification, so the final layers were modified to flatten the 2D image and output to a dense layer of 10 classes. The specific kernel sizes and layers used can be found in the code below.

Without data augmentation, the FG-UNET achieves a categorical accuracy of 97.93% on the test set.

Data Augmentation

Data augmentation is a proven technique to enhance model performance under a limited dataset. By rotating, shifting, and shearing the input images, the network will be able to learn a better internal representation of the handwritten digits, instead of overfitting to the training images.

The Keras API has an ImageDataGenerator that conveniently gives a generator that generates augmented input images to the model.

from tf.keras.preprocessing.image import ImageDataGenerator
gen = ImageDataGenerator(rotation_range=30, width_shift_range=2, height_shift_range=2, shear_range=30, validation_split=val_rate)

Ensemble Learning Methods

Why ensemble learning? Well, two heads are better than one (in this case, three heads).

Say we have the three models from above, which have an error rate of 1.20%, 0.83%, and 2.07% respectively. The set of images that they fail to predict correctly may not overlap completely. One model could have difficulty distinguishing between 1 and 7, while another model may have difficulty telling between 3 and 8. If we can somehow combine the learning from all three models, we should be getting a better model than any of the three individually!

How do we combine these models? It turns out that there are many ways out there to employ ensemble learning, and we are going to discuss a few of them below. Here is the full ensemble learning notebook on Kaggle.

i. Majority Voting (99.35% accuracy)

The most straightforward ensemble method is just to let the models “vote”. If two or more out of the three models predict the same digit, there is a high possibility that they are correct.

The issue with majority voting arises when there is no majority, especially given that we have a 10-way classification. There are a lot of possible solutions when there is no majority, but here we are going with a simple one — If there is no majority, we go with ResNet50, since it has the highest single model accuracy.

Out of the 28,000 test samples, only 842 of them (3.01%) did not have a unanimous prediction and required a vote. Out of the 842 “elections”, only 37 of them (0.13%) did not have a majority. This means that the exception only applied to a very small number of samples.

With Majority Voting, the categorical accuracy on the MNIST dataset was boosted to 99.35%, higher than any of the three models alone.

ii. Ridge Regression Ensemble (99.33% accuracy)

Ridge Regression is more commonly known as a technique to regularize ordinary linear regression. However, it is also possible to create an ensemble with Ridge Regression by taking the individual model predictions as input and the class labels as output.

Each of the three models gives output as a probability distribution of the 10 digits. By concatenating the 3x10 probabilities, we get a vector X of 30 dimensions. For the output vector y, we simply encode the class label into one-hot vectors. By fitting a Ridge Regression to input X and output y, the regressor is going to predict a pseudo-probability vector of the 10 classes (pseudo because the vector may not be normalized).

A Ridge Regression Ensemble boosted the categorical accuracy on the MNIST dataset to 99.33%.

iv. Single transferable vote (STV, 99.36% accuracy)

For the next two ensemble methods, we are going to use a handy module called PyRankVote. PyRankVote was created by Jon Tingvold in June 2019 and implements Single transferable vote (STV), Instant-runoff voting (IRV), and Preferential block voting (PBV).

In STV, each model marks a “ballot” for their ranked choices, in descending order of predicted probabilities. At each round, the least popular “candidate” is eliminated. The vote goes to the model’s top preference whenever possible, but if their first preference is eliminated, then the vote will go to their second, third, fourth choice, etc. The procedure runs recursively until one winner is left.

Using PyRankVote, we can implement STV in just a few lines of codes:

import pyrankvotefrom pyrankvote import Candidate, Ballot
candidates = [Candidate(i) for i in range(10)]ballots = []for j in range(3):    ballot = np.argsort(X_pred[i,:,j]) #input is the predicted probability distribution    ballot = np.flip(ballot) #flipping to descending order        ballots.append(Ballot(ranked_candidates=[candidates[i] for i in ballot]))    election_result = pyrankvote.single_transferable_vote(candidates, ballots, number_of_seats=1)
winners = election_result.get_winners()

STV achieved a categorical accuracy of 99.36% on the Kaggle test set of the MNIST dataset.

v. Instant-runoff voting (IRV, 99.35% accuracy)

Similar to STV, each model also votes for a list of ranked choices in IRV. If a “candidate” obtains a simple majority (over half) of votes in the first round, that candidate wins. Otherwise, the least popular “candidate” is eliminated and those who voted for the least popular one first will have their votes counted toward their next choice instead.

From our experience with the Majority Voting method, we know that only 37 out of 28,000 test samples do not have a simple majority winner. Hence, only these 37 samples will be affected by IRV.

IRV achieved the same categorical accuracy of 99.35% as the Majority Voting method.

Key Takeaway

Here is a table summarizing all the model performances:

We observe that the ensemble model performances are all similar, ranging between 99.3 to 99.4%. The key takeaway of this article is not whether one ensemble method is more superior to another, but that all ensemble models can achieve higher accuracy than any of the three individual models alone.

In academic research, we often focus on finding the best model for a specific machine learning task because of the pressure to publish, and the elegance of a standalone best-performing model would be more easily accepted by journals. In the industry, previous iterations of models that are not ultimately deployed in the final product are often considered sunk costs. If resources permit, deploying an ensemble of several well-performing models could give a better performance than a single best-performing model, and help a company better achieve its business objectives.


VGG16: Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.

ResNet50: He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770–778).

FG-UNET: Stoian, A., Poulain, V., Inglada, J., Poughon, V., & Derksen, D. (2019). Land cover maps production with high resolution satellite image time series and convolutional neural networks: Adaptations and limits for operational systems. Remote Sensing, 11(17), 1986.

Data Augmentation: Shorten, C., & Khoshgoftaar, T. M. (2019). A survey on image data augmentation for deep learning. Journal of Big Data, 6(1), 1–48.

About the author: I am a freelance data scientist and ML engineer. I provide custom solutions for your company’s data science and machine learning needs. For more info, please visit


CC BY-NC-ND 2.0 版權聲明