For a recent assignment in the Machine Learning and Big Data course of my Master’s program, I was asked to add bagging, or bootstrap aggregating, to a digit recognition neural network. We were provided an example digit recognition network written in Python with Keras and Theano, and asked to enhance it to improve on the the default 0.984 accuracy.
I created a Juypter notebook on the deep learning docker image we are using for the course, and began tweaking the existing network to see what improvements I could achieve. I eventually arrived at a model I liked which was still largely similar to the initial example network with some adjustments to dropout and the application of max normalization to weights.
Using my “preferred” network, an ensemble of 10 networks were trained over 40 epochs with mini-batches
of 256. To create the ensemble, a dictionary of Keras models was created (
function creates my “preferred” network … it was my 8th experiemental network).
createBaggingModels helper function created our 10 network ensemble.
Note that it is limited to 26 submodels due to the use of the
ascii_lowercase array for naming.
This was sufficient for my use, but should probably be converted to integer for extensibility.
The ensemble was then run through the training logic. Each models’ weights are saved to the file system for re-use. The
function helps us sample with replacements for the training set specific to the iteration’s model.
Initially, only the averaging technique for aggregation was applied, but voting was added so
a comparison could be done. The code for the
BaggingVote aggregation functions
are shown below. They each follow a similar structure.
Numpy’s argmax function
provides a convenient method of extracting the predicted label but depends on the
result structure being an Numpy
ndarray of size 10 (0-9), where the index into the array cooresponds
to the digit label. Print statements
are still visible from my debugging phases, but these could be removed or converted
to “verbose” output.
results list passed into both functions was populated similar to the following code segment. Keras’
predict_proba function, which returns the raw network probabilities,
is used to preserve the pre-classified results for use later in the aggregating functions.
for loop provides each submodel in the ensemble an opportunity to generate it’s output (the ensemble model
is loaded into the
remodel dictionary with each submodel being an entry in the dictionary).
Finally, the helper functions are called to determine the overall performance of the ensemble model:
The above code segment produces output as shown below with 10 submodels in the ensemble:
Bagging w/ Averaging -------------------- Misses: 141.0 0.9859 Bagging w/ Voting ----------------- Misses: 143.0 0.9857
Somewhat as expected, when additional submodels were included in the ensemble, the accuracy improved. At 15 submodels, everything else equal, the following results were observed (out of 10,000 test samples). Although maybe not statistically significant, the averaging technique seems to do slightly better than voting for this classification use case.
Bagging w/ Averaging -------------------- Misses: 134.0 0.9866 Bagging w/ Voting ----------------- Misses: 136.0 0.9864
A 20 submodel ensemble was tested as well, and the results of the 3-way test are shown below. In this limited investigation, additional submodels quickly began to show diminishing marginal returns.