Deep Neural Networks for Image Classification

Author: Paul A. Beata
GitHub: pbeata


The data used for this project is from the MNIST set of handwritten digits available in the TensorFlow datasets module: Link to TensorFlow Site

This project uses the data set called MNIST for a deep learning application of handwritten digit recognition. The data set provides 70,000 images (each of size 28x28 pixels, no color) of handwritten digits with 1 digit per image.

The goal of this project is to write an algorithm that detects which digit is written in each input image. Since there are only 10 possible digits {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}, this is a classification task with 10 classes.

number_image.JPG

Python Modules

For this deep learning application, we will employ TensorFlow to make use of the Keras neural networks methods.

The data set for this application comes from the "tensorflow-datasets" module; therefore, if you do not have this installed in your system, you must first run one of the following install commands:

These datasets will be stored in the path "C:\Users*USERNAME*\tensorflow_datasets\" (on Windows). The first time you download a dataset, it is stored in its respective folder, but every other time after that it will be automatically loaded from the copy on your machine.

Part 1: Data Preprocessing

Here we will perform two main tasks on the MNIST number recognition dataset:

  1. Load the data from the tensorflow-datasets module
  2. Split into training, validation, and testing data subsets
  3. Scale the data (more specifically, normalize each imahe pixel to the range [0,1])
  4. Shuffle the split and scaled data

1. Load MNIST Data

The method tfds.load actually loads a dataset (or downloads and then loads it if it's the first time we are using this data from the tensorflow-datasets modeuls. In this case, we are interesteed in the MNIST data set.

Using the optional input "as_supervised=True" will load the data set in a 2-tuple structure: (input, target). This separates the input data from the target data for us automatically and is simply a matter of convenience.

2. Train-Validation-Test Proportions

The data set was originally split into TRAIN and TEST subsets already. First, we can check the percentages of each split here using the "mnist_info" data structure to extract info about the loaded data.

By default, the MNIST data set in TensorFlow has pre-labeled "train" and "test" datasets, but there is no validation subset. Therefore, we will split it on our own here in order to generate a validation subset from the training data.

Note: we will perform the actual split after scaling next. This step was only to prepare the relative proportions of each of the data subsets.

3. Scaling the Data

Here we will scale the data in order to make the results more numerically stable. In this case, using image data with pixel values between [0, 255], we will simply normalize each image's pixels such that all the inputs are between 0 and 1 instead.

4. Shuffle and Perform Split

We use the BUFFER_SIZE parameter here for cases when we're dealing with very large data sets. In those cases, we can't shuffle the whole data set all at once because we can't fit it all in memory. Thus, TensorFlow only stores BUFFER_SIZE number of samples in memory at a time and shuffles them. Note: if BUFFER_SIZE=1, then no shuffling will actually happen and if BUFFER_SIZE >= num samples, then shuffling is uniform across the data set. Choosing a BUFFER_SIZE in between is a computational optimization to approximate uniform shuffling. There is a shuffle method already available and we just need to specify the buffer size to use it here.

Shuffle

Split

Our validation data will be equal to 10% of the training set, which we've already calculated above.

We can also take this opportunity to batch the training data. This is helpful during training because we will be able to iterate over the different batches in the training data set.

Part 2: Deep Learning Model

First, we will outline the model to set up ML algorithm for training.

The MNIST data set contains thousands of images comprised of pixels which define each handwritten digit. We can see here that the "image" object is a 28x28 tensor and the output ("label") has 10 classes. In other words, the input size for each image in the set is a collection of 28*28 = 784 pixels and there the output size is 10 for the 10 possible digits to be predicted by the algorith: {0, 1, ..., 9}.

One of the hyperparameters that we can optimize for the neueral network is the size of the hidden layers. In this application, we will use a single size for all the layers. However, since we don't know the optimal hidden layer size, we will perform an optimization to try and find the best one.

Model Definition

Choose the optimizer and the loss function for the model:

Training

This is where we train the first model we have built.

Testing the Initial Model

For the validation data set, we achieved a classification accuracy of 98.4% by the 18th epoch of training. This resulted in a model that produced a testing accuracy of 97%. While these results are good, they were made possible by having pre-existing knowledge that these hyperparameters (e.g., the NN depth of 2 and the hidden layer size of 50) would lead to good results. In the next section, we will try various combinations of hyperparameters to try and find the best combination assuming no prior knowledge.

Part 3: Optimization

Even though we achieved 97% accuracy on the original NN model above, we had some insight already from previous studies to choose hyperparameters that would lead to this performance. Now we will optimize assuming we know nothing about the "best" hyperparameters to use.

Depth = 2

Perform the training and testing using a NN with a depth of 2 again, but consider all the possible combinations of activation functions and hidden layer size (model "width").

With the two hidden layers of size 256 and using "relu" activation functions for each one, we achieve a testing accuracy of 98.1%.

Depth = 4

Increase the depth of the NN to 4 and only consider widths from 4 to 64 this time.

We achieved a max classification accuracy of 97.7% this time.

The activation functions of "relu" and "tanh" are better than "sigmoid" for every test. We can increase the width again and use one of these functions to see if the accuracy improves.

Testing Accuracy: 97.5%

Testing Accuracy: 98.1%

Depth = 8

From the results above, it seems that we can remove the sigmoid activation function and exclude the hidden layer size of 4 and include a width of 128.

Randomize the Activation Functions

By randomly sampling the possible options for the activation functions and using a depth of 8, the best test accuracy was 97.5% for a hidden layer width of 128. We will do one more model with random and without random sampling of the activation functions but this time with a width of 256.