Here we will introduce convolutional neural networks (CNNs) and present an architecture of a CNN that we will train to recognize coins.
					
					
						
					
					
						
						
						
What is a CNN? As we mentioned in the previous article of the series, a CNN is a class of Neural Network (NN) frequently used for image classification tasks, such as object and face recognition, and in general for problems where the input can have a grid-like topology. In CNNs, not every node is connected to all nodes of the next layer. This partial connectivity helps prevent overfitting issues that come up in fully connected NNs, plus speeds up the convergence of the NN.
The core concept surrounding CNNs is that of a mathematical operation known as convolution, which is very common in the field of digital signal processing. Convolution is defined as a product of two functions, resulting in a third function that represents the amount of overlap between the first two. In the area of CNNs, convolution is achieved by sliding a filter, known as a kernel, through the image.
In object recognition, the convolution operation allows us to detect different features in the image, such as vertical and horizontal edges, textures, and curves. This is why one of the first layers in any CNN is a convolutional layer.
Another layer common in CNNs is the pooling layer. Pooling is used to reduce the size of the image representation, which translates to a reduction in the number of parameters and, ultimately, the computational effort. The most common type of pooling is max pooling, which uses a sliding window similar to the one in the convolution operations to harvest, in every location, the maximum value from the group of cells being matched. At the end, it builds a new representation of the image from the harvested maximum values.
Another concept related to convolution is padding. Padding guarantees that the convolution process will evenly happen across the entire image including the border pixels. This guarantee is backed by a zero-pixel border that is added around the downscaled image (after pooling) so that the sliding window can go to all pixels of the image the same number of times.
The most common CNN architectures typically start with a convolutional layer, followed by an activation layer, then a pooling layer, and end with a traditional fully connected network such as a multilayer NN. This type of model, where layers are placed one after the other, is known as a sequential model. Why a fully connected network at the end? To learn a non-linear combination of features in the transformed image (after convolution and pooling).
Here is the architecture we’ll implement in our CNN:
	- Conv2D layer – 32 filters, filter size of 3
- Activation Layer using the ReLU function
- Conv2D layer – 32 filters, filter size of 3
- Activation Layer using the ReLU function
- MaxPooling2D layer – Applies the (2, 2) pooling window
- DropOut layer, at 25% – Prevents overfitting by randomly dropping some of the values from the previous layer (setting them to 0); a.k.a. the dilution technique
- Conv2D layer – 64 filters, filter size of 3
- Activation layer using the ReLU function
- Conv2D layer – 64 filters, filter size of 3 and, stride of 3
- Activation layer using the ReLU function
- MaxPooling2D layer – Applies the (2, 2) pooling window
- DropOut layer, at 25%
- Flatten layer – Transforms the data to be used in the next layer
- Dense layer – Represents a fully connected traditional NN with 512 nodes.
- Activation layer using the ReLU function
- DropOut layer, at 50%
- Dense layer, with the number of nodes matching the number of classes in the problem – 60 for the coin image dataset used
- Softmax layer
The architecture proposed follows a sort of pattern for object recognition CNN architectures; layer parameters had been fine-tuned experimentally.
The result of the fine tuning process of parameters that we went through was partially stored in the Settings class that we present here:
public class Settings
{
        public const int ImgWidth = 64;
        public const int ImgHeight = 64;
        public const int MaxValue = 255;
        public const int MinValue = 0;
        public const int Channels = 3;
        public const int BatchSize = 12;
        public const int Epochs = 10;
        public const int FullyConnectedNodes = 512;
        public const string LossFunction = "categorical_crossentropy";
        public const string Accuracy = "accuracy";
        public const string ActivationFunction = "relu";
        public const string PaddingMode = "same";
        public static StringOrInstance Optimizer = new RMSprop(lr: Lr, decay: Decay);
        private const float Lr = 0.0001f;
        private const float Decay = 1e-6f;
}
We now have an architecture for the CNN that we will present in the next article. In the next delivery, we will examine the CNN we implemented for coin recognition using Keras.NET.