Convolutional Neural Network and Keras Mixed Data Input (Part 1) — by Sharath Manjunath

6 min readJun 16, 2021

Artificial Intelligence has made significant progress in closing the gap between human and computer capabilities. Researchers work on a variety of elements in this area to achieve incredible results. The field of computer vision is one of several such fields.

This discipline aims to enable robots to see the world in the same way that people do, experience it in the same way, and utilize that knowledge for various tasks, including image and video recognition, image analysis and classification, and so on. Advancements in Computer Vision using Deep Learning have been built and improved through time, largely through the use of a single algorithm — a Convolutional Neural Network.

1. Introduction

A CNN is a Deep Learning system that can take an image as input, assign significance to various aspects/objects in the picture, and distinguish between them. When compared to other classification algorithms, the amount of pre-processing required by a CNN is significantly less. While filters are hand-engineered in basic ways, CNN can learn these filters with enough training.

2. Feed Forward Network

Flattening of a 3x3 image matrix into a 9x1 vector

Isn’t a picture nothing more than a matrix of pixel values? Why not simply flatten the picture (for example, a 3x3 image matrix into a 9x1 vector) and pass it to a Multi-Level Perceptron for classification? No, no, no, no, no, no, no, no, no, no, no.

When doing class prediction on relatively basic binary pictures, the technique may display an average precision score, but when it comes to complicated pictures with pixel dependencies throughout, the technique will have little to no accuracy.

Through the application of appropriate filters, a CNN may successfully capture the Spatial and Temporal relationships in a picture. Due to the reduced number of parameters involved and the reusability of weights, the architecture performs superior fitting to the picture dataset. In other words, the network may be trained to better grasp the image’s complexity

3. Convolutional Layer

Example:

Image Size = 1920 x 1080 x 3

Here, 3 is the RGB Channel.

a. First layer neurons = 1920 x 1080 x 3 ~ 6 Million

b. If we have hidden layers, then let’s say ~ 4 Million

c. Just between Input and Hidden, we are calculating 6*4 = 24 Million weights to be calculated.

Note: Deep Neural networks as many Hidden Layers and the calculation of weights might more than 500 Million. Isn’t it consume more computational power?

The disadvantages with Feed-Forward networks:

Too Much Computation
Sensitive to the location of the object in an image.

How Does Humans recognize these images?

When we look at the Koala image, we look at the small features like round eyes, black nose, and fluffy ears and that triggered this as the Koala head by the different neurons in the brain and then aggregate the results and say the Koala head. Similarly, we look up for the hands and legs and we say it is the body of the Koala body, and at the end by these features, our brains say it is the image of the Koala.

But, How can we make computers recognize these tiny features?

There are these little edges at the first three sequence from the top which forms loopy circle pattern which is kind of head in digit 9.

In the middle we got vertical line and at the bottom we got diagonal line.

If we have loopy circle at the top, vertical line in the middle and diagonal line at the bottom, then it is said to be Number 9. But, how can we make computers to recognize them?

In order to do this, we use the concept of filters which are the feature detectors.

In the case of 9, we use three filters. Starting with head i.e., Loppy filter, then at the middle: the vertical line filter and at the bottom Diagonal line filter.

In the below image, we have taken our original image which is 7x7 matrix and we perform a convolutional or filter operation with the head filter/loopy circle pattern inorder to get feature map.

Multiply the subset of original image(3x3) with the filter and continue the same after traversing the entire image and update the average in the feature map. Here, I have taken stride of 1 and the filter with 3x3.

Note: Stride can be any number and filter can be any size.

So, in the feature map wherever you find Number 1 or any umber nearner to 1 then there is a loopy circle pattern exist. In case of Koala, the eye, nose, ears are the filters.

In general, the loopy circle is detected at the top for number 9, at the bottom for 6, both at the top and bottom in case of 8.

In case of koala image, there can be a eye detector at the different location and the location are invariant as the filter is moved all over the images. In case of last image, it detected eye at the three different location, because their exist three koala bear.

Different filters for different features

In the same case there can be a nose detector, ear detector and finally I can apply Convolution operation again to aggregate all these filters into hand detector

The filter can be any dimension and mapped over all the features and formed a feature space to detect the head of koala.

Similarly, there can be koala body detector and then we combine both head detector and body detector inorder to classify the image as Koala or not.

Once the feature as been extracted, it it flatten into a single vector and connected to fully connected layer to identify koala or not.

Remember: the vector of the flatten layer changes based on the location of the features.

Aggregating the features

4. ReLU activation Function

This is not the complete Convolution neural network, we also perform something called ReLu operation to bring out non-linearity in the model. It takes the feature map and changes the negative value to zero.

But, still we did not address the less computation yet!!!!

If we process the same image, still we are processing by calculating Millions of weights and we haven’t reduced the size of the image.

Note: ReLU outputs 0 if the value is negative otherwise returns the same value. For hidden layers, if you don't no which activation function needs to be used, then just use ReLU.

5. Pooling Layer

Pooling layer is used to reduce the size of the image.

Two types of pooling can be done

Max Pooling
Average Pooling

For example, take the feature map and apply the Max/Average pooling and generate the new feature map as shown in the image left.

Therfore for the Number 9 detection, we get the following feature map after being fed from Convolutional and Pooling. Pooling with Convolutional helps with position invariant feature detection.

Benefits of Pooling:

Reduce dimension and computation
Reduce Overfitting as there are less parameters
Model is tolerant towards variations, distorstions

Overall Architecture of Koala Feature extraction and Classification

Thank you for Reading…!!!

Click for Part 2:
https://sharathmanjunath.medium.com/convolutional-neural-network-and-keras-mixed-data-input-part-2-by-sharath-manjunath-bb9529d76f89