Chap3 DL Connectionism

Series:
Basic Intuitions of Machine Learning & Deep Learning for beginners

Chapter 3: Deep Learning Connectionism Architecture
Philosophy of Connectionism

Originally published 16 February, 2021
Revised: 29 Sptember, 2023
By Michio Suginoo

The architecture of Deep Learning is inspired by Neuro Science and Cognitive Science, and particularly by the philosophy of Connectionism.

A biological brain is a massive network of many neurons.

The chart below compares the number of neurons among different types of creatures. And Human ranks on the top after Octopus. (Goodfellow, Bengio, & Courvil, 2016, p. 23)

And Connectionism speculates that:

While each single neuron can only execute simple operations as an individual unit, when many neurons are connected together as a massive neural network, they can collectively exhibit very intelligent behaviours.

Deep Learning incorporates this notion into its architecture.

Simplicity of a single Neuron (Node)

Now, in the next figure, I want to show you how simple a single neuron can be. The circle in the middle represents one single neuron, which is also called a Node. It is the fundamental building block of Deep Learning.

It executes only two simple operations: first it receives an input from the previous neuron and executes a linear transformation on the input (on the left in the chart) and passes the result for the execution of non-linear activation (on the right in the chart). And ultimately, it passes the result of the activation to the next neuron(s).

Linear Transformation (Affine Transformation)

The linear transformation applies two vectors—namely, weight and bias—to the input. They only scale (expand or shrink) and shift the input: no more, no less. Nevertheless, they play an essential role in storing and updating learning properties.

Here is a couple of paragraphs for a little deeper understanding about the linear transformation, if you like: the linear transformation is an affine transformation. It does not change certain characteristics among the input components such as ratio of distances and alignment of certain points.

Affine Transformation:

“An affine transformation is any transformation that preserves collinearity (i.e., all points lying on a line initially still lie on a line after transformation) and ratios of distances (e.g., the midpoint of a line segment remains the midpoint after transformation). In this sense, affine indicates a special class of projective transformations that do not move any objects from the affine space R^3 to the plane at infinity or conversely. An affine transformation is also called an affinity.”
Source: https://mathworld.wolfram.com/AffineTransformation.html

So, an affine transformation preserves:

collinearity (i.e., all points lying on a line initially still lie on a line after transformation) and
ratios of distances (e.g., the midpoint of a line segment remains the midpoint after transformation)

If this additional info about the linear transformation (Affine Transformation) does not appeal well to your intuition, do not worry. You can skip this part. This was just an extra insight. For now, in order to cultivate your intuitions, you can move forward skipping it.

Non-linear Activation

Now in the box on the right, we have the graphical representations of three typical non-linear activations. As you see, each of these charts represents an execution of a very simple non-linear operation.

Here are some intuitions about these three activation functions:

ReLU (Rectified Linear Unit): At the beginning, it was hard for me to believe that ReLU is one of the most popular activation functions used in the hidden layers for two reasons: it’s almost linear; it’s debatable whether it is differentiable at the kinked point. Nevertheless, the notion of ReLU is inspired by the behavior of biological neurons. Neuro Scientists discovered that a single neuron does only respond to signals with a certain minimum criteria of input value. In the activation, it emulates this behavior by only reacting to the positive result of the linear transformation. (Well, the detail criteria can be adjusted during the linear transformation (affine transformation). )
Sigmoid Function: Sigmoid generates a probability proxy, transforming any input of real value over the range between 0 and 1. So, we can use it when we expect the output to have a probability representation.
TanH Function (Hyperbolic Tangent Function): TanH takes any real value and map it over the range between the negative 1 and the positive 1. When we have different types of inputs with different scales, it standardizes/normalizes the scale of the input.

Overall, a single neuron (node) executes only simplistic operations; thus, it is far from an intelligent entity.

Now, regarding a single neuron, I would like to make one more remark: it’s about its parameters.

Parameter

A parameter needs to be initialized outside of the model.

Once initialized, they are continuously updated through iterations in the model.

Since parameters are embedded within the architecture of neuron, a neuron is often called a “parameterized module”. And often a random value is assigned to its initial value.

That’s enough about a single neuron for now. Next, let’s take a look at “the neural networks as a whole”. Remember, Connectionism speculates: while a single neuron executes only simple operations, when multiple neurons are connected, they collectively exhibit an intelligent behaviour as a neural network.

What makes Neural Network as an Intelligent Entity?

What you see in the next figure is a morphology of Deep Learning Algorithm in the form of Computation Graph. Computation Graph is a graphical representation of computation. The most important take away in this slide is: Deep Learning operates two kinds of computing: Parallel Computing and Sequential Computing.

On the left, we have one single layer containing multiple neurons. Within one single layer, multiple neurons behave together (in parallel) without any connection. So, a layer represents a parallel computing, which demands intensive real time processing. A layer constitutes a building block of Deep Learning.

On the right, we have a stack of layers. The name, Deep Learning, is derived from this particular morphology of the stack of layers in its architecture. And connections run from one layer to the next through neurons in sequence. The stack of layers represents Sequential Computing.

Overall, Deep learning, while processing information in parallel, learns in sequence.

As a precaution, this illustration represents one of the most basic prototype of Deep Learning model called ‘Feedforward Neural Networks’. Usually, viable Deep Learning models have far more complex structure. As new models sought for a better performance, the layers of their architecture deepened. The next historical chart reveals this notion.

This is the historical evolution of the depth of layers in the top competing models in a prominent computer vision contest, Large Scale Visual Recognition Challenge, a.k.a. ILSCRC.

You see the # of layers on the top of the bar charts.

Deep Learning was introduced to this contest with the model called AlexNet in 2012. Until then, the # of layers used to be 1.

Nevertheless, AlexNet was not the first Deep Learning model. In the 1960s, the prototype generation of Deep Learning was introduced. Nevertheless, it failed to perform well, because it could not extend its layers enough due to the hardware constraint at that time. This paints a symbolic picture: the evolution of Deep Learning is significantly influenced by the evolution of hardware.

All that said, it turns out to be: simply adding layers does not necessarily improve the performance. It is because often a deeper model is difficult to optimize.

Simply put, to determine a right depth of layers is a tricky business, and would require engineering expertise.

Complexity and Simplicity of Deep Learning

In a nutshell, Deep Learning is a complex network of simple neurons. Complexity and simplicity are both coexisting in the architecture of Deep Learning.

And its complexity can be captured by the size of “its 4 operating components”:

# of neurons or nodes
# of layers
# of connections.
# of other hyperparameters

Today, the complexity of Deep Learning Models is exploding.

In Chapter 5, we will see some implications of the exploding complexity of Deep Learning today.

In this chapter, we saw the simplicity and the complexity of the neural network architecture of Deep Learning. In the next chapter, let’s find out the learning mechanism of Deep Learning. How does it learn?

Donation:
Please feel free to click the bottom below to donate and support
the activities of www.reversalpoint.com

Series:Basic Intuitions of Machine Learning & Deep Learning for beginners

Chapter 3: Deep Learning Connectionism Architecture Philosophy of Connectionism

Series:
Basic Intuitions of Machine Learning & Deep Learning for beginners

Chapter 3: Deep Learning Connectionism Architecture
Philosophy of Connectionism