Understanding the Core Concepts of Deep Learning with MIT 6.S191

Cover Image for Understanding the Core Concepts of Deep Learning with MIT 6.S191
Breye Foka
Breye Foka

🧠 The Building Block: Perceptron

The perceptron is the simplest type of artificial neuron. It's modeled after biological neurons and takes weighted inputs, adds a bias, and passes the result through an activation function.

Forward propagation :
for a single perceptron can be written as:

z=βˆ‘i=1nwixi+bz = \sum_{i=1}^{n} w_i x_i + b a=Ο•(z)a = \phi(z)

Where:

  • xix_i are the input features
  • wiw_i are the weights
  • bb is the bias
  • Ο•\phi is the activation function
  • aa is the output of the neuron

Activation functions introduce non-linearity, such as:

  • Sigmoid:
Οƒ(z)=11+eβˆ’z\sigma(z) = \frac{1}{1 + e^{-z}}
  • ReLU:
ReLU(z)=max⁑(0,z)ReLU(z)=max⁑(0,z)

ReLu and Sigmoid activation functions

πŸ•ΈοΈ Expanding the Web: Neural Networks

A neural network consists of layers of neurons. Each hidden layer performs a transformation on its inputs using dense connections (fully connected layers).

a(l)=Ο•(W(l)a(lβˆ’1)+b(l))a(l) = \phi(W^{(l)} a^{(l-1)} + b^{(l)})

Where

  • ll is the layer index
  • a(l)a^{(l)} is the activation of layer ll
  • Ο•\phi is the activation function

πŸ“‰ Loss Functions: Measuring the Error

The loss function quantifies the difference between predicted and true values. Common types include: Mean Squared Error (MSE):

L=1nβˆ‘i=1n(yiβˆ’y^i)2L = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2

Cross-Entropy Loss (for classification):

L=βˆ’βˆ‘ylog⁑(y^)L= -\sum y \log(\hat{y})

πŸ”„ Backpropagation: The Learning Engine

Backpropagation is the process that allows the network to learn by updating weights based on the loss gradient.

It works by applying the chain rule from calculus, moving layer by layer from output to input.

🧠 Error propagation:

For layer ll:

Ξ΄(l)=(W(l+1))TΞ΄(l+1)βŠ™Ο•β€²(z(l))\delta^{(l)} = \left(W^{(l+1)}\right)^T \delta^{(l+1)} \odot \phi'\left(z^{(l)}\right)

For weights update:

W(l):=W(l)βˆ’Ξ·β‹…βˆ‚Lβˆ‚W(l)W^{(l)} := W^{(l)} - \eta \cdot \frac{\partial L}{\partial W^{(l)}}

🎯 Optimization: Gradient Descent

We minimize the loss using gradient descent, which updates weights to reduce the error:

w:=wβˆ’Ξ·βˆ‚Lβˆ‚ww := w - \eta \frac{\partial L}{\partial w}

Where Ξ·\eta is the learning rate.

Variants include:

  • Stochastic Gradient Descent (SGD): uses a random mini-batch
  • Mini-Batch Gradient Descent: faster, uses subset of data
  • Adaptive methods: e.g., Adam optimizer Learning Rate Effects

πŸ“‰ Visualizing Gradient Descent

This graph shows how gradient descent minimizes a loss function: Gradient Descent This One shows a 3d like view: Finding the Minima

🧩 Generalization and Overfitting

A network that memorizes training data but performs poorly on new data is said to overfit.

πŸ”§ Regularization Techniques:

β€’ Dropout: Randomly deactivates neurons during training β€’ Early stopping: Halts training when validation loss stops improving

Early stopping

🧭 Final Thoughts

This MIT 6.S191 session was a deep dive into the fundamental mechanics of deep learning. From perceptrons to adaptive learning and generalization, it laid the groundwork for building intelligent systems. Stay tuned for more as I continue my journey through this series and dive into more advanced topics like convolutional networks, sequence models, and more!

πŸ‘¨β€πŸ’» If you want to learn along, check out: MIT 6.S191 Intro to Deep Learning


Written by Breye.