Building a Neural Network from Scratch for MNIST Digit Classification

Cover Image for Building a Neural Network from Scratch for MNIST Digit Classification
Breye Foka
Breye Foka

🧠 Building a Neural Network from Scratch for MNIST Digit Classification

🏁 Introduction: The Problem We Want to Solve

The MNIST dataset is a classic benchmark in machine learning, consisting of 70,000 grayscale images of handwritten digits (0 through 9). Each image is 28x28 pixels, unrolled into a 784-dimensional vector. Our task is to build a neural network from scratch using only NumPy to classify these digits correctly. This project is a deep dive into how deep learning works under the hood β€” without relying on high-level libraries like TensorFlow or PyTorch.

🧱 Network Architecture

We build a 3-layer neural network with the following structure:

  • Input Layer: 784 neurons (one per pixel)
  • Hidden Layer: 10 neurons (using ReLU activation)
  • Output Layer: 10 neurons (one for each digit, using softmax activation)

Mathematically:

  • Input vector X∈R784X \in \mathbb{R}^{784}
  • Hidden layer weights W1∈R10Γ—784W_1 \in \mathbb{R}^{10 \times 784} , biases b1∈R10Γ—1b_1 \in \mathbb{R}^{10 \times 1}
  • Output layer weights W2∈R10Γ—10W_2 \in \mathbb{R}^{10 \times 10} , biases b2∈R10Γ—1b_2 \in \mathbb{R}^{10 \times 1}

🧹 Data Preparation

We begin by loading and preprocessing the MNIST dataset:

data = pd.read_csv("./train.csv").to_numpy()
np.random.shuffle(data)

data_dev = data[0:1000].T
Y_dev = data_dev[0]
X_dev = data_dev[1:] / 255.

data_train = data[1000:].T
Y_train = data_train[0]
X_train = data_train[1:] / 255.

Here: We normalize pixel values to [0, 1] and transpose the data for matrix-based operations: each column is a training example

πŸ”’ One-Hot Encoding

To make our labels usable for softmax and cross-entropy loss, we convert them to one-hot vectors:

def one_hot(Y):
    one_hot_Y = np.zeros((Y.size, Y.max()+1))
    one_hot_Y[np.arange(Y.size), Y] = 1
    return one_hot_Y.T

This allows us to compute a meaningful difference between predicted probabilities and actual class labels.

πŸ” Forward Propagation

In the forward pass, we compute activations layer by layer:

  1. Hidden layer (ReLU activation):
Z1=W1X+b1Z_1 = W_1X + b_1 A1=ReLU(Z1)=max⁑(0,Z1) A_1=ReLU(Z_1)=\max(0, Z_1)
  1. Output layer (Softmax activation):
Z2=W2A1+b2 Z_2 = W_2A_1+b_2 Softmax(Z)i=eZiβˆ’max⁑(Z)βˆ‘jeZjβˆ’max⁑(Z)\text{Softmax}(Z)_i = \frac{e^{Z_i - \max(Z)}}{\sum_{j} e^{Z_j - \max(Z)}}
# ReLu function Implementation
def ReLU(Z): return np.maximum(0, Z)
#Softmax function implementation
def softmax(Z):
    A = np.exp(Z - np.max(Z))
    return A / A.sum(axis=0, keepdims=True)

🎯 Loss Function (Cross-Entropy)

To evaluate our predictions, we use cross-entropy loss:

L=βˆ’1mβˆ‘i=1mYiβ‹…log⁑(Y^i)L = - \frac{1}{m} \sum_{i=1}^{m} Y_i \cdot \log(\hat{Y}_i)

πŸ” Backpropagation

We compute gradients to update weights:

Output layer:

dZ[2]=A[2]βˆ’YdZ^{[2]} = A^{[2]} - Y dW[2]=1mdZ[2](A[1])TdW^{[2]} = \frac{1}{m} dZ^{[2]} (A^{[1]})^T db[2]=1mβˆ‘dZ[2]db^{[2]} = \frac{1}{m} \sum dZ^{[2]}

Hidden layer:

dZ[1]=(W[2])TdZ[2]∘g[1]β€²(Z[1])dZ^{[1]} = (W^{[2]})^T dZ^{[2]} \circ g^{[1]'}(Z^{[1]}) dZ[1]=(W[2])TdZ[2]∘ReLUβ€²(Z[1])dZ^{[1]} = (W^{[2]})^T dZ^{[2]} \circ \text{ReLU}'(Z^{[1]}) dW[1]=1mdZ[1]XTdW^{[1]} = \frac{1}{m} dZ^{[1]} X^T db[1]=1mβˆ‘dZ[1]db^{[1]} = \frac{1}{m} \sum dZ^{[1]}

Implementing the function in python

def ReLU_deriv(Z): return Z > 0

πŸ“¦ Parameter Update (Gradient Descent)

Using a learning rate Ξ±\alpha, we update weights:

W:=Wβˆ’Ξ±β‹…dWW := W - \alpha \cdot dW b:=bβˆ’Ξ±β‹…dbb := b - \alpha \cdot db

πŸ” Training Loop


for i in range(iterations):
    Z1 = W1 @ X + b1
    A1 = ReLU(Z1)
    Z2 = W2 @ A1 + b2
    A2 = softmax(Z2)

    # Backprop
    one_hot_Y = one_hot(Y)
    dZ2 = A2 - one_hot_Y
    dW2 = dZ2 @ A1.T / m
    db2 = np.sum(dZ2, axis=1, keepdims=True) / m

    dZ1 = (W2.T @ dZ2) * ReLU_deriv(Z1)
    dW1 = dZ1 @ X.T / m
    db1 = np.sum(dZ1, axis=1, keepdims=True) / m

    # Update
    W1 -= alpha * dW1
    b1 -= alpha * db1
    W2 -= alpha * dW2
    b2 -= alpha * db2

πŸ§ͺ Evaluation and Prediction

def predict(X):
    Z1 = W1 @ X + b1
    A1 = ReLU(Z1)
    Z2 = W2 @ A1 + b2
    A2 = softmax(Z2)
    return np.argmax(A2, axis=0)

def accuracy(preds, Y):
    return np.mean(preds == Y) * 100

This gives us the model’s accuracy on development data.

🧠 Final Thoughts

This project demonstrates how a neural network works from the ground up:

We built each component from scratch

No frameworks β€” just math and NumPy

You now understand how forward and backpropagation drive the learning process

πŸ“Œ Next Steps

Add more layers or try different activation functions

Implement regularization

Build a version with PyTorch or TensorFlow and compare

Thanks for reading! Have feedback, improvements, or want to collaborate? Let’s connect!