Building a Neural Network from Scratch for MNIST Digit Classification



π§ Building a Neural Network from Scratch for MNIST Digit Classification
π Introduction: The Problem We Want to Solve
The MNIST dataset is a classic benchmark in machine learning, consisting of 70,000 grayscale images of handwritten digits (0 through 9). Each image is 28x28 pixels, unrolled into a 784-dimensional vector. Our task is to build a neural network from scratch using only NumPy to classify these digits correctly. This project is a deep dive into how deep learning works under the hood β without relying on high-level libraries like TensorFlow or PyTorch.
π§± Network Architecture
We build a 3-layer neural network with the following structure:
- Input Layer: 784 neurons (one per pixel)
- Hidden Layer: 10 neurons (using ReLU activation)
- Output Layer: 10 neurons (one for each digit, using softmax activation)
Mathematically:
- Input vector
- Hidden layer weights , biases
- Output layer weights , biases
π§Ή Data Preparation
We begin by loading and preprocessing the MNIST dataset:
data = pd.read_csv("./train.csv").to_numpy()
np.random.shuffle(data)
data_dev = data[0:1000].T
Y_dev = data_dev[0]
X_dev = data_dev[1:] / 255.
data_train = data[1000:].T
Y_train = data_train[0]
X_train = data_train[1:] / 255.
Here: We normalize pixel values to [0, 1] and transpose the data for matrix-based operations: each column is a training example
π’ One-Hot Encoding
To make our labels usable for softmax and cross-entropy loss, we convert them to one-hot vectors:
def one_hot(Y):
one_hot_Y = np.zeros((Y.size, Y.max()+1))
one_hot_Y[np.arange(Y.size), Y] = 1
return one_hot_Y.T
This allows us to compute a meaningful difference between predicted probabilities and actual class labels.
π Forward Propagation
In the forward pass, we compute activations layer by layer:
- Hidden layer (ReLU activation):
- Output layer (Softmax activation):
# ReLu function Implementation
def ReLU(Z): return np.maximum(0, Z)
#Softmax function implementation
def softmax(Z):
A = np.exp(Z - np.max(Z))
return A / A.sum(axis=0, keepdims=True)
π― Loss Function (Cross-Entropy)
To evaluate our predictions, we use cross-entropy loss:
π Backpropagation
We compute gradients to update weights:
Output layer:
Hidden layer:
Implementing the function in python
def ReLU_deriv(Z): return Z > 0
π¦ Parameter Update (Gradient Descent)
Using a learning rate , we update weights:
π Training Loop
for i in range(iterations):
Z1 = W1 @ X + b1
A1 = ReLU(Z1)
Z2 = W2 @ A1 + b2
A2 = softmax(Z2)
# Backprop
one_hot_Y = one_hot(Y)
dZ2 = A2 - one_hot_Y
dW2 = dZ2 @ A1.T / m
db2 = np.sum(dZ2, axis=1, keepdims=True) / m
dZ1 = (W2.T @ dZ2) * ReLU_deriv(Z1)
dW1 = dZ1 @ X.T / m
db1 = np.sum(dZ1, axis=1, keepdims=True) / m
# Update
W1 -= alpha * dW1
b1 -= alpha * db1
W2 -= alpha * dW2
b2 -= alpha * db2
π§ͺ Evaluation and Prediction
def predict(X):
Z1 = W1 @ X + b1
A1 = ReLU(Z1)
Z2 = W2 @ A1 + b2
A2 = softmax(Z2)
return np.argmax(A2, axis=0)
def accuracy(preds, Y):
return np.mean(preds == Y) * 100
This gives us the modelβs accuracy on development data.
π§ Final Thoughts
This project demonstrates how a neural network works from the ground up:
We built each component from scratch
No frameworks β just math and NumPy
You now understand how forward and backpropagation drive the learning process
π Next Steps
Add more layers or try different activation functions
Implement regularization
Build a version with PyTorch or TensorFlow and compare
Thanks for reading! Have feedback, improvements, or want to collaborate? Letβs connect!