Multilayer Perceptron (MLP) with PyTorch on MNIST

Multilayer Perceptron (MLP) with PyTorch on MNIST#

After implementing an MLP from scratch, it is useful to reproduce the same model using PyTorch. This gives you (1) a correctness check against a widely used framework and (2) a baseline for future experiments (regularization, better optimizers, GPUs, etc.). This notebook demonstrates how to train a Multilayer Perceptron (MLP) using PyTorch on the MNIST dataset.

Prerequisites#

Install the required packages:

# pip install torch torchvision

1. Imports and Device Setup#

import torch
from torch import nn
from torch.utils.data import DataLoader
from torchvision import datasets, transforms

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device

device(type='cpu')

2. Load the MNIST Dataset with `torchvision`#

MNIST images are 28×28 grayscale. For an MLP, we flatten each image into a 784-dimensional vector. We will use datasets from torchvision to load the MNIST handwritten digits dataset. You can find the list of datasets available on torchvision here. Now let’s take a loot at the parameters we set:

root sets the directory we store and load our data from.
train indicates wether we want the training dataset or the test dataset.
transform allows us to apply transformations to our data, here we are only going to convert the data to tensor so that they work with PyToch, however in the future notebooks you will see more complicated transformations.

transform = transforms.Compose([
    transforms.ToTensor()
])

train_dataset = datasets.MNIST(
    root='data', train=True, download=True, transform=transform
)

test_dataset = datasets.MNIST(
    root='data', train=False, download=True, transform=transform
)

print(f"Training data: {train_dataset}\n")
print(f"Test data: {test_dataset}")

Training data: Dataset MNIST
    Number of datapoints: 60000
    Root location: data
    Split: Train
    StandardTransform
Transform: Compose(
               ToTensor()
           )

Test data: Dataset MNIST
    Number of datapoints: 10000
    Root location: data
    Split: Test
    StandardTransform
Transform: Compose(
               ToTensor()
           )

Data Loaders#

To make loading and working with the data easier, we are going to use DataLoader from torch.utils.data. The DataLoader takes in a dataset and a batch_size parameter, and allows us to iterate over the dataset. Here we do one iteration just to see the data shapes:

train_loader = DataLoader(train_dataset, batch_size=128, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=256, shuffle=False)

3. Define the MLP Model#

This is a standard fully connected network: 784 → hidden → hidden → 10. We do not apply softmax inside the model because CrossEntropyLoss expects raw logits.

class MLP(nn.Module):
    def __init__(self, input_dim=28*28, hidden1=256, hidden2=128, num_classes=10):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(input_dim, hidden1),
            nn.ReLU(),
            nn.Linear(hidden1, hidden2),
            nn.ReLU(),
            nn.Linear(hidden2, num_classes)
        )

    def forward(self, x):
        x = x.view(x.size(0), -1)
        return self.net(x)

model = MLP().to(device)
model

MLP(
  (net): Sequential(
    (0): Linear(in_features=784, out_features=256, bias=True)
    (1): ReLU()
    (2): Linear(in_features=256, out_features=128, bias=True)
    (3): ReLU()
    (4): Linear(in_features=128, out_features=10, bias=True)
  )
)

4. Loss Function and Optimizer#

criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

5. Training and Evaluation Functions#

def train_one_epoch(model, loader, criterion, optimizer, device):
    model.train()
    total_loss, correct, total = 0.0, 0, 0
    for x, y in loader:
        x, y = x.to(device), y.to(device)
        optimizer.zero_grad()
        logits = model(x)
        loss = criterion(logits, y)
        loss.backward()
        optimizer.step()
        total_loss += loss.item() * x.size(0)
        correct += (logits.argmax(1) == y).sum().item()
        total += y.size(0)
    return total_loss / total, correct / total

@torch.no_grad()
def evaluate(model, loader, criterion, device):
    model.eval()
    total_loss, correct, total = 0.0, 0, 0
    for x, y in loader:
        x, y = x.to(device), y.to(device)
        logits = model(x)
        loss = criterion(logits, y)
        total_loss += loss.item() * x.size(0)
        correct += (logits.argmax(1) == y).sum().item()
        total += y.size(0)
    return total_loss / total, correct / total

6. Train the Model#

epochs = 5
for epoch in range(1, epochs + 1):
    train_loss, train_acc = train_one_epoch(model, train_loader, criterion, optimizer, device)
    test_loss, test_acc = evaluate(model, test_loader, criterion, device)
    print(f'Epoch {epoch:02d} | Train Acc: {train_acc:.4f} | Test Acc: {test_acc:.4f}')

Epoch 01 | Train Acc: 0.9032 | Test Acc: 0.9498

Epoch 02 | Train Acc: 0.9601 | Test Acc: 0.9679

Epoch 03 | Train Acc: 0.9733 | Test Acc: 0.9736

Epoch 04 | Train Acc: 0.9802 | Test Acc: 0.9751

Epoch 05 | Train Acc: 0.9853 | Test Acc: 0.9745