Tensors in PyTorch: What Changes Compared to the From-Scratch Implementation?

Tensors in PyTorch: What Changes Compared to the From-Scratch Implementation?#

In the previous section, we implemented a Multilayer Perceptron (MLP) from scratch using basic Python and NumPy arrays. All computations were expressed in terms of scalars, vectors, and matrices, and we explicitly managed:

the forward pass,
gradient derivations and the backward pass,
parameter updates.

PyTorch introduces a new core data type: the tensor. While tensors may look similar to NumPy arrays, they add capabilities that are central to modern deep learning systems: automatic differentiation, hardware acceleration, and a library of optimized deep learning operators.

1. What Is a Tensor?#

In numerical computing, a tensor is a multi-dimensional array. The term emphasizes that we may work with data of arbitrary order (number of axes).

Scalars are 0D tensors
Vectors are 1D tensors
Matrices are 2D tensors
Higher-dimensional arrays are 3D+ tensors

Mathematically, one can view a tensor of order $k$ as an element of a tensor product space: $$ \mathbf{T} \in V_1 \otimes V_2 \otimes \cdots \otimes V_k. $$

In the MLP context, tensors represent the same objects you used earlier—only the container and execution model change.

Mathematical object	From-scratch code	PyTorch
Scalar	`float`	0D tensor
Vector	1D NumPy array	1D tensor
Matrix	2D NumPy array	2D tensor
Batch of matrices	3D array	3D tensor

2. Why Not Just Use NumPy Arrays? (A More Convincing Answer)#

NumPy arrays are excellent numerical containers and are sufficient for forward computation. However, deep learning workloads require additional system-level guarantees and capabilities that NumPy does not provide out of the box:

Automatic differentiation (autograd)
- Deep networks require gradients such as $\nabla_\theta L(\theta)$ for millions of parameters $\theta$.
- With NumPy, gradients must be derived and coded manually or via external tools.
Hardware acceleration and device abstraction
- Training modern models efficiently depends on GPUs (and sometimes other accelerators).
- NumPy operations run on CPU. GPU support requires switching libraries (e.g., CuPy) and re-auditing the pipeline.
A differentiable operator ecosystem
- Deep learning uses specialized ops (convolutions, normalization, embedding lookups, fused kernels).
- PyTorch provides these operators together with correct gradient rules and optimized kernels.

A useful summary is:

NumPy: array computing (values only)
PyTorch tensor: array computing plus gradient tracking plus device-aware execution plus deep-learning primitives

3. Side-by-Side: NumPy Arrays vs PyTorch Tensors (Values)#

Consider a linear layer (affine map): $$ \mathbf{Y} = \mathbf{X}\mathbf{W} + \mathbf{b}, $$ with $\mathbf{X} \in \mathbb{R}^{N \times d}$, $\mathbf{W} \in \mathbb{R}^{d \times m}$, and $\mathbf{b} \in \mathbb{R}^{m}$.

Both NumPy and PyTorch can compute $\mathbf{Y}$ as a forward pass.

# NumPy: forward computation (values only)
import numpy as np

np.random.seed(0)
X = np.random.randn(4, 3)      # N=4, d=3
W = np.random.randn(3, 2)      # d=3, m=2
b = np.random.randn(2,)        # m=2

Y_np = X @ W + b
Y_np

array([[ 3.2955051 , -0.70672864],
       [ 1.38728186,  0.24221777],
       [ 0.8147218 , -0.76782153],
       [ 2.86228391, -1.05442875]])

# PyTorch: forward computation (values only)
import torch

torch.manual_seed(0)
X_t = torch.randn(4, 3)
W_t = torch.randn(3, 2)
b_t = torch.randn(2)

Y_t = X_t @ W_t + b_t
Y_t

tensor([[-0.6639, -0.6620],
        [ 0.5748, -1.5384],
        [-1.7279, -1.2307],
        [-0.0104, -1.9583]])

At this point, the two libraries look similar. The crucial differences appear when we need gradients, devices, and training loops.

4. Tensors and Automatic Differentiation (Computation Graphs)#

In gradient-based learning, we minimize a loss $L(\theta)$ over parameters $\theta$ (weights and biases). Training requires: $$ \theta \leftarrow \theta - \eta \nabla_\theta L(\theta), $$ where $\eta$ is the learning rate.

In the from-scratch section, you explicitly coded partial derivatives such as: $$ \frac{\partial L}{\partial \mathbf{W}}, \quad \frac{\partial L}{\partial \mathbf{b}}. $$

PyTorch tensors can track computation graphs. If a tensor is created with requires_grad=True, PyTorch records the sequence of differentiable operations. Calling backward() applies the chain rule automatically.

The chain rule in backpropagation has the generic form: $$ \frac{\partial L}{\partial \mathbf{W}} = \frac{\partial L}{\partial \mathbf{Y}}\,\frac{\partial \mathbf{Y}}{\partial \mathbf{W}}. $$

For $\mathbf{Y}=\mathbf{X}\mathbf{W}+\mathbf{b}$, this becomes: $$ \frac{\partial L}{\partial \mathbf{W}} = \mathbf{X}^\top\frac{\partial L}{\partial \mathbf{Y}}, \qquad \frac{\partial L}{\partial \mathbf{b}} = \sum_{i=1}^{N} \frac{\partial L}{\partial \mathbf{Y}_{i,:}}. $$

Side-by-Side: Manual Gradients (NumPy) vs Autograd (PyTorch)#

We will use a simple scalar loss: $$ L = \sum_{i,j} Y_{ij}. $$

Then $\frac{\partial L}{\partial Y_{ij}} = 1$ for all entries, so $\frac{\partial L}{\partial \mathbf{Y}}$ is a matrix of ones.

# NumPy: manual gradients for L = sum(Y)
grad_Y = np.ones_like(Y_np)          # dL/dY
grad_W = X.T @ grad_Y                # dL/dW = X^T dL/dY
grad_b = grad_Y.sum(axis=0)          # dL/db = sum over batch

grad_W, grad_b

(array([[5.36563246, 5.36563246],
        [2.26040156, 2.26040156],
        [1.35251476, 1.35251476]]),
 array([4., 4.]))

# PyTorch: autograd for the same computation
X_t = torch.randn(4, 3, requires_grad=True)
W_t = torch.randn(3, 2, requires_grad=True)
b_t = torch.randn(2, requires_grad=True)

Y = X_t @ W_t + b_t
L = Y.sum()
L.backward()

W_t.grad, b_t.grad

(tensor([[-0.8637, -0.8637],
         [ 1.3759,  1.3759],
         [ 0.8702,  0.8702]]),
 tensor([4., 4.]))

Key takeaway: Autograd does not change the mathematics of backpropagation; it changes who writes the gradient code. You still conceptually start from the loss and propagate backward—PyTorch simply performs the bookkeeping consistently and efficiently.

5. Tensor Data Types (`dtype`) and Why They Matter#

Every tensor has a dtype that controls numerical precision and valid operations. Common choices include:

torch.float32 (default for neural network weights and activations)
torch.float64 (higher precision; typically slower and rarely needed for standard training)
torch.int64 (commonly used for class labels)

This becomes important in classification. For example, CrossEntropyLoss expects labels as integer class indices: $$ y \in \{0, 1, \dots, C-1\}, $$ not one-hot vectors.

In the MNIST workflow:

Inputs x are floating-point tensors (e.g., float32)
Labels y are integer tensors (typically int64)

NumPy will often silently cast types in mixed operations, which can hide bugs. PyTorch is stricter in many training-critical paths.

# dtype illustration
x = torch.randn(2, 3)          # float32 by default
y = torch.tensor([1, 0])       # int64 by default for integer literals

x.dtype, y.dtype

(torch.float32, torch.int64)

6. Tensor Shape and Batching#

A major practical difference between educational “from-scratch” code and production deep learning code is batching.

For MNIST, a batch of images typically has shape: $$ (\text{batch}, \text{channels}, \text{height}, \text{width}) = (B, 1, 28, 28). $$

An MLP expects a matrix of shape $(B, 784)$, so we reshape (flatten) each image: $$ \mathbf{X} \in \mathbb{R}^{B \times 784}. $$

In PyTorch, flattening is often written as: x = x.view(x.size(0), -1).

# shape and flattening example
B = 128
x_batch = torch.randn(B, 1, 28, 28)
x_flat = x_batch.view(x_batch.size(0), -1)

x_batch.shape, x_flat.shape

(torch.Size([128, 1, 28, 28]), torch.Size([128, 784]))

7. Device Awareness (CPU vs GPU)#

PyTorch tensors are device-aware: each tensor lives on a specific device (CPU or GPU). The same code can run on a GPU by moving tensors and models to that device:

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
x = x.to(device)
model = model.to(device)

NumPy arrays do not have this concept. To use a GPU in a NumPy-like workflow, you must typically switch libraries (and sometimes APIs), which increases complexity and maintenance cost.

# device illustration
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
x = torch.randn(3, 4)
x_device = x.to(device)

x.device, x_device.device

(device(type='cpu'), device(type='cpu'))

8. Summary: Connecting Both Worlds#

The mathematics of the MLP is identical in both approaches.
The from-scratch implementation emphasizes understanding:
- explicit forward/backward derivations,
- explicit parameter updates.
PyTorch tensors emphasize scalability and correctness:
- automatic differentiation,
- standardized batching,
- device-aware execution,
- and a large library of optimized differentiable operators.

Learning tensors effectively does not replace understanding backpropagation—it operationalizes it for real training workloads.

At a high level, a PyTorch tensor is indeed a multi-dimensional array, similar in structure to a NumPy array. If we restrict attention only to numerical storage and basic linear algebra on the CPU, then NumPy and PyTorch tensors may appear interchangeable.

9. How Does a Tensor Store Information for Differentiation?#

In PyTorch, a tensor is not just a numerical array. In addition to storing values, a tensor carries metadata that enables automatic differentiation (autograd).
This section explains what information is stored, where it is stored, and how it is used during backpropagation.

10. Conceptual Structure of a Tensor#

A tensor participating in differentiation can be abstracted as:

\[ \text{Tensor} = (\text{data}, \text{requires\_grad}, \text{grad}, \text{grad\_fn}) \]

data: numerical values in CPU or GPU memory
requires_grad: whether gradients should be tracked
grad: stores $\frac{\partial L}{\partial \text{tensor}}$ after backpropagation
grad_fn: reference to the operation that created the tensor

This metadata differentiates PyTorch tensors from NumPy arrays.

11. Leaf Tensors vs Non-Leaf Tensors#

Leaf tensors#

Created directly by the user
Have requires_grad=True
Store gradients in .grad

Example:

import torch

X = torch.randn(4, 3)
W = torch.randn(3, 2, requires_grad=True)
b = torch.randn(2, requires_grad=True)

Y = X @ W + b
L = Y.sum()
L.backward()

Non-leaf tensors#

Results of operations
Possess a grad_fn
Do not store gradients by default

Intermediate activations in neural networks are typically non-leaf tensors.

12. Computation Graph#

Each tensor operation adds a node to a directed acyclic graph (DAG):

\[ \text{inputs} \rightarrow \text{operation} \rightarrow \text{output} \]

For:

Y = X @ W + b
L = Y.sum()

The graph conceptually follows:

X ----\
       MatMul ---- Add ---- Sum ---- L
W ----/              ^
                     |
                     b

13. Backward Functions (`grad_fn`)#

The attribute grad_fn references a backward-function object generated by the forward operation.

For example:

This object encodes how to compute: $$ \frac{\partial Y}{\partial X}, \quad \frac{\partial Y}{\partial W}. $$

Thus, each forward operation implicitly defines its backward rule.

Y = X @ W
print(Y.grad_fn)

<MmBackward0 object at 0x107b7ecb0>

5. Backward Pass and the Chain Rule#

Calling:

L.backward()

initiates reverse-mode automatic differentiation:

Initialize $\frac{\partial L}{\partial L} = 1$
Traverse the computation graph in reverse
Apply the chain rule: $$ \frac{\partial L}{\partial x} = \frac{\partial L}{\partial y} \frac{\partial y}{\partial x} $$
Accumulate gradients for leaf tensors

14. Gradient Accumulation#

Gradients accumulate by default:

L.backward()
L.backward()

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[12], line 1
----> 1 L.backward()
      2 L.backward()

File ~/miniforge3/envs/jupyterbook/lib/python3.10/site-packages/torch/_tensor.py:625, in Tensor.backward(self, gradient, retain_graph, create_graph, inputs)
    615 if has_torch_function_unary(self):
    616     return handle_torch_function(
    617         Tensor.backward,
    618         (self,),
   (...)
    623         inputs=inputs,
    624     )
--> 625 torch.autograd.backward(
    626     self, gradient, retain_graph, create_graph, inputs=inputs
    627 )

File ~/miniforge3/envs/jupyterbook/lib/python3.10/site-packages/torch/autograd/__init__.py:354, in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables, inputs)
    349     retain_graph = create_graph
    351 # The reason we repeat the same comment below is that
    352 # some Python versions print out the first line of a multi-line function
    353 # calls in the traceback and some print out the last line
--> 354 _engine_run_backward(
    355     tensors,
    356     grad_tensors_,
    357     retain_graph,
    358     create_graph,
    359     inputs_tuple,
    360     allow_unreachable=True,
    361     accumulate_grad=True,
    362 )

File ~/miniforge3/envs/jupyterbook/lib/python3.10/site-packages/torch/autograd/graph.py:841, in _engine_run_backward(t_outputs, *args, **kwargs)
    839     unregister_hooks = _register_logging_hooks_on_whole_graph(t_outputs)
    840 try:
--> 841     return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
    842         t_outputs, *args, **kwargs
    843     )  # Calls into the C++ engine to run the backward pass
    844 finally:
    845     if attach_logging_hooks:

RuntimeError: Trying to backward through the graph a second time (or directly access saved tensors after they have already been freed). Saved intermediate values of the graph are freed when you call .backward() or autograd.grad(). Specify retain_graph=True if you need to backward through the graph a second time or if you need to access saved tensors after calling backward.

yields: $$ \text{grad} = \frac{\partial L}{\partial \theta} + \frac{\partial L}{\partial \theta}. $$

This supports mini-batch training and necessitates clearing gradients explicitly with optimizer.zero_grad().

15. Why NumPy Arrays Cannot Do This#

NumPy arrays:

store only values
do not record operation history
have no backward rules
lack gradient storage

Therefore, gradient-based learning in NumPy requires manual implementation of backpropagation.

16. Disabling Autograd#

To avoid graph construction during inference:

with torch.no_grad():
    y = model(x)

This makes tensors behave more like NumPy arrays while preserving API consistency.

Summary#

PyTorch tensors extend arrays with differentiation metadata
Computation graphs are built dynamically
Gradients are computed by reverse traversal using the chain rule
Autograd scales manual backpropagation reliably to large models