Tensors in PyTorch: What Changes Compared to the From-Scratch Implementation?#
In the previous section, we implemented a Multilayer Perceptron (MLP) from scratch using basic Python and NumPy arrays. All computations were expressed in terms of scalars, vectors, and matrices, and we explicitly managed:
the forward pass,
gradient derivations and the backward pass,
parameter updates.
PyTorch introduces a new core data type: the tensor. While tensors may look similar to NumPy arrays, they add capabilities that are central to modern deep learning systems: automatic differentiation, hardware acceleration, and a library of optimized deep learning operators.
1. What Is a Tensor?#
In numerical computing, a tensor is a multi-dimensional array. The term emphasizes that we may work with data of arbitrary order (number of axes).
Scalars are 0D tensors
Vectors are 1D tensors
Matrices are 2D tensors
Higher-dimensional arrays are 3D+ tensors
Mathematically, one can view a tensor of order \(k\) as an element of a tensor product space: $\( \mathbf{T} \in V_1 \otimes V_2 \otimes \cdots \otimes V_k. \)$
In the MLP context, tensors represent the same objects you used earlier—only the container and execution model change.
Mathematical object |
From-scratch code |
PyTorch |
|---|---|---|
Scalar |
|
0D tensor |
Vector |
1D NumPy array |
1D tensor |
Matrix |
2D NumPy array |
2D tensor |
Batch of matrices |
3D array |
3D tensor |
2. Why Not Just Use NumPy Arrays? (A More Convincing Answer)#
NumPy arrays are excellent numerical containers and are sufficient for forward computation. However, deep learning workloads require additional system-level guarantees and capabilities that NumPy does not provide out of the box:
Automatic differentiation (autograd)
Deep networks require gradients such as \(\nabla_\theta L(\theta)\) for millions of parameters \(\theta\).
With NumPy, gradients must be derived and coded manually or via external tools.
Hardware acceleration and device abstraction
Training modern models efficiently depends on GPUs (and sometimes other accelerators).
NumPy operations run on CPU. GPU support requires switching libraries (e.g., CuPy) and re-auditing the pipeline.
A differentiable operator ecosystem
Deep learning uses specialized ops (convolutions, normalization, embedding lookups, fused kernels).
PyTorch provides these operators together with correct gradient rules and optimized kernels.
A useful summary is:
NumPy: array computing (values only)
PyTorch tensor: array computing plus gradient tracking plus device-aware execution plus deep-learning primitives
3. Side-by-Side: NumPy Arrays vs PyTorch Tensors (Values)#
Consider a linear layer (affine map): $\( \mathbf{Y} = \mathbf{X}\mathbf{W} + \mathbf{b}, \)\( with \)\mathbf{X} \in \mathbb{R}^{N \times d}\(, \)\mathbf{W} \in \mathbb{R}^{d \times m}\(, and \)\mathbf{b} \in \mathbb{R}^{m}$.
Both NumPy and PyTorch can compute \(\mathbf{Y}\) as a forward pass.
# NumPy: forward computation (values only)
import numpy as np
np.random.seed(0)
X = np.random.randn(4, 3) # N=4, d=3
W = np.random.randn(3, 2) # d=3, m=2
b = np.random.randn(2,) # m=2
Y_np = X @ W + b
Y_np
array([[ 3.2955051 , -0.70672864],
[ 1.38728186, 0.24221777],
[ 0.8147218 , -0.76782153],
[ 2.86228391, -1.05442875]])
# PyTorch: forward computation (values only)
import torch
torch.manual_seed(0)
X_t = torch.randn(4, 3)
W_t = torch.randn(3, 2)
b_t = torch.randn(2)
Y_t = X_t @ W_t + b_t
Y_t
tensor([[-0.6639, -0.6620],
[ 0.5748, -1.5384],
[-1.7279, -1.2307],
[-0.0104, -1.9583]])
At this point, the two libraries look similar. The crucial differences appear when we need gradients, devices, and training loops.
4. Tensors and Automatic Differentiation (Computation Graphs)#
In gradient-based learning, we minimize a loss \(L(\theta)\) over parameters \(\theta\) (weights and biases). Training requires: $\( \theta \leftarrow \theta - \eta \nabla_\theta L(\theta), \)\( where \)\eta$ is the learning rate.
In the from-scratch section, you explicitly coded partial derivatives such as: $\( \frac{\partial L}{\partial \mathbf{W}}, \quad \frac{\partial L}{\partial \mathbf{b}}. \)$
PyTorch tensors can track computation graphs. If a tensor is created with requires_grad=True, PyTorch records the sequence of differentiable operations. Calling backward() applies the chain rule automatically.
The chain rule in backpropagation has the generic form: $\( \frac{\partial L}{\partial \mathbf{W}} = \frac{\partial L}{\partial \mathbf{Y}}\,\frac{\partial \mathbf{Y}}{\partial \mathbf{W}}. \)$
For \(\mathbf{Y}=\mathbf{X}\mathbf{W}+\mathbf{b}\), this becomes: $\( \frac{\partial L}{\partial \mathbf{W}} = \mathbf{X}^\top\frac{\partial L}{\partial \mathbf{Y}}, \qquad \frac{\partial L}{\partial \mathbf{b}} = \sum_{i=1}^{N} \frac{\partial L}{\partial \mathbf{Y}_{i,:}}. \)$
Side-by-Side: Manual Gradients (NumPy) vs Autograd (PyTorch)#
We will use a simple scalar loss: $\( L = \sum_{i,j} Y_{ij}. \)$
Then \(\frac{\partial L}{\partial Y_{ij}} = 1\) for all entries, so \(\frac{\partial L}{\partial \mathbf{Y}}\) is a matrix of ones.
# NumPy: manual gradients for L = sum(Y)
grad_Y = np.ones_like(Y_np) # dL/dY
grad_W = X.T @ grad_Y # dL/dW = X^T dL/dY
grad_b = grad_Y.sum(axis=0) # dL/db = sum over batch
grad_W, grad_b
(array([[5.36563246, 5.36563246],
[2.26040156, 2.26040156],
[1.35251476, 1.35251476]]),
array([4., 4.]))
# PyTorch: autograd for the same computation
X_t = torch.randn(4, 3, requires_grad=True)
W_t = torch.randn(3, 2, requires_grad=True)
b_t = torch.randn(2, requires_grad=True)
Y = X_t @ W_t + b_t
L = Y.sum()
L.backward()
W_t.grad, b_t.grad
(tensor([[-0.8637, -0.8637],
[ 1.3759, 1.3759],
[ 0.8702, 0.8702]]),
tensor([4., 4.]))
Key takeaway: Autograd does not change the mathematics of backpropagation; it changes who writes the gradient code. You still conceptually start from the loss and propagate backward—PyTorch simply performs the bookkeeping consistently and efficiently.
5. Tensor Data Types (dtype) and Why They Matter#
Every tensor has a dtype that controls numerical precision and valid operations. Common choices include:
torch.float32(default for neural network weights and activations)torch.float64(higher precision; typically slower and rarely needed for standard training)torch.int64(commonly used for class labels)
This becomes important in classification. For example, CrossEntropyLoss expects labels as integer class indices:
$\(
y \in \{0, 1, \dots, C-1\},
\)$
not one-hot vectors.
In the MNIST workflow:
Inputs
xare floating-point tensors (e.g.,float32)Labels
yare integer tensors (typicallyint64)
NumPy will often silently cast types in mixed operations, which can hide bugs. PyTorch is stricter in many training-critical paths.
# dtype illustration
x = torch.randn(2, 3) # float32 by default
y = torch.tensor([1, 0]) # int64 by default for integer literals
x.dtype, y.dtype
(torch.float32, torch.int64)
6. Tensor Shape and Batching#
A major practical difference between educational “from-scratch” code and production deep learning code is batching.
For MNIST, a batch of images typically has shape: $\( (\text{batch}, \text{channels}, \text{height}, \text{width}) = (B, 1, 28, 28). \)$
An MLP expects a matrix of shape \((B, 784)\), so we reshape (flatten) each image: $\( \mathbf{X} \in \mathbb{R}^{B \times 784}. \)$
In PyTorch, flattening is often written as:
x = x.view(x.size(0), -1).
# shape and flattening example
B = 128
x_batch = torch.randn(B, 1, 28, 28)
x_flat = x_batch.view(x_batch.size(0), -1)
x_batch.shape, x_flat.shape
(torch.Size([128, 1, 28, 28]), torch.Size([128, 784]))
7. Device Awareness (CPU vs GPU)#
PyTorch tensors are device-aware: each tensor lives on a specific device (CPU or GPU). The same code can run on a GPU by moving tensors and models to that device:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
x = x.to(device)
model = model.to(device)
NumPy arrays do not have this concept. To use a GPU in a NumPy-like workflow, you must typically switch libraries (and sometimes APIs), which increases complexity and maintenance cost.
# device illustration
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
x = torch.randn(3, 4)
x_device = x.to(device)
x.device, x_device.device
(device(type='cpu'), device(type='cpu'))
8. Summary: Connecting Both Worlds#
The mathematics of the MLP is identical in both approaches.
The from-scratch implementation emphasizes understanding:
explicit forward/backward derivations,
explicit parameter updates.
PyTorch tensors emphasize scalability and correctness:
automatic differentiation,
standardized batching,
device-aware execution,
and a large library of optimized differentiable operators.
Learning tensors effectively does not replace understanding backpropagation—it operationalizes it for real training workloads.
At a high level, a PyTorch tensor is indeed a multi-dimensional array, similar in structure to a NumPy array. If we restrict attention only to numerical storage and basic linear algebra on the CPU, then NumPy and PyTorch tensors may appear interchangeable.
9. How Does a Tensor Store Information for Differentiation?#
In PyTorch, a tensor is not just a numerical array. In addition to storing values, a tensor carries metadata that enables automatic differentiation (autograd).
This section explains what information is stored, where it is stored, and how it is used during backpropagation.
10. Conceptual Structure of a Tensor#
A tensor participating in differentiation can be abstracted as:
data: numerical values in CPU or GPU memory
requires_grad: whether gradients should be tracked
grad: stores \(\frac{\partial L}{\partial \text{tensor}}\) after backpropagation
grad_fn: reference to the operation that created the tensor
This metadata differentiates PyTorch tensors from NumPy arrays.
11. Leaf Tensors vs Non-Leaf Tensors#
Leaf tensors#
Created directly by the user
Have
requires_grad=TrueStore gradients in
.grad
Example:
import torch
X = torch.randn(4, 3)
W = torch.randn(3, 2, requires_grad=True)
b = torch.randn(2, requires_grad=True)
Y = X @ W + b
L = Y.sum()
L.backward()
Non-leaf tensors#
Results of operations
Possess a
grad_fnDo not store gradients by default
Intermediate activations in neural networks are typically non-leaf tensors.
12. Computation Graph#
Each tensor operation adds a node to a directed acyclic graph (DAG):
For:
Y = X @ W + b
L = Y.sum()
The graph conceptually follows:
X ----\
MatMul ---- Add ---- Sum ---- L
W ----/ ^
|
b
13. Backward Functions (grad_fn)#
The attribute grad_fn references a backward-function object generated by the forward operation.
For example:
This object encodes how to compute: $\( \frac{\partial Y}{\partial X}, \quad \frac{\partial Y}{\partial W}. \)$
Thus, each forward operation implicitly defines its backward rule.
Y = X @ W
print(Y.grad_fn)
<MmBackward0 object at 0x107b7ecb0>
5. Backward Pass and the Chain Rule#
Calling:
L.backward()
initiates reverse-mode automatic differentiation:
Initialize \(\frac{\partial L}{\partial L} = 1\)
Traverse the computation graph in reverse
Apply the chain rule: $\( \frac{\partial L}{\partial x} = \frac{\partial L}{\partial y} \frac{\partial y}{\partial x} \)$
Accumulate gradients for leaf tensors
14. Gradient Accumulation#
Gradients accumulate by default:
L.backward()
L.backward()
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
Cell In[12], line 1
----> 1 L.backward()
2 L.backward()
File ~/miniforge3/envs/jupyterbook/lib/python3.10/site-packages/torch/_tensor.py:625, in Tensor.backward(self, gradient, retain_graph, create_graph, inputs)
615 if has_torch_function_unary(self):
616 return handle_torch_function(
617 Tensor.backward,
618 (self,),
(...)
623 inputs=inputs,
624 )
--> 625 torch.autograd.backward(
626 self, gradient, retain_graph, create_graph, inputs=inputs
627 )
File ~/miniforge3/envs/jupyterbook/lib/python3.10/site-packages/torch/autograd/__init__.py:354, in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables, inputs)
349 retain_graph = create_graph
351 # The reason we repeat the same comment below is that
352 # some Python versions print out the first line of a multi-line function
353 # calls in the traceback and some print out the last line
--> 354 _engine_run_backward(
355 tensors,
356 grad_tensors_,
357 retain_graph,
358 create_graph,
359 inputs_tuple,
360 allow_unreachable=True,
361 accumulate_grad=True,
362 )
File ~/miniforge3/envs/jupyterbook/lib/python3.10/site-packages/torch/autograd/graph.py:841, in _engine_run_backward(t_outputs, *args, **kwargs)
839 unregister_hooks = _register_logging_hooks_on_whole_graph(t_outputs)
840 try:
--> 841 return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
842 t_outputs, *args, **kwargs
843 ) # Calls into the C++ engine to run the backward pass
844 finally:
845 if attach_logging_hooks:
RuntimeError: Trying to backward through the graph a second time (or directly access saved tensors after they have already been freed). Saved intermediate values of the graph are freed when you call .backward() or autograd.grad(). Specify retain_graph=True if you need to backward through the graph a second time or if you need to access saved tensors after calling backward.
yields: $\( \text{grad} = \frac{\partial L}{\partial \theta} + \frac{\partial L}{\partial \theta}. \)$
This supports mini-batch training and necessitates clearing gradients explicitly with optimizer.zero_grad().
15. Why NumPy Arrays Cannot Do This#
NumPy arrays:
store only values
do not record operation history
have no backward rules
lack gradient storage
Therefore, gradient-based learning in NumPy requires manual implementation of backpropagation.
16. Disabling Autograd#
To avoid graph construction during inference:
with torch.no_grad():
y = model(x)
This makes tensors behave more like NumPy arrays while preserving API consistency.
Summary#
PyTorch tensors extend arrays with differentiation metadata
Computation graphs are built dynamically
Gradients are computed by reverse traversal using the chain rule
Autograd scales manual backpropagation reliably to large models