## PyTorch basics
PyTorch is one of the most popular libraries for numerical computation and currently is amongst the most widely used libraries for performing machine learning research. In many ways PyTorch is similar to NumPy, with the additional benefit that PyTorch allows you to perform your computations on CPUs, GPUs, and TPUs without any material change to your code. PyTorch also makes it easy to distribute your computation across multiple devices or machines. One of the most important features of PyTorch is automatic differentiation. It allows computing the gradients of your functions analytically in an efficient manner which is crucial for training machine learning models using gradient descent method. Our goal here is to provide a gentle introduction to PyTorch and discuss best practices for using PyTorch.
The first thing to learn about PyTorch is the concept of Tensors. Tensors are simply multidimensional arrays. A PyTorch Tensor is very similar to a NumPy array with some ~~magical~~ additional functionality.
A tensor can store a scalar value:
a = torch.tensor(3)
print(a) # tensor(3)
or an array:
b = torch.tensor([1, 2])
print(b) # tensor([1, 2])
c = torch.zeros([2, 2])
print(c) # tensor([[0., 0.], [0., 0.]])
or any arbitrary dimensional tensor:
d = torch.rand([2, 2, 2])
Tensors can be used to perform algebraic operations efficiently. One of the most commonly used operations in machine learning applications is matrix multiplication. Say you want to multiply two random matrices of size 3x5 and 5x4, this can be done with the matrix multiplication (@) operation:
x = torch.randn([3, 5])
y = torch.randn([5, 4])
z = x @ y
Similarly, to add two vectors, you can do:
z = x + y
To convert a tensor into a numpy array you can call Tensor's numpy() method:
And you can always convert a numpy array into a tensor by:
x = torch.tensor(np.random.normal([3, 5]))
### Automatic differentiation
The most important advantage of PyTorch over NumPy is its automatic differentiation functionality which is very useful in optimization applications such as optimizing parameters of a neural network. Let's try to understand it with an example.
Say you have a composite function which is a chain of two functions: `g(u(x))`.
To compute the derivative of `g` with respect to `x` we can use the chain rule which states that: `dg/dx = dg/du * du/dx`. PyTorch can analytically compute the derivatives for us.
To compute the derivatives in PyTorch first we create a tensor and set its `requires_grad` to true. We can use tensor operations to define our functions. We assume `u` is a quadratic function and `g` is a simple linear function:
x = torch.tensor(1.0, requires_grad=True)
return x * x
In this case our composite function is `g(u(x)) = -x*x`. So its derivative with respect to `x` is `-2x`. At point `x=1`, this is equal to `-2`.
Let's verify this. This can be done using grad function in PyTorch:
dgdx = torch.autograd.grad(g(u(x)), x)
print(dgdx) # tensor(-2.)
### Curve fitting
To understand how powerful automatic differentiation can be let's have a look at another example. Assume that we have samples from a curve (say `f(x) = 5x^2 + 3`) and we want to estimate `f(x)` based on these samples. We define a parametric function `g(x, w) = w0 x^2 + w1 x + w2`, which is a function of the input `x` and latent parameters `w`, our goal is then to find the latent parameters such that `g(x, w) ≈ f(x)`. This can be done by minimizing the following loss function: `L(w) = Σ (f(x) - g(x, w))^2`. Although there's a closed form solution for this simple problem, we opt to use a more general approach that can be applied to any arbitrary differentiable function, and that is using stochastic gradient descent. We simply compute the average gradient of `L(w)` with respect to `w` over a set of sample points and move in the opposite direction.
Here's how it can be done in PyTorch:
import numpy as np
# Assuming we know that the desired function is a polynomial of 2nd degree, we
# allocate a vector of size 3 to hold the coefficients and initialize it with
# random noise.
w = torch.tensor(torch.randn([3, 1]), requires_grad=True)
# We use the Adam optimizer with learning rate set to 0.1 to minimize the loss.
opt = torch.optim.Adam([w], 0.1)
# We define yhat to be our estimate of y.
f = torch.stack([x * x, x, torch.ones_like(x)], 1)
yhat = torch.squeeze(f @ w, 1)
def compute_loss(y, yhat):
# The loss is defined to be the mean squared error distance between our
# estimate of y and its true value.
loss = torch.nn.functional.mse_loss(yhat, y)
# Generate some training data based on the true function
x = torch.rand(100) * 20 - 10
y = 5 * x * x + 3
return x, y
x, y = generate_data()
yhat = model(x)
loss = compute_loss(y, yhat)
for _ in range(1000):
By running this piece of code you should see a result close to this:
[4.9924135, 0.00040895029, 3.4504161]
Which is a relatively close approximation to our parameters.
This is just tip of the iceberg for what PyTorch can do. Many problems such as optimizing large neural networks with millions of parameters can be implemented efficiently in PyTorch in just a few lines of code. PyTorch takes care of scaling across multiple devices, and threads, and supports a variety of platforms.