## Broadcasting the good and the ugly PyTorch supports broadcasting elementwise operations. Normally when you want to perform operations like addition and multiplication, you need to make sure that shapes of the operands match, e.g. you can’t add a tensor of shape `[3, 2]` to a tensor of shape `[3, 4]`. But there’s a special case and that’s when you have a singular dimension. PyTorch implicitly tiles the tensor across its singular dimensions to match the shape of the other operand. So it’s valid to add a tensor of shape `[3, 2]` to a tensor of shape `[3, 1]`. ```python import torch a = torch.tensor([[1., 2.], [3., 4.]]) b = torch.tensor([[1.], [2.]]) # c = a + b.repeat([1, 2]) c = a + b print(c) ``` Broadcasting allows us to perform implicit tiling which makes the code shorter, and more memory efficient, since we don’t need to store the result of the tiling operation. One neat place that this can be used is when combining features of varying length. In order to concatenate features of varying length we commonly tile the input tensors, concatenate the result and apply some nonlinearity. This is a common pattern across a variety of neural network architectures: ```python a = torch.rand([5, 3, 5]) b = torch.rand([5, 1, 6]) linear = torch.nn.Linear(11, 10) # concat a and b and apply nonlinearity tiled_b = b.repeat([1, 3, 1]) c = torch.cat([a, tiled_b], 2) d = torch.nn.functional.relu(linear(c)) print(d.shape) # torch.Size([5, 3, 10]) ``` But this can be done more efficiently with broadcasting. We use the fact that `f(m(x + y))` is equal to `f(mx + my)`. So we can do the linear operations separately and use broadcasting to do implicit concatenation: ```python a = torch.rand([5, 3, 5]) b = torch.rand([5, 1, 6]) linear1 = torch.nn.Linear(5, 10) linear2 = torch.nn.Linear(6, 10) pa = linear1(a) pb = linear2(b) d = torch.nn.functional.relu(pa + pb) print(d.shape) # torch.Size([5, 3, 10]) ``` In fact this piece of code is pretty general and can be applied to tensors of arbitrary shape as long as broadcasting between tensors is possible: ```python class Merge(torch.nn.Module): def __init__(self, in_features1, in_features2, out_features, activation=None): super().__init__() self.linear1 = torch.nn.Linear(in_features1, out_features) self.linear2 = torch.nn.Linear(in_features2, out_features) self.activation = activation def forward(self, a, b): pa = self.linear1(a) pb = self.linear2(b) c = pa + pb if self.activation is not None: c = self.activation(c) return c ``` So far we discussed the good part of broadcasting. But what’s the ugly part you may ask? Implicit assumptions almost always make debugging harder to do. Consider the following example: ```python a = torch.tensor([[1.], [2.]]) b = torch.tensor([1., 2.]) c = torch.sum(a + b) print(c) ``` What do you think the value of `c` would be after evaluation? If you guessed 6, that’s wrong. It’s going to be 12. This is because when rank of two tensors don’t match, PyTorch automatically expands the first dimension of the tensor with lower rank before the elementwise operation, so the result of addition would be `[[2, 3], [3, 4]]`, and the reducing over all parameters would give us 12. The way to avoid this problem is to be as explicit as possible. Had we specified which dimension we would want to reduce across, catching this bug would have been much easier: ```python a = torch.tensor([[1.], [2.]]) b = torch.tensor([1., 2.]) c = torch.sum(a + b, 0) print(c) ``` Here the value of `c` would be `[5, 7]`, and we immediately would guess based on the shape of the result that there’s something wrong. A general rule of thumb is to always specify the dimensions in reduction operations and when using `torch.squeeze`.