PyTorch 1.0 shines for rapid prototyping with dynamic neural networks, auto-differentiation, deep Python integration, and strong support for GPUs
Deep learning is an important part of the business of Google, Amazon, Microsoft, and Facebook, as well as countless smaller companies. It has been responsible for many of the recent advances in areas such as automatic language translation, image classification, and conversational interfaces.
We haven’t gotten to the point where there is a single dominant deep learning framework. TensorFlow (Google) is very good, but has been hard to learn and use. Also TensorFlow’s dataflow graphs have been difficult to debug, which is why the TensorFlow project has been working on eager execution and the TensorFlow debugger. TensorFlow used to lack a decent high-level API for creating models; now it has three of them, including a bespoke version of Keras.
CNTK (Microsoft) and Apache MXNet (Amazon) have been the principal competitors to TensorFlow, but there are other framework lineages to consider. Caffe (Berkeley Artificial Intelligence Research Lab), originally for image classification, was expanded and updated to Caffe2 (Facebook and others) and given strong production capabilities. Torch (Facebook, Twitter, Google, and others) uses Lua scripting and a CUDA (Compute Unified Device Architecture) C/C++ back end to efficiently solve problems in machine learning, computer vision, signal processing, and other fields. Despite its strengths as a scripting language, Lua became a liability to Torch when the bulk of the deep learning community adopted Python.
CUDA is Nvidia’s API for its general purpose GPUs. GPUs are much faster than CPUs for training and making predictions from deep neural networks; so are Google’s TPUs (tensor processing units) and FPGAs (field programmable gate arrays), which are available for use on AWS, Microsoft Azure, and elsewhere. In some cases, the use of advanced chips (GPUs, TPUs, or FPGAs) can speed up computations over CPUs by 50x per chip used, reducing training times from weeks to hours or from hours to minutes.
PyTorch (Facebook, Twitter, Salesforce, and others) builds on Torch and Caffe2, using Python as its scripting language and an evolved Torch CUDA back end. The production features of Caffe2 – highly scalable execution engine, accelerated hardware support, support for mobile devices, etc. – are being incorporated into the PyTorch project.
Tensors and neural networks in Python
PyTorch is billed by its developers as “Tensors and dynamic neural networks in Python with strong GPU acceleration.” What does that mean?
Tensors are a mathematical construct that is used heavily in physics and engineering. A tensor of rank two is a special kind of matrix; taking the inner product of a vector with the tensor yields another vector with a new magnitude and a new direction. TensorFlow takes its name from the way tensors (of synaptic weight, or the strength of connection between nodes) flow around its network model. NumPy also uses tensors, but calls them n-dimensional arrays (ndarray
).
We’ve already discussed GPU acceleration. A dynamic neural network is one that can change from iteration to iteration. For example, a dynamic neural network model in PyTorch may add and remove hidden layers during training to improve its accuracy and generality. PyTorch recreates the graph on the fly at each iteration step. In contrast, TensorFlow by default creates a single dataflow graph, optimizes the graph code for performance, and then trains the model.
While eager execution mode is a fairly new option in TensorFlow, it’s the only way PyTorch runs: API calls execute when invoked, rather than being added to a graph to be run later. That might seem like it would be less computationally efficient, but PyTorch was designed to work that way, and it is no slouch when it comes to training or prediction speed.
PyTorch architecture
At a high level, the PyTorch library contains the following components:
Package | Description |
---|---|
torch | A tensor library like NumPy, with strong GPU support. |
torch.autograd | A tape-based automatic differentiation library that supports all differentiable tensor operations in torch. |
torch.nn | A neural networks library deeply integrated with autograd and designed for maximum flexibility. |
torch.multiprocessing | Python multiprocessing, but with magical memory sharing of torch tensors across processes. Useful for data loading and Hogwild training. |
torch.utils | A data loader, trainer, and other utility functions for convenience. |
torch.legacy(.nn/.optim) | Legacy code that has been ported over from torch for backward compatibility reasons. |
PyTorch integrates acceleration libraries such as Intel MKL (Math Kernel Library) and the Nvidia cuDNN (CUDA Deep Neural Network) and NCCL (Nvidia Collective Communications) libraries to maximize speed. Its core CPU and GPU tensor and neural network back ends—TH (Torch), THC (Torch CUDA), THNN (Torch Neural Network), and THCUNN (Torch CUDA Neural Network)—are written as independent libraries with a C99 API. At the same time, PyTorch is not a Python binding into a monolithic C++ framework, but designed to be deeply integrated with Python and to allow the use of other Python libraries.
The memory usage in PyTorch is efficient compared to Torch and some of the alternatives. One of the optimizations is a set of custom memory allocators for the GPU, since available GPU memory can often limit the size of deep learning models that can be solved at GPU speeds.
CUDA GPU support in PyTorch goes down to the most fundamental level. In the example below, you see the code detecting a CUDA device, creating a tensor on the GPU, copying a tensor from CPU to GPU, adding the two tensors on the GPU, printing the result, and finally copying the result from GPU to CPU with a different data type and printing that result.
What about using multiple GPUs? DataParallel
, a method of the nn
neural network class, splits your data automatically and sends job orders to multiple models on several GPUs. After each model finishes its job, DataParallel
collects and merges the results before returning it to you.
PyTorch has the ability to snapshot a tensor whenever it changes, allowing you to record the history of operations on a tensor and automatically compute the gradients later.
Subsequent operations with a tensor that requires gradients may create new tensors, and those will inherit the requires_grad
flag. You can change the requires_grad
flag in place on a tensor at any time with the requires_grad_(…)
method. In PyTorch, a trailing underscore on a method name such as requires_grad_
means that it updates the tensor in place; methods without the trailing underscore generate a new tensor.
How do those snapshots help compute gradients? Basically, the framework approximates the gradient at every saved tensor by looking at the differences between that point and the previous tensor. This is less accurate, but roughly three times more efficient per variable parameter, than evaluating deltas around each state to get the derivatives. If the step size is small, the approximation won’t be too bad.
In PyTorch, you compute the gradient using backpropagation (backprop) by calling the tensor’s backward()
method, as shown in this animation, after clearing out any existing gradients from the neural network’s buffers. Then you can use that to update the weight tensor. In short, PyTorch programs create a graph on the fly. Then back-propagation uses the dynamically created graph, automatically calculating the gradients from the saved tensor states.
PyTorch optimizers
Most of the weight update rules (optimizers) used to find the minimum error take the gradient of the loss function as the initial direction to change the values for the next step, multiplied by a small learning rate to reduce the magnitude of the step. The basic algorithm is called steepest descent. For machine learning, the usual variant is stochastic gradient descent, or SGD, which uses multiple batches of data points and often goes through the data multiple times (epochs).
More sophisticated versions of stochastic gradient descent, for example Adam and RMSprop, may compensate for biases, fold in momentum and velocity with the gradient, average gradients, or use adaptive learning rates. PyTorch currently supports 10 optimization methods.
PyTorch neural networks
The torch.nn
class defines modules and other containers, module parameters, 11 kinds of layers, 17 loss functions, 20 activation functions, and two kinds of distance functions. Each kind of layer has many variants, for example six convolution layers and 18 pooling layers.
The torch.nn.functional
class defines 11 categories of functions. Somewhat confusingly, both torch.nn
and torch.nn.functional
contain loss and activation member functions. In many cases, however, the torch.nn
member is little more than a wrapper for the corresponding torch.nn.functional
member.
This very simple model has two 2D convolution layers, and uses a rectified linear unit (ReLU) activation function for both layers. The three parameters to nn.Conv2d
are the number of input channels, the number of output channels, and the size of the convolving kernel.
You can also use one of the container modules to group your layers into a model.
PyTorch examples
The pytorch/examples repo contains worked-out models for MNIST digit classification using convolutional neural networks; word-level language modeling using LSTM RNNs; ImageNet image classification using residual networks; LSUN scene understanding using deep convolutional generative adversarial networks (DCGAN); variational auto-encoder networks; image super-resolution using an efficient sub-pixel convolutional neural network; artistic style transfer using perceptual loss functions; training a CartPole to balance in OpenAI Gym with actor-critic models; SNLI natural language inference with global vectors for word representation (GloVe), LSTMs, and torchtext; and time sequence prediction (sine wave signal values) using LSTMs.
The “Learning PyTorch with Examples” tutorial walks you through different ways of implementing machine learning with Python frameworks, before coming to the example below, which uses torch.nn
and torch.optim
to implement learning in a three-layer neural network model. In this case the loss function uses Professor Hinton’s MSELoss, and the optimizer chosen is Adam.