Koila - Prevent `CUDA error: out of memory error` in 1 line
Link Link to heading
๐ Features Link to heading
-
๐ Prevents
CUDA error: out of memory errorwith one single line of code. -
โ๏ธ Automatically accumulates gradients when batch sizes are too large.
-
๐ฆฅ Lazily evaluates PyTorch code to save computing power.
-
โ๏ธ Automatically splits along the batch dimension to more GPU friendly numbers (2’s powers) to speed up the execution.
-
๐ค Minimal API (wrapping all inputs will be enough).
๐ค Why Koila? Link to heading
Ever encountered RuntimeError: CUDA error: out of memory?
We all love PyTorch because of its speed, efficiency, and transparency, but that means it doesn’t do extra things. Things like preventing a very common error that has been bothering many users since 2017.
This library aims to prevent that by being a light-weight wrapper over native PyTorch. When a tensor is wrapped, the library automatically computes the amount of remaining GPU memory and uses the right batch size, saving everyone from having to manually fine-tune the batch size whenever a model is used.
Also, the library automatically uses the right batch size to GPU. Did you know that using bigger batches doesn’t always speed up processing? It’s handled automatically in this library too.
Because Koila code is PyTorch code, as it runs PyTorch under the hood, you can use both together without worrying compatibility.
Oh, and all that in 1 line of code! ๐
๐๏ธ How does it work under the hood? Link to heading
CUDA error: out of memory generally happens in forward pass, because temporary variables will need to be saved in memory.
Koila is a thin wrapper around PyTorch. It is inspired by TensorFlow’s static/lazy evaluation. By building the graph first, and run the model only when necessarily, the model has access to all the information necessarily to determine how much resources is really need to compute the model.
In terms of memory usage, only shapes of temporary variables are required to calculate the memory usage of those variables used in the model. For example, + takes in two tensors with equal sizes, and outputs a tensor with a size equal to the input size, and log takes in one tensor, and outputs another tensor with the same shape. Broadcasting makes it a little more complicated than that, but the general ideas are the same. By tracking all these shapes, one could easily tell how much memory is used in a forward pass. And select the optimal batch size accordingly.