The component that fetches the data in parallel is called data prefetcher. The data for the next batch can be loaded to GPU while the model is working on the current batch. It is possible to further parallelize this pipeline. This pipeline processed 20 batches during the first second. Mean loading time of one batch in CPU for each loader process is 80 ms, mean transfer time from cpu to gpu is 15 ms, and mean model time (forward+backward pass) is 30 ms.Īs you can see, the CPU tensor is loaded to GPU memory and then processed by the model in sequence. This is typically done in sequence, and this device-transfer time is added to your overall training time. In a typical deep learning pipeline, one must load the batch data from CPU to GPU before the model can be trained on that batch. Comparison of training pipelines with and without prefetcher. Today I will explain a method that can further speed up your training, provided that you already achieved sufficient data loading efficiency. In the previous post I explained that, after a certain point, improving efficiency of the data loaders or increasing the number of data loaders have a marginal impact on the training speed of deep learning.
0 Comments
Leave a Reply. |