Memory maps to accelerate machine learning training

EuroSciPy 2022

Memory-mapped files are an underused tool in machine learning projects, which offer very fast I/O operations, making them suitable for storing datasets during training that don't fit into memory. In this talk, we will discuss the benefits of using memory maps, their downsides, and how to address them.

When working on a machine learning project, one of the most time-consuming parts is the model's training. But a big part of the model's training is usually filled with filesystem I/O, which is very slow, especially in the context of computer vision. In this talk, we will focus on using memory maps for storing the datasets during training - which allows you to significantly reduce the training time of your model. We will also compare using memory maps to other ways to store the dataset during training, such as: in-memory datasets, one image per file, hdf5 file, etc. and will describe the strong and weak sides of the different approaches. Colab notebooks will be provided, and practical examples on significant performance improvements of popular online tutorials will be shown. We will also show how to address common shortcomings and painpoints of using memory maps in machine learning projects.

Speakers: Hristo Vrigazov