Memory is now a key component to understand and control to extract the best performance from High Performance Computing computer architectures. With an increasing gap between CPU and memory speed, the memory accesses themself now represent one of the largest performance impact which increase the impact of the memory layout on performance. On the other side, the memory size available is still growing with now easily reachable TB memory systems which make a huge space to handle with a management cost which if badly done can become non neglectable.
In order to better address those questions I will present two tools I developped during my post-docs arround the HPC fields: MALT [MALloc Tracker] and NUMAPROF.
MALT is dedicated to memory management analysis by tracking calls to malloc and annotate the source code of a C/C++/Fortan application. It provides many metrics arround memory management, like the allocations sizes, lifetime, time charts to better understand the memory behavior of you application in them of memory management. The tool comes with the nice web based graphical interface.
NUMAPROF is built arround the same code base to provide NUMA (Non Uniform Memory Access) performance analysis by tracking the memory location of every access in a NUMA system by using binary instrumentation. It is usefull to easily track the remote of unbound memory access which can be a source of large performance loss in such systems. The tool has currently mostly been tested on the new Intel Knigh Landing architecture.
With TB of now availble memory on large HPC servers and NUMA (Non Uniform Memory Access) architectures, memory management can be on of the biggest and hard to handle bottleneck to reach performance at scale on large systems.
During my PhD. I studied and developed a parallel memory allocator specifily designed for NUMA architectures and tuned for large-scale application. On this topic I was able to quickly gain a speedup of 2x on a multi-million line C++ numerical simulation at CEA just by changing the memory allocator (not recompiling the app) when running on a 16-processor machine (128 cores) and making 75 million allocations in 5 minutes.
The gains where coming from better handling of the memory management overhead over the OS (Linux) which was not well scaling on this platform and explicit handling of the NUMA topology which mixed where providing the 2x performance gap compared to the best today production memory allocator.
For this development I would have been really happy to have a memory profiler to look how the memory is placed and what was the allocation profile of the application. This is what I'm doing with MALT and NUMAPROF by projecting dedicated memory metrics onto the source code by annotating it with many metrics (allocation size, chunk lifetime, NUMA remote memory accesses....). The tool also provide global views on the applications over charts and computed metrics.
The two tools also use an uncommon approach for such tools in HPC as it provides a web-based interface using tools like D3JS/Bootstrap/Jquery and exposed by a small python webserver. This permit to fix a big issue in HPC when running remotely as the GUI of the profiler needs to be X-forwared which make it slow, badly themed. Or to run locally without having our source code at the same place. The web server permit to easily ssh-port-forward the interface and eventually to work at many people remotely looking at the same profile. This also provides quickly a nicer rendering with less development overhead.
MALT has been developped and financed as a post-doc at the Exascale Computing Research Lab. NUMAPROF was then extrapoled from this work as a side research project at CERN.
I will presend shortly both tools and some motivation showing the possibly huge impact on memory management on HPC applications, one with rougthly one million line of codes.
Speakers: Sébastien Valat