Would snmalloc (https://news.ycombinator.com/item?id=37851210) help with scalable threading or not? It claims to be better at allocating memory on a producer and freeing on a consumer thread, and "Freeing memory in a different thread to initially allocated it, does not take any locks and instead uses a novel message passing scheme to return the memory to the original allocator, where it is recycled. This enables 1000s of remote deallocations to be performed with only a single atomic operation enabling great scaling with core count."
I’ll offer that if you have a long lasting thread, you might consider allocating big blocks to it and having a “pool” it can grab memory from. In C++ you can use a shared pointer to release it back to the pool.
This avoids contention in user space. It also reduces fragmentation. You can also bound the memory usage by blocking until memory is free.
If memory serves, boost C++ has some code to help there though I did it myself.
It’s a shame that they don’t compare against mimalloc which is another Microsoft project and it’s unclear which tcmalloc they are comparing against (the gperftools one is stale and performs worse than the current standalone release)