That's cool but I think the proper solution is to write a Linux kernel module that can reserve GPU RAM via DRM to create ramdisks, not create a userspace filesystem using OpenCL.
That would give proper caching, direct mmap support if desired, a reliable, correct and concurrent filesystem (as opposed to this author's "all of the FUSE callbacks share a mutex to ensure that only one thread is mutating the file system at a time"), etc.
I'd *HIGHLY* recommend this video to anyone here. It is exactly that fun silly computer science stuff where you also learn a shit ton. His channel is full of this stuff.
Don't ask why, ask why not
Is essentially the motto of his channel, and it is the best. Leads to lots of innovations and I think we all should encourage more of this kind of stuff.
So that's an Gen 2 CPU, with DDR3 RAM and a PCIe 3.0 GPU.
On a modern system, with a recent kernel+FUSE, I expect the results would be much better.
But we also now have the phram kernel module, with which you can create a block device completely bypassing FUSE, so using phram should result in even greater performance than vramfs.
It is not precious if you don't run LLMs or play games. For many people like myself, video card is idle most of the time.
Using its ram to speed-up compilation or similar is not a bad idea.
What is the overhead on a FUSE filesystem compared to being implemented in the kernel? Could something like eBPF be used to make a faster FUSE-like filesystem driver?
> What is the overhead on a FUSE filesystem compared to being implemented in the kernel?
The overhead is quite high, because of the additional context switching and copying of data between user and kernel space.
> Could something like eBPF be used to make a faster FUSE-like filesystem driver?
eBPF can't really change any of the problems I noted above. To improve performance one would need to change how the interface between kernel and user space part of FUSE filesystem works to make it more efficient.
That said FUSE support for io_uring, which got merged recently in Linux 6.14, has a potential there, see:
There is considerable overhead of the user space <> kernel <> userspace switches, you can see similar with something like Wireguard if you compare the performance of its go client Vs the kernel driver.
Some fuse drivers can avoid the overhead by letting the kernel know that the backing resource of a fuse filesystem can be handled by the kernel (e.g. for fuse based overlays FS where the backing storage is xfs or something), that probably isn't applicable here.
If you're in kernel space though I don't think you'd have access to OpenCL so easily, you'd need to reimplement it based on kernel primitives.
> What is the overhead on a FUSE filesystem compared to being implemented in the kernel?
It depends on your use case.
If you serve most of your requests from kernel caches, then fuse doesn't add any overhead. That was the case for me, when I had a FUSE service running to directly serve all commits from all branches (from all of history) at the same time as directories directly from the data in a .git folder.
Somewhat related, there is NVIDIA CUDA Direct Storage[0] which provides an API for efficient “file transfer” between GPU and local filesystem. Always wanted to give it a try but haven’t yet
> I have 192GB of CPU VRAM in my desktop and that was cheap to obtain.
How? Or what's "cheap" here? (Because I wouldn't call 192G of just regular RAM that's plugged into the motherboard cheap, I think everything else is more expensive, and if there's some hack here that I haven't caught I very much would like to know about it)
Which is pretty cheap compared to the cost of my whole build and whatever other things I've spent on. Cheap is relative, but I'm just saying that if you're going to spend $3000+ on a build, and you love to work with massive datasets, VMs, and things, $500 for a metric fuckton of RAM so that your system is never, ever swapping, is a very worthwhile thing to spend on.
192GB worth of GPU will cost you about $40000, for reference, and will be less performant if your goal is just a vramfs for CPU tasks.
* Beware that using 4 DDR5 slots will cut your memory bandwidth in half on consumer motherboards and CPUs. But I willingly made that tradeoff. Maybe at some point I'll upgrade to a server motherboard and CPU.
Couple of reasons.
1. You can use vram when you don't have massive amounts of ram for a ramdisk (or /dev/shm)
2. Depending on implementation, you might have faster random seek/write than normal ram.
3. You could presumably run certain gpu kernels on the vramfs.
Hands down the latter. Good M.2 drives can generally get pretty close to the capacity of the bus, and you can fit literally a thousand times more stuff on 4 NVME than you can on any old GPU.
It has been tried in each generation of motherboard design but in an era where GPUs had a custom motherboard slot that normal cards could not occupy it made a sort of sense. And I know there have been times where the northbridge could not saturate as many PCIe devices as one might have motherboard slots. So even leaving the slot intended for GPUs empty or populated with a daughter card might be leaving performance on the table. But I suspect a riser card would fit handily into a 16x slot without blocking more than one or two 2x slots.
Why? Vram has to be powered as long as you're scanning out of it, any competent design is going to support powering down most of the GPU while keeping RAM alive otherwise an idle desktop is going to suck way more power than necessary
GPUs will drop memory clocks dynamically, with at least one supported clock speed that's intended to be just fast enough to support scanning out the framebuffer. I haven't seen any indication that anybody is dynamically offlining VRAM capacity.
you can validate this yourself: if you have access to an A/H100, allocate a 30gb tensor and do nothing - you'll see nvidia-smi's reported wattage go up by a watt or so
> Warning: Multiple users have reported this to cause system freezes, even with the fix in #Complete system freeze under high memory pressure. Other GPU management processes or libraries may be swapped out, leading to nonrecoverable page faults.
and in general you have to be really careful swapping to anything that uses a driver that could itself be swapped (which FUSE is especially prone to, but IIRC even ZFS and NFS did(?) have caveats with swap).
OTOH that same page documents a way to swap to vram without going through userspace, so don't take this as opposition to the general idea:)
That would give proper caching, direct mmap support if desired, a reliable, correct and concurrent filesystem (as opposed to this author's "all of the FUSE callbacks share a mutex to ensure that only one thread is mutating the file system at a time"), etc.