Man I know this is HN, and I know we have a certain decorum we should be maintaining, but with the recent activity in this field the most appropriate response to these posts is "4bit when?" or "f16 when?". Not sure which one is applicable. I am having no luck running it on a 6GB vram gpu, so I guess its the 16 bit floating point one.
related to this - to those releasing models, it would be great if you could share how much VRAM is required (seems very common for this key piece of info to be missing).
I'm successfully running it on a 12GB GPU (while it downloads some 12.1GB of model data on first run, the highest GPU memory usage was ~6.5GB, settling back down to around 5GB), however the results are nothing like the samples given on the github page. Using the exact code given and in the runs I've tried the results are rather terrible.
I'm not being negative -- some of the samples are really neat on their page -- and I know there is some idiosyncrasy of my setup that is causing issues, though it is a pretty typical conda + pytorch with CUDA 11.8.
Playing with the text and waveform temp from their defaults 0.7 is yielding some semi-decent results, but it feels essentially random.