Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Thanks :). It's just a lot of prompting and string parsing. There are models like "Hermes-2-Pro-Mistral" (the one from the video) which are trained to work with function signatures and outputting structured text. But at the end it's just strings in > strings out, haha. But its fun (and sometimes frustrating) to use LLMs for flow control (conditions, loops...) inside your programs.


Got a link for that one? I have found a few with Hermes-2-Mistral in the name.


Wow, I didn't know about "Hermes 2 Pro - Mistral 7B", cheers!


It's my go to "structured text model" atm. Try "starling-ml-beta" (7b) for some very impressive chat capabilities. I honestly think that it outperforms GPT3 half the time.


Sorry to repeat the same question I just asked the other commenter in this thread, but could you link the model page and recommend a specific level of quantization for the models you've referenced? I'd love to play with these models and see what you're talking about.



Thank you — from that page, at the bottom, I was able to find this link to what I think are the quantized versions

https://huggingface.co/NousResearch/Hermes-2-Pro-Mistral-7B-...

If you have the time, could you explain what you mean by "Q5 is minimum"? Did you determine that by trying the different models and finding this one is best, or did someone else do that evaluation, or is that just generally accepted knowledge? Sorry, I find this whole ecosystem quite confusing still, but I'm very new and that's not your problem.


Talking GGUF, Usually the higher you can afford to go wrt. quantization(e.g. Q5 is better than Q4, etc), the better. A Q6_K has minimal performance loss from the Q8, so in most cases if you can fit a Q6_K it's recommended to just use that. TheBloke's READMEs[0] usually have a good table summarizing each quantization level.

If you're RAM constrained, you'll also have to make trade-offs about the context length. e.g. you could have 8 GB RAM and a Q5 quant with shorter context, vs Q3 with longer, etc.

[0]:https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF


Thank you!


It's the best balance if you have limited compute performance.


Thank you


Have you considered using grammar sampling?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: