• 1 Post
  • 42 Comments
Joined 7 months ago
cake
Cake day: March 22nd, 2024

help-circle
  • To go into more detail:

    • Exllama is faster than llama.cpp with all other things being equal.

    • exllama’s quantized KV cache implementation is also far superior, and nearly lossless at Q4 while llama.cpp is nearly unusable at Q4 (and needs to be turned up to Q5_1/Q4_0 or Q8_0/Q4_1 for good quality)

    • With ollama specifically, you get locked out of a lot of knobs like this enhanced llama.cpp KV cache quantization, more advanced quantization (like iMatrix IQ quantizations or the ARM/AVX optimized Q4_0_4_4/Q4_0_8_8 quantizations), advanced sampling like DRY, batched inference and such.

    It’s not evidence or options… it’s missing features, thats my big issue with ollama. I simply get far worse, and far slower, LLM responses out of ollama than tabbyAPI/EXUI on the same hardware, and there’s no way around it.

    Also, I’ve been frustrated with implementation bugs in llama.cpp specifically, like how llama 3.1 (for instance) was bugged past 8K at launch because it doesn’t properly support its rope scaling. Ollama inherits all these quirks.

    I don’t want to go into the issues I have with the ollama devs behavior though, as that’s way more subjective.


  • It’s less optimal.

    On a 3090, I simply can’t run Command-R or Qwen 2.5 34B well at 64K-80K context with ollama. Its slow even at lower context, the lack of DRY sampling and some other things majorly hit quality.

    Ollama is meant to be turnkey, and thats fine, but LLMs are extremely resource intense. Sometimes the manual setup/configuration is worth it to squeeze out every ounce of extra performance and quantization quality.

    Even on CPU-only setups, you are missing out on (for instance) the CPU-optimized quantizations llama.cpp offers now, or the more advanced sampling kobold.cpp offers, or more fine grained tuning of flash attention configs, or batched inference, just to start.

    And as I hinted at, I don’t like some other aspects of ollama, like how they “leech” off llama.cpp and kinda hide the association without contributing upstream, some hype and controversies in the past, and hints that they may be cooking up something commercial.



  • Your post is suggesting that the same models with the same parameters generate different result when run on different backends

    Yes… sort of. Different backends support different quantization schemes, for both the weights and the KV cache (the context). There are all sorts of tradeoffs.

    There are even more exotic weight quantization schemes (ALQM, VPTQ) that are much more VRAM efficient than llama.cpp or exllama, but I skipped mentioning them (unless somedone asked) because they’re so clunky to setup.

    Different backends also support different samplers. exllama and kobold.cpp tend to be at the cutting edge of this, with things like DRY for better long-form generation or grammar.


  • So there are multiple ways to split models across GPUs, (layer splitting, which uses one GPU then another, expert parallelism, which puts different experts on different GPUs), but the way you’re interested in is “tensor parallelism”

    This requires a lot of communication between the GPUs, and NVLink speeds that up dramatically.

    It comes down to this: If you’re more interested in raw generation speed, especially with parallel calls of smaller models, and/or you don’t care about long context (with 4K being plenty), use Aphrodite. It will ultimately be faster.

    But if you simply want to stuff the best/highest quality model you can at VRAM, especially at longer context (>4K), use TabbyAPI. Its tensor parallelism only works over PCIe, so it will be a bit slower, but it will still stream text much faster than you can read. It can simply hold bigger, better models at higher quality in the same 48GB VRAM pool.














  • Depends what you mean by censored. I never have a problem with Qwen or llama as long as I give them the right prompt and system prompt. Its not like an API model, they have to continue whatever response you give them.

    And… For what? If you are just looking for like ERP, check out drummer’s finetunes. Otherwise I tend to avoid “uncensored” finetunes as they dumb the model down a bit, but take your pick: https://huggingface.co/models?sort=modified&search=14B

    But you are going to struggle if you can’t get rocm working beyond very small context, as that means no flash attention anywhere.

    Also, assuming you end up using kobold.cpp-rocm instead, I would use a IQ3_M or IQ3_XS GGUF quantization of a 14B model.



  • I don’t like Fedora because its CUDA support is third party, and AFAIK they dont natively package ROCm. And its too complex to use through something like distrobox… I don’t want to tell you to switch OSes, but you’d have a much better time with CachyOS, which is also optimized for Steam gaming.

    Alternatively you could try installing rocm images through docker, but you have to make sure GPU passthrough is working).

    It also depends on your GPU. If you are on an RX 580, you can basically kiss rocm support goodbye, and might want to investigate mlc-llm’s vulkan backend.

    Fimbulvetr is ancient now, your go to models are Qwen 2.5 14B at short context or llama 3.1 8B/Qwen 2.5 7B at longer context.