Faster Ollama alternative

RandomlyRight@sh.itjust.works · 4 months ago

Faster Ollama alternative

RandomlyRight@sh.itjust.works · 4 months ago

Yeah, but there are many open issues on GitHub related to these settings not working right. I’m using the API, and just couldn’t get it to work. I used a request to generate a json file, and it never generated one longer than about 500 lines. With the same model on vllm, it worked instantly and generated about 2000 lines

theunknownmuncher@lemmy.world · edit-2 4 months ago

Are you using a tiny model (1.5B-7B parameters)? ollama pulls 4bit quant by default. It looks like vllm does not used quantized models by default so this is likely the difference. Tiny models are impacted more by quantization

I have no problems with changing num_ctx or num_predict

RandomlyRight@sh.itjust.works · 4 months ago

It was multiple models, mainly 32-70B

theunknownmuncher@lemmy.world · edit-2 4 months ago

Can you try setting the num_ctx and num_predict using a Modelfile with ollama? https://github.com/ollama/ollama/blob/main/docs/modelfile.md#parameter

RandomlyRight@sh.itjust.works · 4 months ago

I’ve read about this method in the GitHub issues, but to me it seemed impractical to have different models just to change the context size, and that was the point I started looking for alternatives

theunknownmuncher@lemmy.world · 4 months ago

You can overwrite the model by using the same name instead of creating one with a new name if it bothers you. Either way there is no duplication of the llm model file