Welcome to Incremental Social! Learn more about this project here!
Check out lemmyverse to find more communities to join from here!

Audalin , (edited )

Mostly via terminal, yeah. It's convenient when you're used to it - I am.

Let's see, my inference speed now is:

  • ~60-65 tok/s for a 8B model in Q_5_K/Q6_K (entirely in VRAM);
  • ~36 tok/s for a 14B model in Q6_K (entirely in VRAM);
  • ~4.5 tok/s for a 35B model in Q5_K_M (16/41 layers in VRAM);
  • ~12.5 tok/s for a 8x7B model in Q4_K_M (18/33 layers in VRAM);
  • ~4.5 tok/s for a 70B model in Q2_K (44/81 layers in VRAM);
  • ~2.5 tok/s for a 70B model in Q3_K_L (28/81 layers in VRAM).

As of quality, I try to avoid quantisation below Q5 or at least Q4. I also don't see any point in using Q8/f16/f32 - the difference with Q6 is minimal. Other than that, it really depends on the model - for instance, llama-3 8B is smarter than many older 30B+ models.

  • All
  • Subscribed
  • Moderated
  • Favorites
  • selfhosted@lemmy.world
  • random
  • incremental_games
  • meta
  • All magazines