Loading LLM Models
Why do people are drawn to macs to run LLM locally instead of Nvidia with CUDA support?
- Mac are more portable and cheaper for equal performance of NVIDIA GPUs.
- Unified memory can use the same data in CPU and GPU without moving the data back and forth.
- Mac Metal and MLX is improving.
How LLM models are loaded:
- The LLM models (safetensors, gguf, etc) are loaded from disk into the memory.
- When inference, the weights are transferred from RAM to VRAM. But we still need RAM to keep the system overhead and if VRAM is full, it can possibly spills to RAM when GPU memory spikes.
- When inference is done, the weights stays in the VRAM.
- In unified memory, there is no “offload” step because the RAM and VRAM stay in the same pool. So this is more efficient.
- The VRAM gets freed when the model is explicitly unloaded.