mlx-lm Model Bringup Process

How new model architectures get added to mlx-lm.

Model Loading Flow

ModelArgs (dataclass)

Subclass of BaseModelArgs (provides from_dict for parsing config.json)
All architecture hyperparameters: hidden size, layers, heads, vocab size, RoPE config

Model (nn.Module)

Internal pattern:

Embedding -> [TransformerBlock x N] -> RMSNorm -> LM Head

Each block: Input -> LayerNorm -> Attention -> Residual -> LayerNorm -> MLP -> Residual

Architecture	Lines	Why
Llama	~274	Standard dense transformer, baseline
Qwen3.5	~524	Hybrid attention, MoE routing, vision, gated delta updates
DeepSeek V3	~600+	MoE with shared experts, multi-latent attention

Llama-like architectures (Mistral, Yi) can reuse components or be thin wrappers. Novel architectures need full forward pass from scratch.

Weight mapping — HF weight names don't always match MLX module structure. sanitize() handles renames, drops, reshapes. Wrong mapping = silent correctness bugs.
Attention variants — GQA, MQA, sliding window, linear, sparse all need different implementations. mx.fast.scaled_dot_product_attention covers standard SDPA only.
RoPE variants — standard, NTK-aware, YaRN, dynamic. rope_utils.py handles common ones.
KV cache types — Standard vs RotatingKVCache (sliding window) vs ArraysCache (SSM). Hybrid models use different types per layer.
Quantization — must work with MLX's quantization. Quantized SDPA has its own codepath requiring specific tensor layouts.

Straightforward (follows existing pattern):

Non-trivial (new concepts):

Follow-up fixes (bringup isn't done at merge):

Known architecture (Llama/Mistral/Qwen-family) -> likely already supported or trivial to add
New mechanism (novel attention, novel MoE, hybrid SSM) -> 300-600 lines of new MLX code + weight mapping
~117 architectures currently supported — check mlx_lm/models/ before assuming unsupported