An Efficient Large-scale Training Codebase for MLLMs

Mar 2025 · Engineering Notes
Training Infra Triton vLLM

Draft notes for a scalable training + serving stack. Replace the placeholder diagram with your own visual.

Scalable training diagram

Goals

1) Data pipeline

2) Model parallel (4D)

3) Fused kernels and Triton tips

import triton
import triton.language as tl

@triton.jit
def fused_bias_gelu(X_ptr, B_ptr, Out_ptr, n_elements, BLOCK: tl.constexpr):
    pid = tl.program_id(axis=0)
    offsets = pid * BLOCK + tl.arange(0, BLOCK)
    mask = offsets < n_elements
    x = tl.load(X_ptr + offsets, mask=mask, other=0.0)
    b = tl.load(B_ptr + offsets, mask=mask, other=0.0)
    y = x + b
    # fast approximate GELU
    y = 0.5 * y * (1.0 + tl.tanh(0.79788456 * (y + 0.044715 * tl.pow(y, 3))))
    tl.store(Out_ptr + offsets, y, mask=mask)

4) Optimizer and scheduling

5) Logging and observability

6) Inference/serving (vLLM-style)

7) Suggested repo layout

Next steps