Compute-Skipping Policies for Diffusion LLMs (dLLM-v2) (2026)
Built on the Fast‑dLLM v2 codebase, this project implements and analyzes two ways to skip redundant computation across denoising steps:
- Layer-level skipping — cache each decoder layer’s input hidden states and output, then reuse the cached output on the next step when the new input is highly cosine‑similar.
- Token-level skipping — a finer‑grained, stability‑aware policy that recomputes only the least‑similar tokens per layer (using an adaptive reuse ratio driven by the batch‑average cosine similarity) and reuses cached outputs for the rest.
The write‑up studies the resulting accuracy‑vs‑FLOPs trade‑off — how much computation can be safely skipped before generation quality degrades.
Report: Implementation of Two Compute-Skipping Policies in dLLM-v2 (PDF)
