Collectively AI Open-Sources OSCAR: An Consideration-Conscious 2-Bit KV Cache Quantization System for Lengthy-Context LLM Serving


Lengthy-context inference makes the KV cache one of many principal prices of serving LLMs. Throughout autoregressive decoding, the cache grows with context size, batch dimension, and mannequin depth. At excessive batch sizes and lengthy contexts with 100K tokens throughout dozens of concurrent requests the KV cache consumes a big fraction of GPU reminiscence. Compressing it’s a direct technique to enhance batch dimension and scale back reminiscence site visitors.

The apparent method is quantization. However pushing KV caches to INT2 (2-bit) precision has been largely impractical. Prior strategies both collapse in accuracy or require customized serving layouts incompatible with paged KV-cache techniques. Collectively AI’s OSCAR (Offline Spectral Covariance-Conscious Rotation) addresses each issues.

Why INT2 KV Cache Quantization is Laborious

KV activations include channel-wise outliers. A small subset of channels holds extraordinarily giant values. Most channels are well-behaved. While you apply INT2 quantization which has solely 4 representable ranges and people outliers dominate the dimensions issue. The quantizer wastes most of its vary on uncommon spikes. Regular values get compressed into only one or two efficient ranges. This degrades consideration high quality considerably.

Rotation-based quantization addresses this by making use of a hard and fast orthogonal remodel, usually a Hadamard remodel, to redistribute outlier vitality throughout all channels. This method works moderately effectively at INT4. At INT2, a deeper downside stays: the rotation is data-oblivious. It may possibly easy activation ranges, however it doesn’t know which instructions the eye mechanism truly reads. Spreading quantization error uniformly isn’t the identical as pushing it into low-importance instructions. At INT2, with solely 4 ranges, that distinction determines whether or not the mannequin works in any respect.

https://arxiv.org/pdf/2605.17757v1

What OSCAR Does In another way

OSCAR’s key statement is that the rotation utilized earlier than quantization must be derived from consideration statistics themselves — not from the uncooked distribution of KV activations.

For keys, the downstream error that issues isn’t the Euclidean reconstruction error of Okay. It’s the error in consideration logits. The analysis group confirmed this error is: ‖QK − QK̂‖²F = tr((Okay − Okaŷ)QQ(Okay − Okaŷ)). The weighting matrix is the question covariance QQ, not OkayOkay. Instructions the place queries have giant vitality amplify quantization errors in logits. OSCAR estimates the empirical question covariance CQ = (1/N) Σ qnqn from a calibration set, eigen-decomposes it, and makes use of the eigenvectors UQ as the important thing rotation foundation.

For values, the related error is within the consideration output SV. This depends upon how the eye rating matrix S weights every worth row. The analysis group defines the score-weighted worth covariance CS = (1/N) VSSV. Instructions that stay giant after aggregation by S are those quantization error propagates by way of. OSCAR makes use of the eigenvectors US of CS as the worth rotation foundation.

The ultimate composed rotations are:

RK = UQ · HHad · Pbr
RV = US · HHad · Pbr

Every of the three components addresses a definite failure mode of per-group low-bit quantization:

  • UQ / US aligns channels with attention-importance instructions. This diagonalizes the error-weighting matrix so crucial instructions are identifiable.
  • HHad (Walsh-Hadamard remodel) then equalizes channel significance precisely. Lemma 1 within the analysis paper proves each diagonal entry of HHad Λ HHad equals tr(Λ)/d — the peaky eigenspectrum uncovered by UQ is compressed to a uniform worth throughout all channels.
  • Pbr (permuted bit-reversal) reorders channels in order that for any power-of-two quantization group dimension, every group receives one consultant from every degree of the significance hierarchy.

The analysis group offers Theorem 1 proving UQ and US are optimum underneath a frozen-error surrogate goal with diagonal residual assumptions.

The Serving System: Combined-Precision Cache Structure

OSCAR integrates into SGLang’s manufacturing serving stack as an INT2 KV-cache mode with full compatibility with paged consideration.

The KV cache structure makes use of three areas per request:

  • Sink tokens (first S0 = 64 tokens): saved in BF16. These perform as consideration sinks.
  • Current tokens (final W = 256 tokens earlier than present place): saved in BF16.
  • Historical past tokens (every part in between): saved as INT2 after OSCAR rotation and clipping.

At 128K context size, the BF16 sink and up to date home windows symbolize solely 0.24% of complete tokens. The ablation (Desk 5 within the analysis paper) exhibits (S=64, R=256) is the accuracy-efficiency knee: smaller home windows noticeably harm accuracy; bigger home windows give negligible extra profit at increased BF16 reminiscence price.

https://arxiv.org/pdf/2605.17757

Write and browse paths use fused Triton kernels. On the write path, every token is rotated, clipped to a calibration-derived percentile threshold (typical values: cK = 0.96, cV = 0.92), then quantized with per-token uneven INT2 at a default group dimension of GK = 64 channels per group. On the learn path, the INT2 kernel unpacks bytes, dequantizes, inverse-rotates, and passes outcomes to the eye kernel — multi function fused cross with out further reminiscence site visitors. The worth rotation RV is absorbed into the mannequin’s projection weights offline, eliminating its on-line compute price.

End result

The analysis group evaluated OSCAR on 4 mannequin configurations: Qwen3-4B-Pondering-2507, Qwen3-8B, Qwen3-32B, and GLM-4.7-FP8 (358B parameters). Benchmarks embrace AIME25, GPQA-Diamond, HumanEval, LiveCodeBench v6, and MATH500, all at 32K most era size.

Accuracy (at 2.28 bits per KV aspect):

Mannequin BF16 Imply OSCAR Imply Hole to BF16
Qwen3-4B-Pondering-2507 75.64 71.86 −3.78
Qwen3-8B 70.84 69.42 −1.42
Qwen3-32B 74.19 74.17 −0.02
GLM-4.7-FP8 (358B) 77.89 78.16 +0.27

For context on how competing strategies examine: naive INT2 (no rotation) scores 0.00 on each Qwen3-4B and Qwen3-8B. QuaRot-INT2 (Hadamard-only rotation) scores 1.40 on Qwen3-4B and 10.14 on Qwen3-8B. TurboQuant at 3.25 bits drops 43.90 factors on Qwen3-4B-Pondering. Noticed-INT4 at 4.25 bits reaches 73.11 on Qwen3-4B — OSCAR at 2.28 bits reaches 71.86.

https://arxiv.org/pdf/2605.17757

The analysis group additionally in contrast in opposition to channel-wise strategies on AIME25 (Desk 1). On Qwen3-8B, OSCAR at 2.38 BPE achieves 66.67±3.33 — above KIVI-KV2* at 57.67 (2.26 BPE) and Kitty at 59.67 (2.39 BPE). Observe that channel-wise strategies require residual buffers or customized web page layouts that don’t match commonplace paged-attention serving, so this comparability is restricted to the only shared benchmark the place outcomes had been accessible.

Lengthy-context robustness (RULER-NIAH):

Mannequin Methodology 16K 32K 64K 128K
Qwen3-4B-Pondering BF16 99.7 99.3 85.3 81.0
Qwen3-4B-Pondering QuaRot-INT2 0.0 0.0 15.6 0.0
Qwen3-4B-Pondering OSCAR 97.8 87.6 61.9 39.5
Qwen3-8B BF16 98.9 97.3 79.2 78.2
Qwen3-8B QuaRot-INT2 19.0 9.8 0.0 0.0
Qwen3-8B OSCAR 93.9 86.3 61.9 45.0

On GLM-4.7-FP8, OSCAR matches the BF16 curve by way of 128K.

Throughput (H100, 100K context, batch dimension 1):

Decode throughput speedup relative to BF16, at growing context lengths:

Mannequin 30K 60K 100K
Qwen3-4B-Pondering 1.98× 2.52× 3.08×
Qwen3-8B 1.84× 2.29× 2.88×
GLM-4.7-FP8 1.98× 2.49× 2.83×

At batch dimension 32, job-level throughput at 100K context reaches 6.17× over BF16 on Qwen3-4B-Pondering and seven.83× on GLM-4.7-FP8. The speedup will increase with context size as a result of decoding turns into more and more KV-bandwidth-bound. Lowering KV reminiscence by 8× instantly reduces that bottleneck. The net rotation overhead is absorbed into the decode kernels.

Marktechpost’s Visible Explainer

OSCAR — How-To Information
01 / 08

01

Overview

What’s OSCAR?

OSCAR (Offline Spectral Covariance-Conscious Rotation) is a 2-bit KV cache quantization system from Collectively AI for long-context LLM serving.

As an alternative of making use of a generic Hadamard rotation, OSCAR derives attention-aware rotations from a one-time offline calibration cross — aligning quantization noise with instructions that spotlight is least delicate to.

The consequence: INT2 precision with near-BF16 accuracy and full compatibility with paged KV-cache serving.


KV Reminiscence Discount


Decode Speedup

2.28
Bits Per KV Factor

02

Setup

Stipulations

Earlier than getting began, be sure you have the next in place:

  • 01
    {Hardware}: NVIDIA H100 GPU (80 GB) really useful. A100 may go for smaller fashions.
  • 02
    SGLang put in: OSCAR is built-in into the SGLang serving framework. Set up the most recent model from supply.
  • 03
    Triton: Customized fused kernels are written in Triton. Triton ships with most up-to-date PyTorch / SGLang installs.
  • 04
    A supported mannequin: Qwen3-4B, Qwen3-8B, Qwen3-32B, GLM-4.7-FP8, or MiniMax-M2.7. Pre-computed rotations can be found for all of those.
pip set up sglang[all] --upgrade
pip set up triton

03

Step 1

Obtain Pre-Computed Rotations by way of RotationZoo

Collectively AI publishes pre-computed rotation matrices and clip thresholds for supported fashions in RotationZoo on ModelScope. No recalibration wanted.

from modelscope import snapshot_download

# Obtain RotationZoo on your mannequin
rotation_path = snapshot_download(
    'togethercomputer/OSCAR-RotationZoo'
)

The downloaded artifact incorporates per-layer RK, RV rotation matrices and clip thresholds cK, cV for every supported mannequin. These are fastened offline parameters — they aren’t up to date at runtime.

Qwen3-4B / 8B / 32B2.28 BPE

GLM-4.7-FP8 (358B)2.28 BPE

MiniMax-M2.72.28 BPE

Customized (run calibration)any mannequin

04

Step 2 (Optionally available)

Run Offline Calibration for a Customized Mannequin

In case your mannequin isn’t in RotationZoo, run the one-time calibration cross. OSCAR dumps Q, Okay, V activations from a small dataset, estimates attention-aware covariance, and writes out rotation matrices and clip thresholds.

python calibrate_oscar.py 
  --model-path /path/to/your-model 
  --calib-data gpqa_diamond 
  --calib-tokens 8192 
  --output-dir ./oscar_rotations/
Calibration isn’t task-specific. The paper exhibits that outcomes are low-sensitivity to area (MMLU, WikiText, GPQA-Diamond all produce related accuracy). Run it as soon as and reuse throughout all duties.

Typical values produced: cK ≈ 0.96, cV ≈ 0.92 per layer.

05

Step 3

Launch SGLang with INT2 KV Cache Enabled

Move the rotation path and allow INT2 KV mode when launching the SGLang server.

python -m sglang.launch_server 
  --model-path Qwen/Qwen3-8B 
  --kv-cache-dtype int2 
  --oscar-rotation-path ./oscar_rotations/ 
  --oscar-sink-size 64 
  --oscar-recent-size 256 
  --tp 1 
  --port 30000
Tensor parallelism is supported. For Qwen3-32B use --tp 2 (2×H100). For GLM-4.7-FP8 use --tp 8 (8×H100).

The server exposes a typical OpenAI-compatible API. No client-side modifications are wanted.

06

Step 4

Key Configuration Parameters

Parameter Default What it controls
–oscar-sink-size 64 First N tokens saved in BF16 as consideration sinks
–oscar-recent-size 256 Final N tokens saved in BF16 earlier than present place
cK (clip ratio) 0.96 Percentile clip for rotated key activations
cV (clip ratio) 0.92 Percentile clip for rotated worth activations
Group dimension GK 64 Channels per INT2 quantization group (head dim)

The paper identifies (sink=64, current=256) because the accuracy-efficiency knee. Smaller home windows scale back accuracy noticeably; bigger home windows add BF16 reminiscence overhead with negligible acquire.

07

Step 5

Run Inference and Confirm

As soon as the server is working, question it with the usual OpenAI consumer:

from openai import OpenAI

consumer = OpenAI(
    base_url="http://localhost:30000/v1",
    api_key="none"
)

response = consumer.chat.completions.create(
    mannequin="Qwen/Qwen3-8B",
    messages=[{"role": "user",
               "content": "Your long-context prompt here"}],
    max_tokens=1024
)
print(response.selections[0].message.content material)

Prefix caching works out of the field. OSCAR preserves the usual paged KV-cache abstraction, so SGLang’s radix cache and prefix reuse perform usually. No application-level modifications are wanted.

08

Outcomes

Accuracy vs BF16 Baseline

Averaged throughout AIME25, GPQA-Diamond, HumanEval, LiveCodeBench v6, and MATH500 at 32K era size.

Qwen3-4B-Pondering

−3.78

Paper: arXiv:2605.17757   RotationZoo: modelscope.cn/fashions/togethercomputer/OSCAR-RotationZoo

Key Takeaways

  • OSCAR quantizes LLM KV caches to 2-bit precision by rotating activations utilizing attention-aware covariance matrices, not generic Hadamard transforms.
  • At 2.28 bits per KV aspect, OSCAR stays inside 3.78 factors of BF16 accuracy on Qwen3-4B-Pondering whereas naive INT2 collapses to zero.
  • KV cache reminiscence drops roughly 8×, decode pace improves as much as 3× at 100K context, and job-level throughput reaches as much as 7.83× at giant batch sizes.
  • Pre-computed rotation matrices for Qwen3-4B/8B/32B, GLM-4.7-FP8, and MiniMax-M2.7 can be found in RotationZoo — no recalibration wanted.
  • OSCAR integrates instantly into SGLang with full paged KV-cache and prefix cache compatibility, requiring no modifications to the inference consumer.

Try the Repo on GitHub, Modelscope and Research PaperAdditionally, be at liberty to comply with us on Twitter and don’t neglect to hitch our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Must associate with us for selling your GitHub Repo OR Hugging Face Web page OR Product Launch OR Webinar and many others.? Connect with us