RYS on small language models

Qwen3 family · four scales · zero training

RYS — Repeat Your Self — duplicates a contiguous block of transformer layers so hidden states pass through the same reasoning circuit twice. David Ng discovered it on Qwen2-72B in March.

I ran it across Qwen3 at four sizes — 0.6B, 1.7B, 8B, 32B — using the same probe methodology at each, so the scaling shape of the result space is observable. Per-configuration sweep results ship alongside each model.

Method

Take a published model. Pick a contiguous run of transformer layers. Duplicate them in place, preserving order. The hidden state now passes through that block twice on its way to the next stage. The weights are unchanged. No retraining, no fine-tuning, no distillation. The model loads slightly larger and runs slightly slower.

Ng's framing is that certain blocks of layers act as indivisible cognitive circuits — routing them twice strengthens whatever capability that circuit carries. alainnothere's contribution was the search tool (llm-circuit-finder) that lets you sweep configurations and find the productive blocks per model.

What this run adds

Two things. First, breadth. Prior published runs covered specific single models. This run applies the same probe methodology across the Qwen3 family at four scales (0.6B, 1.7B, 8B, 32B), so the scaling shape of the result space — how the productive block, the metrics that move, and the trade-offs change as parameter count moves through three orders of magnitude — is directly comparable across sizes.

Second, the sweeps. Each model ships with its full per-configuration sweep results — not just the winning block, but every block tested with all three metrics. The shape of the result space matters as much as the peak.

Results

Qwen3-0.6B-RYS-10-13

28 layers expanded to 31. Layers 10–13 duplicated.

+6.3%
math
~0
EQ
~0
reasoning
28→31
layers

Smallest scale tested. Math improves; EQ and reasoning hold.

Qwen3-1.7B-RYS-7-10

28 layers expanded to 31. Layers 7–10 duplicated. 51 configurations swept.

+9.1%
math
+0.94
EQ
−6%
reasoning
51
configs

Largest math gain in the family. Reasoning drops — the trade is real at this scale.

Qwen3-8B-RYS-16-19

36 layers expanded to 39. Layers 16–19 duplicated. 117 configurations swept.

+6.7%
math
−1.17
EQ
+23.5%
reasoning
117
configs

Largest reasoning gain in the family. Baseline reasoning was weakest at this scale; the duplication heals it.

Qwen3-32B-RYS-20-28

64 layers expanded to 72. Layers 20–28 duplicated (nine layers). 63 configurations swept.

+4.5%
math
+0.04
EQ
+18%
reasoning
63
configs

The only configuration that improves all three metrics simultaneously out of 63 tested.

Three benchmarks measured per configuration: math, EQ (emotional reasoning), and reasoning (multi-step). Numbers come straight off the sweep results in each model's HF repo — see the JSONLs for the exact suites and raw scores.

Models

0.6B 1.7B 8B 32B

All four are GGUF Q4_K_M quantizations — small enough for consumer GPUs. The 0.6B is 424 MB; the 32B is 22 GB. Run with llama.cpp or llama-cpp-python; no special inference path needed. The model card on each repo has a copy-paste invocation.

License & cost

Apache 2.0 across the family. Training cost: zero — there is no training. The artifacts are GGUF Q4_K_M; the 0.6B and 1.7B run on integrated GPUs, the 8B fits on most consumer cards, the 32B wants 24 GB of VRAM.

Datasets

Per-configuration sweep results — every block tried, every metric collected — ship as JSONLs inside each model's HF repo.

Sources

RYSSLMQwen3Layer DuplicationZero Training