RYS on small language models

Take a published model. Pick a contiguous run of transformer layers. Duplicate them in place, preserving order. The hidden state now passes through that block twice on its way to the next stage. The weights are unchanged. No retraining, no fine-tuning, no distillation. The model loads slightly larger and runs slightly slower.

Ng's framing is that certain blocks of layers act as indivisible cognitive circuits — routing them twice strengthens whatever capability that circuit carries. alainnothere's contribution was the search tool (llm-circuit-finder) that lets you sweep configurations and find the productive blocks per model.

What this run adds

Two things. First, breadth. Prior published runs covered specific single models. This run applies the same probe methodology across the Qwen3 family at four scales (0.6B, 1.7B, 8B, 32B), so the scaling shape of the result space — how the productive block, the metrics that move, and the trade-offs change as parameter count moves through three orders of magnitude — is directly comparable across sizes.

Second, the sweeps. Each model ships with its full per-configuration sweep results — not just the winning block, but every block tested with all three metrics. The shape of the result space matters as much as the peak.

Since this run, the same probe methodology has gone wider still — across 10 architecture families (Qwen2.5, Llama, Mistral, Gemma, Yi, Granite, SmolLM, TinyLlama) from 135M to 32B, 22 model repos gathered in two Sovereign Collections. The Qwen3 study below stays the headline: cleanest scaling shape, strongest single result. Full collection on Hugging Face.

Results

Qwen3-0.6B-RYS-10-13

28 layers expanded to 31. Layers 10–13 duplicated.

+6.3%

math

reasoning

28→31

layers

Smallest scale tested. Math improves; EQ and reasoning hold.

Qwen3-1.7B-RYS-7-10

28 layers expanded to 31. Layers 7–10 duplicated. 51 configurations swept.

+9.1%

math

+0.94

−6%

reasoning

configs

Largest math gain in the family. Reasoning drops — the trade is real at this scale.

Qwen3-8B-RYS-16-19

36 layers expanded to 39. Layers 16–19 duplicated. 117 configurations swept.

+6.7%

math

−1.17

+23.5%

reasoning

117

configs

Largest reasoning gain in the family. Baseline reasoning was weakest at this scale; the duplication heals it.

Qwen3-32B-RYS-20-28

64 layers expanded to 72. Layers 20–28 duplicated (nine layers). 63 configurations swept.

+4.5%

math

+0.04

+18%

reasoning

configs

The only configuration that improves all three metrics simultaneously out of 63 tested.

Three benchmarks measured per configuration: math, EQ (emotional reasoning), and reasoning (multi-step). Numbers come straight off the sweep results in each model's HF repo — see the JSONLs for the exact suites and raw scores.

A caveat earned since: these scores come from thin search probes (16 math / 16 EQ / 17 reasoning, coarsely quantized). They are a good heuristic for finding productive duplication blocks, but a poor capability benchmark — and at small scale (≤~1.5B) their failure modes can manufacture apparent gains that don't reproduce on a full benchmark. Read the small-model deltas as search signal, not capability claims; the evidence is most trustworthy at the largest scales, where RYS was first demonstrated.

Models

All four are GGUF Q4_K_M quantizations — small enough for consumer GPUs. The 0.6B is 424 MB; the 32B is 22 GB. Run with llama.cpp or llama-cpp-python; no special inference path needed. The model card on each repo has a copy-paste invocation.

License & cost

The Qwen3 models here are Apache 2.0; across the wider sweep, each model inherits its base model's license. Training cost: zero — there is no training. The artifacts are GGUF Q4_K_M; the 0.6B and 1.7B run on integrated GPUs, the 8B fits on most consumer cards, the 32B wants 24 GB of VRAM.

Datasets

Per-configuration sweep results — every block tried, every metric collected — ship as JSONLs inside each model's HF repo.

RYS on small language models

Method

What this run adds

Results

Qwen3-0.6B-RYS-10-13

Qwen3-1.7B-RYS-7-10

Qwen3-8B-RYS-16-19

Qwen3-32B-RYS-20-28

Models

License & cost

Datasets

Sources