Audio note: this article contains 48 uses of latex notation, so the narration may be difficult to follow. There's a link to the original text in the episode description.

I've been researching tensor networks as a more interpretable architecture, but whenever I tell people this, they always ask "But is it any good?"

So I trained multiple 500M parameter LLMs on fineweb, showing the tensor variant needed ~4% more batches of data to match CE-loss.

There's a few caveats, so my personal estimate is around 15% worst to 10% better. Details below.

The Architecture

Replacing MLP w/ a Bilinear Layer

An MLP is a linear encoder, ReLU, then linear decoder.

A bilinear layer asks "what's better than one encoder? Two!"

<span>_Bilinear(x) = D(Lx odot Rx)_</span>

Where <span>_odot_</span> means "element-wise multiply" eg

A SwiGLU Layer (Swish Gated Linear Unit) says "Let's add in nonlinearities"

<span>_SwiGLU(x) = D(swish(Lx) odot Rx)_</span>

SwiGLU is a SOTA architecture & Bilinear is a tensor network.

Replacing Softmax Attn w/ Bilinear Attn

For a tensor network, we are only allowed polynomial nonlinearities. For attention, this means we need to replace softmax w/ [...]

---

Outline:

(00:48) The Architecture

(00:51) Replacing MLP w/ a Bilinear Layer

(01:45) Replacing Softmax Attn w/ Bilinear Attn

(02:24) Experiment & Results

(03:48) Caveats:

(03:52) (1) Softmax attention ran faster cause it has a CUDA kernel

(04:19) (2) Bilinear Attention can run much faster than Softmax Attn

(05:20) (3) Bilinear Attention has more Parameters

(05:52) (4) This was the 2nd-Dumbest Tensor-Attn Variant

(06:19) Replication & Trained Models

(06:31) Future Work

(06:56) Path to Impact

(07:59) Interp w/ Tensor Networks

(10:29) Appendix A: Noam Shazeers 2020 paper:

(11:03) Appendix B: Scaling of Bilinear Attention

(13:46) Appendix C: Bilinear Attention Expressivity

(14:22) Appendix D: But what about Flash Attention?

The original text contained 2 footnotes which were omitted from this narration.

---

First published:
January 12th, 2026

Source:
https://www.lesswrong.com/posts/hp9bvkiN3RzHgP9cq/tensor-transformer-variants-are-surprisingly-performant

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Three neural network architecture diagrams: MLP, Bilinear, and SwiGLU models.

Table comparing Softmax and Bi_attn performance percentages for Bilinear and Swiglu methods.

Table comparing Softmax and Bi_attn model performance percentages for Bilinear and Swiglu methods.

Table comparing FFN model variants with training steps and performance metrics across two training conditions.

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“Tensor-Transformer Variants are Surprisingly Performant” by Logan Riggs

LessWrong (30+ Karma)