“If you happen to can’t match the desk in reminiscence, throw the desk away.” — A sensible engineer, in all probability
Overview
Multi-kilobyte phrase embeddings and multi-gigabyte language fashions have grow to be the status-quo for NLP, but there may be an alternate lineage whose mental roots run by cognitive science, management concept and even the arithmetic of random projection.
On this publish we stroll, line-by-line, by a 4-component text-classifier I constructed that:
- Extracts options with an LMU — a Linear Reminiscence Unit derived from control-theoretic programs that yields provably optimum continuous-time reminiscence kernels.
- Mixes temporal context with a micro-RWKV stack — a recurrent type of the favored RWKV structure that retains sequence-length scaling at O(T) as a substitute of O(T²).
- Hashes each token right into a binary ±1 hyper-vector on the fly, avoiding the V×DVtimes DV×D lookup desk completely.
- Combines dense LMU/RWKV options with a bundled hyper-vector in a bind-and-bundle head to yield a single log-odds scalar.
On commodity Colab {hardware}, the whole mannequin — together with vocabulary constructing, coaching on…