How we trained a wake word at 0.8% EER with 25K parameters

The strongest ViolaWake reference model is intentionally small. The documented production recipe uses a TemporalCNN over OpenWakeWord embeddings: 96-dimensional embedding frames, a 9-frame window, two 1D convolution layers, batch normalization, dropout, adaptive max pooling, and a compact MLP head.

The result is a 25,409-parameter wake head that exports to about 102 KB as ONNX. That wake head pairs with the shared OpenWakeWord backbone, which the repository documents as about 1.33 MB, for roughly 1.43 MB total runtime footprint for wake detection.

Architecture

The TemporalCNN keeps ordering across the wake word. Instead of flattening or mean-pooling all frames immediately, the model applies convolution across time so it can learn local temporal patterns in the phrase. The architecture documented in src/violawake_sdk/training/temporal_model.py is:

Training data

The proven recipe combines user positives, TTS positives, confusable negatives, speech negatives, and universal negative corpora where available. The goal is not simply to separate "wake word" from silence. The goal is to separate the wake word from normal speech, music, noise, and words that sound close enough to trigger a naive detector.

Why confusables matter

False activations often come from near phrases. A wake word model for "viola" has to learn that "violin", "violent", "villa", and other similar sounds are not the wake word. ViolaWake's recipe uses two rounds of confusable negative mining: a broad round and a tighter hard-negative round.

Metrics

The documented reference recipe reports d-prime 8.577, EER 0.8%, and AUC 0.9993. The public benchmark v2 is harsher and reports 5.49% EER on a shared adversarial corpus. Both numbers are useful. One describes the reference recipe. The other sets expectations for a more challenging comparison.

Practical lesson

Small models can work when the embedding backbone is strong and the negative set is serious. The training process matters as much as the head architecture. If you only train on positives and easy silence, the model will probably fail in a living room.

Sources

Keep exploring