Recently I’ve been enjoying building instruments with Web Audio where constraints such as fixing the number of voices used are at the core of the design. Voice allocation is tricky to get right so I’d like to show in this article how I’ve been approaching working with a set of voices all the while maintaining scheduling accuracy.
This article will cover the following ground:
- Timing in Web Audio, why simultaneously reading and writing the future is tricky
- Cross-thread communication, how two threads share a donut
- Constrained polyphony, aka why do we have constraints?
- Voice stealing, when it’s your turn to shine
By the end of this article, we’ll get to a point where we have explored how we might implement voice stealing algorithms and we can leave with some ideas about when we might reach for one or the other. This may provide some building blocks or starting ideas for creating a digital instruments on the web with a fixed voice count that can rely on precise timing when managing these voices.
Simultaneously reading and writing the future is tricky
Why scheduling note events while managing voice allocation is difficult with Web Audio primitives.Let’s begin by walking through the simple path to playing a synth in Web Audio. When I press down a key on my keyboard, I wish to hear a synth at a pitch corresponding to the pressed key. When I release the key, the synth should fade out. When I play multiple keys, I expect to hear multiple pitches simultaneously. This is wonderfully straightforward to do in Web Audio. On a keypress I can create and connect a OscillatorNode with a GainNode, controlling its amplitude using parameter ramps over the GainNode’s gain value. This unconstrained polyphony is pretty performant and can get us a long way.
Imagine, however, I want to have a fixed number of voices. I could start by keeping track of the voices in an array, and if I send a note to a voice at an index that is already running then that voice could fade out quickly and start up a new one afterward.
The below graphic demonstrates this idea. Along the horizontal timeline, the first note triggers voice one. This voice is not playing any note so it can instantly start playing. While it is playing out, we launch a new note which triggers the second voice, which is free. When we play another new note this triggers the voice with the oldest note which is voice one. The voice is instructed to move from its existing note to the next one, indicated by the green color.
const audioContext = new AudioContext();
const voices = [];
const startVoice = (index) => {
if (voices[index]) {
endVoice(index);
}
const oscillatorNode = new OscillatorNode(audioContext);
const gainNode = new GainNode(audioContext, { gain: 0 });
oscillatorNode.connect(gainNode);
gainNode.connect(audioContext.destination);
oscillatorNode.start(audioContext.currentTime);
gainNode.gain.linearRampToValueAtTime(1, audioContext.currentTime + 1);
voices[index] = { oscillatorNode, gainNode };
};
const endVoice = (index) => {
const voice = voices[index];
voice.gainNode.gain.setTargetAtTime(0, audioContext.currentTime, 0.1);
voice.oscillatorNode.stop(audioContext.currentTime + 1);
voices[index] = null;
};
This is fine but we can do better! There are a couple of things that don’t fully meet my aim. Firstly I’m being loose with the voice count as I start new voices when others might be fading out, meaning that I overflow the fixed voice count. If I was emulating an analogue synthesizer with a strict number of voice circuits, this would not be a faithful emulation - I need to reuse the nodes I’ve created. Secondly, I’m removing the voice from the voices array when I call endVoice but I should be waiting until that voice has finished releasing and its gain amplitude is at 0. This would allow me to determine when a voice is “active” or “free”.
Unfortunately we can’t easily achieve that second point with timing reliability 1. If we were being incredibly slack we might be tempted to use a setTimeout to set this when we think the voice will be free.
const msDelay = 1000;
voice.gainNode.gain.setTargetAtTime(0, audioContext.currentTime, 0.1);
const now = performance.now();
const timeToFreeVoice = now + msDelay;
setTimeout(() => {
voices[index] = null;
}, timeToFreeVoice);
This is a bad idea because we can’t predict what other tasks the main thread will handle near when our setTimeout callback should run, so the callback might get executed earlier or later than we expect. Also, consider how finicky the code might get if we’re allowing for the ramp release time to continue to be adjustable while a voice is releasing.
So we have some difficulties setting voice states at that point in the future on the main thread. Instead we need access to the audio sample data - a perfect use for the AudioWorkletProcessor. The limitation to this approach is that we can’t use Web Audio API primitives such as OscillatorNode - you’re responsible for writing your own voice DSP now.
Two threads share a donut
If we want to use an AudioWorkletProcessor (AWP) we need to find a way to send note event data to it from the main thread. There are a number of ways to communicate with an AWP. First you can pass data by posting a message. Even though this data will have a small memory footprint and we might hope for it this to be fast2 we don’t have precise timing control here, and we should also aspire to avoid unnecessary memory allocations that can impact the audio3. So posting data by message isn’t ideal for us. Another method for communicating with AWPs is via parameters but this is not exactly suitable for our uses.
A solution to this problem can be found in sharing memory between two threads thanks to SharedArrayBuffer. We’ll need to write a note to memory and read it just once. By using a ring buffer we can queue new events into memory on the main thread and dequeue them for processing on the audio thread. This is performant and helps us on our way.
The diagram below shows an overview of the architecture of the main and audio threads and how they communicate.
So let’s step through the necessary code on the main thread.
Our note event data will comprise of id, pitch, stage and time properties. These are number types, varying in requirements of size. The stage property references whether a note is in the “on” or “off” stage, and the id property helps us track notes across their stages.
/**
* @typedef {Object} NoteEvent
* @property {number} pitch - A MIDI pitch value (0-127) - Uint8
* @property {number} id - An unsigned integer ID - Uint32
* @property {number} stage - A binary flag (1 or 0) - Uint8
* @property {number} time - AudioContext time - Float64
* */
This data packs into 14 bytes so we can create some shared memory fixed to a multiple of this size. I choose 256 because it is double the range of MIDI note values, which is plenty for now and should allow for pretty dense playing styles.
const sab = new SharedArrayBuffer(NOTE_EVENT_SIZE * 256);
Next we create a ring buffer on the main thread and pass this to a NoteEventWriter class, which we’ll look at shortly.
const ringBuffer = new RingBuffer(sab, Uint8Array);
const noteWriter = new NoteEventWriter(ringBuffer);
I’m using the ringbuf.js package by Paul Adenot. This NoteEventWriter class is very closely adapted from one of the excellent examples provided in that package’s documentation. You can see in the enqueueChange method how the note can be packed into available buffer space.
export class NoteEventWriter {
constructor(ringbuf) {
if (ringbuf.type() !== 'Uint8Array') {
throw TypeError('This class requires a ring buffer of Uint8Array');
}
this.ringbuf = ringbuf;
this.mem = new ArrayBuffer(NOTE_EVENT_SIZE);
this.array = new Uint8Array(this.mem);
this.view = new DataView(this.mem);
}
/**
* Enqueue a single note event change.
* @param {NoteEvent} noteEvent
* @returns {boolean} True if enqueued successfully.
*/
enqueueChange(noteEvent) {
if (this.ringbuf.available_write() < NOTE_EVENT_SIZE) {
return false;
}
this.view.setUint8(0, Math.min(noteEvent.pitch, 127));
this.view.setUint32(1, noteEvent.id, true); // little-endian
this.view.setUint8(5, noteEvent.stage ? 1 : 0);
this.view.setFloat64(6, noteEvent.time, true);
return this.ringbuf.push(this.array) === NOTE_EVENT_SIZE;
}
}
Now that we have a mechanism for putting appropriately sized data into the ring buffer, let’s put it to work. The SynthPlayer class instantiates an AWP, passing it the SharedArrayBuffer, and connects it to the audio graph. By calling its scheduleEvent method the SynthPlayer adds note events into the ring buffer and therein communicates data to the audio thread.
class SynthPlayer {
/**
*
* @param {number} polyphony
*/
constructor(polyphony) {
const sab = new SharedArrayBuffer(NOTE_EVENT_SIZE * 256);
const ringBuffer = new RingBuffer(sab, Uint8Array);
this.writer = new NoteEventWriter(ringBuffer);
this.synthWorklet = new AudioWorkletNode(audioContext, "synth", {
processorOptions: {
polyphony,
sab,
},
});
this.gainNode = new GainNode(audioContext, { gain: 1 });
this.synthWorklet.connect(this.gainNode);
this.gainNode.connect(audioContext.destination);
this.destroy = this.destroy.bind(this);
this.onMessage = this.onMessage.bind(this);
this.synthWorklet.port.onmessage = this.onMessage;
}
/**
*
* @param {NoteEvent} noteEvent
*/
scheduleEvent(noteEvent) {
this.writer.enqueueChange(noteEvent);
}
}
This feels like a good time to jump a few steps forward and hear something. Start the interactive widget below then play some notes using your keyboard (QWERTY - A, W, S, E, D, F, T, G, Y, H, U, J, K, O, L for a chromatic octave, Z and X jump octaves) or use the buttons to play sequences of notes. The oscillator is derived from Mutable Instruments Plaits. In the graphic the voices correspond to the circles and when playing display the MIDI note number playing above it.
If all went well, you may have noticed during playback that when 4 notes are held down, new notes will steal an existing voice. This logic is all handled in the AudioWorkletProcessor.
In its constructor we retrieve our shared array buffer, instantiate a new RingBuffer and a NoteEventReader class.
class Synth extends AudioWorkletProcessor {
// etc...
constructor(options) {
super(options);
const sab = options.processorOptions.sab;
const ringBuffer = new RingBuffer(sab, Uint8Array);
this.notesReader = new NoteEventReader(ringBuffer);
// etc...
}
}
This NoteEventReader class does the inverse of the NoteEventWriter, by dequeuing data from the ringbuffer.
class NoteEventReader {
constructor(ringbuf) {
this.ringbuf = ringbuf;
this.mem = new ArrayBuffer(NOTE_EVENT_SIZE);
this.array = new Uint8Array(this.mem);
this.view = new DataView(this.mem);
}
/**
* Attempt to dequeue a single note event change.
* @param {Object} o An object to provide four properties: `pitch, id, stage, time
* @return true if a note change has been dequeued, false otherwise.
*/
dequeueChange(o) {
if (this.ringbuf.empty()) {
return false;
}
const rv = this.ringbuf.pop(this.array);
o.pitch = this.view.getUint8(0);
o.id = this.view.getUint32(1, true);
o.stage = this.view.getUint8(5);
o.time = this.view.getFloat64(6, true);
return rv === this.array.length;
}
}
Once dequeued, that data will no longer exist in the ring buffer. Next comes a VoiceAllocator class to distribute notes to voices. When processing, we allocate notes which fall within the current rendering quantum and add to a queue future ones for later use. In the process function first we handle note events from the NoteEventReader and then the queued events.
class VoiceAllocator {
// Used to avoid allocations in realtime
noteEventPool = Array(MAX_NOTES_IN_POOL)
.fill()
.map(() => ({
pitch: 0,
id: 0,
stage: 0,
time: 0,
}));
noteEventPoolIndex = 0;
// etc...
process() {
const blockStartTime = currentTime;
const blockEndTime = currentTime + RENDER_QUANTUM * sampleDuration;
for (let frame = 0; frame < RENDER_QUANTUM; frame++) {
let noteEvent;
while (true) {
noteEvent = this.noteEventPool[this.noteEventPoolIndex];
if (!this.noteReader.dequeueChange(noteEvent)) {
break;
}
this.noteEventPoolIndex =
(this.noteEventPoolIndex + 1) % MAX_NOTES_IN_POOL;
if (noteEvent.time >= blockEndTime) {
this.addNoteToQueue(noteEvent);
} else {
this.allocateNote(noteEvent, frame);
}
}
}
while (this.noteQueue.length > 0 && this.noteQueue[0].time < blockEndTime) {
const note = this.noteQueue.shift();
if (note.time < currentTime) {
console.log("Missed task", note);
}
const frame = Math.floor((note.time - blockStartTime) / sampleDuration);
this.allocateNote(note, frame);
}
}
}
To avoid allocating new memory in the process method, we dequeue into pre-allocated note event objects found in the noteEventPool array stored in the Voice Allocator class.
Why do we have constraints?
Instruments nearly always have some form of constraints. In the real world a piano has a polyphony totaling the number of the hammers within it, which maps proportionally to the number of keys available for playing it. An analogue polyphonic synthesizer will typically allow for fewer voices, often between 2 and 64. This constraint is defined by the amount of parallel voice circuits found inside the synthesizer. Increasing polyphony requires a greater amount of components which would mean more expensive and heavier instruments. In the digital world the constraints are less clear, but there are still barriers to infinite polyphony. These may be computational power - we can only do so much on our CPUs - or for creative or aesthetic choices. When emulating a polyphonic analogue synthesizer, an important part of the work comes in making sure that our synthesizer’s voicing behaviour matches these constraints.
If we are playing at maximum polyphony and look to play another note, we will need to steal. How you approach stealing and sound generation will be determined by your needs, the remaining code examples in this article are only included as to provide context rather than direction. In my case I quickly fade out an existing voice on my synth and cue up the pending note using a countdown in frames to determine when to retrigger.
class Voice {
// etc...
/**
* @param {number} pitch
* @param {number} time
* @param {number} noteId
* @param {number} frame
*/
steal(pitch, time, noteId, frame) {
const sampleDuration = 1 / sampleRate;
this.gateTrigger = 0;
this.gateBuffer.fill(0, frame);
this.pendingPitch = pitch;
this.pendingTime = time + this.stealTimeInFrames * sampleDuration;
this.pendingNoteId = noteId;
this.retriggerCountdown = this.stealTimeInFrames;
}
process() {
// etc...
for (let i = 0; i < RENDER_QUANTUM; i++) {
if (this.retriggerCountdown === 0) {
this.trigger(this.pendingPitch, this.pendingTime, this.pendingNoteId, i);
}
// etc...
if (this.retriggerCountdown >= 0) {
this.retriggerCountdown--;
}
}
}
}
It's your turn to shine
In the following interactive element is a polysynth. Try playing with different styles, and explore the different algorithms. I encourage you to think about when different playing styles might call for each one. Changing the amplitude envelope shape might help guide you to one or the other.
The following algorithms are:
- Oldest - steal the oldest voice
- Quietest - steal the quietest voice
- Round Robin - steal the next index unless active
- Simple Round Robin - steal the next index regardless of active state
The first algorithm is steal the oldest voices. We keep track of what time a voice is triggered, then for incoming notes find the voice with the oldest note event and steal it, prioritising voices that are being released. While it is not implemented here, an interesting take on this can be found from JUCE with its MPE instrument which looks for the oldest note which isn’t a pitch extremity (the lowest or highest note). This would be helpful when seeking to maintain melodic structures.
/**
*
* @param {Voice[]} voices
* @returns {Voice}
*/
function findOldestVoice(voices) {
let bestVoice = null;
for (let i = 0; i < voices.length; i++) {
const voice = voices[i];
if (voice.isInReleasePhase && !voice.isInStealPhase) {
if (bestVoice === null) {
bestVoice = voice;
continue;
}
if (compareVoiceAge(bestVoice, voice) > 0) {
bestVoice = voice;
}
}
}
if (!bestVoice) {
for (let i = 0; i < voices.length; i++) {
const voice = voices[i];
if (bestVoice === null) {
bestVoice = voice;
continue;
}
if (compareVoiceAge(bestVoice, voice) > 0) {
bestVoice = voice;
}
}
}
return bestVoice;
}
/**
*
* @param {Voice} a
* @param {Voice} b
* @returns {number}
*/
function compareVoiceAge(a, b) {
const aTime = a.pendingTime > -1 ? a.pendingTime : a.timeOfNoteStart;
const bTime = b.pendingTime > -1 ? b.pendingTime : b.timeOfNoteStart;
return aTime - bTime;
}
Opting for an algorithm that prioritises stealing the quietest voice might offer some more subtle voice management. This is a fairly simple solution, calculating the energy (RMS) of the audio samples of each voice for comparison. This has advantages over an “oldest” algorithm in situations where lengthy sustained notes should not be stolen in favour of more staccato playing, for example. We need to be careful that we do not steal from voices that are already in the process of being stolen, so we prioritise voices that are solely in a release phase, then those that aren’t in the steal phase and then lastly those in the steal phase. The release phase is defined by whether the voice envelope’s “gate trigger” is off, and the steal phase is determined by whether there’s a positive retrigger countdown.
/**
* Seeks quietest voice with a preference
* for voices that are in release phase
* then for voices that are not yet in steal phase
* and lowest priority is the voices that are being stolen
*
* @param {Voice[]} voices
* @returns {Voice}
*/
function findQuietestVoice(voices) {
let quietestIndex = -1;
let minRMS = 1;
for (let i = 0; i < voices.length; i++) {
if (voices[i].isInReleasePhase && !voices[i].isInStealPhase) {
let rms = computeRMS(voices[i].samples);
if (rms < minRMS) {
quietestIndex = i;
minRMS = rms;
}
}
}
if (quietestIndex > -1) {
return voices[quietestIndex];
}
minRMS = 1;
for (let i = 0; i < voices.length; i++) {
if (!voices[i].isInStealPhase) {
let rms = computeRMS(voices[i].samples);
if (rms < minRMS) {
quietestIndex = i;
minRMS = rms;
}
}
}
if (quietestIndex > -1) {
return voices[quietestIndex];
}
minRMS = 1;
for (let i = 0; i < voices.length; i++) {
if (voices[i].isInStealPhase) {
let rms = computeRMS(voices[i].samples);
if (rms < minRMS) {
quietestIndex = i;
minRMS = rms;
}
}
}
return voices[quietestIndex];
}
Another approach is Round Robin, which can be simple or smarter. Round Robin increments over your voices, and selects the next voice in turn. If a note is held down at the voice that is next in the round then we can use a simple approach which just steals it, or look for the next free voice and jump to that one. It can be especially interesting to explore timbral variances between voices when working with this type of voice allocation.
Alternatively we might split the voices based on pitch regions, which is common on synthesizers that can run two or more patches at a time - allowing performers to play a bass patch in one pitch region and then higher up the keyboard play something else on another patch. I haven’t implemented the split algorithm, but it should be reasonably straightforward after following this article.
Lastly, as mentioned earlier, it is important that we know that a voice is free or active. I handle this by determining whether its amp envelope is active. In the voice processing method I keep track of the envelope active state before rendering its next sample. If the envelope has reached 0 from a non-zero value then it is no longer active. If the active state has changed I can run a callback that notifies the voice allocator that the voice is free.
class Voice {
// etc...
process() {
for (let i = 0; i < RENDER_QUANTUM; i++) {
// etc...
const prevEnvActive = this.ampEnvelope.active;
this.samples[i] *= this.ampEnvelope.getNextSample(
this.gateBuffer[i],
attackTimeInFrames,
decayTimeInFrames,
sustainLevel,
this.isInStealPhase ? this.stealTimeInFrames : releaseTimeInFrames,
);
if (prevEnvActive && !this.ampEnvelope.active) {
this.onFreeCallback();
}
}
}
}
Wrapping up
This technique of sharing memory between the main and audio thread allows for flexible approaches to assigning note events to voices, interrupting them when necessary. This gets us into the territory of emulating some of the classic behaviours found in other instruments. Remember, sometimes you don’t need the complexity introduced with voice allocation - a well-implemented synth voice that is used just once and cleaned up afterwards can be more performant or easier to get going with. Finally, if you have some thoughts or feedback on this article, I’d love to receive an email from you.
Footnotes
- This topic is far better covered by the excellent article A Tale of Two Clocks by Chris Wilson. If you are serious about making things with the Web Audio API, you'll want to refer to that.
- An interesting exploration of the speed of postMessage() can be found in Is postMessage slow? by Surma
- Paul Adenot explains well the limitations of postMessage() in the introduction to his ringbuf.js library, a wait free spsc ringbuffer for the web