Tag Archive Digital Signal Processing

ByAlexander Nguyen

HOA Encoder using OM-SoX

Abstract: OpenMusic and the OM-SoX library were used to create a way to encode mono audio files as a 3D Ambisonics signal up to the third order.

Responsible: Alexander Nguyen (WS 2023/24)

Main text:

Ambisonics

Ambisonics is a method for describing a two- or three-dimensional sound field (in the following I shall restrict myself to 3D Ambisonics). Ambisonics uses a basis of orthogonal functions and the spherical coordinate system to describe the sound field along a spherical surface resulting from a sound source . The simplest case is “Zero-th Order Ambisonics”, which resembles an ideal omnidirectional microphone: exactly one audio channel is used (also called the “W” channel, according to Furse-Malham naming). With “First Order Ambisonics” (FOA), the signal is split into an additional three channels (three bases): These are the three “directional” components (also called X, Y, Z channels). Assuming an ideal point sound source is placed at the end of one of these axes, then only this axis (with respect to the same ordinal number) will contain the signal. In the case of Ambisonics, the channels of lower orders are always included, i.e. the FOA signal consists of a total of four audio channels. In general, the number of channels for a 3D Ambisonics signal of the $n$th order can be calculated using the formula $(n+1)^2$ (i.e. for $n=0$: 1; for $n=1$: 4, for $n=2$: 9, for $n=3$: 16). Ambisonics signals with ‘higher’ order numbers (…, 2, 3, 4, …) are also referred to as Higher Order Ambisonics (HOA).

Channel Numbering

An HOA signal therefore consists of several components. There are several approaches to sorting the components in a multi-channel audio file. The sorting chosen here for this project is “Ambisonic Channel Numbering” (ACN), in which each channel is assigned an integer number starting at zero (0). The first channel is therefore labeled “0”, the second channel “1”, the third channel “2” and so on. This numerical designation can be used to determine the ‘order’ ($l$) and the ‘degree‘ ($m$) to which the component belongs. See Table 1 for an overview of all components of 3rd Order Ambisonics (3OA) – and a collation with an alternative labeling, “Furse Malham” (FuMa).

Auswertung der Formeln für l und m anhand der ACN-Werte. Zusätzlich die alternative Bezeichnung nach Furse-Malham (FuMa).

Table 1. Evaluation of the formulas for l and m based on the ACN values. In addition, the alternative designation according to Furse-Malham (FuMa)

Normalization

The values $l$ (order) and $m$ (degree) are used to calculate a normalization factor for each audio channel. The normalization used here is called “Semi-Normalized 3D” (SN3D). See Table 2 for an overview of the normalization factors for all components of 3rd Order Ambisonics.

ACN together with SN3D normalization reflect a currently common convention called ambiX (Nachbar et al., 2011).

SN3D-Faktoren für Order l und Degree m, d.h. N_lm^((SN3D)). Hinweis: Wenn also m=0 ist, ist auch immer N_l0^((SN3D) )=1, und die Tabelle ist symmetrisch bezüglich m.

Table 2. SN3D factors for order $l$ and degree $m$, i.e. $N_{lm}^{(SN3D)}$. Note: If $m=0$, then $N_{l0}^{(SN3D)}=1$, and that the table is symmetric with respect to $±m$.

Encoding

To map a point sound source in Ambisonics, its audio signal is added to each of the audio channels, weighted using the normalization factor just described and an attenuation factor. The attenuation factor, which will be defined below, depends on the angle of incidence (described in the spherical coordinate system) and the ACN number (i.e. order and degree). An intuition (w.r.t. FOA): The attenuation is minimum (0 dB or multiplication factor 1, respectively) if the angle of incidence coincides with one of the axes in an ordinary 3-dimensional coordinate system ($x$, $y$ or $z$), maximum (-∞ dB or factor 0) if it is perpendicular to it.

In Ambisonics, the 3D coordinate system is usually defined as follows: The “front” (relative to the listener’s point of view) is defined as the positive x-axis. Being a right-handed system, this implies that the positive y-axis points to the left” and the positive z-axis points “up“. For the transformation to polar coordinates, i.e. to the spherical coordinate system, one defines 0° azimuth (θ) coincident to the positive x-axis on the xy-plane, counterclockwise. 0° elevation (ϕ) coincident to the xy-plane, maximum positive, if coincident to the positive z-axis (see Figure 1), with:

$0≤θ≤2π$

$-π/2≤ϕ≤π/2$

Abbildung 1. Visualisierung des Koordinatensystems und der Bezugspunkte. Positiv-x = vorne, positiv-y = links, positiv-z = oben. θ (theta) linksdrehend (0° = vorne), ϕ „aufwärtsdrehend“ (0° = in der xy-Ebene).

Figure 1. Visualization of the coordinate system and the reference points. Positive-x = front, positive-y = left, positive-z = top. θ (theta) left-turning (0° = front), ϕ “up-turning” (0° = in the xy plane).

In order to encode a time $t$-dependent signal $S(t)$ of a point sound source with angles of incidence $θ, ϕ$ in Ambisonics, the eventual Ambisonics signal component is calculated separately for each channel $B_l^m$. To do this, the signal is multiplied by the attenuation factor $Y_l^m$ :

$B_l^m (t) := S(t)\cdot Y_l^m (\theta, \phi)$

The formula for the attenuation factor is (see Nachbar et al., 2011):

\[
Y_l^m(\theta, \phi) :=N_l^{|m|} \cdot P_l^{|m|}(sin(\phi)) \cdot \begin{cases}
sin(|m|\theta) & \text{if } m < 0\\
cos(|m|\theta) & \text{if } m > 0\\
1 & \text{if } m=0
\end{cases}
\]

where $P_l^m$ is the “associated Legendre polynomial” of $l$-th order and $m$-th degree, and $P_l$ is the (unassociated) Legendre polynomial of $l$-th order (in the Rodrigues representation). These are defined as follows:

\[\begin{eqnarray*}
P_l(x) &:=& \frac{1}{2^l\cdot l!}\cdot \frac{d^l}{dx^l} \left[ (x^2-1)^l \right] \\
P_l^m(x) &:=& (1-x^2)^{\frac{m}{2}}\cdot \frac{d^m}{dx^m} \left[ P_l(x) \right] \\
&=& \frac{1}{2^l\cdot l!}\cdot (1-x^2)^\frac{m}{2}\cdot\frac{d^{l+m}}{dx^{l+m}} \left[ (x^2-1)^l \right]
\end{eqnarray*}\]

For example:

\[\begin{eqnarray*}
P_0^0(x) &=& (1-x^2)^\frac{0}{2}\cdot \frac{d^0}{dx^0} \left[ P_0(x) \right] \\
&=& 1\cdot P_0(x) = 1 \cdot 1 = 1
\end{eqnarray*}\]

\[\begin{align*}
P_2^1(x) &= (1-x^2)^\frac{1}{2}\cdot \frac{d^1}{dx^1} \left[ P_2(x) \right] \\
&= (1-x^2)^\frac{1}{2}\cdot \frac{d}{dx} \left[ \frac{1}{2^2\cdot 2!}\cdot \frac{d^2}{dx^2} [ (x^2-1)^2 ] \right] \\
&= (1-x^2)^\frac{1}{2}\cdot \frac{d}{dx} \left[ \frac{1}{8}\cdot \frac{d^2}{dx^2} [ x^4-2x^2+1 ] \right] \\
&= (1-x^2)^\frac{1}{2}\cdot \frac{d}{dx} \left[ \frac{1}{8}\cdot \frac{d}{dx} [ 4x^3-4x ] \right] \\
&= (1-x^2)^\frac{1}{2}\cdot \frac{d}{dx} \left[ \frac{1}{8}\cdot [ 12x^2-4 ] \right] \\
&= (1-x^2)^\frac{1}{2}\cdot \frac{d}{dx} \left[ \frac{3}{2} x^2 -\frac{1}{2} \right] \\
&= (1-x^2)^\frac{1}{2}\cdot \left[ \frac{3\cdot 2}{2} x \right] \\
&= (1-x^2)^\frac{1}{2}\cdot \frac{6}{2}x \\
&= 3x\cdot (1-x^2)^\frac{1}{2} \\
\end{align*}\]

 

Let $x≡sin(ϕ)$, then we obtain one of the spherical harmonics (see Table 3 for further examples):

\[\begin{align*}
P_2^1(sin(\theta)) &= 3\cdot sin(\phi)\cdot \sqrt{1-sin^2(\phi)} \\
&= 3\cdot sin(\phi)\cdot \sqrt{cos^2(\phi)} \\
&= 3\cdot sin(\phi)\cdot cos(\phi) \\
&= \frac{3\cdot sin(2\phi)}{2} \\
\end{align*}\]

 

The formulas for FOA are thus:

\[\begin{align*}
\text{ACN 1 / W:}\qquad &B_0^0(t) =S(t)\cdot Y_0^0(\theta, \phi)= S(t) \\
\text{ACN 2 / Y:}\qquad &B_1^{-1}(t) =S(t)\cdot Y_1^{-1}(\theta, \phi)= S(t)\cdot cos(\phi) \cdot sin(\theta) \\
\text{ACN 3 / Z:}\qquad &B_1^0(t) =S(t)\cdot Y_1^1(\theta, \phi)= S(t)\cdot sin(\phi) \\
\text{ACN 4 / X:}\qquad &B_1^1(t) =S(t)\cdot Y_1^1(\theta, \phi)= S(t) \cdot cos(\phi) \cdot cos(\theta) \\
\end{align*}\]

Tabelle 3. Ambisonics-Formeln bis zur dritten Ordnung (ACN-Zählung, SN3D-Normalisierung, 0≤θ≤2π Azimut (0° = vorne, linksdrehend), -π/2≤ϕ≤π/2 Elevation (0° = auf der xy-Ebene, aufwärtsdrehend)).

Table 3. Ambisonics formulas up to third order (ACN counting, SN3D normalization, $0≤θ≤2π$ azimuth (0° = forward, counterclockwise), $-π/2≤ϕ≤π/2$ elevation (0° = on the xy plane, upward rotation))

Read More

ByLaura Peter

Whitney Music Box with OMChroma/OMPrisma in OpenMusic

The Whitney Music Box is a sonified and/or visual representation of a series of interrelated sound elements. From a musical point of view, these elements can be related chromatically or harmonically, for example. In the visual representation, each of these elements is represented by a circle or dot (see Figure 1). These dots circle around a common center point depending on their own assigned frequency. The lower the frequency, the smaller the radius of the orbiting circle and the higher the orbital speed. Each sound element represents multiples of a fixed fundamental frequency in a harmonic series. As soon as an element has completed a revolution around the center point, the sound is triggered with the frequency it represents. Due to the mathematical relationship between the individual elements, there are moments during the performance of the Whitney Music Box in which certain elements are triggered simultaneously and phases in which the elements can be perceived consecutively. At the beginning and at the end, all elements are triggered simultaneously.

Figure 1: Whitney Music Box – visual representation

In this project, OMChroma is used to synthesize the individual sound elements (see Figure 2). The synthesis classes of OMChroma inherit from OpenMusic’s class-array object. The columns in the array describe the individual components within the synthesis. The rows represent parameters that can be assigned locally to the individual components or globally to the entire process. For the Whitney Music Box, elements are needed that implement the individual pitch gradations and the temporal offset of the individual pitch gradations. An OMChroma matrix is regarded as an event. Such an event represents a pitch and the sound repetitions within the global duration of the Whitney Music Box. The global duration is defined at the beginning and also describes the round trip time of the lowest frequency or the previously defined start frequency. Each matrix represents a frequency that is a multiple of the start frequency. The round trip time of a sound element is calculated using the formula

duration(global) / n

Where n is the index of the individual sound elements or matrices. The higher the index, the higher the frequency and the shorter the round trip time. The repetitions of the sound elements are defined by the parameter e-dels . Each component of a matrix is given a different entry delay. These entry delays are spaced at regular intervals of duration(global) / n.

Figure 2: Application of OMChroma

Without spatialization, the Whitney Music Box with OMChroma sounds like this:


Figure 3 shows how the collected matrices or sound events are spatialized with the OMPrisma library. This was based on the visual representation of the Whitney Music Box. Sound elements with a low frequency are further away from the center and sound elements with a high frequency circle closer to the center. With OMPrisma, this representation is to be implemented in spatial sound. This means that sounds with a low frequency should sound further away and sounds with a high frequency should sound closer to the listener. In the OpenMusic patch, elements with an even index were also positioned further to the front and further to the right and, similarly, elements with an odd index were positioned further to the left and back in order to distribute the sounds evenly in the room. The OMPrisma classes also offer presets for the attenuation function, air-absorption function and time-of-flight function . These were used to create an even greater sense of spatiality in addition to the positioning in the room.

Figure 3: Application of OMPrisma

In stereo, for example, the Whitney Music Box sounds like this:


Figure 4 shows how the collected OMChroma and OMPrisma matrices are merged using the chroma-prisma function. The list of all collected matrices is returned via an om-loop and rendered as a sound using the synthesize function(see Figure 5).

Figure 4: chroma-prisma

Figure 5: loop and synthesize

The OpenMusic patch and sound samples can be downloaded from the following link: https://github.com/lauraptrcodes/Whitney-music-box

ByAndres Kaufmes

Transient Processor

Transient Processor

SKAS symbolic sound processing and analysis/synthesis

Prof. Dr. Marlon Schumacher

Intermediate project by Andres Kaufmes

HfM Karlsruhe – IMWI (Institute for Music Informatics and Musicology)

Winter semester 2022/23

_____________

For this interim project, I worked on the implementation of a transient processor in OpenMusic with the help of the OM-Sox library.
A transient processor (also known as a transient designer or transient shaper) can be used to influence the attack/release behavior of the transients of an audio signal.

The first hardware device presented was the SPL TD4, introduced by SPL in 1998, which was available as a 19″ rack device and is still available today in an advanced version.

Transient Designer from SPL. (c) SPL

Transient Designers are particularly suitable for processing percussive sounds or speech. First, the transients must be isolated from the desired audio signal; this can be done using a compressor, for example. A short attack time “ducks” the transients and the signal can be subtracted from the original. The audio signal can then be processed with further effects in the course of the signal chain.

Transient processor patch. FX chain of the two signal paths (left “Transient”, right “Residual”).

At the top of the patch you can see the audio file to be processed, from which, as just described, the transients are isolated using a compressor and the resulting signal is subtracted from the original. Now two signal paths are created: The isolated transients are processed in the left-hand “chain”, the residual signal in the right-hand one. After both signal paths have been processed with audio effects, they are mixed together, whereby the mixing ratio (dry/wet) of both signal paths can be adjusted as desired. At the end of the signal processing there is a global reverb effect.

“Scope” view of the two signal paths. Sketches of the possible signal path and processing.

Sound examples:

Isolated signal:

Residual signal:

Byadmin

BAD GUY: An acousmatic study

Abstract:

Inspired by the “Infinite Bad Guy” project, and all the very different versions of how some people have fueled their imaginations on that song, I thought maybe I could also experiment with creating a very loose, instrumental cover version of Billie Eilish’s “Bad Guy”.

Supervisor: Prof. Dr. Marlon Schumacher

A study by: Kaspars Jaudzems

Winter semester 2021/22
University of Music, Karlsruhe

To the study:

Originally, I wanted to work with 2 audio files, perform an FFT analysis on the original and “replace” its sound content with content from the second file, based only on the fundamental frequency. However, after doing some tests with a few files, I came to the conclusion that this kind of technique is not as accurate as I would like it to be. So I decided to use a MIDI file as a starting point instead.

Both the first and second versions of my piece only used 4 samples. The MIDI file has 2 channels, so 2 files were randomly selected for each note of each channel. The sample was then sped up or down to match the correct pitch interval and stretched in time to match the note length.

The second version of my piece added some additional stereo effects by pre-generating 20 random pannings for each file. With randomly applied comb filters and amplitude variations, a bit more reverb and human feel was created.

Acoustic study version 1

Acousmatic study version 2

The third version was a much bigger change. Here the notes of both channels are first divided into 4 groups according to pitch. Each group covers approximately one octave in the MIDI file.

Then the first group (lowest notes) is mapped to 5 different kick samples, the second to 6 snares, the third to percussive sounds such as agogo, conga, clap and cowbell and the fourth group to cymbals and hats, using about 20 samples in total. A similar filter and effect chain is used here for stereo enhancement, with the difference that each channel is finely tuned. The 4 resulting audio files are then assigned to the 4 left audio channels, with the lower frequency channels sorted to the center and the higher frequency channels sorted to the sides. The same audio files are used for the other 4 channels, but additional delays are applied to add movement to the multi-channel experience.

Acousmatic study version 3

The 8-channel file was downmixed to 2 channels in 2 versions, one with the OM-SoX downmix function and the other with a Binauralix setup with 8 speakers.

Acousmatic study version 3 – Binauralix render

Extension of the acousmatic study – 3D 5th-order Ambisonics

The idea with this extension was to create a 36-channel creative experience of the same piece, so the starting point was version 3, which only has 8 channels.

Starting point version 3

I wanted to do something simple, but also use the 3D speaker configuration in a creative way to further emphasize the energy and movement that the piece itself had already gained. Of course, the idea of using a signal as a source for modulating 3D movement or energy came to mind. But I had no idea how…

Plugin “ambix_encoder_i8_o5 (8 -> 36 chan)”

While researching the Ambix Ambisonic Plugin (VST) Suite, I came across the plugin “ambix_encoder_i8_o5 (8 -> 36 chan)”. This seemed to fit perfectly due to the matching number of input and output channels. In Ambisonics, space/motion is translated from 2 parameters: Azimuth and Elevation. Energy, on the other hand, can be translated into many parameters, but I found that it is best expressed with the Source Width parameter because it uses the 3D speaker configuration to actually “just” increase or decrease the energy.

Knowing which parameters to modulate, I started experimenting with using different tracks as the source. To be honest, I was very happy that the plugin not only provided very interesting sound results, but also visual feedback in real time. When using both, I focused on having good visual feedback on what was going on in the audio piece as a whole.

Visual feedback – video

Channel 2 as modulation source for azimuth

This helped me to select channel 2 for Azimuth, channel 3 for Source Width and channel 4 for Elevation. If we trace these channels back to the original input midi file, we can see that channel 2 is assigned notes in the range of 110 to 220 Hz, channel 3 notes in the range of 220 to 440 Hz and channel 4 notes in the range of 440 to 20000 Hz. In my opinion, this type of separation worked very well, also because the sub-bass frequencies (e.g. kick) were not modulated and were not needed for this. This meant that the main rhythm of the piece could remain as a separate element without affecting the space or the energy modulations, and I think that somehow held the piece together.

Acousmatic study version 4 – 36 channels, 3D 5th-order Ambisonics – file was too big to upload

Acoustic study version 4 – Binaural render

Byadmin

Spectral Select: An acousmatic 3D audio study

 

 

Abstract:
Spectral Select explores the spectral content of one sample and the amplitude curve of a second sample and unites them in a new musical context. The meditative character of the output created by iteration is both contrasted and structured by louder amplitude peaks.
In a revised version, Spectral Select was spatialized in Ambisonics HOA-5 format.

Supervisor: Prof. Dr. Marlon Schumacher

A study by: Anselm Weber
Winter semester 2021/22
University of Music, Karlsruhe

 


About the study:
In which forms of expression is the connection between frequency and amplitude expressed ? Are both areas intrinsically connected and if so, what could be approaches to redesigning this order?
Such questions have occupied me for some time. That’s why the attempt to redesign them is the core topic of Spectral Select.
I was inspired by AudioSculpt from IRCAM, which we got to know in our course: “Symbolic Sound Processing and Analysis/Synthesis” together with Prof. Dr. Marlon Schumacher and Brandon L. Snyder and which we partially rebuilt.
Spectral Edit works on a similar principle, but instead of having a user work out interesting areas within a spectrum of a sample, it was decided to use a second audio sample. This additional sample (from now on referred to as “amplitude sound” in the course of this article) determines how the first sample (from now on referred to as “spectral sound”) is to be processed by OM-Sox.
To achieve this, two loops are used:
First, individual amplitude peaks are analyzed out of the amplitude sound in the first “peakloop”. This analysis is then used in the heart of the patch, the “choosefreq” loop, to select interesting sub-ranges from the spectral sample. Loud peaks filter narrower bands from higher frequency ranges and form a contrast to weaker peaks, which filter somewhat broader bands from lower frequency ranges.

peakloop – Analysis
choosefreq Loop – Audio Processing


How small the respective iteration steps are affects both the length and the resolution of the overall output. Depending on the sample material, a large number of short grains or fewer but longer subsections can be created. However, both of these parameters can be selected freely and independently of each other.
In the enclosed piece, for example, a relatively high resolution (i.e. an increased number of iteration steps) was chosen in combination with a longer duration of the cut sample. This creates a rather meditative character, whereby no two sections will be 100% identical, as there are constantly minimal changes under the peak amplitudes of the amplitude sound.
The still relatively raw result of this algorithm is the first version of my acousmatic study.

Acousmatic study version 1


The subsequent revision step was primarily aimed at working out the differences between the individual iteration steps more precisely. For this purpose, a series of effects were used, which in turn behave differently depending on the peak amplitude of the amplitude sound. To make this possible, the series of effects was integrated directly into the peak loop.

Acousmatic study version 2


In the third and final revision step, the audio was spatialized to 8 channels.
The individual channels sound into each other and change their position in a clockwise direction. This means that the basic character of the piece remains the same, but it is now also possible to follow the “working through” of the choosefreq loop spatially. To maintain this spatiality, the output was then converted to binaural stereo for the upload using Binauralix.

Acoustic study version 3 – Binaural

 

Spectral Select – Ambisonics

In the course of a further revision, Spectral Select was re-spatialized using the spatialization class “Hoa-Trajectory” from OM-Prisma and converted to the Ambisonics format.
To ensure that this step fits in well conceptually and sonically with the previous edits, the amplitude sound should also play an important role in the spatial position.
The possibilities for spatializing sounds with the help of Open Music and OM Prism are numerous. In the end, it was decided to work with Hoa-Trajectory. Here, the sound source is not bound to a fixed position in space and can be described with a trajectory that is scaled to the total duration of the audio input.

Spatialization with HOA.TRAEJECTORY

 

 

The trajectory is created depending on the amplitude analysis in the previous step.
A simple, three-dimensional circular movement, which spirals downwards, is perturbed with a more complex, two-dimensional curve. The Y-values of the more complex curve correspond to the analyzed amplitude values of the amplitude sound.
Depending on the scaling of the amplitude curve, this results in more or less pronounced deviations in the circular motion. Higher amplitude values therefore ensure more extensive movements in space.

 

 


It is interesting to note that OM-Prisma also takes Doppler effects into account. As a result, it is also audible that at higher amplitude values, more extreme distances to the listening position are covered in the same time. This step therefore has a direct influence on the timbre of the entire piece.
Depending on the scaling of the trajectory, fast movements can be strongly overemphasized, but artifacts can also occur (if the distance is too great).
To get a better impression, 2 different runs of the algorithm with different distances to the listener follow.

 

Version with extreme Doppler effects which can result in artifacts – binaural stereo

Versionwith closer distance and more moderate Dopp ler effects – Binaural Stereo

 

In contrast to the previous sound examples, the spectral sound and amplitude sound have been replaced in this example. This is a longer sound file for analyzing the amplitudes and a less distorted drone as a spectral sound.
The idea behind this project is to experiment with different sound files anyway.
Therefore, the old algorithm has been reworked to offer more flexibility with different sound files:

Revised scalable version of the old algorithm for selecting from the spectral sound

In addition, a randomized selection is now made from the spectral sound on the time axis. As a result, any shaping context should come from the magnitude of the amplitude sound and any timbre should be extracted from the spectral sound.

 

ByBrandon Snyder

Integrating ML with DSP Frameworks for Transcription and Synthesis in CAC

 

A link to download the applications can be found at the end of this blogpost. This project was also presented as a paper at the 2022 International Conference on Technologies for Music Notation and Representation (TENOR 2022).

Modularity in Sound Synthesis Tools

This blogpost walks through the structure and usage of two applications of machine learning (ML) methods for sound notation and synthesis. The first application is a modular sample replacement engine that uses a supervised classification algorithm to segment and transcribe a drum beat, and then reconstruct that same drum beat with different samples. The second application is a texture synthesis engine that uses an unsupervised clustering algorithm to analyze and sort large numbers of audio files.

The applications were developed in OpenMusic using the OM-SoX modular synthesis/analysis framework. This was so that the applications could be as modular as possible. Modular, meaning that they could be customized, extended, and integrated into a user’s own OpenMusic workflow. We believe this modularity offers something new to the community of ML and sound synthesis/analysis tools currently available. The approach to sound synthesis and analysis used here involves reading and querying many separate audio files. Such an approach can be encompassed by the larger term of “corpus-based concatenative synthesis/analysis,” for which there are already several effective tools: the Caterpillar System, Audioguide, and OM-Pursuit. Additionally, OM-AI, ml.*, and zsa.descriptors are existing toolkits that integrate ML methods into Computer-Aided Composition (CAC) environments. While these tools are very precise, the internal workings of them are not immediately clear. By seeking for our applications to be modular, we mean that they can be edited, extended and integrated into existing CAC programs. It also means that they can be opened and up, examined, and reverse-engineered for a user’s own education.

One example of this is in figure 1, our audio analysis engine. Audio descriptors are implemented as subpatches in lambda mode, and can be selected as needed for the input audio. 

Figure 1: Interchangeable audio descriptors are set as patches in lambda mode. Here, a patch extracting 13 MFCCs is being used.

Another example is in figure 2, a customizable distance function in our texture synthesis application. This is the ML clustering algorithm that drives the application. Being a patch built from smaller OpenMusic objects, it is not only a tool for visualizing the algorithm at work, it also allows a user to edit it. For example, the n-dimension euclidean distance function could be substituted with another distance function, if needed.

Figure 2: A simple k-means clustering algorithm, built within an OpenMusic abstraction. The distance function takes the form of a subpatcher in lambda mode.

 With the modularity of the project introduced, we will on the next page move on to the two specific applications.

Pages: 1 2 3 4

ByBrandon Snyder

Machine Learning in Music: One Application with Voice and Live Electronics

Up until this past August, my impressions of what machine learning could be used for was mostly functional, detached from any aesthetic reference point within my artistic practice. Cars recognizing stop signs, radiologists detecting malignant legions in tissue; these are the first things to come to my mind. There is definitely an art behind programming these tasks. However, it wasn’t clear to me yet how machine learning could relate to my world of contemporary concert music. Therefore, when I participated in Artemi-Maria Gioti’s machine learning workshop at impuls Academy 2021, my primary interest was to make personal artistic connections to this body of research, and to see what ways I could interrogate my underlying aesthetic assumptions in artistic applications of machine learning. The purpose of this text is to share with you the connections I made. I will walk through the composition process of my piece Shepherd for voice and live electronics, using it as a frame to touch upon basic machine learning theories and methods, as well as outline how I aesthetically reacted to them. I will not go deep into the technicalities of machine learning – there are far more qualified people than I for that specific task. However, I will say that the technical content of this blogpost is inspired heavily from Artemi-Maria Gioti, who led this workshop and whose research covers the creative applications of machine learning in a much deeper way. A further dive into the already rich world of machine learning and music can be begun at her website.

A fundamental definition of machine learning can be framed around the idea of improvement through experience. As computer scientist Tom M. Mitchell describes it, “The field of machine learning is concerned with the question of how to construct computer programs that automatically improve with experience.” (Mitchell, T. (1998). Machine Learning. McGraw-Hill.). This premise of ‘improvement’ already confronted me with non-trivial questions. For example, if machine learning is utilized to create an improvising duo partner, what exactly does the computer understand as ‘good’ or ‘bad’ improvisation, as it gains experience? Before even beginning to build a robust machine learning algorithm, answering this preliminary question is an entire undertaking in and of itself. In my piece Shepherd, the electronics were trained to recognize the sound of my voice, specifically whether I was whispering, talking, yelling, or being silent. However, my goal was not to create a perfectly accurate recognition algorithm. Rather, I wanted the effectiveness and the ineffectiveness of the algorithm to both play equal roles in achieving the piece’s concept. Shepherd is a performance piece takes after a metaphor from Jesus in the christian bible – sheep recognize a shepherd by the sound of their voice (John 10). The electronics reacts to my voice in a way that is simultaneously certain and uncertain. It is a reflection, through performance, on the nuances of spiritual faith, the way uncertainty necessarily partakes in the formation of conviction and belief. Here the electronics were not functional instrument (something designed to be controlled by my voice), but rather were functioning more as a second player (a duo partner, reacting to my voice with a level of unpredictability).

Concretely in the program, the electronics returned two separate answers for every input it is given (see figure 1). It gives a decisive, classification answer (“this is ‘silence’, this is ‘whispering’, this is ‘talking’, etc.), and it gives an indecisive, erratic answer via regression (‘silence: 0.833; whispering: 0.126; talking: 0.201; yelling: 0.044’). And important for this concept of conceiving belief through doubt, the classification answer is derived from the regression answer. The decisive answer (classification) was generally stable in its changes over time, while the indecisive answer (regression) moved more quickly and erratically. Overall, this provided a useful material for creating dynamic control of the actual digital sounds that the electronics produced. But before touching on the DSP, I want to outline how exactly these machine learning algorithms operate, how the electronics learn and evaluate the sound of my voice.

Figure 1: Max MSP and Wekinator (off-screen) analyze an audio’s MFCCs to give two outputs on the nature of the input audio. The first output is from a regression algorithm, the second is from a classification algorithm. 

In order for the electronics to evaluate my input voice, it first needs a training set, a collection of data extracted from audio of my voice, with which it could use to ‘learn’ my voice. An important technical point is that the machine learning algorithm never observes actual audio data. With training and testing data, the algorithm is always looking at numerical data (here called ‘descriptors’) extracted from the audio. This is one reason machine learning algorithms can work in realtime, even with audio. As I alluded to, my voice recognition program is underpinned by two machine learning concepts: classification and regression. A classification algorithm will return a discrete value from its input data. In my case, those values are ‘silence’, ‘whispering’, ’talking’, and ‘yelling’. To make a training set then, I recorded audio of each of these classes (4 audio files in total), and extracted MFCCs (Mel-Frequency Cepstrum Coefficients) from it. MFCC’s are a representation a sound’s spectral energy calibrated to the range of typical human auditory perception, and are already commonly used in speech recognition programs, music-information retrieval applications, and other applications based around timbre-recognition.

I used the Max MSP library Zsa.descriptors to calculate my MFCCs. I also experimented with other audio descriptors such as spectral centroid, spectral flatness, amplitude peaks, as well as varying numbers of MFCC’s. Eventually I discovered that my algorithm was most accurate when 13 MFCCs were the only descriptor, and that description data was taken only about five  times a second. I realized that, on a micro-level timescale, my four classes had a lot similarity. For example, the word ‘synthesizer,’ carried lots of ’s’ noise, which is virtually the same when whispered as when talked. Because of this, extracting data at an intentionally slower rate gave the algorithm a more general picture of each of my voice-classes, allowing these micro-moments of similarity to be smoothed out.

The standard algorithm used for my voice recognition concept was classification. However, my classification algorithm was actually built using a second common machine learning algorithm: regression. As I mentioned before, I wanted to build into my electronics a level of ‘indecision’, something erratic that would contrast the stable nature of a standard classification algorithm. Rather than returning discrete values, a regression algorithm gives a new ‘predictive’ value, based on a function derived from the training set data. In the context of my piece, the regression algorithm does not return a specific voice-class. Rather, it gives four percentage values, each corresponding to how close or far my input is to each of the four voice-classes. Therefore, though I may be whispering, the algorithm does not say whether I am whispering or not. It merely tells me how close or far away I am from the ‘whispering’ data that it has been trained on.

I used a regression algorithm in Wekinator, a simple and powerful machine learning tool, to build my model (see figure 2). Input audio was analyzed in Max MSP, and the descriptor data was sent via OSC to Wekinator. Wekinator built the predictive regression model from this data and then sent output back to Max MSP to be used for DSP control. In Max, I made my own version of a classification algorithm based on this regression data.

Figure 2: Wekinator is evaluating MFCC data from Max MSP and returning 4 values from 0.0-1.0, indicating the input’s similarity to the four voice classes (silence, whispering, talking, yelling). The evaluation is a regression model trained on 752 data samples. 

All this algorithm-building once again returns me to my original concern. How can I make an aesthetic connection with these concepts? As I mentioned, this piece, Shepherd is for my solo voice and live electronics. In the piece I stand alone on a stage, switching through different fictional personas (a speaker at a farming convention, a disgruntled restaurant chef, a compilation video of Danny Wolfers saying the word ‘synthesizer,’ and a preacher), and the electronics reacts to these different characters by switching through its own set of personas (sheep; a whispering, whimpering sous chef; a literal synthesizer; and a compilation of christian music). Both the electronics and I change our personas in reaction to each other. I exercise some level control over the electronics, but not total. As I said earlier, the performance of the piece is a reflection on the intertwinement of conviction and doubt, decision and indecision, within spiritual faith. Within this concept, the idea of a machine ‘improving’ towards ‘perfection’ is no longer an effective framework. In the concept, and consequently in the music I attempted to make, stable belief (classification) and unstable indecision (regression) were equal contributors towards the musical relationship between myself and the electronics.

Based on how my voice was classified, the electronics operated one of four DSP modules. The individual parameters of a given module were controlled by the erratic output data of the regression algorithm (see figure 3). For example, when my voice was classified as silent, a granular synthesizer would create textures of sheep-like noises. Within that synthesizer, the percentage levels of whispering and talking ‘detected’ within the silence would manipulate the pitch shifting in the synthesizer (see figure 4). In this way, the music was not just four distinct sound modules. The regression algorithm allowed for each module to bend and flex in certain directions, as my voice subtly suggested hints of one voice class from within another. For example, in one section I alternate rapidly between the persona of a farmer talking at a farming convention, and a chef frustratingly whispering at his sous chef. The electronics moved consequently between my whispering and talking DSP modules. But also, as my whispering became more frustrated and exasperated, the electronics would output higher levels of talking in its regression algorithm. Thus, the internal drama of my theatrical  performances is reacted to by the electronics.

Figure 3: The classification data would trigger one of four DSP modules. A given DSP module would receive the regression values for all four vocal classes. These four values would control the parameters of the DSP module.

Figure 4: Parameter window for granular synth triggered when the electronics classifies my voice as ‘silent’. The amount of whispering and talking detected in the silence would control the pitch of the grain. The amount of silence detected in the silence controlled the grain’s duration. Because this value is relatively static during actual silence from my voice, a level of artificial duration manipulation (seen a the top of the window) was programmed. 

I want to return to Tom Mitchell’s thesis that machine learning involves computer improving  automatically through experience. If Shepherd is a voice recognition tool, then it is inefficient at improvement. However, Shepherd was not conceived as a tool. Rather, creating Shepherd was more so a cultivation of a relationship between my voice and the electronics. The electronics were more of a duo partner, and less of an instrument. To put this more concretely, I was never looking for ‘accurate’ results from the machine. As I programmed, I was searching for results that illustrated Shepherd’s artistic concept of belief intertwined with doubt. In this way, ‘improving’ the piece did not mean improving the algorithm’s accuracy. It meant ‘improving’ the relationship between myself and the electronics. One positive from this approach is that the compositional process was never separated from the programming of the electronics. Both developed in tandem. The composing this piece brought me to the realization that creative applications of machine learning can be applied at every level of its discourse. If you ware interested in hearing a recording of this performance, a bootleg recording of the premiere can be found here.

https://youtu.be/LFQnpp5Uzbg

References:

  • Artemi-Maria Gioti – composer and artistic researcher working in the field of artificial intelligence. 
  • Wekinator – free, open-source software created by Rebecca Fiebrink that uses machine learning to create musical instruments, game interfaces, computervision, and other tools in sound and animation. 
  • Zsa.descriptors –  library for real-time sound descriptors analysis for Max MSP developed by Mikhail Malt and Emmanuel Jourdan.
  • NYU Music and Audio Research Laboratory – Free online resources and datasets.
  • AIMC – conference on artificial intelligence and musical creativity.
  • OM-Pursuit – Dictionary-based sound modelling for computer-aided composition in Open Music.

 

ByChristophe Weis

Supercollider Renderer