A reinforcement learning agent for live sound improvisation
This paper presents a novel approach for developing customized autonomous agents for musical improvisation. The model we propose employs reinforcement learning to adaptively control the parameters of a sound synthesizer in response to live audio from a musician. The agent is trained on a corpus of audio files that exemplify the musician’s instrument and stylistic repertoire. During training, the agent listens and learns to imitate the incoming sound according to a set of perceptual descriptors by continuously adjusting the parameters of the synthesizer it controls. To achieve this objective, the agent learns specific strategies that are characteristic of its autonomous behavior in a live interaction. In the paper, we provide a detailed description of the design and implementation of the model for the agent and discuss its application in three selected scenarios.
Musical Agent, Reinforcement Learning, Improvisation
Free music improvisation performances develop as a balance between convergent and divergent elements. While playing, musicians continuously coordinate through listening, creating musical structures that evolve in time. Cognitive models of improvisation have outlined how musicians coordinate using interaction strategies which are often based on alternating segments of tension and release. When describing their contribution in specific moments of improvisation, musicians often describe how what they played adopts, augments, or contrasts with the idea of the other musicians [1][2][3]. This is especially evident in duos, in which musicians explicitly report of playing “with”, “against” or “without” the other musician [4]. Such collaborative decision-making happens through continuous positive and negative feedback, based on mutual reinforcement or divergence [5].
Recent human-machine co-creative systems for musical improvisation have employed improvisational strategies identified in psychological studies to implement complex interactive behavior [6][7]. However, the factors that determine successful coordination between a human and a machine in this creative setting have not yet been clearly identified.
This paper introduces a novel approach for developing human-agent systems for musical improvisation based on reinforcement learning. The agent can learn to control an arbitrary set of parameters of a sound synthesizer, continuously modifying parameters determining the timbre of the generated sound. The agent adjusts synthesis parameters in response to the live human musician it is listening to. In particular, the way in which the agent responds to the musician is determined by the selected reward function used during training. The reward function we use aims to maximize the similarity between musician- and agent-generated sound with respect to selected perceptually-related audio descriptors. In this way, the model learns to operate the synthesizer it is trained with, adaptively matching specific aspects of the musician’s sound as explicitly defined during training.
The agent can be trained to control any sound synthesizer using a corpus derived from musicians playing any instrument. Consequently, the sound produced by the agent may significantly differ from the musician's sound due to the different instruments being operated. We leverage on the state and action concepts in reinforcement learning to provide a degree of autonomous agency to the trained system. The agent learns to change parameters of the synthesizer gradually, in small steps. Therefore, to match the input sound, the agent requires passing across several states, each generating a different timbre. This approach determines a variable delay between musician and agent, that when significant can be perceived as the agent responding rather than reacting to the improvising musician, and as a degree of autonomy of the agent itself. Open-source code developed for this work is available at the project’s repository.
A live algorithm is defined as “an autonomous machine that interacts with musicians in an improvised setting” [8]. It is a type of interactive music system [9] which is able to analyze incoming audio from a live musician and respond through an algorithmic process by controlling a sound synthesis model in real-time. The structure of a live algorithm can be decomposed into analysis, patterning/reasoning and synthesis components.
The behavior of a live algorithm is determined by how it listens and interprets the input of a human musician, therefore it is defined by the choice of machine listening technique, the model structure and its environment. In general, the inputs and outputs of these systems can be in the form of audio or symbols such as control parameters or music notation [10]. Recent approaches employ multi-granular analysis, considering different time scales [11] and/or multi-modal analysis, combining audio, symbolic notation, and embodied data [12][13], to describe incoming audio starting from the computation of acoustic or perceptual features. An alternative method is based on the latent encoding of raw audio, which uses learned representations instead of features extracted explicitly with digital signal processing techniques [14][15].
Several patterning/reasoning algorithms for co-improvisation systems have been proposed and studied, with the most popular being rule-based models [16], evolutionary computation [17][18] and Markov models [19][20][21]. In recent years, corpus-based approaches with machine learning have been developed using sequence models [22][13], convolutional neural networks [23] and factor oracle algorithms [11] [24][25]. These corpus-based improvisational systems are often trained using similarity metrics between the incoming sound and previous musical knowledge defined by a musical corpus or learned representations. In these systems, the incoming audio is analyzed and matched to segments of the corpus or to generated material according to an arbitrarily complex mapping algorithm.
The choice of patterning algorithm largely affects a system’s behavior in a live setting. Algorithmic behaviors can be analyzed according to the musician’s perceptions while playing with a system, and they are often described in terms of perceived creative agency [26]. In particular, successful computer improvisers are those whose behaviors are perceived as similar to those of human performers, demonstrating autonomy, novelty, participation and leadership [8]. These characteristics can be expressed through a balance of novelty and stylistic imitation [27]. Achieving a balance between these two aspects is mentioned as the objective of several sequence-based systems [28][29]. Characteristic behavior of improvising machines have been identified by Young & Bown [30] as shadowing, mirroring, coupling or negotiation, and implemented by Thelle & Pasquier in their SpireMuse system [31]. Other objectives of co-improvisation systems are human-machine coadaptation [13] and contextual relevance [32].
Reinforcement learning is an unsupervised machine learning technique in which an agent learns to solve a problem through repeated interaction with its environment [33]. In reinforcement learning, the machine learning model represents an agent which can influence the State
The reinforcement learning paradigm shares several modeling principles with the idea of live algorithms, since both models involve an agent interacting with an external environment to satisfy a goal. However, differently from the application of reinforcement learning in games, musically motivated goals can be hard to identify and to explicitly define in quantifiable terms [35], necessary to reward the agent during the training phase.
A reinforcement learning model of musical tasks involves the development of a musically-motivated feedback function. Using a human in a training feedback loop is not feasible because the bandwidth at which training episodes can be completed would be extremely low. Murray-Rust et al. [36] propose three feedback criteria for reinforcement signals: satisfaction of internal goals, appreciation of fellow participants and memetic success. Similar criteria have been applied by Collins in the Improvagent system [35], in which reinforcement signals are based on the quality of musical prediction and the degree of influence in the interaction. Rewards explicitly based on music theory have been employed in the context of jazz improvisation [37] and context-based rewards have been used in combination with a factor oracle algorithm in the Omax system as a way to steer the attention towards preferred musical materials [38].
According to Drummond (2009) [39] interactive music systems can be categorized as score-driven or performance-driven. In score-driven systems the time structure of the composition is embedded in the system, which tracks the performer triggering events defined in time. In performance-driven systems, on the other hand, the agency of the algorithm depends solely on the analysis of the material played by the performer and the nature of the algorithm’s response. In improvisation, musicians do not follow a predetermined score, therefore the performance-driven approach is the only viable solution. In this context, autonomous computer-based co-improvisers often employ machine listening techniques for understanding the musician’s behavior and decide how to musically act. This approach reflects the nature of collective improvised music, in which each musician’s contribution develops in response what is being played by the others.
Our interaction scenario focuses on a live musician improvising with a computer-based agent that controls a synthesizer. The agent's objective is to perform a matching task, striving to produce sounds as similar as possible to those of the live musician, within the bounds of feasibility. This challenge arises because the human and the agent may be playing instruments with significantly different timbres. Furthermore, the agent's ability to control the sound synthesizer is limited, as its sonic palette is typically restricted to a subset of synthesis parameters that it has been trained to manage. Consequently, the agent attempts to imitate the musician based on a set of perceptual features that are defined as parameters of the reward function. The choice of a matching objective is deliberate, as similarity-based functions are prevalent in driving musical agents. They are widely used in corpus-based methods for concatenative synthesis [40] and in algorithms designed for the dynamic exploration of synthesis parameter spaces [41].
Differently from agents based on supervised learning, in which the matching task generates an immediate mapping, responses generated with reinforcement learning are based on a strategy developed by the agent during training. In reinforcement learning, an agent can develop different strategies to optimize the same objective with the same training process. This process provides opportunity for the agent to develop characteristic and eventual unexpected behaviors, which can be described as being between shadowing and mirroring, according to Young & Bown’s terminology [30]. Indeed, even when a sound match is theoretically possible, it is achieved by navigating through a sequence of states that grant the agent a degree of autonomy in the unfolding of its sonic response. A data-flow illustrating of the live system is shown in Figure 1.
As illustrated in Figure 1, the state
The approach we take to train the agent that controls the parameters of the synthesizer, along with the implementation details provided later, is both generic and modular. Consequently, it permits the experimentation of training agents with arbitrary sound descriptors, synthesis engines, reward functions, and musicians playing various instruments with diverse stylistic repertoires. The specific instrument and stylistic repertoire that the agent learns to match are represented in the sound corpus provided by the musician, which is used exclusively during the training phase.
In reinforcement learning, the problem is modeled as an agent interacting with its environment, with goals defined by a reward function that it seeks to optimize. In our scenario, these elements are modeled as detailed in the respective subsections that follow.
The environment is composed of a live musician and a synthesizer. The live musician generates the target sound, which the agent tries to match. The sound is analyzed according to a series of sound descriptors, which extract specific acoustic-perceptual characteristic from the sound timbre. In our current implementation, the following descriptors can be utilized:
Loudness: loudness (dB), true-peak (dB), loudness (linear RMS) [43];
Mel-Frequency Cepstral Coefficients (MFCC): 7 coefficients [44];
Spectral shape: centroid (Hz), spread (Hz), skewness (ratio), kurtosis (ratio), rolloff (Hz), flatness (dB), crest (dB) [45];
Chroma: coefficients describing the spectral energy in 12 bins corresponding to the semitones of the musical octave in western tonal harmony [46].
The specific selection of descriptors depends on the synthesizer the agent controls and the musician’s sound, in theory, any descriptors could be employed depending on the musical or sonic objective. These descriptors are extracted from the live audio coming from the human musician and by the agent separately, resulting the feature vectors
The agent is an artificial neural network which takes the state observation
The synthesis parameters are updated at each step by summing the new action vector multiplied by the step size to the parameter values of the previous step.
The reinforcement learning agent we employed utilizes a Deep Q-Network (DQN) [47]. This agent estimates the rewards received by the agent for specific actions using a neural network, computing its expected values (Q-values). The specific architecture of the artificial neural network we chose is a feedforward, fully connected network with two hidden layers, each consisting of 64 neurons and employing a Rectified Linear Unit (ReLU) activation function.
The reward function provides the agent with feedback on how its actions have affected the environment [34]. A positive reward motivates the agent to repeat a similar action in the same environmental conditions. The reward value is computed as a function of the current state
In the current design, the vector of synthesis parameters
The function
We implemented a conditional reward because preliminary experiments indicated that augmenting the reward function with a substantial bonus for a WRMSE near zero significantly accelerates the learning process. Therefore, in our reward function, we add a positive coefficient
In reinforcement learning, the training process involves the agent repeatedly interacting with the environment. Each cycle in which the agent reaches an ending state is called an episode. During each step of an episode, the agent observes the current state of the environment
The model is trained on a curated sound corpus consisting of audio files recorded from the musician, which vary in size and duration. The purpose of this corpus is to tailor the agent's training to the musical instrument’s timbre and a repertoire that aligns with what is anticipated the musician’s live performance.
In the live environment, changing the parameters alters the sound, which is analyzed to update the environment’s state. However, generating and analyzing sound at each training step is computationally expensive, and integrating a typical sound synthesizer (often standalone hardware, software, or coded in a real-time audio programming language) within a reinforcement learning framework is a complex task. To overcome these issues, we replace the synthesizer with a lookup table that correlates synthesis parameters with the associated sound features. This table allows us to update the environment's state efficiently when new parameters are enacted by an action by retrieving the relevant features from the table. We construct this table using the actual synthesizer that the agent will control, calculating the perceptual acoustic features for every possible combination of parameters controlled by the agent. To manage computational and memory constraints, we vary the parameters in discrete steps of selected size
In the training process, each episode is based on a randomly chosen audio file from the corpus, with each step representing the features computed from a single window of that file. An episode corresponds to a sound file in corpus, and it terminates at the end of the file. The agent's task is to identify the combination of parameters that best replicates the selected subset of musician-defined features. Given that the musician’s sound can potentially change continuously, the agent must develop a policy capable of adapting to the rate of change of the computed features from the training corpus files. To promote generalization and robustness in the learned policy, episodes where target features are extracted directly from the sound corpus alternate with episodes using target features retrieved from a randomly selected entry in the lookup table, in which case the duration of the episode is fixed to 6000 steps.
The agent is trained using the Experience Replay algorithm [48]; at each training iteration, the agent interacts with the environment over several episodes and stores the resulting observations, actions, and rewards at each step in a replay buffer. The agent then samples batches of data from the replay buffer to update the prediction network.
The system is composed of three main components: the agent, the training environment and the live environment. The agent is implemented in Python using the Stable Baselines 3 library [49]. The offline training environment is also implemented in Python, utilizing the Gymnasium library [50]; features from the sound corpus are extracted using Pure Data (PD) patches executed offline. The synthesizer controlled by the agent is implemented as a PD abstraction, which is used to compute the lookup table for training and in the live environment, in which the agent, running in a Python server, communicates with PD via Open sound Control (OSC).
The perceptual acoustic features described in the Environment section are computed using the FluCoMa toolkit PD implementation [51]. We employ FluCoMa in both offline training environment and live environment to ensure consistency. Features are computed with fixed settings, using window size of 256 samples, hop size of 128 samples, and a sampling rate of 44.1 kHz.
The lookup table for the synthesizer is generated by recording a 2-second sound file for each possible combination of the synth parameters. The features associated with each combination of synthesis parameters are averages calculated over the 2-second recording, which is analyzed using the same descriptors as the corpus analysis. Before training the agent, the features computed from the corpus and those in the lookup table are normalized to have zero mean and unitary standard deviation. The time required to compute the lookup table increases exponentially with the number of variable synthesis parameters and their variation step size
In the live environment, the trained agent operates within a Python script that responds to queries from a Pure Data (PD) patch. Messages and data are exchanged using the OSC protocol. The PD patch analyzes live audio input from the musician and also hosts the synthesizer that is controlled by the trained agent. The sound signals from the musician and the synthesizer are analyzed separately in PD, using the selected perceptual audio descriptors. These descriptors, along with the current parameters of the synthesizer, are sent to the agent via OSC, which then prompts the agent to respond.
The theoretical minimum latency of the agent's response to the musician—disregarding the negligible computation time—depends on the window size used for computing the FluCoMa descriptors. In the experiment detailed in this paper, the window size is 256 samples, which corresponds to a latency of 5.805 ms at a 44.1 kHz sampling rate. The rate at which new feature vectors are sent to the agent, and a new set of parameters is communicated to the synthesizer, is determined by the hop size between consecutive analysis windows. In the current implementation we have fixed the hop size to 128 samples, which, at a sampling rate of 44.1 kHz, corresponds to a parameter update rate of 344.59 Hz, or every 2.902 ms. In the current system design, the hop size used in the live environment must match the one utilized for analyzing the corpus and generating the lookup table for the synthesizer.
The agent was trained and evaluated on three selected scenarios, each employing a different synthesis model varying in parameter complexity, sound association, and the number of continuous control parameters.
Tone generator. A simple sine wave oscillator with two control parameters: frequency and amplitude. The frequency ranges from 20 to 2500 Hz, while the amplitude ranges between 0 and 1. We utilize this synthesizer as a baseline to evaluate both the environment and the learning algorithm.
Frequency modulation synthesizer. Two sinusoidal oscillators, a carrier and a modulator, with three the control parameters: carrier frequency, harmonicity, and modulation index. The frequency parameter ranges from 1 to 400 Hz, while both harmonicity and the modulation index are in the range from 0 to 5. The harmonicity parameter determines the frequency relationship between the partials and the carrier, affecting the spectral distribution, whereas the modulation index influences the intensity or spectral energy of the partials relative to the carrier.
Granular synthesizer. A standard implementation of a granular synthesizer with three control parameters: density, grain starting position and grain duration [52]. The synthesizer processes a guitar sample that is randomly selected from GuitarSet [53]. The density parameter, ranging from 1 to 40 Hz, determines the rate at which new grains are played. The starting position specifies the sample within the file where the grain begins, and the grain duration sets the duration of each grain in milliseconds, with a range from 25 to 1000 ms.
The ranges of all the synthesis parameters are normalized between 0 and 1 to facilitate training.
In the three experiments, the agent controlling each synthesizer was trained using a corpus of sounds from the same instrument it aims to control. Consequently, both the musician and the agent play the same instrument. This approach was chosen because the agent's goal is to match the timbre of the musician’s instrument precisely. By using the same instrument, the agent has the potential to achieve theoretically perfect timbre matching, thus allowing us to thoroughly test the effectiveness of the reward function. The reward and WRMSE are used to measure the model's performance across 10 episodes in a test environment, each episode lasts 6000 steps; the mean and the standard deviation of both metrics were computed over this span and reported for each experiment.
The tone generator features controllable parameters for frequency and amplitude. Three distinct agents were trained to match specific attributes: loudness (measured by linear RMS), spectral centroid (measured in Hz), and a combination of both. The unitary range of both parameters was normalized and quantized with a step size of
Table 1. Mean and standard deviation of reward and WRMSE over 10 test episodes for three agents trained to control parameters of the tone generator using distinct sets of acoustic features. | |||
Matched features | Reward | WRMSE | |
---|---|---|---|
Loudness (linear RMS) | Mean | 0.643 | 0.043 |
Std | 0.067 | 0.005 | |
Spectral centroid (Hz) | Mean | 0.474 | 0.057 |
Std | 0.186 | 0.006 | |
Loudness (linear RMS) and spectral centroid (Hz) | Mean | 0.520 | 0.109 |
Std | 0.104 | 0.029 |
Results show that matching loudness is easier for the model than matching the spectral centroid, and that pursuing a combined objective yields better results than focusing on the spectral centroid alone.
The examples in Video 1 show that the agent can adjust the synthesizer's parameters to match the loudness and spectral centroid of the sound source it listens to. The third example demonstrates that the agent, though trained to match a basic tone generator, responds appropriately to a musician's performance even though the timbre has significant differences from the one used for training.
The frequency modulation synthesizer features controllable carrier frequency, harmonicity and modulation index. The harmonicity and modulation index primarily influence the timbre by determining the spacing and amplitude of the partials, respectively. Since the model lacks direct control over amplitude, it produces a continuous, sustained sound. Five distinct agents were trained to match specific attributes: loudness (measured by linear RMS) and spectral centroid (measured in Hz), MFCC, spectral shape, chroma, and a combination of MFCC, spectral shape and chroma. The unitary ranges of carrier frequency, harmonicity, and modulation index were normalized and quantized with a step size of
Table 2. Mean and standard deviation of reward and WRMSE over 10 test episodes for five agents trained to control parameters of the frequency modulation synthesizer using distinct sets of acoustic features. | |||
Matched features | Reward | RMSE | |
---|---|---|---|
Loudness (linear RMS) and spectral centroid (Hz) | Mean | 0.163 | 0.265 |
Std | 0.070 | 0.035 | |
MFCC | Mean | -0.527 | 0.591 |
Std | 0.198 | 0.027 | |
Spectral shape | Mean | -0.169 | 0.375 |
Std | 0.136 | 0.029 | |
Chroma | Mean | -0.357 | 0.575 |
Std | 0.183 | 0.124 | |
MFCC, Spectral shape and Chroma | Mean | 0.010 | 0.411 |
Std | 0.036 | 0.020 |
In this case, the relationship between the synthesis parameters and sound descriptors is highly complex and nonlinear. Consequently, we observe better performance when training the agent with a combination of perceptual audio descriptors. Spectral shape descriptors more effectively capture changes in timbre compared to MFCCs and chroma. However, using a combination of MFCCs, chroma, and spectral shape descriptors in the matching objective leads to improved overall timbre matching.
In Video 2, it is evident that although the agent successfully imitates the incoming sound's timbre, the controlled parameters continue to oscillate even after a close match is achieved. This oscillation stems from the larger step size used in generating the synthesizer's lookup table compared to the step size in the previous example. Consequently, the agent often struggles to find a match sufficiently close to the incoming sound, resulting in an apparent oscillation between two close matches. Another factor contributing to the parameter instability, even when the musician's sound remains constant, is the complex non-linear relationship between the synthesis parameters and timbre, which confounds the agent's decision-making regarding the necessary parameter adjustments to match the incoming audio.
The granular synthesizer features controllable density, grain start and grain duration. The timbre of this synthesizer can only be uniquely defined after choosing the sound source for granulation, making the lookup table sound source-dependent. The grain start parameter, with its extensive range, plays a central role in determining the timbre. Five distinct agents were trained to match the specific attributes previously utilized for the frequency modulation synthesizer. The unitary ranges of the three parameters were normalized and quantized with a step size of
Table 3. Mean and standard deviation of reward and WRMSE over 10 test episodes for five agents trained to control parameters of the granular synthesizer using distinct sets of acoustic features. | |||
Matched features | Reward | RMSE | |
---|---|---|---|
Loudness (linear RMS) and spectral centroid (Hz) | Mean | -10.508 | 0.661 |
Std | 7.160 | 0.096 | |
MFCC | Mean | -0.588 | 0.403 |
Std | 0.214 | 0.092 | |
Spectral shape | Mean | -3.486 | 0.653 |
Std | 1.621 | 0.184 | |
Chroma | Mean | -9.301 | 0.365 |
Std | 10.754 | 0.186 | |
MFCC, Spectral shape and Chroma | Mean | -0.755 | 0.474 |
Std | 0.275 | 0.062 |
The best performance observed in this case is obtained by training on MFCCs, followed by the matching objective based on all the descriptors. Differently from the previous case, in which combining descriptors lead to better convergence, for this synthesizer adding spectral shape and chroma to the MFCCs makes convergence more difficult. Therefore, as expected, the optimal selection of perceptual audio descriptors for training the agent is specific to each synthesizer.
It should be noted that an agent that performs well according to the metrics in the table does not necessarily behave as desired during live interaction and may not align with the musician's preferences. An important consideration when choosing matching descriptors is the specific musical objective. For instance, using loudness as a matching descriptor may result in an envelope-follower behavior, as seen in Video 1, which might be unsuitable for some musical contexts. Similarly, selecting chroma descriptors can cause the agent to follow only the musician's pitch classes, ignoring signal amplitude—especially in this configuration where the agent is limited to controlling synthesis parameters, rather than also manipulating pitch or velocity. Moreover, it's not certain that training with the chosen synthesizer and descriptors will lead to convergence; hence, convergence should be considered alongside musical objectives as crucial criteria in this process. For this reason, quantitative testing can assist in the decision of which agent should be used among a set of agents trained on different descriptors.
The first example in Video 3 demonstrates that the agent can identify parameter combinations that produce a sound similar to that of the musician, even when using different synthesis parameters. The agent quickly adapts to changes in the incoming sound, employing various strategies based on its initial parameter setting. This distinctive behavior, achieved through the reinforcement learning approach, enables the agent to discover and explore unexpected solutions. The third example illustrates how this agent can create a textural element that matches and adapts to an evolving soundscape, generated using an additive synthesizer and gradually modulated by low-frequency oscillators.
Experiments demonstrate that the agents trained using the reinforcement learning approach we propose are able to match perceptual properties of an incoming sound in a dynamic scenario representative of a musical performance. The cases presented in this paper provide a baseline that outlines the basic properties of this system, which is necessary for further development and testing in more complex settings.
Overall, our experiments show that the agents are able to learn policies with a wide variety of matching descriptors, and that the choice of descriptors crucially affects the model’s performances. Therefore, the combination of descriptors used has to be tailored to the characteristics of the synthesis model and the sound corpus.
Agents trained on a corpus composed of the same sounds they can produce are flexible and able to adapt to timbres that are quite different from the training corpus, such as the string ensemble in Video 1 and the additive synthesizer in Video 3. However, previous experiments we have made using a corpus that is highly different from the timbre produced by the agent have shown that the model does not always converge if there is little overlap between the features in the corpus and in the lookup table. This is due to the current formulation of the reward function, which penalizes big differences in timbre: if the synthesizer can never obtain a satisfactory match it always receives a negative reward, hindering convergence. In future experiments we will adapt the reward function to handle large differences between the agent’s and the musician’s timbre, and to take into account how the timbre evolves in time. This might involve developing rewards with explicitly musical objectives rooted in a specific music practice or more generic criteria such as novelty and exploration.
We identify two main bottlenecks of the model described in this paper: the computation of the lookup table and the discrete action space. Both these features significantly aggravate the convergence of the training process and affect the possible number of parameters the agent can control. To work out these issues, in future experiments we plan to employ a continuous action space with variable step size instead of a discrete one, and to experiment substituting the lookup table with live generation of audio and computation of descriptors during training, or with a machine learning model to estimate the descriptors corresponding to the synthesis parameters. We anticipate that this approach will help resolve other issues, such as the agent's instability in finding the optimal match for the musician’s sound, and the non-smooth response of the synthesis control.
Several aspects of the model need further investigation. The modeling approach and the training process we propose have many possible alternatives which have not yet been explored. Recent advancement in reinforcement learning such as self-play [54], intrinsic motivation [55] and curiosity-based learning [56] hold potential for applications in live algorithms for musical improvisation, and they could be integrated in our model’s training methodologies. In the future, we plan to further test the model with these options, employing more complex synthesis engines and wider sound corpora.
This work is funded by the Department of Musicology at the University of Oslo as part of the first author's project for his doctoral research fellowship. The research presented herein relies on open-source coding frameworks and data that are publicly available. We have released the code developed for this project as open-source software to ensure reproducibility. No human participants were involved in the research. The authors report no conflicts of interest.