Skip to main content
SearchLoginLogin or Signup

Deep Steps: A Generative AI Step Sequencer

Designing the interaction and guiding further development through user feedback for a generative AI step sequencer

Published onAug 29, 2024
Deep Steps: A Generative AI Step Sequencer
·

Abstract

Deep Steps is a MIDI step sequencer standalone application with an integrated generative neural network. The neural network can be quickly and easily trained using the user’s own audio loops, opening training data up to be part of the creative process. The design philosophy is to have a real-time interaction involving generative AI in a musical interface, which can be part of an electronic music production workflow. The system was subjected to a quantitative user study as part of a user-centric design process. Firstly, this assessed its overall usability as a software system and a creative tool. Secondly, this tested user responses to two sets of user-interface control conditions common to similar generative systems. The findings indicate positive scores for the system’s usability and highlight several strengths and drawbacks of the two common control paradigms. These findings will be used to inform the future development of the system.

Author Keywords

Creative AI, Co-Creativity, Human-Computer Interaction, Generative AI

CCS Concepts

•Applied computing~Arts and humanities~Sound and music computing•Human-centered computing~Human computer interaction (HCI)~HCI design and evaluation methods~User studies•Human-centered computing~Human computer interaction (HCI)~HCI design and evaluation methods~Usability testing•Computing methodologies~Machine learning~Machine learning approaches~Neural networks

Introduction

The application of deep learning techniques has notably advanced the field of music generation [1][2]. Many techniques are now established as state-of-the-art and best practices, covered by Briot et al.[3], and able to generate convincing pieces of long-form music automatically. Also highlighted by Briot is that the majority of the systems they review operate in an “offline” manner and do not exhibit much in the way of user control. Therefore, how users can potentially interact with these types of systems remains open for exploration.

For music interfaces specifically, as reviewed by Jourdan and Caramiaux[4], machine learning (ML) techniques have become increasingly prominent, more recently exhibiting a trend towards generative deep learning. Also highlighted by Jourdan and Caramiaux, however, is that the systems usually follow the established ML workflow of providing an already-trained model to the user. Training data and model parameters are rarely exposed to user control. In terms of music technology, this is analogous to early drum machines, which had only preset sounds and rhythms. The history of drum machines[5] can be defined by their iterative opening of parameters to user control, each time allowing for new forms of creative interaction. This arguably culminates with the introduction of digital sampler technology, allowing complete freedom over sound and programmability.

Exposing ML parameters as part of a real-time creative musical interaction is a key exploration here. Presented here is a user-centric design process for the development of Deep Steps. Deep Steps is a generative MIDI step sequencer with a user-trainable neural network implemented as a stand-alone application. For interaction with the neural network, users are able to train the model using their own corpus of audio loops, choose a number of training epochs, and trigger the generation of new output through the user interface. While the neural network generates the rhythmic elements of the sequence, the user retains control of pitch, key, tempo, gate length, and other musical parameters. A demo improvised performance using Deep Steps is available here.

This paper begins by outlining the background and motivation behind Deep Steps, highlighting its integration of AI into human creative processes in electronic music production. Following this, we detail the implementation of Deep Steps, which explores a design philosophy focused on accessibility. Prioritizing user-friendly, real-time human-ML interactions [6], the aim is to contribute to musicians' creative workflows.

We then present the results of a preliminary user study conducted to assess Deep Steps's usability and the effectiveness of its generative controls. The initial findings indicate that the system is suitably usable as a system and a creative tool. This encourages the system's future development and has provided insight into what different musicians consider control experiences engaging while using a generative electronic music production system.

Background

Algorithmic composition practices have a long history pre-dating the computer[7] and are becoming ever-more present in the broader field of electronic music production. Deterministic processes such as Euclidean rhythms[8], pitch quantisation and stochastic processes are visible in communities associated with audio programming and modular synthesis. This can be exemplified by the popularity of Eurorack modules such as the Music Thing Turing Machine, Mutable Instruments Grids and Marbles, ALM Pamela’s Workout range, and many more, all based on the algorithmic music techniques mentioned above. This is also illustrated in the recent development of the Torso T1, a hardware sequencer dedicated solely to algorithmic processes as part of a real-time interaction. Furthermore, Ableton Live 12 now features native integration of algorithmic music generation in the form of MIDI Generators. While many of these processes have existed as Max4Live devices for some time previously, their native implementation implies a shift towards their wider adoption.

In these forms of practice, while the musical system is responsible for the specifics of the music created, the system's design and the rules it follows become part of the human musician's creative process. Consequently, the musician's expertise and creativity are shifted towards designing and implementing the system, exercising overarching meta-control, and curating the system's output. This concept describes well the intentions behind Deep Steps and its emphasis on music AI’s user-trainability.

Musical HCI

When considering human-computer interaction (HCI) principles for long-term engagement in musical interfaces, Wallis et al.[9] notes the heuristics of immediacy and autonomy. Applying these heuristics to many existing generative systems, they display high levels of immediacy, potentially generating entire pieces of music with little input from a user, for instance, via text prompt or a single button press. However, this is not always an engaging interaction for the music practitioner. Much or all of the activity and creative act has been abstracted away from them, and they cannot exert sufficient control over the process. This, therefore, infringes on their operational freedom.

At the other end of the spectrum, as noted by Fiebrink & Caramiaux[10], many other ML implementations require significant computational resources, massive datasets, and potentially advanced specialistic expertise or software. As such, they lack the necessary immediacy to be widely appealing to musicians.

Human-AI Co-Creation

The nature of co-creativity with AI agents is increasingly being examined[11][12][13], notably as part of the field of HCI. For Wu[14], co-creativity refers to “the ability for human and AI to co-live and co-create by playing to each other’s strengths to achieve more.” Specifically for music, Pasquier et al.[15] introduce the idea of “musical metacreation,” whereby elements of the musical output result from an automated system. These are referred to as “musical agents,” and the ways in which they manifest vary considerably [16].

A notable commonality between musical HCI and AI co-creativity is that they differentiate themselves from classical HCI and AI by “the notion of creative tasks for which there is no clear ‘best’ outcomes or optimal solutions”[15]. The creative activity and output are done for their own sake, with its end goals comparatively “ill-defined.”

Interaction and Control

The notion of control in the context of musical human-AI interactions is an emerging area of research [17]. Of relevance here, firstly, is M4L.RhythmVAE from Tokoui[18], a Max4Live device that generates rhythmic drum patterns using a Variational Autoencoder (VAE). Like the system presented here, M4L.RhythmVAE is user-trainable. It also follows a similar control paradigm with a two-dimensional latent dimension fed via an X-Y pad. Another notable system is Latent Drummer from Warren and Çamcı[19], whereby user input is achieved by drawing on a touch surface, similarly feeding values into the latent dimensions of a VAE. Users evaluate both of these systems: M4L.RhythmVAE underwent a summative assessment during a workshop and performance, and Latent Drummer was iteratively assessed via a series of performances.

The other common control paradigm is that of single-button generation, often after user input of seed data. This is evident in Google Magenta[20] and its associated “Magenta Studio” plugins for Ableton Live. While these plugins showcase a range of generative AI techniques, their models are pre-trained, and parameters are inaccessible to the user. Their interactions are often based on “Generate” buttons with limited other user controls that revolve around steering the output.

User-Centric Approaches

User experience is the focus for Louie et al. [21] when evaluating a deep generative system, though the participants are all specifically novice users. Of great relevance here is Bougueng Tchemeube et al. [22], who evaluate a generative system in a music production environment for experienced users. Parts of the methodology used there have been replicated here. These, again, are summative evaluations of existing systems. Similar to Magenta Studio, these systems also have a single button for triggering generation plus controls for steering.

Of relevance is Thelle[23], which explores co-creativity with a generative ML system. This is examined experientially through a six-month process whereby musicians collaborated with musical agents trained on corpora of their own musical output. This interaction is notable as the music agents respond in real-time to the musical input of the musicians via a form of “machine listening.” Its results, however, indicate a shift in perspective and increased acceptance of the musical agents over the course of the evaluation while the agents are tweaked to interact better with the human musicians.

In these works, user evaluation is often a generalist exploration of the systems. Usually, these tend to be summative and represent a final evaluation of a finished system. The creators can thus fix and presuppose the interactions. Nowhere is the control of the systems scrutinised by an outside user. Here, while the user study is used to test the overall usability of a system, it is also intended to test specific control and interaction paradigms to inform ongoing development.

Implementation

The Deep Steps system is a standalone application that uses the openFrameworks C++ creative computing toolkit.  Pure Data is embedded to handle the sequencer logic, and Python is embedded in the neural network. The application has MIDI-in for synchronising to an external clock source such as a DAW and MIDI-out, which can be connected to a virtual instrument, plugin, hardware instrument, etc. The implementation is open and ongoing, with the source code available on Github.

Data

Users can use their own corpus of audio files to train the neural network. The user’s audio corpus is processed using the aubio library to detect onsets in the audio files and then transformed into many-hot encoded arrays via a Python script as outlined in[3]. This is done in segments of one bar. Firstly, onsets are rounded to the nearest sixteenth note for step on or off, as in standard step sequencers. They are also rounded to a 48-pulses-per-quarter-note (PPQN) time-base to extract micro-timing substep “groove.” The encoding for the groove is based on the work found in [18] and [24], whereby distances between 32nd-note intervals are represented as continuous values between 0 and 1.

Model

The neural network architecture is an applied implementation of a stacked autoencoder, as described in [3]. It is implemented using code from the ML-From-Scratch library. When training, the architecture comprises three successive fully connected layers of encoding and decoding, with a four-dimensional bottleneck layer. To improve training stability and speed, batch normalization[25] is used for every hidden layer. When used for generation, values are passed into the bottleneck layer and fed forward through the decoder stage to generate rhythmic parts at the output, ie. the 16 step on/off values and 16 offset values for micro-timings. These values are then passed into the relevant areas of the sequencer logic. The exact architecture of the neural network is fixed and arrived at through previous iterative development for this purpose[26]. The exact architecture is displayed in Figure 1.

Figure 1: Deep Steps’ Stacked Autoencoder Architecture with two Fully Connected (FC) encoding and decoding layers with a 4-dimensional Bottleneck (BN) layer. Input and output sizes of 32 are based on 16-step on/off plus 16 values for the 48 PPQN micro-timings. After training, values can be fed through the decoder stage to generate rhythmic parts of a sequence: 16-step on/off values and 16 offset values for mico-timings.

Previous work has focused on constructing the model architecture such that it can be used with relatively small datasets (example datasets are 168 and 39 audio loops) and generally takes only a few seconds to train, depending on the number of epochs. This affordance is important for making the implementation accessible to music producers.

User Interface

The application itself provides a user interface for interaction. Key aspects include the traditional step sequencer interface with vertical sliders as per-step controls for pitch, underneath which are step on/off indicators. The user is in control of pitches per step, while rhythmic aspects of the sequence are generated by the neural network.

Figure 2: Deep Steps UI Condition 1. This was the first UI tested and uses a “Generate” button visible on the left of the interface. This button feeds random values into the neural network’s decoder stage and instantly generates a rhythmic part for the sequence. The verticle sliders control the pitch as in a standard step-sequencer. It also includes other controls for training the neural network in the bottom left, such as adding audio loops for training data, data processing, and training for a specific number of epochs.

Two user interface (UI) conditions were tested here. The first UI condition afforded the user a “generate” button (Figure 2) to create a new rhythmic part instantly. This control paradigm aligns with several other generative ML music systems and prioritises immediacy. The second control condition provided four continuous sliders (Figure 3) to pass values individually into the model’s bottleneck layer dimensions. Theoretically, this gives the user a more instrument-like control over the generation, prioritising autonomy.

Figure 3: Deep Steps UI Condition 2. This second UI affords the user four continuous sliders to feed values into the bottleneck layer manually. These are then fed forward through the decoder to generate the rhythmic part. In giving the user continuous sliders, the intention was to provide an interaction that is more similar to others in electronic music instruments. All other user controls are the same. The white box numbers are for MIDI CC mapping of the four sliders. This aspect was not included in the study but is for future iterations.

Further user controls include horizontal sliders for “Gate length,” which controls the length of the outputted MIDI note, and “Sub-Steps Scaling,” which scales the amount of micro-timing “groove” offsets into the sequence. On the right are pitch controls for musical keys and a choice of retaining the full chromatic scale or constraining pitch to a major or minor scale.

At the bottom of the UI are the controls for user training the model. “Open Corpus” opens the folder into which the user can add their audio loops, “Make Datasets” processes and encodes the corpus, and “Train” trains the model for the number of epochs specified in the “Epochs” control.

Experiments

Drawing on a “user-centric design” philosophy essential to third-wave HCI[27], an initial iteration of the Deep Steps system was subjected to a preliminary user study with 11 participants, capturing Likert-scale ratings (1-5) to inform the system's further development and future evaluation.

Setup

The experiment was conducted on a MacBook Pro laptop running the Deep Steps application and Ableton Live. Ableton Live was configured to send MIDI clock data and receive the MIDI output from Deep Steps via an IAC bus, along with a basic setup involving a default synthesizer virtual instrument to process MIDI data from Deep Steps and a simple drum pattern.

Deep Steps was set up with an untrained neural network. Participants were provided with two sets of sample audio loops for experimentation: one set comprised 39 guitar loops, and the other contained 168 synthesiser loops. This software configuration was standardized across all participants.

Before the study started, all participants were given a brief demonstration to familiarize them with the application's functionalities.

Procedure

Participants were invited to use the system for this preliminary user study and provide feedback on its usability via questionnaires. The participants were recruited either directly or via mailout. The only prerequisite for participation was some experience with music production. The study took place in person, with the participants using the author’s laptop to run the setup.

Participants were given the option of bringing their own corpus of audio loops or using the ones provided.  In the study, participants interacted with the system in two different test conditions.

The surveys used in the study are based on the methodology found in Bougueng Tchemeube et al.[22] in their usability section and are as follows:

  • The Standard System Usability Scale (SUS) from Brooke[28] captures a 1-5 Likert scale and generates a score of 1-100 for the system overall. A score below 50 is considered unacceptable, between 50 and 70 is marginal, and above 70 is acceptable for the usability of the system.

  • The Creativity Support Index (CSI) from Cherry and Latulipe (2014) captures twelve agreement statements on a 1-10 basis and a paired-factor comparison test, both of which cover the areas of Collaboration, Enjoyment, Exploration, Expressiveness, Immersion, and Results Worth Effort. As in [22], as the system here involves no human collaboration, the agreement statements relating to that have been excluded. Similarly, the agreement statements have been captured on a 1-5 Likert scale.

Following the brief application demonstration, participants were instructed to freely explore creating music using the setup. Participants were given ten minutes to explore each UI condition, completing the SUS and CSI questionnaires after each one. They were invited to use the setup under each condition as they saw fit as part of a music-making scenario, including:

  • Training and re-training the neural network with different datasets and numbers of epochs.

  • Generating rhythmic material using the relevant control paradigm for each condition.

  • Creating melodic content using the step sequencer’s pitch sliders, quantised to a scale if desired.

  • Any other techniques associated with music production, such as changing tempo, drum accompaniment, using effects, etc.

Results

The following sub-sections outline the results from each part of the evaluation methodology. The mean (average) is provided for high-level overviews of total scores. The standard deviation (SD) will also be provided in brackets where this happens. The SD indicates the spread of values around the mean.

Standard System Usability Scale (SUS)

The mean SUS score for “Condition 1” with the “Generate” button was 81.36/100 (SD = 12.47), while the score for “Condition 2” with the latent value sliders was 74.31/100 (SD = 19.27). Both control paradigms, therefore, scored acceptable levels of usability.

Figure 4: SUS Total Results for Condition 1 (C1) and Condition 2 (C2). This boxplot illustrates the distribution of the total SUS scores under each UI condition. The box itself outlines the middle 50% of the users’ results, while the lines display the lowest and highest scores. Circles are outliers in the data.

SUS Total Scores

Figure 4 shows the greater consistency of results for Condition 1. This aligns with the mean total results, implying it was a more usable system for these participants. Opinions on the usability of Condition 2, while still acceptable by the test metrics, are much more varied.

Creativity Support Index (CSI)

The CSI captures data relating to different “factors” relevant to creativity: Collaboration, Enjoyment, Exploration, Expressiveness, Immersion, and Results Worth Effort. The participants responded to agreement statements rating the system under each condition concerning these factors. These are the “factor scores”. Every possible combination of factors was also paired, and the participants were asked to choose which was more important—the number of times a factor was picked forms the “factor count” for each participant. Factor scores are then weighted by their counts to produce a “weighted factor score.” It should be noted that this is an adaptation of the methodology in [29]. This is detailed further in the Limitations and Future Work section.

Tabulated below as in [29] are the mean scores for each element of the CSI.

CSI Condition 1 - Avg. Score total: 72.5/100 (SD = 14.29)

Scale

Avg. Factor Counts (SD)

Avg. Factor Score (SD)

Avg. Weighted Factor Score (SD)

Results Worth Effort (RWE)

1.18 (1.32)

4.45 (0.68)

32.27 (26.64)

Exploration

2.18 (0.87)

3.63 (0.80)

16.72 (3.43)

Immersion

2.09 (1.13)

3.9 (0.53)

33.09 (21.02)

Expressiveness

1.63 (1.28)

3.18 (0.75)

20.72 (18.4)

Enjoyment

2.9 (0.83

4.45 (0.68)

51.27 (14.94)

CSI Condition 2 - Avg. Score total: 71.6/100 (SD = 10.988)

Scale

Avg. Factor Counts (SD)

Avg. Factor Score (SD)

Avg. Weighted Factor Score (SD)

Results Worth Effort (RWE)

1.18 (1.32)

3.9 (0.7)

18.9 (21.99)

Exploration

2.18 (0.87)

4 (0.63)

18.18 (2.96)

Immersion

2.09 (1.14)

3.8 (0.75)

32.72 (19.82)

Expressiveness

1.63 (1.29)

3.63 (1.36)

21.81 (21.04)

Enjoyment

2.9 (0.83)

4.54 (0.68)

51.63 (12.06)

The averaged total scores display no discernible preference between the two conditions as a creative tool. As [29] maps the output scores to academic grading (above 90 being an “A” and below 50 an “F”), both of these conditions can be considered to have achieved acceptable scores.

Figure 5: Users responded to agreement statements on a 1-5 rating of the system under two UI conditions. This is in relation to five factors relevant to creativity: Enjoyment, Expressivity, Immersion, Exploration, and Results Worth Effort (RWE). The mean of these scores are shown in this radar chart.

Figure 6: Users’ responses to the 1-5 agreement statements, rating the system across Enjoyment, Expressivity, Immersion, Exploration, and Results Worth Effort (RWE) under two UI conditions. Each user’s response is plotted in these charts, with each colour representing a different user.

Considering the factor scores as in [22], these illustrate the users' general impressions of the system. Across both conditions, on average (Figure 5), all factors score favorably, with scores circa four. The only exception is Expressiveness in Condition 1, with an average score of 3.18. Considering each factor individually (Figures 6, 7 & 8), the Enjoyment and Results Worth Effort score was consistently high, while the other factors showed greater variation in their responses.

Figure 7: CSI Factor Scores–Condition 1 (FS1). User responses to the 1-5 agreement statements assessing UI Condition 1 across Enjoyment, Expressivity, Immersion, Exploration, and Results Worth Effort (RWE). The boxplot illustrates the distribution of the responses. The box itself outlines the middle 50% of the user’s results, while the lines display the lowest and highest scores. Circles are outliers in data.

Figure 8: CSI Factor Scores—Condition 2 (FS2). User responses to the 1-5 agreement statements assessing UI Condition 1 across Enjoyment, Expressivity, Immersion, Exploration, and Results Worth Effort (RWE). The boxplot illustrates the distribution of the responses. The box itself outlines the middle 50% of the user’s results, while the lines display the lowest and highest scores. Circles are outliers in the data.

Considering the weighting of the pair-factor comparisons (Figures 8, 9 & 10), Enjoyment can be observed as the key factor for this group of participants and highlighted as a strength of the system under both conditions. Expressiveness again averages poorly but scores well with some outliers. Results Worth Effort appears to be a divisive factor among the participants, with many users scoring the system highly for it but, on average, deeming it the least important factor.

Figure 9: The scores for each factor, Enjoyment, Expressivity, Immersion, Exploration, and Results Worth Effort (RWE), are weighted by the number of times the user chose it in a paired-factor comparison. This indicates how well the system supported that particular user’s creativity. Here, the mean of each of these weighted scores under the two UI conditions is shown in the radar charts.

Figure 10: Weighted Factor Scores—Condition 1 (WS1)—Weighting the factor scores for UI Condition 1 with the number of times the user chose the factor in the paired-factor comparison displays the system’s success across the different factors and how important they are to the user. This yields a different data distribution than in Figure 7, with clearer distinctions between the factors.

Figure 11: Weighted Factor Scores—Condition 2 (WS2)—Weighting the factor scores for UI Condition 2 with the number of times the user chose the factor in the paired-factor comparison yields a different data distribution than in Figure 8.

Discussion

Taking the overall scores to assess the system’s general usability, both conditions scored above acceptable in the surveys. This is an encouraging outcome for an in-development piece of software as it implies that its core principles and functionality are potentially valid. In the following, we discuss the outcome under four emerging themes: What are the participants' interactive strategies, what are their control experiences with an AI system, and what are the limitations and potential areas of improvement?

User Interface Interaction

The survey results illustrate users' differing experiences under the two test conditions and the nature of the types of interactions they each lend themselves to. Condition 1, with its instant “generate” button, is grounded in being an immediate interaction. It allows participants to generate ideas quickly and then concentrate on other aspects of the music-making process, such as changing the step pitches, changing the tempo, adjusting the instrument timbre, etc. It achieved relative consistency in its scoring, as illustrated by its lower standard deviation of SUS scores.

The sliders in Condition 2 were a far more divisive control paradigm. This may stem from the arbitrary nature of feeding values into latent space. Sliders imply more autonomy and operational freedom, though how users reacted to this differed greatly. Many users welcomed the implied expansion of direct control, finding the instant generation of Condition 1 to be less and less engaging throughout their time with it. The ability to manually control values and see the generation change was more satisfying for many and represented greater authorial ownership over the material produced. Yet, others found that the sliders created an unnecessary complication and that building a mental map of the interaction was difficult and distracting. These participants are visible in the SUS box plot, scoring Condition 2 notably lower than most others. As such, this condition has, on average, scored lower.

Throughout initial development, the intuition had been that the sliders would offer a more engaging and instrument-like interaction for musicians. The user study highlights strengths and drawbacks in both types of control, which has challenged this intuition and altered the course of future development.

A point of frustration noted occasionally for Condition 1 was the inability to return to a previously generated part. This is a characteristic of algorithmic processes involving elements of stochasticity. This resulted in some users recording sequences they liked as MIDI before moving forward. As the stacked Autoencoder architecture is deterministic, generated parts can be recalled by simply feeding it the same values. This was an affordance only allowed for, however, by the sliders. As the generate button feeds random values through the decoder, the immediacy it offers is enabled by a stochastic technique as part of the user interaction and co-creative process rather than via the model itself. This, combined with the control and recall provided by the sliders, implies a justification for both types of user controls.

Neural Network Generation and Training

For similar generative tasks, it is common to employ architectures such as the Variational Autoencoder (VAE) or Generative Adversarial Network (GAN)[3]. While its generative capabilities are more limited compared to these, an advantage of the stacked Autoencoder architecture here, is that it could be trained within a few seconds under the typical usage exhibited by the users. This allowed for the intended interaction of users training with different datasets and re-training with a different number of epochs. Generally, when a user becomes bored or unsatisfied with the generative output of the model, they trained it again to achieve different results, in effect becoming part of the creative interaction.

When considering the neural network more closely, many users remarked on how quickly it trained and generated. Those more familiar with ML were curious about training it “poorly” with a very low number of 1-5 epochs or pushing its capabilities by training for a significantly high number, such as 500. For all, however, the model’s accuracy was of little concern, with their focus being on the musical outputs they were able to achieve.

Music Production Workflows

Noted on several occasions are instances whereby the participants would focus on music production tasks other than generating material. Beyond what has already been mentioned, users were observed adding drum accompaniment, changing the timbre of instruments, and recording MIDI parts from the sequencer they liked. It should be emphasised at this point that Deep Steps intends to be part of a larger music production workflow. At its core, the system is still a step sequencer, and while it has been augmented with the neural network, it still exhibits many of the limitations associated with that interface. The results do imply, however, that although it featured initially unfamiliar workflows for the neural network integration, it can, in fact, successfully fulfill this intention.

Overall, participants' verbal feedback revolved around users desiring additional controls and visual feedback, including resetting the pitch of all steps, modulating the generative controls, reading out pitch values, and an alternative to sliders for pitch input. Where possible, these will be implemented in the next iteration.

Limitations and Future Work

In the full CSI methodology, factor counts are multiplied by factor scores to give a weighted factor score. The weighted factor scores are then summed together. The highest sum possible here is 300. This is then divided by 3 to output a final total score out of 100.

Following the CSI usage in [22], the agreement statements and pair-factor comparisons relating to collaboration were excluded, and only one agreement statement per factor was captured on a 1-5 Likert scale. For reasons of standardization and comparison, however, Cherry and Latulipe recommend against adapting the methodology, including excluding a factor, as it interferes with the weighted scoring. As such, this final step is absent from [22].

Weighting the factors was nonetheless desired and enlightening for user-centric design. In removing the collaboration factor and scaling up the factor scores to align with Cherry and Latulipe, the highest possible weighted score was 200. Dividing this by 2 gave a score of 100, similar to the prescribed methodology. As this is an adaptation of the metric, it should be considered individualistic and not entirely comparable to other systems using the full CSI methodology. Relatedly, the user experience of the UI conditions were not measured against an existing baseline such as a standard step-sequencer, nor were the participants “blinded”. Though neither of the utilised methodologies require this, it is nonetheless an avenue worth pursuing in future work.

In general, this user study has illuminated several areas that need to be addressed in the future development of Deep Steps. Firstly, the findings imply that the application is useable both as a system and a creative tool. The grounding principle that training data and ML parameters can be a part of an engaging musical interaction is validated. Secondly, they imply that there is an argument to be made for both control paradigms. That there was no clear preference for the participants overall illustrated the strengths and weaknesses of both ways of interacting with a generative system. Rather than continue further development using the author’s intuition alone, this study has been illuminating for guiding future work toward appealing to a broader group of musicians.

As a result of this work, development will continue on the Deep Steps application, implementing the findings from this study into the next iteration of the software.

Although this methodology has served well as a means of user-centric iterative design, the amount of time each participant has spent with the system does not allow for assessing its potential for long-term engagement. As such, the next iteration of the system will be subjected to a more longitudinal qualitative study. The findings here, however, will also be used to inform the framing of this future study.

Conclusions

This paper introduced the Deep Steps application, a generative MIDI step sequencer with a user-trainable neural network. We presented its design sensibilities, implementation as a stand-alone application, and a preliminary user study. We found that this initial system iteration performed acceptably as software and a creative tool. The study also uncovered findings relating to responses to user interface controls that ran counter to the author’s intuition. These findings will inform future iterations of the system’s design and future user experience studies.

Ethics Statement

Our system was designed to utilize compact models that could be trained in real-time using a small amount of data. In doing so, we prioritized accessibility and minimized the environmental footprint of neural network computations. We entrusted the aspects of data privacy and ethics to the users by enabling them to train the model with their own datasets. The user study was conducted with complete anonymity and with the participants' informed consent.

Comments
0
comment
No comments here
Why not start the discussion?