We present a small-scale framework for generating phrase-level symbolic music with limited training data. The human evaluation shows that our framework outperforms the state-of-the-art transformer-based models in terms of overall quality and rhythmic regularity.
Music is organized into a hierarchical structure with repetition at different time scales, from beats, phrases, and themes to sections. Transformer-based models successfully generate music that exhibits some level of periodicity in minute-long excerpts. Recent large language models have extended the coherence to a longer range in certain types of music. Nevertheless, it is still challenging to impose necessary musical structure constraints, such as the recurrence of sections within a movement of a sonata. Since music composition is highly constrained by structure, an alternative approach is to generate music phrases guided by ruled-based structure representation. In this study, we contribute a novel framework that generates complete and coherent musical phrases using an encoder-decoder architecture. We also propose REMI Lite, which tailors REMI[1], a music representation with a quantized time grid, to learn phrase endings in classical music while maintaining rhythmic regularity. The human evaluation shows that our approach achieved high generation quality when trained with only 195 piano sonata movements. Our novel finger-tapping experiment also validates that the representation allows for producing music with steady perceived beats.
AI Music Generation, Music Representation, AI Music Evaluation
Automatic music generation has gained much attention since transformer-based models such as Music Transformer[2] and MuseNet1 are capable of modeling long-term dependencies in musical sequences. Variants of transformers such as Transformer XL[3], Perceiver AR[4], LLaMA[5], and LLaMA2[6] have further extended the context windows to full-length music. While these powerful new tools have yielded state-of-the-art performance in modeling coherence in longer sequences, recent efforts in generating music conditioned on melodies, chords, rhythms, high-level text descriptions, etc. have demonstrated the potential to provide a more desirable result by combining human-interpretable controls.
Controllable music features in existing models are mainly manifested as latent representations of these features or conditional inputs in a format compatible with the music representation. Despite harmony, melody, and rhythm which are commonly used in controllable music generation, structural coherence has yet to be further explored. Structure is a fundamental compositional feature in music, especially in classical music which generally has a clear musical form, with a hierarchical grouping of recurring units including sections, phrases, ideas, motives, etc. For example, sonata follows a strict three-part overall structure, namely Exposition, Development, and Recapitulation. The Recapitulation is an essential repeat of Exposition with an absence of the modulation to the dominant key at the end of the section[7]. In addition, musical structure also plays an important role in music perception. Psychological and neural evidence has suggested that hierarchical perception of music is organized in a bottom-up fashion, from beat and meter to phrase and sections. When listening to naturalistic classical music, audiences are first entrained to the low-level rhythmic structures such as steady beats and meters, then are kept engaged by the tension and release constantly induced by a violation of expectation at boundaries between phrase structures[8][9][10][11][12]. However, music from recent music generation systems is often strikingly different from human-composed music in terms of structure[13]. Learning high-level structural coherence relies on all relevant contexts from an input window extended to the entire piece of music. Taking the piano sonatas in this study for example, the input sequence relative to a whole movement has a max of 21299 tokens in MIDI-like[14] [2] tokenization, which is challenging because the memory complexity for intermediate information in transformer-based models is quadratic in the input length.
While high-level structural coherence is a bottleneck in computation, it guides real-world music composition. Music is highly constrained with internally connected sections and recurring phrases, which makes it possible to break down the computational requirements into phrase-level generation and phrase-relation modeling on inputs with reasonable length. Therefore, we start with improving the quality of single phrase generation.
A phrase is defined as the smallest musical unit that conveys a complete musical thought[15]. The end of a phrase is marked by a cadence, usually perceived as a rhythmic or melodic articulation, which is yet to be carefully modeled. Phrases also exhibit a clear beat-meter structure and remarkable rhythmic regularity, which we seek to address in this work.
We propose a novel framework for polyphonic, phrase-level symbolic music generation, targeting piano sonatas during the classical era. Inspired by the music inpainting task in real-world composition, the framework is designed to finish a musical phrase conditioned on primer bars at positions not limited to the beginning. The framework entails a new score-based representation, REMI Lite, and an encoder-decoder model, with the bidirectional encoder enhancing within-phrase coherence and the auto-regressive decoder generating new sequences. The model is fine-tuned on automatically segmented phrases to learn phrase endings. Evaluation of the proposed framework includes a self-report survey on the overall impression of the generated samples and a novel finger-tapping task to assess the rhythmic regularity.
Our major contributions are as follows:
We propose a framework for single-phase, score-based, symbolic music generation that achieves higher quality than a Music Transformer baseline with a sequence length of 512 trained on a small dataset. The proposed framework outperforms the baseline in both reconstruction loss and human evaluation.
We adapt REMI tokenization to our corpus and develop REMI Lite which allows for flexible meter representation, higher temporal resolution, and phrase-ending learning.
We propose a novel finger-tapping approach for examining perceived rhythmic regularity in AI-generated music, which mitigates the information bias possibly induced by regular self-report approaches.
Automatic music generation can be dated back to 1959[16]. In recent years, transformer-based models have become the mainstream tools in this area since the attention mechanism[17] was introduced into music modeling by Music Transformer[2]. Currently, music generation systems are mainly improved from the following perspective:
modeling long-range dependencies;
optimizing music representations for different types of inputs;
imposing controls over high-level components such as melody contour, rhythmic structure, chord progression, emotions, etc.
As music performance can easily contain hundreds of thousands of elements even after compression, the key challenge is to enable long-range pattern recognition without adding extraneous computational complexity. Initial breakthroughs by Music Transformer[2] opened the way to modeling longer-term coherence for symbolic music generation with Relative-Positional Encoding. Other approaches reduce the cost of handling long sequences by truncating attention[3], introducing sparsity into the attention[18], or using cross-attention to compress the computation[19][20][21]. PopMAG models the harmonic progression in prolonged contexts for accompaniment generation in pop songs using Recurrent Transformer[22]; Pop Music Transformer[1] generates 12-bar long piano pop song excerpts with Transformer-XL[3]; Perceiver AR extends attention from multiple bars to the entire piece, successfully generating high-level structure at some degree[4]. However, these are still far from reproducing section-level repetition and a clear beat-bar-phrase hierarchy that makes sense for music composition and perception.
The input to a music generation system can be either audio or symbolic music. We limit our discussion to symbolic music in this work. Pitch, velocity, track, and temporal information are 4 types of attributes usually encoded in symbolic music representation. The temporal information entails note-onsets, offsets, or durations. We categorize these representations by whether temporal information is quantized with smaller units. Representations using absolute time, such as MIDI-like, support expressive performance. MIDI-Like was first introduced in Performance RNN[14] and improved in Music Transformer[2] with richer velocity levels and higher temporal resolution to reconstruct dynamics and grooves in music performance. Quantizing the temporal information as in Multi-Track Music (MMM)[23], Compound Words (CW)[24], Revamped MIDI-derived events (REMI)[1], and REMI+[25] largely enhance rhythmic consistency in music modeling and generation. These representations also incorporate bar information estimated by beat and downbeat tracking to impose metrical structure.
Efforts have been made to increase the controllability of the models. Controlling conditions are imposed as latent codes learned from a secondary model such as Variational Auto-Encoder (VAE)[26] on top of the backbone generation model: Music FaderNet[27] generates music guided by high-level valence and arousal through latent clustering; FIGARO[25] combines description-to-sequence encoding either learned or provided by experts; Wei et al[28] disentangle representations for rhythmic pattern and chord progression for conditional music generation. Recently Large Language Models (LLMs) have emerged as a viable path for text-to-music generation. By leveraging the ability of LLaMA2[6] to understand music theory in the form of text, ChatMusician[29] is now capable of composing well-structured and full-length music conditioned on the text description and chords, motifs, and melodies in text-compatible ABC notation, but still facing limitations such as predominantly generating music in the style of Irish folks resulting from the significant portion of training data in this genre. Another downside of pre-trained LLMs for music generation is that fine-tuning LLMs on non-text data can be hard and not all music can be effectively represented in a text-compatible way. For example, ABC notation, adopted in ChatMusician, is only suitable for presenting part-by-part music but not for polyphonic music without clear voicing.
We proposed a novel framework for phrase-level symbolic music generation based on musical scores in Humdrum notation. In the training stage, scores are tokenized and chunked into event segments for pre-training the model. The model is then fine-tuned on event phrases to learn the phrase endings. During the inference stage, the model generates original event sequences conditioned on event prompts which can be one or multiple bars of notes at any position in a phrase. The output event segments are then converted to MIDI and audio files for playback. Figure 1 shows the diagram of the proposed framework.
This section demonstrates how we designed the new representation, REMI Lite for compositional purposes. We also discuss how we adapt REMI[1], an existing representation using a position-bar time grid, to our corpus.
We seek to impose a beat-bar-phrase hierarchy in REMI Lite. It is important to highlight that the beat-bar structure is readily available from scores in our dataset, which eliminates the extraneous quantization error from beat and downbeat tracking since downbeat estimation in classical music is notoriously difficult.
As in REMI, notes are represented by sets of
We increase the temporal resolution of note onsets and durations in REMI Lite. Instead of dividing a bar into 16 intervals in REMI, we quantize note onsets and durations with a minimum unit of 1/24 or 1/32 quarter notes, which allows for capturing fine-grained rhythmic changes in classical music. Figure 2 shows how notes are tokenized.
We adopt the
A musical phrase consists of multiple bars and is equivalent to a sentence in written language. As there are no explicit symbols marking the end of a phrase, like a period in a sentence, we apply an
We assume that time signature and tempo are consistent within a phrase in music composition, and thereby prepend a pair of
Expressive dynamics, i.e. velocity, and articulations are excluded because they are less relevant to compositional structure. We remove trills and grace notes to simplify the representation. We skip the
In this study, phrase-level generation is approached as an inpainting problem, i.e. in-filling the missing bars in a phrase conditioned on the bars given as a primer, to use context from both directions. Inspired by bidirectional masked language models such as BART[30] and MASS[31], the backbone sequence-to-sequence model is implemented as a bidirectional encoder trained over masked data and a left-to-right autoregressive decoder. The model is trained by optimizing the reconstruction loss computed on the masked part. It is also a new attempt to generate musical phrases with a progressive masking strategy.
We adopt the standard Transformer encoder-decoder architecture and adapt to our data by (1) reducing the number of layers in the encoder and decoder from 6 to 4 because our dataset is much smaller than most datasets for language tasks and (2) replacing non-linear activation with GELU.
The major contribution is the bar-level masking strategies. Instead of masking individual tokens, we mask out consecutive tokens within one or multiple bars. The model is trained with two types of masking, with 50% of the bars randomly selected or consecutive bars in the middle of a phrase. The masked sequences are fed to the encoder, and the tokens in the masked part are fed to the decoder for auto-regressive generation. Note that inputs to the encoder and decoder should start with a pair of tempo and time signature tokens. Loss is computed for masked positions only. See Figure 4 for more details. The masking strategy is designed for the inpainting task. Meanwhile, it also effectively boosts the dataset when trained on a limited number of movements.
The model is first pre-trained on event segments regardless of phrase ending. To increase the number of samples in the training set, we chunk the event sequence of a full-length song into 8-bar overlapping segments with a hop size of 2 bars. Since these segments are not sampled as phrases, the
The phrases are annotated automatically using an unsupervised segmentation approach which detects phrase boundaries by grouping similar patterns[32]. Specifically, we use a triplet mining method to leverage common audio descriptors for clustering[33]. We computed beat-synchronized log-scaled Mel Spectrogram feature patches on the audio signals converted from scores and trained an encoder to boost the audio features. The encoder is a 3-layer convolutional neural network that assigns high similarity values for pairs of feature patches sampled from the same phrase and low values otherwise. The embeddings returned by the model are then used as audio descriptors for clustering. We retrieved the detected phrase boundaries on scores.
For inference, the encoder can be primed with one or multiple bars at any position within a phrase. The encoder receives a sequence of primer notes, with
The dataset consists of 244 piano sonata movements from four composers, Mozart, Beethoven, Haydn, and Scarlatti during the Classical Era from the KernScores2 library, a collection of digital music scores with Humdrum notation in **kern format[34]. Note that for Beethoven, we only use sonatas before 1810. Table 1 shows the number of movements from each composer after removing scores that cannot be parsed. We divide the movements by an 80%-20% training-validation split.
Composer | Number of Movements |
---|---|
Mozart | 69 |
Beethoven | 85 |
Haydn | 25 |
Scarlatti | 65 |
The music in this dataset is highly consistent in style and shares uniformity in structure. We further reduce the data variance by transposing all the movements to C major or C minor based on the key signature on scores. This enables the model to learn relative harmonic relationships and chord progression regardless of the key. To avoid introducing octave variance during key transpose, we raise the pitch for movements in G, G#, A, A#, and B and lower the pitch for the rest.
We consider two variants of Transformer architectures with different tokenization strategies as the baseline for evaluation. In Baselines 1 and 2, we faithfully follow the original settings in Music Transformer with global attention, except that we apply the MIDI-like tokenization on the converted MIDI files in Baseline 1 and use REMI Lite in Baseline 2. In Baseline 3, we stick to the original BERT[35] architecture and its data corruption strategy but with REMI Lite representation.
To ensure that the baselines have the same number of parameters as our proposed model, all models are trained with
Table 2 reports the note-wise validation negative log-likelihood (NLL) of all models on our dataset. NLL is improved by using the Encoder-Decoder model, REMI Lite tokenization, and Bar-level masking when trained on a small dataset. To our surprise, although Baseline 3 reaches the second-lowest NLL among all approaches, it fails to generate meaningful outputs.
Model | Representation | Validation NLL | |
---|---|---|---|
Baseline 1 | Music Transformer | MIDI-like | 1.4096 |
Baseline 2 | Music Transformer | REMI Lite | 1.2669 |
Baseline 3 | BERT (Encoder-Decoder w/Dynamic Mask) | REMI Lite | 0.7487 |
Ours | Encoder-Decoder w/Bar-level Mask | REMI Lite | 0.6799 |
Human evaluation is still the most reliable way to assess whether AI-generated music is comparable to naturalistic music. A “good” generated music excerpt is expected to be consistent with the primer sequence regarding overall impression, tempo, and metrical structure. To this end, we carried out an online listening test with two independent sessions, investigating perceived quality and rhythmic regularity respectively.
We selected 10 musical excerpts with different tempi and time signatures from sonatas in the validation set and converted them to audio signals to form a Reference set. The duration of the excerpts ranges from 10 to 20 seconds. The tempo Reference excerpts are set as in Table 3 to ensure that the beat inter-onset intervals (IOIs) fall in a proper range of 200-1800 msec for in-phase tapping[36].
Time Signature | Tempo/BPM | |
Excerpt 1 | 4/4 | 120 |
Excerpt 2 | 2/4 | 120 |
Excerpt 3 | 4/4 | 120 |
Excerpt 4 | 4/4 | 120 |
Excerpt 5 | 3/4 | 132 |
Excerpt 6 | 4/4 | 120 |
Excerpt 7 | 2/4 | 108 |
Excerpt 8 | 4/4 | 144 |
Excerpt 9 | 3/4 | 72 |
Excerpt 10 | 2/4 | 96 |
Two Experimental sets are generated by the proposed framework and Baseline 1, as continuations of the first 2 bars in the Reference excerpts. To take full advantage of the proposed framework and improve generation quality, the last bar of the Reference excerpt is provided as additional information to the proposed framework. We provide an extra bar before the Reference excerpt to Baseline 1 to ensure that both models are conditioned on 3 bars of primer. Figure 6 explains how test excerpts are generated. The Experimental excerpts are supposed to have identical tempo and time signature as the corresponding Reference excerpts with the same primers.
We introduce an Anchor set for rejecting data that are not qualified for further statistical analysis. An Anchor is a pitch-transposed Reference and thus has an identical rhythm as the Reference and sounds more similar to the Reference than the generated samples.
Session 1 is an online survey3 on the perceived quality of the generated music. Participants listened to a pair of musical excerpts and were asked to provide a 5-point scale rating (Bad, Poor, Fair, Good, and Excellent) on how well the two excerpts matched each other based on their first impression.
There were 5 runs in Session 1, with each run containing a group of excerpts with the same primer. The participant compared the Reference excerpt to one Anchor and two Experimental excerpts, yielding 3 pairs of comparisons per run. To avoid fatiguing subjects from listening to too many samples, we only presented 5 groups randomly, resulting in 15 comparisons. We obtained 114 ratings from 13 participants after rejecting the invalid ratings if the match between Reference and Anchor in the same run was rated as Bad, Poor, or Fair. Figure 7 shows the overall statistics of ratings. The improvement in overall perceived quality from using the proposed framework over Baseline 1 is statistically significant
Session 2 examines the perceived rhythmic regularity via a finger-tapping task. While the commonly used self-report methods might suffer from the possibility of biased responses, our approach directly reflects how beats are perceived as finger-tapping is less accurate when a beat is difficult to find[37]. Although finger-tapping has long been used to study sensorimotor synchronization to musical rhythmic patterns, it is the first time it has been used to evaluate AI music generation systems.
Participants were asked to tap the beats of the music on a computer keyboard the way they normally tap their feet when listening to music. We developed a web app, Music Eval, to collect the finger-tapping onsets. There were 5 runs in Session 2. For each run, subjects performed finger-tapping on a group of excerpts with the same primer, first on an Anchor excerpt, followed by two Experimental excerpts presented in random order. To avoid fatiguing the participants, we randomly presented 5 groups, resulting in 15 excerpts for each participant. We analyzed the Asynchrony, which is defined as the difference between intertap intervals (ITIs) and ground truth beat inter-onset intervals (IOIs), from different models.
6 participants were recruited for Session 2. After excluding invalid tappings if the participants tapped to the Anchor in the same run in an unexpected way, we computed the F-statistic on the Asynchrony of Baseline 1 and the proposed framework. Table 4 shows the FDR-corrected intra-person
Subject 1 | 2.3501 | 3.50e-05*** |
Subject 2 | 9.2764 | 3.33e-16*** |
Subject 3 | 11.2413 | 3.33e-16*** |
Subject 4 | 0.9825 | 0.5320 |
Subject 5 | 4.4155 | 0.0008*** |
Subject 6 | 3.4250 | 6.16e-09*** |
In this paper, we present a small-scale framework suitable for modeling phrase-level symbolic music with limited data. The proposed framework outperforms the state-of-the-art transformer-based models trained on the same dataset in terms of reconstruction loss, overall impression, and rhythmic regularity. We attribute the improvement over the baseline to the use of encoder-decoder structure, REMI Lite representation, and the bar-level masking strategy we adopted during the training stage: the bidirectional encoder equipped with the Bar-Pad tokens in REMI Lite allows for incorporating context from forward and backward in both training and inference; the progressive bar-level masking strategy effectively boosts the dataset when the model is trained with only 195 sonata movements. Moreover, our finger-tapping experiment provides a novel way to measure human response to AI-generated music while mitigating self-report bias in human evaluation. The generated samples exhibit compelling phrase-level hierarchy, suggesting new possibilities for composing well-structured, full-length sophisticated music without the aid of pre-trained LLMs and extensive training data. This leaves us enthusiastic about future research in this direction.
There are several possible extensions to this work. Firstly, the model could be optimized to retain the capability of producing highly coherent outputs with less primer information. Secondly, adding dynamics, articulation, and grace notes which are currently excluded from the music representation, could lead to artistic interpretation and expressiveness of the output. Lastly, the current framework still lacks control over melody, rhythm, and chord. As classical music has a rich rhythmic structure and logical harmonic progression, we will leave this topic for future studies.
As a final remark, we hope that the proposed framework will benefit the community by resolving the discrepancy between the enormous amount of data required for training LLMs and the inadequacy of data available for music modeling.
The model in this work was trained on part of a publicly available dataset, KernScores.
This study involved human participants in evaluating the quality of music generated by the proposed framework. The evaluation study was reviewed and approved by the Committee for the Protection of Human Subjects at Dartmouth College (CPHS #: STUDY00032901), and informed consent was obtained from each participant. The evaluation entailed two independent sessions conducted online. Participants might choose to attend either or both sessions. Volunteers were recruited inside Dartmouth College via email. Participation was voluntary and anonymous. No identifiable information was collected during the study.
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
We performed the following preprocessing steps before calculating the Asynchrony:
Remove jittery taps with
Reject invalid runs. For each run in Session 2, participants performed finger-tapping on an Anchor and two Experimental excerpts. The Anchor is used to reject a run if (1) the tapping pattern does not belong to any of the coordination modes shown in Figure 8 and (2) the standard deviation of ITIs is larger than half of the IOI. The tapping pattern is estimated based on the average ITIs.
Adjust the IOI according to the estimated coordination mode.
Concatenate valid ITIs for every participant.