This paper introduces a Transformer-based melody harmonization method aimed at generating chord progressions with varying levels of rhythmic complexity.
Recently, deep learning methods have demonstrated remarkable performances in automatic melody harmonization. However, most approaches do not focus on generating flexible harmonic rhythms. This study introduces a metric called ‘‘rhythmic complexity’’ to measure the level of complexity in chord rhythm patterns and proposes a transformer-based melody harmonization method utilizing latent space regularization to explicitly control musical attributes and generate chord progressions with varying levels of rhythmic complexity. The proposed method generates chord progressions with diverse rhythms by controlling interpretable parameters and achieves rhythm-controllable melody harmonization. The performance of the proposed method was evaluated by comparing the rhythmic complexity of chord progressions generated by both humans and the proposed method. Four measures were used, along with six commonly used metrics for evaluating melody harmonization evaluation. Experimental results show that the proposed method can generate chord progressions with a diverse range of rhythmic patterns comparable to those created by humans. Furthermore, the proposed method outperforms existing methods in terms of controllability due to the strong correlation between the controllable musical attributes and the generated results. The proposed method can be leveraged to create high-quality music compositions and is expected to benefit music producers, composers, and enthusiasts alike.
In recent years, rapid developments have been made in deep learning, which has resulted in significant improvements in the quality of automatically generated music. This study focused on automatic melody harmonization, which generates chord progressions based on a given melody.
Previous studies on melody harmonization using neural networks have primarily utilized recurrent neural networks (RNNs). Lim et al.[1] used bidirectional long short-term memory (BLSTM) to predict the chords for each bar from 24 major/minor triads. Yeh et al.[2] extended this method to predict chords and chord functions. However, when using automatic melody harmonization to support music production, it is desirable to freely control the musical features of the generated results, such as the genre, rhythm, and complexity, according to the intended use of the composer. Therefore, some studies control their model and generate chords according to the conditions of melody and user instructions. Chen et al. proposed SurpriseNet [3], a model based on a conditional variational autoencoder (CVAE) [4] that incorporates a surprise contour calculated from the transition probability in a Markov chain. Ryu et al. proposed three transformer-based models for chord generation: standard transformer (STHarm), variational transformer (VTHarm), and regularized variational transformer (rVTHarm). Notably, rVTHarm allows the control of the number of chord types in the generated chord progression [5].
Despite the validity of previous studies, there are several limitations to the chords generated by these methods, and the types and numbers of chords remain limited. Most of these methods have limitations in generating chord sequences owing to their fixed harmonic rhythms, such as the limited duration of a bar or half of a bar. The pace of chord progression affects the rhythm and overall impression conveyed in a song because changes in chord duration can alter the the melody structure[6].
In this study, we introduce a measure called ‘‘rhythmic complexity’’ to measure the level of complexity in chord rhythm patterns. The inspiration for rhythmic complexity comes from cognitive music theory, where the melody is not viewed as a flat-note sequence but as a hierarchical structure. Rhythmic complexity is created by determining where chords are placed within a measure; we simplified this structure to five levels. We also propose a transformer-based melody harmonization method that utilizes latent space regularization to control musical attributes explicitly. This method generates chord progressions with varying levels of rhythmic complexity, enabling rhythm-controllable melody harmonization.
Experiments were conducted to measure the quality of the generated chord progressions by comparing the rhythmic complexity of chord progressions generated by humans and the proposed method using four measures along with six commonly used metrics for evaluating melody harmonization. Experimental results show that our method can generate chord progressions with a diverse range of rhythmic patterns comparable to those created by humans. Furthermore, we quantitatively evaluated how the generation reflects control. by computing the correlation between the control attributes and generation results. The results show a strong correlation between the controllable musical attributes and the generated results, confirming that the proposed method outperforms previous methods in terms of controllability. Our contributions are as follows: (1) proposing a novel measure called rhythmic complexity and introducing metrics to evaluate the rhythm of chords close to human cognition, and (2) incorporating it into a transformer-based melody harmonization method for explicit control of musical attributes for superior controllability compared to previous methods.
The remainder of this paper is organized as follows. Section 2 gives an overview of the related works... Section 3 presents the methods used in this study. Section 4 presents the experimental setup. Section 5 presents the study findings and their discussion. Section 6 summarizes the results and presents the conclusions drawn from them.
Traditionally, the task of generating chord progressions has been approached using methods such as hidden Markov models (HMM) [7][8][9] and Genetic Algorithms (GA) [10]. Tsushima et al. proposed a hierarchical structure of chords based on the probabilistic context-free grammar (PCFG) of chords and developed a metrical Markov model to estimate a variable-length chord sequence for a given melody [9].
Recently, deep learning models have become the predominant approach in this area. Lim et al. [1] first proposed a model based on a BLSTM network that generated chords from 24 triads for each bar. Yeh et al. [2] extended Lim’s model to predict a chord from 48 triads for each half-bar and predicted chord labels functions. Objective metrics for assessing the coherence and diversity of the created chord sequences have also been proposed.
In addition, several studies controlled their models using a variational autoencoder (VAE) [11] and generated chords according to the melody conditions and user instructions. SurpriseNet [3] introduced by Chen et al. uses BLSTM and CVAE to generate chord progressions for each half a bar through user-controllable conditions. By adding a surprise contour calculated from entropy as an input, the surprisingness of the generated chord progressions could be controlled. Ryu et al. proposed rVTHarm[5], a regularized variational transformer-based model for the controllable generation of chords that can control the number of chord types, called chord coverage (CC), in the generated chord progression.
Previous neural-network-based music generation studies have mostly used RNNs. However, in recent years, interest in transformer-based methods has increased [12]. Compared with RNNs, transformers are better equipped to handle relationships between temporally distant elements and are capable of capturing repeating structures that exist in music at various scales.
Music Transformer [13] model is a successful model for symbolic music generation with long-term coherence and structure achieved through a self-attention mechanism. Pop Music Transformer utilized Transformer-XL and introduced a novel data representation called revamped MIDI-derived event (REMI) data representation. REMI enables the easy learning of regular rhythm patterns in popular music by including information such as bar lines and positions within a measure in its sequence [14]. Jiang et al. [15] proposed a hierarchical Transformer VAE that uses self-attention blocks to learn contextual melodic representations, thereby enabling the model to control the melodic and rhythmic contexts. Wu et al. proposed MuseMorphose [16], a Transformer and VAE-based model to accomplish style transfer of long pop piano pieces. This model allows users to specify musical attributes such as rhythmic intensity and polyphony at the level of individual bars.
Several studies have focused on music generation under user-controllable conditions. In contrast, our work focuses on generating music by controlling interpretable parameters.
The geodesic latent space regularization method proposed by Hadjeres et al. achieved some controllability for music data by encoding an attribute along a single dimension of the latent space [17]. However, whether this model can handle more complex or multiple attributes simultaneously has not been demonstrated. MIDI-VAE [18] is a model based on VAE that enables the transfer of musical styles between compositions. In addition, this model can interpolate short musical pieces, produce medleys, and create a blend of complete songs.
Pati and Lerch proposed a loss function that encodes the selected musical attributes along specific dimensions of the latent space for regularization of a latent variable to correlate with multiple musical attributes [19]. The users could interactively control the attributes, similar to the current research's focus. Furthermore, Music FaderNets [20] proposed by Tan et al. use the latent regularization method introduced by Pati et al. [19] to encode a multi-dimensional, regularized latent space instead of a single dimension for each low-level feature, increasing flexibility.
This section details the model architecture and the measure of rhythmic complexity. Our transformer-based method generates flexible rhythmic chord progressions by introducing ‘‘rhythmic complexity’’ and latent space regularization [19], providing understandable controls over the musical attributes of the generation.
For the data representation of the melody and chords, we adopted a relatively simple method based on REMI [14]. Table 1 lists the events.
Table 1: REMI-based event representation. The ranges are given in brackets.
Event | Description |
---|---|
Note | The onset of a note in a melody (12 types). |
Chord | The onset of a chord (96 types). |
Rest | Expressing a rest. |
Duration | The duration of a note in a melody and chord. |
Position | The position within a measure. |
Bar | The beginning of a measure. |
EOS | End-of-Sentence |
Padding | Excluded from attention calculation. |
A melody sequence is represented by a combination of three events: the position, note, and duration events, indicating the position within a measure, pitch, and length of a note, respectively. The sequence of these three tokens, repeated in succession, represents a monophonic melody. Similarly, the chord sequence combines position event, chord event indicating the chord type, and duration event. Duration event represents the note duration using quarter notes as a unit. In addition, a bar event indicates the position within a measure, and a padding event adjusts the input data to the model's dimensions.
The dataset contains various types of chords. However, in this study, we follow Sun et al. [21] and consider 96 types of chords, including major, minor, augmented, diminished, suspended, major seventh, minor seventh, and dominant seventh chords.
The architecture of the modified rVTHarm model is shown in Figure 1. Our model consists of three key modules: melody encoder, context encoder, and chord decoder, denoted by
The context encoder uses the input chord sequences and outputs a compressed latent variable
However, training with our model alone cannot provide intuitive control over chord progression. Therefore, following rVTHarm, we utilize the latent space regularization method proposed by Pati et al. [19] for the loss function to control the musical attributes of the generated results. In this study, we select the number of chord types in the progression, called chord coverage, and rhythmic complexity (Section 3.3) as musical attributes.
To compute this loss function, an attribute distance matrix
where
where MSE is the mean squared error. This formulation of Eq. 6 is expected to correspond the values of the regularized dimension to changes in attribute values, such as a transition from high to low.
The training objective is expressed using Eq. 7:
where
Finally, we present the hyperparameters and training conditions used in the model. The melody- and chord-embedding sizes were set to 256. A hidden size of 256 was used, with attention head sizes of four and four attention blocks. The latent variable size
The regularization method of Pati et al. [19] can compute regularization loss if the attribute values are computable. Therefore, we introduce a measure called ‘‘rhythmic complexity,’’ which measures the level of complexity in chord rhythm patterns and can be converted into attribute values.
A chord is a group of three or more notes that sound together and are built on a root note to form a harmonic unit. Chord pitch events govern the time until the next pitch event. Thus, by focusing on the temporal position of the onset of each event, the relationships between the sound durations can be treated as rhythmic patterns created by placing events at specific positions. The regular occurrence of events at certain intervals produces a pulse, and the interval between the pulses is called the beat. In this study, we aim to create rhythmic complexity by determining the location of events within a beat.
The inspiration for rhythmic complexity comes from the cognitive music theory [22], which considers melody as a hierarchical structure rather than a flat sequence of notes. These positions are obtained by dividing the beat; various positions are obtained by repeatedly dividing the beat into prime numbers. In music, binary or ternary divisions are commonly used; divisions of more than five are rarely used because they become too complex. For instance, five can be expressed as a combination of binary and ternary divisions (2+3), and seven can be expressed as a combination of binary, binary, and ternary divisions (2+2+3). Therefore, the basis for generating rhythmic patterns could be simplified into binary and ternary divisions. In this study, we focused only on the binary division for simplicity.
An overview of the division method is presented in Figure 2. Level 1 represents the position without division. Levels 2 to 5 represent the positions obtained by one, two, three, and four divisions, respectively. For example, in a 4/4 time signature, Levels 1 and 5 represent the position when incorporating an entire note and sixteenth note, respectively. While it is possible to consider Levels 6 and higher, notes shorter than a thirty-second note are not significant as rhythmic elements of chords; thus, we limit our consideration to Level 5. Thus, there are 16 positions within a measure, each with different significance in the rhythmic pattern.
In practice, the rhythm complexity is determined by assigning scores to positions within a measure, as presented in Table 2. These scores are calculated based on the depths of the levels shown in Figure 2.
Table 2: Metrical complexity score (MCS).
Objective evaluations of the proposed method were conducted. In this section, we describe the experimental setup.
In our experiments, we utilized the Hooktheory Lead Sheet Dataset (HLSD) [23] that provides a diverse range of rhythmic chord sequences to evaluate our proposed method. The dataset was collected from a user-contributed platform, Hooktheory, and consisted of high-quality human-composed melodies along with corresponding chord accompaniments.
After filtering out scores with a 4/4 time signature and those with no chord and melody, 16110 scores were divided into train, valid, and test sets with ratios of 8:1:1.
The proposed method was evaluated using the six objective metrics proposed in [2]. The first and last three metrics measure the quality of chord progression and the harmonicity between the melody and chords, respectively:
Chord histogram entropy (CHE): The entropy of the chord histogram.
Chord coverage (CC): The number of chord types in the chord sequence.
Chord tonal distance (CTD): The tonal distance between adjacent chords when they are represented by 6-D feature vectors [24].
Chord tone to non-chord tone ratio (CTnCTR): The ratio of chord tones to non-chord tones.
Pitch consonance score (PCS): The average of the consonance scores based on pitch intervals between the melody note and corresponding chord notes.
Melody-chord tonal distance (MCTD): The tonal distance between a melody note and its aligned chord represented by 6D feature vectors.
The temporal regularity and perceptibility of the beat pattern determine rhythmic complexity. A rhythm with regular and equidistant beats has low complexity, whereas a sequence of sounds with no perceivable order has high complexity. In addition, as the duration of chords decreases, the amount of information per unit time increases, increasing the complexity. Sound grouping is more likely when sounds are temporally close, resulting in higher complexity. Based on the premise that diverse rhythms increase complexity, this study considers the following four components that complicate rhythms: types of note values, rhythm density, position within a measure, and relative duration of adjacent chords.
Because the evaluated metrics do not evaluate the rhythm of the chords, we propose four additional metrics.
Harmonic rhythm histogram entropy (HRHE): This metric calculates the entropy of the histogram of |R|, which counts the frequency of chord durations within the chord sequence.
Harmonic rhythm density (HRD): The number of chords per measure.
Metrical complexity score (MCS): The average score calculated from the position within a measure, as listed in Table 2.
Chord duration ratio (CDR): The relative duration of adjacent chords. The complexity increases as the relative duration of adjacent chords increases; however, integer multiples are considered less complex. For instance, the pitch ratio between a half note and quarter note (2:1) is less complex than that between a dotted quarter note and quarter note (3/2:1). This metric is calculated as follows:
where
The objective evaluation results with the six evaluation metrics proposed in [2] are presented in Table 3, and evaluation metrics for rhythmic complexity are listed in Table 4. Our goal is to control melody harmony intuitively using interpretable parameters. In this study, we assume that users can manipulate two parameters:
First, we examine the proposed model using different values of
Table 3: Objective evaluation results using six evaluation metrics [2] .
Metrics | CHE | CC | CTD | CTnCTR | PCS | MCTD |
---|---|---|---|---|---|---|
0.552 | 1.918 | 0.885 | 0.663 | 1.044 | 1.475 | |
1.047 | 3.177 | 1.062 | 0.643 | 0.980 | 1.498 | |
1.184 | 3.945 | 1.098 | 0.642 | 0.955 | 1.497 | |
Human | 1.497 | 5.450 | 1.038 | 0.807 | 1.535 | 1.365 |
In the generated samples shown in Figure 3, the number of chord types used in the generated chord progressions increases as
Next, we examine our model using different values of
Table 4: Objective evaluation results using the evaluationmetrics for rhythmic complexity.
Metrics | HRHE | HRD | MCS | CDR |
---|---|---|---|---|
0.000 | 0.998 | 0.000 | 0.999 | |
0.018 | 1.371 | 0.372 | 1.016 | |
0.529 | 2.662 | 4.007 | 2.098 | |
Human | 0.703 | 1.693 | 1.627 | 1.746 |
In the generated samples in Figure 4, the positions within the measure of the generated chords become more diverse as parameter
However, as observed from Table 5 (bottom), a strong negative correlation exists between the controlled parameters and the evaluation metrics, indicating that controlling one attribute affects other uncontrolled attributes.
Table 5: Comparison of correlation coefficients between parameters and musical attributes (CC and MCS). Generated results are from different values of
rVTHarm [5] | Ours | ||
---|---|---|---|
metric | |||
CC | 0.451 | 0.993 | -0.909 |
MCS | N/A | -0.965 | 0.905 |
This study introduced a novel measure of ‘‘rhythmic complexity’’ and proposed a melody harmonization technique based on a transformer architecture that explicitly manipulates musical characteristics, particularly the complexity level of chord rhythm patterns. Controlling interpretable parameters enables the proposed method to generate chord progressions with diverse rhythms comparable to those created by humans. Compared to previous methods, the proposed approach demonstrates superior controllability, with a high correlation between the controllable musical attributes and the generated outcomes. These results indicate that the proposed method is a valuable resource for music composition and production.
In the future, we will aim to improve the proposed model further to enable the independent control of multiple musical attributes. Furthermore, it is worth exploring methods that consider the melody structure. While this study focused on the rhythmic patterns of chords, the structure of the melody is also essential for musical appeal. Developing a method that clearly expresses the characteristics of a melody, such as utilizing pre-trained models, would enable more advanced music generation.
This study did not include human participants, animals, or other sensitive data. Ethics approval was not required for this study. The authors declare no conflicts of interest.