Skip to main content
SearchLoginLogin or Signup

Phrase-Level Symbolic Music Generation

We present a small-scale framework for generating phrase-level symbolic music with limited training data. The human evaluation shows that our framework outperforms the state-of-the-art transformer-based models in terms of overall quality and rhythmic regularity.

Published onAug 29, 2024
Phrase-Level Symbolic Music Generation
·

Abstract

Music is organized into a hierarchical structure with repetition at different time scales, from beats, phrases, and themes to sections. Transformer-based models successfully generate music that exhibits some level of periodicity in minute-long excerpts. Recent large language models have extended the coherence to a longer range in certain types of music. Nevertheless, it is still challenging to impose necessary musical structure constraints, such as the recurrence of sections within a movement of a sonata. Since music composition is highly constrained by structure, an alternative approach is to generate music phrases guided by ruled-based structure representation. In this study, we contribute a novel framework that generates complete and coherent musical phrases using an encoder-decoder architecture. We also propose REMI Lite, which tailors REMI[1], a music representation with a quantized time grid, to learn phrase endings in classical music while maintaining rhythmic regularity. The human evaluation shows that our approach achieved high generation quality when trained with only 195 piano sonata movements. Our novel finger-tapping experiment also validates that the representation allows for producing music with steady perceived beats.

Author Keywords

AI Music Generation, Music Representation, AI Music Evaluation

Introduction

Automatic music generation has gained much attention since transformer-based models such as Music Transformer[2] and MuseNet1 are capable of modeling long-term dependencies in musical sequences. Variants of transformers such as Transformer XL[3], Perceiver AR[4], LLaMA[5], and LLaMA2[6] have further extended the context windows to full-length music. While these powerful new tools have yielded state-of-the-art performance in modeling coherence in longer sequences, recent efforts in generating music conditioned on melodies, chords, rhythms, high-level text descriptions, etc. have demonstrated the potential to provide a more desirable result by combining human-interpretable controls.

Controllable music features in existing models are mainly manifested as latent representations of these features or conditional inputs in a format compatible with the music representation. Despite harmony, melody, and rhythm which are commonly used in controllable music generation, structural coherence has yet to be further explored. Structure is a fundamental compositional feature in music, especially in classical music which generally has a clear musical form, with a hierarchical grouping of recurring units including sections, phrases, ideas, motives, etc. For example, sonata follows a strict three-part overall structure, namely Exposition, Development, and Recapitulation. The Recapitulation is an essential repeat of Exposition with an absence of the modulation to the dominant key at the end of the section[7]. In addition, musical structure also plays an important role in music perception. Psychological and neural evidence has suggested that hierarchical perception of music is organized in a bottom-up fashion, from beat and meter to phrase and sections. When listening to naturalistic classical music, audiences are first entrained to the low-level rhythmic structures such as steady beats and meters, then are kept engaged by the tension and release constantly induced by a violation of expectation at boundaries between phrase structures[8][9][10][11][12]. However, music from recent music generation systems is often strikingly different from human-composed music in terms of structure[13]. Learning high-level structural coherence relies on all relevant contexts from an input window extended to the entire piece of music. Taking the piano sonatas in this study for example, the input sequence relative to a whole movement has a max of 21299 tokens in MIDI-like[14] [2] tokenization, which is challenging because the memory complexity for intermediate information in transformer-based models is quadratic in the input length.

While high-level structural coherence is a bottleneck in computation, it guides real-world music composition. Music is highly constrained with internally connected sections and recurring phrases, which makes it possible to break down the computational requirements into phrase-level generation and phrase-relation modeling on inputs with reasonable length. Therefore, we start with improving the quality of single phrase generation.

A phrase is defined as the smallest musical unit that conveys a complete musical thought[15]. The end of a phrase is marked by a cadence, usually perceived as a rhythmic or melodic articulation, which is yet to be carefully modeled. Phrases also exhibit a clear beat-meter structure and remarkable rhythmic regularity, which we seek to address in this work.

We propose a novel framework for polyphonic, phrase-level symbolic music generation, targeting piano sonatas during the classical era. Inspired by the music inpainting task in real-world composition, the framework is designed to finish a musical phrase conditioned on primer bars at positions not limited to the beginning. The framework entails a new score-based representation, REMI Lite, and an encoder-decoder model, with the bidirectional encoder enhancing within-phrase coherence and the auto-regressive decoder generating new sequences. The model is fine-tuned on automatically segmented phrases to learn phrase endings. Evaluation of the proposed framework includes a self-report survey on the overall impression of the generated samples and a novel finger-tapping task to assess the rhythmic regularity.

Our major contributions are as follows:

  1. We propose a framework for single-phase, score-based, symbolic music generation that achieves higher quality than a Music Transformer baseline with a sequence length of 512 trained on a small dataset. The proposed framework outperforms the baseline in both reconstruction loss and human evaluation.

  2. We adapt REMI tokenization to our corpus and develop REMI Lite which allows for flexible meter representation, higher temporal resolution, and phrase-ending learning.

  3. We propose a novel finger-tapping approach for examining perceived rhythmic regularity in AI-generated music, which mitigates the information bias possibly induced by regular self-report approaches.

Related Works

Automatic music generation can be dated back to 1959[16]. In recent years, transformer-based models have become the mainstream tools in this area since the attention mechanism[17] was introduced into music modeling by Music Transformer[2]. Currently, music generation systems are mainly improved from the following perspective:

  1. modeling long-range dependencies;

  2. optimizing music representations for different types of inputs;

  3. imposing controls over high-level components such as melody contour, rhythmic structure, chord progression, emotions, etc.

As music performance can easily contain hundreds of thousands of elements even after compression, the key challenge is to enable long-range pattern recognition without adding extraneous computational complexity. Initial breakthroughs by Music Transformer[2] opened the way to modeling longer-term coherence for symbolic music generation with Relative-Positional Encoding. Other approaches reduce the cost of handling long sequences by truncating attention[3], introducing sparsity into the attention[18], or using cross-attention to compress the computation[19][20][21]. PopMAG models the harmonic progression in prolonged contexts for accompaniment generation in pop songs using Recurrent Transformer[22]; Pop Music Transformer[1] generates 12-bar long piano pop song excerpts with Transformer-XL[3]; Perceiver AR extends attention from multiple bars to the entire piece, successfully generating high-level structure at some degree[4]. However, these are still far from reproducing section-level repetition and a clear beat-bar-phrase hierarchy that makes sense for music composition and perception.

The input to a music generation system can be either audio or symbolic music. We limit our discussion to symbolic music in this work. Pitch, velocity, track, and temporal information are 4 types of attributes usually encoded in symbolic music representation. The temporal information entails note-onsets, offsets, or durations. We categorize these representations by whether temporal information is quantized with smaller units. Representations using absolute time, such as MIDI-like, support expressive performance. MIDI-Like was first introduced in Performance RNN[14] and improved in Music Transformer[2] with richer velocity levels and higher temporal resolution to reconstruct dynamics and grooves in music performance. Quantizing the temporal information as in Multi-Track Music (MMM)[23], Compound Words (CW)[24], Revamped MIDI-derived events (REMI)[1], and REMI+[25] largely enhance rhythmic consistency in music modeling and generation. These representations also incorporate bar information estimated by beat and downbeat tracking to impose metrical structure.

Efforts have been made to increase the controllability of the models. Controlling conditions are imposed as latent codes learned from a secondary model such as Variational Auto-Encoder (VAE)[26] on top of the backbone generation model: Music FaderNet[27] generates music guided by high-level valence and arousal through latent clustering; FIGARO[25] combines description-to-sequence encoding either learned or provided by experts; Wei et al[28] disentangle representations for rhythmic pattern and chord progression for conditional music generation. Recently Large Language Models (LLMs) have emerged as a viable path for text-to-music generation. By leveraging the ability of LLaMA2[6] to understand music theory in the form of text, ChatMusician[29] is now capable of composing well-structured and full-length music conditioned on the text description and chords, motifs, and melodies in text-compatible ABC notation, but still facing limitations such as predominantly generating music in the style of Irish folks resulting from the significant portion of training data in this genre. Another downside of pre-trained LLMs for music generation is that fine-tuning LLMs on non-text data can be hard and not all music can be effectively represented in a text-compatible way. For example, ABC notation, adopted in ChatMusician, is only suitable for presenting part-by-part music but not for polyphonic music without clear voicing.

Methods and Materials

Proposed Framework

We proposed a novel framework for phrase-level symbolic music generation based on musical scores in Humdrum notation. In the training stage, scores are tokenized and chunked into event segments for pre-training the model. The model is then fine-tuned on event phrases to learn the phrase endings. During the inference stage, the model generates original event sequences conditioned on event prompts which can be one or multiple bars of notes at any position in a phrase. The output event segments are then converted to MIDI and audio files for playback. Figure 1 shows the diagram of the proposed framework.

Figure 1

The proposed framework. The black, gray, and dotted lines mark the pre-training, fine-tuning, and inference workflow respectively.

Symbolic Music Representation

This section demonstrates how we designed the new representation, REMI Lite for compositional purposes. We also discuss how we adapt REMI[1], an existing representation using a position-bar time grid, to our corpus.

We seek to impose a beat-bar-phrase hierarchy in REMI Lite. It is important to highlight that the beat-bar structure is readily available from scores in our dataset, which eliminates the extraneous quantization error from beat and downbeat tracking since downbeat estimation in classical music is notoriously difficult.

Note Representation

As in REMI, notes are represented by sets of Note-On\text{Note-On}, Pitch\text{Pitch}, and Duration\text{Duration} tokens, and are arranged in the ascending order of Note-On\text{Note-On} tokens. Notes with the same onset share one Note-On\text{Note-On} token.

We increase the temporal resolution of note onsets and durations in REMI Lite. Instead of dividing a bar into 16 intervals in REMI, we quantize note onsets and durations with a minimum unit of 1/24 or 1/32 quarter notes, which allows for capturing fine-grained rhythmic changes in classical music. Figure 2 shows how notes are tokenized.

Figure 2

An example of note representation in REMI Lite. The highlighted notes are encoded as tokens on the right.

Bar Representation

We adopt the Bar\text{Bar} tokens from REMI to mark bar lines. To determine the number of tokens to be predicted during the inference stage, each bar is padded to a fixed length with Bar-Pad\text{Bar-Pad} tokens. In this study, the bar length is set to 64 which covers 95.41% of the bars in our dataset. Bar-Pad\text{Bar-Pad} tokens are not applied to the first and the last bar in a phrase. Note that the Bar-Pad\text{Bar-Pad} tokens should be distinguished from regular padding tokens as we only compute the reconstruction loss over the Bar-Pad\text{Bar-Pad} tokens.

Phrase Representation

A musical phrase consists of multiple bars and is equivalent to a sentence in written language. As there are no explicit symbols marking the end of a phrase, like a period in a sentence, we apply an End-of-Phrase\text{End-of-Phrase} token to make the model aware of the phrase endings.

We assume that time signature and tempo are consistent within a phrase in music composition, and thereby prepend a pair of Tempo\text{Tempo} and Time Signature\text{Time Signature} tokens to the beginning of each phrase. Figure 3 shows how phrases are encoded.

Figure 3

An example of how a phrase is encoded. A phrase starts with a pair of time-signature and tempo tokens, followed by notes and Bar-Pad\text{Bar-Pad} tokens separated by Bar\text{Bar} tokens, and ends with an End-of-Phrase\text{End-of-Phrase} token. There are 64 tokens between two Bar\text{Bar} tokens.

Expressive dynamics, i.e. velocity, and articulations are excluded because they are less relevant to compositional structure. We remove trills and grace notes to simplify the representation. We skip the Chord\text{Chord} event tokens because harmonic features can be picked up through local dependencies of notes by the transformer and other sequential models. There is no representation for tracks as our corpus is limited to piano music. REMI Lite uses fewer tokens to represent a note thereby compressing longer musical sequences into limited input length.

Model Architecture

In this study, phrase-level generation is approached as an inpainting problem, i.e. in-filling the missing bars in a phrase conditioned on the bars given as a primer, to use context from both directions. Inspired by bidirectional masked language models such as BART[30] and MASS[31], the backbone sequence-to-sequence model is implemented as a bidirectional encoder trained over masked data and a left-to-right autoregressive decoder. The model is trained by optimizing the reconstruction loss computed on the masked part. It is also a new attempt to generate musical phrases with a progressive masking strategy.

We adopt the standard Transformer encoder-decoder architecture and adapt to our data by (1) reducing the number of layers in the encoder and decoder from 6 to 4 because our dataset is much smaller than most datasets for language tasks and (2) replacing non-linear activation with GELU.

The major contribution is the bar-level masking strategies. Instead of masking individual tokens, we mask out consecutive tokens within one or multiple bars. The model is trained with two types of masking, with 50% of the bars randomly selected or consecutive bars in the middle of a phrase. The masked sequences are fed to the encoder, and the tokens in the masked part are fed to the decoder for auto-regressive generation. Note that inputs to the encoder and decoder should start with a pair of tempo and time signature tokens. Loss is computed for masked positions only. See Figure 4 for more details. The masking strategy is designed for the inpainting task. Meanwhile, it also effectively boosts the dataset when trained on a limited number of movements.

Figure 4

Two types of masking applied during the training stage. We omit the tempo and time signature token for display purposes only. Each block in the input sequences stands for all the tokens within the bar.

Training

The model is first pre-trained on event segments regardless of phrase ending. To increase the number of samples in the training set, we chunk the event sequence of a full-length song into 8-bar overlapping segments with a hop size of 2 bars. Since these segments are not sampled as phrases, the End-of-Phrase\text{End-of-Phrase} tokens are not appended during the pre-training stage. We then fine-tuned the model on phrases to learn the phrase endings.

Phrase Segmentation

The phrases are annotated automatically using an unsupervised segmentation approach which detects phrase boundaries by grouping similar patterns[32]. Specifically, we use a triplet mining method to leverage common audio descriptors for clustering[33]. We computed beat-synchronized log-scaled Mel Spectrogram feature patches on the audio signals converted from scores and trained an encoder to boost the audio features. The encoder is a 3-layer convolutional neural network that assigns high similarity values for pairs of feature patches sampled from the same phrase and low values otherwise. The embeddings returned by the model are then used as audio descriptors for clustering. We retrieved the detected phrase boundaries on scores.

Inference

For inference, the encoder can be primed with one or multiple bars at any position within a phrase. The encoder receives a sequence of primer notes, with Mask\text{Mask} tokens placed at other positions. Unlike most existing generative models, the primer doesn’t have to be the beginning of a phrase. The decoder generates output auto-regressively as a continuation of the decoder input as shown in Figure 5. The input to the decoder is essentially all the note tokens before the masked part in the encoder input. In an extreme case where no primer notes are given, this model is equivalent to a decoder-only auto-regressive model.

Figure 5

Examples of inputs to the encoder and decoder in the inference stage. The decoder continues the generation starting from the masked part of the encoder input. When the first bar of encoder input is masked as in (b), tempo and time signature tokens serve as the BOS\text{BOS} token in the decoder input, indicating the beginning of auto-regressive generation.

Dataset

The dataset consists of 244 piano sonata movements from four composers, Mozart, Beethoven, Haydn, and Scarlatti during the Classical Era from the KernScores2 library, a collection of digital music scores with Humdrum notation in **kern format[34]. Note that for Beethoven, we only use sonatas before 1810. Table 1 shows the number of movements from each composer after removing scores that cannot be parsed. We divide the movements by an 80%-20% training-validation split.

Table 1

Composer

Number of Movements

Mozart

69

Beethoven

85

Haydn

25

Scarlatti

65

The music in this dataset is highly consistent in style and shares uniformity in structure. We further reduce the data variance by transposing all the movements to C major or C minor based on the key signature on scores. This enables the model to learn relative harmonic relationships and chord progression regardless of the key. To avoid introducing octave variance during key transpose, we raise the pitch for movements in G, G#, A, A#, and B and lower the pitch for the rest.

Results

Baselines and Model Settings

We consider two variants of Transformer architectures with different tokenization strategies as the baseline for evaluation. In Baselines 1 and 2, we faithfully follow the original settings in Music Transformer with global attention, except that we apply the MIDI-like tokenization on the converted MIDI files in Baseline 1 and use REMI Lite in Baseline 2. In Baseline 3, we stick to the original BERT[35] architecture and its data corruption strategy but with REMI Lite representation.

To ensure that the baselines have the same number of parameters as our proposed model, all models are trained with N=8N=8 or Nencoder=Ndecoder=4N_{\text{encoder}} = N_{\text{decoder}} = 4 layers, M=8M=8 attention heads, and d=0.1d=0.1 dropout rate. The hidden dimensions of attention layers (hidden) and feedforward layers (ff) are set to dhidden=768d_{\text{hidden}} = 768 and dff=1024d_{\text{ff}} = 1024 respectively. We reduce the input sequence length to 512 which is long enough to handle 8-bar segments with REMI Lite representation. All models are trained on NVIDIA V100 with a batch size of 48.

Table 2 reports the note-wise validation negative log-likelihood (NLL) of all models on our dataset. NLL is improved by using the Encoder-Decoder model, REMI Lite tokenization, and Bar-level masking when trained on a small dataset. To our surprise, although Baseline 3 reaches the second-lowest NLL among all approaches, it fails to generate meaningful outputs.

Table 2

Model

Representation

Validation NLL

Baseline 1

Music Transformer

MIDI-like

1.4096

Baseline 2

Music Transformer

REMI Lite

1.2669

Baseline 3

BERT (Encoder-Decoder w/Dynamic Mask)

REMI Lite

0.7487

Ours

Encoder-Decoder w/Bar-level Mask

REMI Lite

0.6799

Evaluation

Human evaluation is still the most reliable way to assess whether AI-generated music is comparable to naturalistic music. A “good” generated music excerpt is expected to be consistent with the primer sequence regarding overall impression, tempo, and metrical structure. To this end, we carried out an online listening test with two independent sessions, investigating perceived quality and rhythmic regularity respectively.

Stimuli

We selected 10 musical excerpts with different tempi and time signatures from sonatas in the validation set and converted them to audio signals to form a Reference set. The duration of the excerpts ranges from 10 to 20 seconds. The tempo Reference excerpts are set as in Table 3 to ensure that the beat inter-onset intervals (IOIs) fall in a proper range of 200-1800 msec for in-phase tapping[36].

Table 3

Time Signature

Tempo/BPM

Excerpt 1

4/4

120

Excerpt 2

2/4

120

Excerpt 3

4/4

120

Excerpt 4

4/4

120

Excerpt 5

3/4

132

Excerpt 6

4/4

120

Excerpt 7

2/4

108

Excerpt 8

4/4

144

Excerpt 9

3/4

72

Excerpt 10

2/4

96

Two Experimental sets are generated by the proposed framework and Baseline 1, as continuations of the first 2 bars in the Reference excerpts. To take full advantage of the proposed framework and improve generation quality, the last bar of the Reference excerpt is provided as additional information to the proposed framework. We provide an extra bar before the Reference excerpt to Baseline 1 to ensure that both models are conditioned on 3 bars of primer. Figure 6 explains how test excerpts are generated. The Experimental excerpts are supposed to have identical tempo and time signature as the corresponding Reference excerpts with the same primers.

Figure 6

Generate musical excerpts for evaluation conditioned on primer notes for the proposed framework and Baseline 1.

We introduce an Anchor set for rejecting data that are not qualified for further statistical analysis. An Anchor is a pitch-transposed Reference and thus has an identical rhythm as the Reference and sounds more similar to the Reference than the generated samples.

Session 1

Session 1 is an online survey3 on the perceived quality of the generated music. Participants listened to a pair of musical excerpts and were asked to provide a 5-point scale rating (Bad, Poor, Fair, Good, and Excellent) on how well the two excerpts matched each other based on their first impression.

There were 5 runs in Session 1, with each run containing a group of excerpts with the same primer. The participant compared the Reference excerpt to one Anchor and two Experimental excerpts, yielding 3 pairs of comparisons per run. To avoid fatiguing subjects from listening to too many samples, we only presented 5 groups randomly, resulting in 15 comparisons. We obtained 114 ratings from 13 participants after rejecting the invalid ratings if the match between Reference and Anchor in the same run was rated as Bad, Poor, or Fair. Figure 7 shows the overall statistics of ratings. The improvement in overall perceived quality from using the proposed framework over Baseline 1 is statistically significant (p<0.001)(p<0.001). See Appendix A for more details on data analysis.

Figure 7

Excerpts generated from the proposed framework were rated as a better match of the Reference than Baseline 1. Error bars show the standard deviation of the ratings.

Session 2

Session 2 examines the perceived rhythmic regularity via a finger-tapping task. While the commonly used self-report methods might suffer from the possibility of biased responses, our approach directly reflects how beats are perceived as finger-tapping is less accurate when a beat is difficult to find[37]. Although finger-tapping has long been used to study sensorimotor synchronization to musical rhythmic patterns, it is the first time it has been used to evaluate AI music generation systems.

Participants were asked to tap the beats of the music on a computer keyboard the way they normally tap their feet when listening to music. We developed a web app, Music Eval, to collect the finger-tapping onsets. There were 5 runs in Session 2. For each run, subjects performed finger-tapping on a group of excerpts with the same primer, first on an Anchor excerpt, followed by two Experimental excerpts presented in random order. To avoid fatiguing the participants, we randomly presented 5 groups, resulting in 15 excerpts for each participant. We analyzed the Asynchrony, which is defined as the difference between intertap intervals (ITIs) and ground truth beat inter-onset intervals (IOIs), from different models.

6 participants were recruited for Session 2. After excluding invalid tappings if the participants tapped to the Anchor in the same run in an unexpected way, we computed the F-statistic on the Asynchrony of Baseline 1 and the proposed framework. Table 4 shows the FDR-corrected intra-person F-value\text{F-value} and p-valuep\text{-value}. The Asynchrony of our proposed framework is significantly smaller (p-value<0.0001p\text{-value} < 0.0001) than Baseline 1 in 5 out of 6 participants, suggesting that beats in the music generated by the proposed framework are more steady and easier to perceive. See Appendix A for more details on data processing and analysis.

Table 4

F-value\text{F-value}

p-valuep\text{-value}

Subject 1

2.3501

3.50e-05***

Subject 2

9.2764

3.33e-16***

Subject 3

11.2413

3.33e-16***

Subject 4

0.9825

0.5320

Subject 5

4.4155

0.0008***

Subject 6

3.4250

6.16e-09***

Conclusion

In this paper, we present a small-scale framework suitable for modeling phrase-level symbolic music with limited data. The proposed framework outperforms the state-of-the-art transformer-based models trained on the same dataset in terms of reconstruction loss, overall impression, and rhythmic regularity. We attribute the improvement over the baseline to the use of encoder-decoder structure, REMI Lite representation, and the bar-level masking strategy we adopted during the training stage: the bidirectional encoder equipped with the Bar-Pad tokens in REMI Lite allows for incorporating context from forward and backward in both training and inference; the progressive bar-level masking strategy effectively boosts the dataset when the model is trained with only 195 sonata movements. Moreover, our finger-tapping experiment provides a novel way to measure human response to AI-generated music while mitigating self-report bias in human evaluation. The generated samples exhibit compelling phrase-level hierarchy, suggesting new possibilities for composing well-structured, full-length sophisticated music without the aid of pre-trained LLMs and extensive training data. This leaves us enthusiastic about future research in this direction.

There are several possible extensions to this work. Firstly, the model could be optimized to retain the capability of producing highly coherent outputs with less primer information. Secondly, adding dynamics, articulation, and grace notes which are currently excluded from the music representation, could lead to artistic interpretation and expressiveness of the output. Lastly, the current framework still lacks control over melody, rhythm, and chord. As classical music has a rich rhythmic structure and logical harmonic progression, we will leave this topic for future studies.

As a final remark, we hope that the proposed framework will benefit the community by resolving the discrepancy between the enormous amount of data required for training LLMs and the inadequacy of data available for music modeling.

Ethics Statement

Data Privacy

The model in this work was trained on part of a publicly available dataset, KernScores.

Human Participants Involvement

This study involved human participants in evaluating the quality of music generated by the proposed framework. The evaluation study was reviewed and approved by the Committee for the Protection of Human Subjects at Dartmouth College (CPHS #: STUDY00032901), and informed consent was obtained from each participant. The evaluation entailed two independent sessions conducted online. Participants might choose to attend either or both sessions. Volunteers were recruited inside Dartmouth College via email. Participation was voluntary and anonymous. No identifiable information was collected during the study.

Potential Conflicts of Interest

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Appendix

A. Human Evaluation Data Analysis

Session 1

Session 2

We performed the following preprocessing steps before calculating the Asynchrony:

  1. Remove jittery taps with ITI<0.01\text{ITI} < 0.01 second.

  2. Reject invalid runs. For each run in Session 2, participants performed finger-tapping on an Anchor and two Experimental excerpts. The Anchor is used to reject a run if (1) the tapping pattern does not belong to any of the coordination modes shown in Figure 8 and (2) the standard deviation of ITIs is larger than half of the IOI. The tapping pattern is estimated based on the average ITIs.

  3. Adjust the IOI according to the estimated coordination mode.

  4. Concatenate valid ITIs for every participant.

Figure 8

Common coordination modes in finger-tapping tasks. Thin vertical bars stand for taps. Interonset interval (IOI) and intertap interval (ITI) are indicated.

Comments
0
comment
No comments here
Why not start the discussion?