The use of AI and traditional Chinese instruments in sonic art.
This paper explores the innovative integration of Large Audio AI Models (LMs) with traditional Chinese instruments in the concert version of the work "Prelude: To Listening". Through the use of AI-generated fixed media for the instruments sanxian, sanshin, and Chinese percussions, the project pushes the boundaries of sonic art and mixed music. The study examines the creative potential of LMs, such as AudioLDM2 [1] and Audiocraft [2], in generating novel sounds that deviate from exact replication, thereby offering fresh avenues for artistic expression. The paper highlights the conceptual innovation of using film subtitles as musical prompts, which imbues the audio with additional narrative and emotional layers. It also discusses the challenges of debugging LMs and the limitations of controlling audio features through text, emphasizing the need for more sophisticated methods to bridge text inputs and audio outputs. The performance, led by artist Ryo Ikeshiro and supported by a team of collaborators, showcases the potential of LMs in the evolution of experimental music and sound arts. The code and audio are available at https://github.com/prelude-to-listening/concert.
Large Audio AI Models, audio generation, mixed music, fixed-media electronics, Sonic Art
Set in Hong Kong, "Prelude: To Listening" is a concert that transcends eras through live instrument performance (Video 2, 3) and field recording (film section, Video 1), tapping into historical echoes. The instrumental section harmoniously merges the sonorous sanxian, sanshin, and Chinese percussions with the avant-garde Large AI Models, offering novel soundscapes. The interdisciplinary nature of this paper encapsulates the essence of experimental music by challenging the norms of sound replication and embracing the unpredictability of AI in the creative process.
The film section includes Ambisonics field recordings with video and text. It takes the audience on a journey through sites of historical and cultural significance in Sok Kwu Wan, Lamma Island, Hong Kong. The film explores the possibilities of experiencing traces of the past and the future in audiovisual footage related to its colonial and traditional heritage. The subtitles serve as narration and are highly suggestive, emphasizing connections—real or imagined—between the present sounds and the contextual information of past and future events.
The sanxian, a traditional Chinese stringed instrument, is known for its long history and distinctive three-string configuration. Its resonant body, typically made from wood, is complemented by its silk or nylon strings, which produce a warm and resonant sound. The sanshin, a similar instrument hailing from Okinawa, Japan, boasts a slightly smaller frame and is often used in local folk music, characterized by its clear and bright tone. Both instruments are played by plucking or strumming the strings, and they are highly valued for their versatility and expressive capabilities.
Chinese percussion instruments, on the other hand, showcase the richness of the country's musical heritage. They include a wide array of objects designed to produce rhythm and texture, such as the xiaocha (a set of metallic cymbals), gu (a barrel-shaped drum), and paigu (a set of tuned drums). These percussion instruments are played using various techniques, including striking with mallets or hands, shaking, or scraping, to create a diverse range of sounds that can convey intricate patterns and dynamics. The construction of these instruments is deeply rooted in cultural symbolism and craftsmanship, with materials and designs often reflecting regional characteristics and historical evolution.
This paper contributes to the discourse on AI in sound arts by highlighting the positive effects of using LMs for audio generation. It showcases how the imperfections in AI's ability to replicate sounds can lead to innovative sonic expressions. The paper also delves into the conceptual innovation of using film subtitles as musical prompts, which introduces a new dimension of narrative and emotion to the generated audio. Furthermore, it explores the seamless integration of AI-generated fixed media with traditional instruments, demonstrating the potential for AI to complement and enhance cultural music traditions.
The reflections in this paper provide valuable insights into the complexities of debugging LMs for audio generation and the challenges of achieving precise control over audio features using textual inputs. It underscores the need for more sophisticated methods to bridge the gap between text inputs and audio outputs, and the importance of expanding the diversity of training data to cater to the vast array of musical genres and instruments.
As we unfold the narrative of "Prelude: To Listening", this paper invites readers to appreciate the convergence of traditional artistry and AI innovation.
Our approach for Large Audio Models in the "Prelude: To Listening" project involves a meticulous process (Figure 2) to achieve the desired fixed-media electronics. Initially, we craft prompts by distilling and refining subtitles from the film itself, as shown in Figure 1. We have developed a suite of five principal models, each in distinct configurations: AudioLDM2 (original, music, large) and Audiocarft (AudioGen and MusicGen in G major and 76bpm). These prompts are systematically fed into the models in sequence, yielding audio outputs for each. The iterative process of refining prompts and tweaking the structures and parameters of the models is ongoing until our artists consider the produced sounds to be satisfactory.
In the next phase, we curate the best outputs from the generated audio, selecting those that align with our artistic vision. Subsequently, we perform post-processing on the selected audio clips, which may include adjusting pitch, tempo, and other parameters only if necessary to enhance the auditory experience.
Finally, we execute the instrumental accompaniment through a two-phase approach: offline and real-time. During the offline phase, we integrate the AI-generated audio with pre-recorded tracks of sanshin and Chinese percussions from our musicians. Subsequently, the real-time phase features the sanxian artist delivering a live performance, synchronized with the remixed audio during the film's interludes, thereby creating an immersive and interactive experience for the "Prelude: To Listening" performance.
We propose a distinctive approach to prompt design for audio generation by leveraging subtitles from the 'Prelude: to Listening' film. This innovative method enriches the prompts with vibrant scene descriptions, which serve to contextualize and emotionally charge the audio output of LMs.
The prompts crafted from the film's subtitles diverge from traditional audio LM prompts that directly describe sound characteristics, such as "Loud bus roaring and voices.", "Nature environmental noise with various bird vocalization, high fidelity, children playing far away and light wind.". Instead, our prompts encourage the LMs to transcend mere technical replication of sounds, aiming to create an immersive and emotionally resonant accompaniment for sanxian, sanshin, and Chinese percussions.
For example, prompts such as "We are listening across time" and "We may even be able to hear the cacophony from the previous weeks" provide a framework that is both temporally and spatially rich. This guides the LMs to generate audio that is not just a reflection of the scenes but also carries the cultural and historical weight of the narratives described, perfectly suited for complementing the instruments’ unique tonal qualities and cultural heritage.
The use of these narrative-driven prompts showcases our dedication to exploring new frontiers in audio generation for traditional instruments like sanxian, sanshin, and Chinese percussions. By infusing the prompts with story and emotion, we enable the LMs to generate audio that is not only technically coherent but also artistically and culturally aligned with the instruments’ rich musical legacy. This approach offers a fresh perspective on the creative potential of LMs in enhancing traditional music and experimental sound arts.
Although audio generation shares commonalities across different types of audio, such as speech, music, and sound effects, designing models for each type requires careful consideration of specific objectives and biases that can significantly differ from those of other types. To bring us closer to a unified perspective of audio generation, this work [1] proposes AudioLDM2, a framework that utilizes the same learning method for speech, music, and sound effect generation. The framework introduces a general representation of audio, called "language of audio" (LOA). Any audio can be translated into LOA based on AudioMAE [3], a self-supervised pre-trained representation learning model. In the generation process, the framework translates any modalities into LOA by using a GPT-2 model and performs self-supervised audio generation learning with a latent diffusion model conditioned on LOA. The proposed framework naturally brings advantages such as in-context learning abilities and reusable self-supervised pre-trained AudioMAE and latent diffusion models. Experiments on the major benchmarks of text-to-audio, text-to-music, and text-to-speech demonstrate new state-of-the-art or competitive performance to previous approaches.
Checkpoint Selection:
audioldm2
: 350M UNet for general text-to-audio.
audioldm2-large
: 750M UNet for larger-scale text-to-audio.
audioldm2-music
: 350M UNet for text-to-music.
Prompt Construction:
Create a prompt from the film subtitles.
Add a negative prompt for quality guidance.
Inference Control:
Adjust num_inference_steps
for audio quality control.
Vary guidance_scale
for prompt linkage and sound quality.
Waveform Evaluation:
Generate and evaluate multiple waveforms with different seeds.
Set num_waveforms_per_prompt
for batch generation and automatic scoring.
Rank generated audio based on scoring results.
Our exploration into the intricate workflows of AudioLDM2 involves an analysis of token importance within the large model. Utilizing ECCO [4], a Python library built on PyTorch and Transformers, we gain insights by visualizing attention patterns generated by GPT-2 during the following two-phase transition of the universal audio representation, LOA (Language of Audio).
Conditioning Information to LOA Translation with GPT-2:
Input prompts undergo encoding through CLAP's text-branch and Flan-T5's encoder.
Text embeddings are computed and projected to a shared space using AudioLDM2ProjectionModel.
GPT-2 auto-regressively predicts eight new embedding vectors based on the projected embeddings.
LOA to Audio Generation with Latent Diffusion Model
Generated embedding vectors and Flan-T5 text embeddings become crucial cross-attention conditioning elements in the Latent Diffusion Model (LDM).
AudioCraft powers audio compression and generation research and consists of three models: MusicGen [2], AudioGen [5], and EnCodec [6]. MusicGen, which was trained with Meta-owned and specifically licensed music, generates music from text-based user inputs, while AudioGen, trained on public sound effects, generates audio from text-based user inputs. EnCodec, typically used foundationally in building MusicGen and AudioGen, is a state-of-the-art, real-time, high-fidelity audio codec that leverages neural networks to compress any kind of audio and reconstruct the original signal with high fidelity. This work further proposes a diffusion-based approach [7] to EnCodec to reconstruct the audio from the compressed representation with fewer artifacts.
Our refinements enhance the Audiocraft framework in several key aspects (refer to Appendix Wiki of Audiocraft for parameter explanations):
Increase prompt segments: From 10 to 15 segments.
Extend the duration for each segment: From 30 seconds to 300 seconds.
Prolong overlaps: From 9 seconds to 60 seconds.
Enhance user control: Introduce the ability to input individual segment durations and overlaps.
Automatic calculation: Implement a feature to automatically calculate the overall duration and timings based on user-defined segment durations, repeats, and overlaps.
Our artists conducted a meticulous evaluation of the AI-generated audio, employing a multifaceted set of metrics to assess the quality and effectiveness of the soundscapes produced. The evaluation criteria included:
Sound Quality: This metric assessed the overall richness and cleanliness of the audio, focusing on the presence of unwanted noise and the suitability of the audio's texture, ensuring a pleasant and engaging auditory experience.
Semantic Relevance: The audio was evaluated for its relevance to the textual prompts, measuring how well the generated sounds matched the intended narrative themes and emotional undertones as conveyed by the subtitles.
Experimental Music Perspective: The judgment considered the innovativeness and creativity of the audio from the vantage point of experimental music, appreciating the AI's ability to produce sounds that, while not traditionally replicative, offered new dimensions in sonic exploration.
Instrumental Blend: A critical evaluation was made regarding the integration of AI-generated sounds with the sanxian, sanshin, and Chinese percussions, to determine the seamlessness of their blend and the overall harmony within the composition.
Through our iterative evaluation and refinements (Sec. 3.3), we meticulously honed the prompts and the models, culminating in the selection of the finest iteration for our fixed-media electronics in "Prelude: To Listening," as detailed in Sec. 3.2. The spectrograms are plotted using Librosa [8], with amplitude spectra converted to decibels and displayed on a logarithmic frequency axis. The audio is uniformly sampled at a rate of 48 kHz.
[prompt_1_1] |
[prompt_1_2] |
[prompt_1_3] |
[prompt_2_1] |
[prompt_2_2] |
[prompt_2_3] |
[prompt_2_4] |
[prompt_2] |
[prompt_3] |
[prompt_4_2] |
[prompt_4_3] |
[prompt_4_4] |
Offline accompaniment: generative audio + sanshin + Chinese percussions
—> remixed recordings (i.e., fixed-media electronics).
For the real-time accompaniment, featuring the sanxian (live) with the fixed-media electronics, please appreciate the performance (Video 2, 3).
The iterative refinement of our generative audio was a transformative process, deeply influenced by the multifaceted evaluation criteria. Starting with Sound Quality, we meticulously polished the audio to achieve a cleaner and richer texture, reducing unwanted noise and enhancing the tonal depth for a more engaging auditory experience.
For Semantic Relevance, we fine-tuned the prompts to ensure the generated sounds closely mirrored the narrative prompts, thus making the audio more contextually resonant and emotionally aligned with the subtitles' suggested themes.
Through the lens of Experimental Music Perspective, we encouraged the AI to explore beyond conventional boundaries, leading to a more innovative and expressive audio output that offered fresh sonic dimensions, even if it meant embracing a degree of creative 'noise' that traditional standards might reject.
In terms of Instrumental Blend, we adjusted the integration of AI-generated sounds with traditional instruments to create a seamless and harmonious blend, ensuring that each note contributed to the overall composition's cohesive and symphonic quality.
Taking the example "[prompt_3]" from the bay scenes, we initially encountered a serene calm in the recording. However, after several refinements of the prompt and the model parameter, we guided the AI to imagine and generate the energetic and bustling sounds of the Dragon Boat Race, which were not present during the original recording. The result was an audio transformation that turned the tranquil bay into a vibrant arena, providing an ominous prelude to the imagined raucousness of the event, and ultimately serving as a fitting resolution to the scene's auditory narrative.
We observe that most generative AI models are designed to replicate and recreate existing sounds with precision. However, in the realm of experimental music, the objective often diverges from this norm. As such, when AI systems like AudioLDM2 and Audiocraft do not perfectly replicate known sounds, they inadvertently produce novel audio that can be intriguing and valuable from an experimental music perspective. These "failures" to reproduce conventional sound characteristics open up new avenues for sonic exploration and artistic expression.
The use of film subtitles as prompts for the instrumental section's fixed-media electronics introduced a fresh and innovative approach to the creative process. This method allowed for a more nuanced and contextual generation of audio, as the prompts provided not only a description of the sounds but also an emotional and narrative framework that the AI could interpret.
The interactive process between human and machine was a key positive effect. The computer's response to human-crafted prompts, followed by the musician's reaction and improvisation to the computer-generated audio, created a dynamic and evolving musical dialogue. This back-and-forth interaction is emblematic of the collaborative nature of experimental music, where boundaries are blurred, and the roles of writer, performer, and AI become intertwined.
In terms of sound, the AI-generated accompaniments blended remarkably well with the elements of sanxian, sanshin, and Chinese percussions. The generative models produced sounds that, while not exact replicas of traditional Chinese music, complemented the sanxian's timbre and the percussive textures in a way that was aesthetically pleasing and contextually appropriate for sonic art, experimental music, and contemporary music settings.
The AI's ability to generate sounds that harmonize with traditional instruments showcases the potential for AI to be integrated into diverse musical contexts. It suggests that AI can be a flexible tool in the hands of experimental musicians, capable of contributing to the creation of new soundscapes that resonate with cultural and musical traditions while also breaking new ground.
One of the core challenges in working with Large AI Models for audio generation is the difficulty in debugging and understanding the direct relationship between input text prompts and output audio. While sound artists are interested in the transparency of how a given text prompt shapes the resulting audio, the process is not always straightforward.
As discussed in Section 2.2.2, we introduced an experiment to visualize token importance, aiming to shed light on how specific words in the text prompt might influence the audio output. However, this is just one aspect of the audio generation process.
AudioLDM2 employs AudioMAE, a self-supervised pre-trained model, to extract essential acoustic features from audio. These features, including pitch, timbre, and rhythm, are critical in guiding the latent diffusion model to produce realistic audio. This step indicates that the output audio is influenced not only by the text prompt but also by the pretraining data and the architecture of AudioMAE, which can affect the quality and authenticity of the generated sounds.
Moreover, AudioLDM2's generation process incorporates an element of randomness. The same text prompt might lead to different audio outputs on different occasions due to the latent diffusion model's sampling from a Gaussian distribution and the GPT-2 model's potential to generate varying sequences of Language of Audio (LOA). While this randomness can introduce diversity and a degree of creativity into the output, it also introduces inconsistency and instability.
In essence, the input prompt plays a significant role in determining the semantic content and general characteristics of the output audio. However, it is not the sole factor at play. The interplay between the input prompt, acoustic features extracted by AudioMAE, the model's pretraining data, architecture, and the inherent randomness of the generation process all contribute to the final audio output.
To generate audio that aligns with specific artistic visions, sound artists may need to engage in an iterative process of experimentation. This involves adjusting text prompts, tweaking model parameters, and evaluating the outcomes to achieve the desired results.
Generative Large Audio Models typically rely on text inputs to produce output, which presents a unique challenge when the goal is to control the specific features of the generated audio. In our project, we observed that while text prompts can guide the overall direction of the audio, achieving precise control over the output's characteristics is not straightforward.
For instance, consider the prompt:
Knowing that
Cantonese Chinese operas
such as "Six States Installation of Minister" were performed for the benefit of the Tin Hau Temple behind us may colour how we listen to this field recording.
Initially, this prompt lacked the specific descriptor "Cantonese Chinese", which led to generated audio that did not closely resemble the intended Cantonese Chinese operas. Recognizing this shortfall, we refined the prompt to include the descriptive adjective, hoping it would lead to a more accurate representation of the desired sound.
However, even with this adjustment, the generated audio did not align well with the expected outcome of traditional Chinese operas. It was only through additional manipulation of the audio, such as slowing it down by half and lowering the pitch by an octave, that the result began to approach the sonic qualities of Chinese music.
This experience highlights the limitations of using text inputs to generate audio with detailed sonic precision. While LMs can interpret and respond to textual cues, the translation of these cues into specific audio features is not always direct or predictable. The process often requires trial and error, and sometimes post-generation audio processing, to achieve the desired sound.
This reflection points to the need for more nuanced methods of audio generation that can better bridge the gap between text inputs and audio outputs, particularly in the context of experimental music and sound arts where the stakes for creative expression and sonic accuracy are high. It also suggests that future developments in AI-generated audio may benefit from more sophisticated means of inputting and refining audio characteristics, beyond the textual prompts that currently dominate the field.
One of the critical reflections from our experience with AudioLDM2 and Audiocraft is the discrepancy between the training data available to generative models and the diverse range of genres and instruments that sound artists seek to explore. This became particularly evident in our work with sanxian and sanshin, instruments that were not represented in the models' training data.
The absence of specific instrumental data presents a significant challenge when attempting to generate audio for unique or less common instruments. Despite our best efforts in crafting detailed and context-rich text prompts, the models were unable to produce the distinct sounds of sanxian and sanshin. This limitation suggests that future developments in AI-generated audio should prioritize expanding the range of instruments and genres represented in training data, enabling artists to push the boundaries of sound creation and explore new sonic territories.
By March 2024, the landscape of audio generation has been transformed by advanced Large Audio AI Models like AudioLM [9], AudioBox [10], ChatMusician [11], AudioGPT [12], WavJourney [13], and MusicLDM [14]. AudioLM, Google's initiative, adapts language modelling techniques to generate consistent and detailed audio. Meta's AudioBox creates a variety of sounds, offering control over voice styles and sound effects. ChatMusician, an open-source project, specializes in music generation, surpassing GPT-4 in music comprehension. AudioGPT combines large language models with audio models to produce multi-modal responses, including speech and music. WavJourney automates storytelling, crafting narratives with personalized speakers and dynamic soundtracks. MusicLDM addresses the challenge of generating novel music from text, ensuring originality and tackling plagiarism. These models not only push the boundaries of AI in audio but also open new avenues for creative expression and content generation.
Some might contend that our AI-generated audio sounds somewhat random and is affected by digital artifacts and noise, stemming from the unconventional prompts and their allusions to less common instruments. However, large audio models have primarily been trained on common instruments, natural sounds, and widely available music, and have been tested with conventional prompts describing sound characteristics or referencing Western instruments and popular genres. For instance, prompts like "A funky bass guitar grooving in sync with drums" or "A contemporary hip-hop beat with smooth rhymes and catchy hooks" are typical. Consequently, the models might lack knowledge of less common instruments, such as the sanxian and sanshin, and genres like Cantonese opera, and they may also be inadequate in evaluating prompts for such sounds.
Besides the models previously mentioned, some avant-garde initiatives (Figure 6) have surfaced, tackling a wide range of audio tasks.
In the series "Evaluating Audio AI LM models for Sound Design" by Ambient Art Styles, the author critically assesses the application of LMs in the realms of sound design and music production. Through a multi-part exploration, the author compares various models, including Meta/Facebook's MusicGen, Google's MusicLM, and Riffusion, using specific prompts to generate music and sound effects. The evaluation delves into the quality of the outputs, the influence of training data on the models' performances, and the potential for human-AI collaboration in the creative process. The series also scrutinizes the models' ability to understand complex prompts and produce coherent audio sequences.
The "Prelude: To Listening" project has showcased the innovative potential of integrating AI with traditional instruments, revealing a path forward for experimental music. Despite the current limitations in precise audio control from text inputs, this work has demonstrated the creative opportunities that arise when AI deviates from replication, producing unique soundscapes. The use of film subtitles as prompts has provided a richer context for AI-generated audio, enhancing the performance's narrative depth. Looking ahead, future work should focus on expanding the diversity of training data and developing more nuanced methods for audio generation. By refining the interplay between text inputs and acoustic features, future models can better capture the subtleties of traditional music, enabling artists to fully exploit the creative possibilities of AI in sound arts.
The work for the performance and this paper was supported by a grant from the Research Grants Council of the Hong Kong Special Administrative Region, China (Project No. CityU 21602822). Additionally, we acknowledge the financial backing provided by the Arts Capacity Development Funding Scheme of the Government of the Hong Kong Special Administrative Region, which has been instrumental in supporting the NME15 project.
For the "Prelude: To Listening" project, we upheld ethical research practices, prioritizing accessibility, inclusion, and sustainability. Respecting data privacy and socio-economic fairness, we safeguarded personal information in accordance with GDPR and other data protection laws.
Informed consent was obtained from all human participants, with careful selection, fair remuneration, and considerate post-research engagement. Funding sources were disclosed to maintain transparency, and potential conflicts of interest were identified and reported to preserve the integrity and credibility of our research.
Multi-Prompt: This feature allows you to control the music, adding variation to different time segments.
[Prompt Segments (number)]:
Amount of unique prompts to generate throughout the music generation.
[Repeat (number)]:
Write how many times this prompt will repeat (instead of wasting another prompt segment on the same prompt).
[Duration (number)]:
How long you want the generated music to be (in seconds).
[Overlap (number)]:
How much each new segment will reference the previous segment (in seconds). For example, if you choose 20s: Each new segment after the first one will reference the previous segment 20s and will generate only 10s of new music if the segment duration is set as 30s.