AIMC2024 Performance Proposal
glemöhnic is a ~20 minute performance piece that utilises SpeechBrain, OpenAI and CoquiTTS voice and speech models to transcribe, synthesise and clone historical and real-time improvised vocal gestures.
This piece explores how extra-normal [1] vocal sounds trigger and provoke nonsensical AI-mediations of human vocality, by utilising non-text audio as input material for text-expectant AI speech recognition and synthesis models. The mediation of non-textual human voice gestures by these ASR models yields eclectic, bizarre and poetic nonsense, which is further utilised as textual input for text-to-speech synthesis and voice cloning models. CoquiTTS’ [2] XTTS_V2 model re-constructs the syllabic, phonemic and garbled poems into vocal clones that oscillate between their reference audio (the original audio dataset input) and the scraped audio data that the XTTS_V2 model has been trained on. The result of this is a collection of original and cloned audio samples that are utilised as sonic material in a live coded musical performance, using the strudelREPL platform.
To give a specific overview of how the AI model pipeline is implemented in glemöhnic, Image 1 below illustrates the overall flow for the models used, and will be systematically discussed.
To explain the role of the SpeechBrain and Whisper ASR models pipeline, one branch (in orange in Image 1) utilises the SpeechBrain [3] wav2vec 2.0 ASR model. This model is an end-to-end system, comprised of a tokenizer block pretrained on the CommonVoice dataset and Facebook’s acoustic model—Wav2Vec2-Large-LV60. The wav2vec 2.0 builds on Wav2Vec2 with the addition of 2 deep neural net layers and further finetuning on CommonVoice. The output representation is then parsed by a CTC decoder, which learns the alignment between input audio and output token sequences. The second branch (pink in Image 1) is OpenAI’s [4] Whisper ASR model, which is a multilingual speech recognition model pretrained on 680,000 hours of audio and correlating transcripts scraped from the internet [5].
The purpose of using these two models in parallel is to expose the different quirks SpeechBrain and Whisper have as a result of the training of their respective datasets. The SpeechBrain model is pretrained on an English language corpus, which has difficulty parsing non-text-based speech and triggers the attention mechanism to enter phonemic “death-loops”. Whisper, in comparison, is pretrained on a multilingual corpus of scraped audio and correlating transcripts, which yield unexpected mappings between more phonemic, textual, and timbral audio input and the original scraped audio dataset. This is illustrated below in Image 2.
The next stage in the pipeline (green in Image 1) is CoquiTTS’ XTTS_V2 cloning and text-to-speech synthesis model. This receives and loads individual input CSV transcriptions from the Whisper and SpeechBrain models and pairs them with their parent audio file (blue in Image 1). The transcriptions are used as prompt material for the XTTS_V2 model, and the parent audio is used as a speaker reference to synthesis the prompt material and clone it according to the parent audio. The resulting output is individual—or a bank1 of—audio material that is then used as samples in the final pipeline stage (yellow in Image 1). In this final stage, the samples are then manipulated in real-time in the strudel platform to create BLAH. The overall intention of this real-time manipulation is to engage with a form of AI voice to construct sound collages inspired by Dada [6] [7] [8], grounded within rhythmic and gestural conventions of improvised voice.
Thematically, glemöhnic engages with the conference theme in its engagement of AI models that have been built, trained and deployed with certain intentionalities: namely to recognise, transcribe and synthesis human speech. It further ties into sound art exploration, historical art movements around language and text-as-sound (such as Dada and zaum).
glemöhnic would work best in Performance 2 at Oxford University. It could also work well at the Performance 3 club night at the Old Fire Station, as it utilises the strudel live coding platform to interact with the output XTTS cloned audio.
The performer will provide:
own laptop
microphones
audio transmitter and receiver pack
audio interface
power plugs and UK power adaptors.
The performer can also bring all of her own audio cable, if required. The audio interface has two female XLR outputs, one for each stereo channel, which should go to the PA system.
a standing-height table that would suitably fit a laptop, audio interface, the transmitter and receiver
stereo PA system (ideally with subwoofer) and basic theatrical lighting.
What secrets and poetry might be contained in our burps? Our giggles?
glemöhnic utilises explores how our vocal identities are pulled from us, and reformed into new vocal identities- through feeding a sigh or cough into text-oriented AI voice models and allowing them to unravel and warp.
This performance utilises open sourced speech recognition models from SpeechBrain, OpenAI and CoquiTTS to transcribe, synthesise, clone and mutate real-time vocal sounds and historical voice datasets. From this unraveling of non-text voice sounds by text-expectant AI models, scraped and forgotten voices are sung into being and rise from the dusty depths of scraped datasets. These new sounding bodies are reunited with their parent voice bodies, to create converging and diverging tapestries inspired by Dada and zaum.
glemöhnic catalyses the wayward mutations of human voice that occur when wordless voice is fed to word-dependent AI models.
Some smaller audio excerpts are included below, as demonstration.
Kelsey Cotton is a vocalist-artist-mover working with experimental music, Musical Artificial Intelligence, electronic textiles, soft-robotics, and Human-Computer Interaction. As a researcher, Kelsey is fascinated with pushing the limits of musical bodies, with her recent work delving deeper into designing artifacts which harness, augment and fuse different physiologies. She is passionate about somatic interaction, the potential for intersomatic experiences between fleshy and synthetic bodies, and first-person feminist perspectives of musical AI. Kelsey is currently undertaking PhD studies in Interactive Music and AI at Chalmers University of Technology in Gothenburg, Sweden.
The paper has utilised open-sourced, pretrained AI models and the first author’s own personal dataset. No human participants (other than the first author) were recruited for this study, and no sensitive data were collected. This paper has the intention to contribute to musical AI research, and to support future research within this community. The predicted environmental impact of this work as minimal since the computation required to create this work was comparable to daily personal computer usage. Accessibility of the technology in this work is limited with the general accessibility to computers and computational development frameworks.
This work was partially supported by the Wallenberg AI, Autonomous Systems and Software Program—Humanity and Society (WASP-HS), funded by the Marianne and Marcus Wallenberg Foundation and the Marcus and Amalia Wallenberg Foundation.