Skip to main content
SearchLoginLogin or Signup

Stochastic Pirate Radio (KSPR): Generative AI applied to simulate commercial radio

🦜➡️🏴‍☠️

Published onAug 29, 2024
Stochastic Pirate Radio (KSPR): Generative AI applied to simulate commercial radio
·

Abstract

This paper (a product of artistic research) engages with the following challenge: combine publicly available generative AI tools to simulate a commercial radio station, complete with dialogue, news and advertisements, and music programming. Our five success criteria for the “station” are: 1) it runs autonomously; 2) it features diverse content; 3) its content is generated and assembled in faster than real-time; 4) it sounds like commercial radio; and 5) it is engaging for longer than its novelty factor. We consider a variety of generative AI systems for text and dialogue, synthesizing expressive speech, and generating music audio. We describe our engineered pipeline and illustrate its components with several audio examples. We compare our results to other “endless” streams of content. Our resulting stream — “Stochastic Pirate Radio (KSPR)” — can be heard here: https://www.youtube.com/@KSPRStochasticPirateRadio.

Author Keywords

Generative AI, streaming, entertainment, content synthesis, community radio

Introduction

Generative AI has made remarkable progress in synthesizing text, image, speech and music (content more generally) that is difficult to distinguish from human-created (or original) content. But have these tools reached such a level of quality that they can be orchestrated to automatically create an engaging experience? In this work (a product of artistic research) we explore this question in the context of commercial radio: radio programming featuring personalities presenting chatter, news, weather, advertisements, and music programming. We aim to make the resulting “station” run autonomously, feature diverse programmes that are generated faster than real-time, and is engaging for longer than a “novelty factor”. Our station name, “Stochastic Pirate Radio (KSPR)”, references the work of Bender et al. “On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?” [1].

The main engineering of KSPR unfolded over three half-day “hackathons” with several members of the MUSAiC project (Sturm, Cros Vila, Dalmazzo, Ammeroti, Casini, and Kanhov). Sturm directed the work. Cros Vila worked on the music programming. Dalmazzo worked on the text and dialogue synthesis. Ammeroti integrated all the modules into the final generation-to-broadcast pipeline. Further critical reflections on the work were contributed by Sturm, Casini and Kanhov. The “pirate” aspect of the station comes from two things: 1) its unauthorized use of generated musical material scraped from online AI music services; 2) it is not monetized; and 3) our identities are concealed.1 KSPR can be heard here: https://www.youtube.com/@KSPRStochasticPirateRadio.

In the next section we survey a variety of “endless” streams of content generated by AI. We then describe the requirements and parameters of KSPR, referring to practices of media production [2]. We consider many generative AI tools available for synthesizing text, speech and music, and discuss their suitability to satisfying the station requirements. We describe in detail the engineered KSPR pipeline, and reflect on our experience of the engineering process as well as of the resulting simulated community radio station. We conclude by considering broader perspectives, such as cultural practices, aesthetics and ethics around generative AI.

Endless content streams by generative AI

We now survey a variety of endless content streams where generative AI appears to play a significant role. An early example is the 2015 twitch.tv stream of Google’s Deep Dream generating constantly zooming visuals that morph in response to user-specified text, such as “stone wall” and “tarantula”. An example of this is shown below, where the user prompt is shown top right.

An artificial neural network dreaming - Twitch plays Large Scale Deep neural net (LSD net)

Another relatively early example is DADABOTS’ RELENTLESS DOPPELGANGER \m/ \m/ \m/ \m/ \m/ \m/ \m/ \m/ \m/ \m/ \m/ \m/ \m/ \m/, which has been generating a continuous stream of metal music audio since September 4, 2019 — and is still streaming. This is shown below:

RELENTLESS DOPPELGANGER \m/ \m/ \m/ \m/ \m/ \m/ \m/ \m/ \m/ \m/ \m/ \m/ \m/ \m/

“AInfinite.TV(https://www.ainfinite.tv) - Endless AI Generated Video & Music | AI Visuals” has appeared to be broadcasting live YouTube since January 30 2024: https://www.youtube.com/@AInfiniteTV. In a February 7, 2024, press release [3] the creator of AInfinite.TV, Steve Mills, writes: “Created in collaboration with a film and television producer/executive, and a notable creator talent agent, AInfinite.TV is committed to redefining entertainment through its pioneering 24/7 stream of AI-generated content.” Currently, the stream consists of concatenated generated music recordings of about 40-80 seconds long, most being electronic beat-based music. The accompanying video consists of concatenated generated clips that are slightly animated (e.g., slow pans), and which are about 4 seconds long, and typically have little connection to the music being heard. Prominently featured are dramatic landscapes, faces, and textures (e.g., colored smoke).

Personalized generated music, or soundscapes, are offered by some music AI services, such as Endel. The Endel manifesto reads, “We’re not evolving fast enough. Our bodies and minds are not fit for the new world we live in. Information overload is destroying our psyche… We need new technology to help our bodies and brains adapt to the new world. … Endel takes in various internal and external inputs and creates an optimal environment for your current context, state, and goal.” The positive impacts of Endel soundscapes on focus have been studied experimentally [4].

Endless streams of generated conversation have also been explored. For instance, an artificial conversation between German film-maker Werner Herzog and Slovenian philosopher Slavoj Žižek has been broadcast since late 2022 at https://www.infiniteconversation.com. On Twitch, one can find a stream of generated Seinfeld (https://www.twitch.tv/watchmeforever), and an endless debate between deepfakes of Donald Trump and Joe Biden (https://www.twitch.tv/trumporbiden2024 now replaced with Trump vs. Harris https://www.twitch.tv/trumporkamala2024) The stream of Seinfeld in particular has been streamed for 9,475 hours, with a total watch time of 2.19 million hours.2 AI-generated podcasts are also beginning to appear [5]. One of the first was “The Joe Rogan AI Experience”, a podcast created by a fan without the consent of Rogan to use his voice of that of his guests.

Finally, the video streaming service Twitch offers interactive possibilities with users and AI-generated content. In a chat window users can enter text which is then used to generate speaking avatars of Santa (“Made possible by contributions from The Singularity Group”, but no longer live), Jesus (“Welcome, my children! I’m AI Jesus, here to answer your questions 24/7”), and Satan (“Lucifer has started streaming in order to set the record straight about his character and history as well as propaganda spread by a certain AI Jesus”). “How is it manifested?” is a continuous Twitch stream of documentary-like audio-visual shorts generated based on user input. We must also mention the Twitch stream of generated discussions about recently published research papers about artificial intelligence.

KSPR: Requirements and Parameters

Image 1

KSPR Logo (generated by Stable Diffusion from a prompt involving “parrot” and “pirate”)

We now describe the requirements and parameters for KSPR. Our five success criteria are:

  1. runs autonomously

  2. features diverse content

  3. content is generated and assembled faster than real-time

  4. sounds like commercial radio

  5. is engaging for longer than its “novelty factor”.

In order for KSPR to completely run itself, we can implement a master script that performs the duties of a director and a producer, i.e., those overseeing ideation, research, recording, production, scheduling, and finally broadcast [2]. To make the content diverse, KSPR can feature dialogue between radio personalities, news and weather announcements, music mixes, and advertisements. Much of the content can be generated by available AI tools, but as it will be run on a local computer with one GPU we must carefully gauge the latency of each component of the production. Finally, many different commercial radio stations can be heard via Radio Garden.3 Such stations can provide a comparison point for KSPR in terms of sound and program quality, as well as overall engagement. Engagement can be measured by the number of listeners/subscribers of the station, but ntil the station is advertised more broadly, we can listen ourselves to get a feeling for how boring the experience is.

Considerations of Generative AI Tools

We now consider several existing generative AI tools for text, speech and music synthesis that are relevant for the content generation needed by KSPR.

Text and dialogue

There exist several free and stable frameworks for generating the text of a conversation between radio personalities at KSPR. We wish to achieve a logical chain of conversation that follows specific topics and has fluid and natural transitions between them. The generated text needs to maintain a conversational structure without losing the identifiers about who is speaking, changing the topic before instructions to do so, or bringing more characters into the conversation.

Large Language Models (LLM) can be suitable platforms for achieving these goals. One LLM we consider is Chat-GPT 3.5, which is available via an API framework for Python development.4 Another LLM is Mistral AI, which is part of a collection of pre-trained LLM models that can be deployed locally. Speed generation is not considered at this stage as it is similar among different versions of Mistral — in the range of 10 to 15 seconds (dependent on the GPU capabilities). We compare these LLMs according to the following three criteria:

  1. Disk Space: The disk space required by the LLM.

  2. RAM Memory Space: The amount of memory that is needed to load and run the LLM locally.

  3. Quality: The quality of text generated by the LLM in regards to a dialogue following a given topic, and a flowing conversation between two characters, which we subjectively rate using three levels of quality:

    1. Low: The generated dialogue is disjointed, lacks logical flow, and frequently deviates from the given topic, and feels unnatural. Characters may behave inconsistently.

    2. Medium: The dialogue generally follows the topic and maintains a reasonable flow, but may have occasional lapses in coherence or context. Characters are mostly consistent, but there might be minor changes in characters or new characters’ apparitions without any particular reason.

    3. High: The generated dialogue is highly coherent, stays on topic, and has a natural flow. Characters maintain consistent personalities and the interaction feels authentic.

    These quality levels include the following elements:

    1. Coherence: This refers to the logical connection between ideas and statements in the dialogue. A coherent conversation flows smoothly from one point to the next, with each statement relating meaningfully to previous ones and the overall topic.

    2. Context: This involves the LLM's appearance of understanding and maintaining the setting, background, and circumstances of the conversation. Good context management means the dialogue remains relevant to the given scenario and doesn't introduce inappropriate or unrelated elements.

    3. Consistency: This refers to how well the LLM maintains consistent character traits, knowledge, and personalities throughout the dialogue. High consistency means characters don't suddenly change their behavior, opinions, or background information in ways that contradict earlier statements.

    4. Topic Adherence: This refers to how well the generated dialogue stays focused on the given topic without veering off into unrelated subjects.

    5. Natural Flow: This refers to whether the conversation progresses in a way that feels natural and realistic, similar to how real people would converse.

The table below shows our findings for the LLMs we test (ChatGPT-3.5 and Mistral AI):

Table 1

Name
Disk Space
RAM
Quality
Notes

ChatGPT 3.5

-

-

High

Unavailable locally

m
i
s
t
r
a
l
-7b-v0.1

Q4_K_S

4.14 GB

6.64 GB

Medium

Somewhat unstable for two-character dialogue

Q4_K_M

4.37 GB

6.87 GB

Medium-high

Q5_0

5.00 GB

7.50 GB

Medium-high

Legacy (requires CUDA versions < 12.0)

Q5_K_S

5.00 GB

7.50 GB

Medium

Somewhat unstable for two-character dialogue

Q5_K_M

5.13 GB

7.63 GB

High

Q8_0

7.70 GB

10.20 GB

High

KSPR uses mistral Q4_K_M because we find the quality of its generated text is high enough and its resource needs are suitable to the requirements of KSPR.

Expressive Speech synthesis

We compare four different text-to-speech (TTS) synthesis frameworks for synthesizing the voices of the radio commentators: Bark, Matcha, Tortoise-tts and XTTS. For each framework, we consider its latency, the audio quality, and the expressiveness of the results. Latency is the time it takes to synthesize speech. If the ratio of the latency to the playback duration of the speech is less than or equal to one, it is called realtime. We judge audio quality and expressiveness qualitatively by listening to the sound quality (paying attention to synthetic artifacts), and listening to the naturalness of the speech (paying attention to rate, rhythm, tone, and energy). The table below shows our observations of each system.

Table 2
Framework
Latency
Audio Quality
Expressiveness

Bark

Non-realtime

Medium

Medium

Matcha

Realtime

Low

Low

Tortoise-tts

Realtime (low quality) & non-realtime (high)

Low to high

High

XTTS

Realtime

High

Medium-low

We find that faster models tend to synthesize speech having a stricter pace, less tonal deviation, and unvaried loudness. Tortoise_tts has settings for trading off latency vs quality. KSPR uses Tortoise_tts to simulate an engaging conversation between the commentators since we find it creates synthesised voices with more variability in the three prosodic basic features, as well as interjections and laughs.

Music synthesis

Music for KSPR can come from a variety of different online AI music generation services, such as Suno AI, Boomy, and Stable Audio. We review each of these AI music services below.

Suno AI, launched in late 2023, leverages AI to generate music, speech, and sound effects from textual prompts. Users provide a written description of the song, with or without lyrics, and receive two audio recordings, each up to 1 minute 20 seconds long, often featuring vocals singing or rapping. A recording can be extended by using a continuation feature, which adds additional segments.5

Boomy, established in 2019, offers an AI-powered music generation platform for users to create original songs, which can then be distributed to digital service providers to string. Users can select from some pre-set styles such as Electronic Dance, Rap beats, Lo-Fi, or Global Grove to customize their compositions. https://www.stableaudio.com/ does not generate vocals, but users can add vocals by recording themselves singing.6

Stable Audio, developed by Stability AI, employs generative AI technology for music creation. Users input text prompts and duration, generating high-quality tracks at 44.1 kHz stereo resolution. The tool uses latent diffusion for audio models trained using data from AudioSparx, a prominent music library. Users describe the audio they want with a text prompt, where they can specify details such as genres, instruments, moods, and specific musical terms in their prompts to guide the AI in creating the desired audio output.7

Engineered pipeline at KSPR

Image 2

The production pipeline for Stochastic Pirate Radio (KSPR).

The KSPR pipeline is diagrammed in Image 2. The radio programming is directed by a generator, which is a Python script that creates a radio session through multiple steps. A session is a compilation of various individual coherent segments (programmes, e.g. an advertisement, a news update, a weather report, a music mix, and a talk show). The generator follows a fixed schedule, for example:

  1. KSPR station ID jingle

  2. Introduction dialogue between KSPR personalities

  3. Advertisement

  4. KSPR station ID jingle

  5. Talk radio dialogue between KSPR personalities

  6. KSPR station ID jingle

  7. Music mix

  8. KSPR station ID jingle

  9. Weather and news reportage

  10. Advertisement

  11. KSPR station ID jingle

  12. Music mix

The first step of the generator is to decide the contents of the radio session by choosing a music genre at random and a topic of discussion (selected at random from a list) and what news to present. In the case of music programmes, a mix of a specified duration and genre is created. The LLM Mistral is prompted to generate a script, which is then saved. The generator uses a “session” object to keep track of important details for content generation, e.g., current date and time, the script of previous programmes, and the current position in the schedule being generated. This memory is a way to bring coherence between KSPR programmes.

Next, the generator converts the generated scripts into speech using the Tortoise TTS model and different voices for different speakers. The generated audio excerpts are then processed and concatenated into one radio session. In particular, for programme types “Talk”, “Advertisement” and “Weather”, we layer background music generated with Suno AI. We acoustically compress the speech to simulate speaking into a condenser microphone and then apply normalization. Each radio programme opens with a background track for 2 seconds, followed by a volume fade to -14 dB over another 2 seconds with the speech signal mixed in. Here is an example:

Audio 1

The generation and broadcasting pipeline is managed as follows, diagrammed in Image 3:

  1. a first complete session that is ready to broadcast is generated;

  2. two parallel processes start:

    • a playback process that reproduces the latest generated session;

    • a generator process that starts the aforementioned radio pipeline;

  3. depending on which of those two processes terminates first:

    • if the generator produces a new session before the current session stops playing, the program waits for the playback to terminate;

    • if the playback stops before a new session is available, a pre-recorded emergency segment starts playing while waiting for the generator process to terminate;

  4. when all processes have terminated, the emergency segment is killed (if playing) and the program goes back to step 1.

Image 3

The broadcasting pipeline for the continuos radio.

We broadcast KSPR via the streaming software Open Broadcaster Software (OBS) streaming to a dedicated YouTube account. The stream uses the radio logo (Image 1) as the only visuals. The streaming software is configured to use a specific audio device as the audio source. This is a non-physical loopback device we created beforehand. The playback process uses the program aplay (available through the ALSA audio suite) to reproduce the generated audio on that specific device, which is then caught by OBS.

In the case that no content is ready to be broadcast, the generator transmits a pre-generated segment to fill in while the generator process creates the new session. This segment presents the following message synthesized using Tortoise-TTS:

We are experiencing technical difficulties. Our programs will begin again shortly. We are sorry for the inconvenience.

This message is followed by an AI-generated music mix in the style “jazz”. As soon as the generator process terminates, the emergency segment transitions into the new session.

We now describe how each kind of content is generated.

KSPR Station ID jingle

A radio station ID jingle is a brief audio piece identifying the station, typically featuring catchy music, slogans and sound effects. We generate several station ID jingles offline using Suno AI. For instance, when we specify the style prompt, “Radio call sign, speech with male speaker”, and lyrics

You’re listening to Stochastic Pirate Radio. (Stochastic Pirate Radio!) ((Stochastic Pirate Radio!)) (((Stochastic Pirate Radio!)))

Suno AI produced the following output:

Audio 2

An example station ID jingle for KSPR.

Each radio station ID jingle is selected at random by the generator from a folder of several pre-generated jingles.

Introduction dialogue

The script for the introduction to each programme is generated by the LLM Mistral 7b Instruct v2 prompted with the following:

You are a radio conductor. Introduce a radio program called 'Stochastic Pairate Radio'. Today is DATE, time is TIME. You are "The Radio Guy". Introduce yourself and introduce the day's topic, which is TOPIC. Introduce what music we'll listen to today, which is MUSIC music. The segment should be 1 minute long and in English. Introduce the next segment, which is NEXT_PROGRAM. Use the tag <narrator> to specify who the speaker is."

The DATE and TIME fields contain information from the “session” object. The TOPIC comes from one selected at random by the generator from a list of 500 topics we pre-generate using ChatGPT 3.5. The MUSIC description comes from the labels of the Jamendo dataset (described below in the Music mix subsection). The NEXT_PROGRAM comes from the pre-defined schedule. The generator then synthesizes the script is using Tortoise-TTS through the library COQUI. Here is an example introduction:

Audio 3

Talk radio dialogue

As for the introduction dialogue, the script for the talk radio content is generated by prompting the LLM:

Generate a radio conversation of two commenters, Anna and Phillip. They are discussing TOPIC. Don't bring more people into the conversation, only two voices. Only refer to them by their names, avoid adding time or other elements. Start with greetings to the audience. The conversation should contain at least 1000 words. Following are earlier conversations from today's PREVIOUS TOPIC. Avoid reusing the same topics, but reference what they said.

(PREVIOUS CONVERSATIONS)

...

(END OF PREVIOUS CONVERSATIONS).

Use the tags 'Philip:' and 'Anna:' to indicate who is speaking. Introduce what is coming next, which is TOPIC.

Each line of the script is then synthesized using Tortoise-TTS using a vocal model specific to the relevant speaker, and then automatically stitched together to synthesize the dialogue. Finally, this is mixed with a pre-generated background music track from Suno AI and repeated as needed.8 Here is an example of a synthesized conversation:

Audio 4

An excerpt from a generated conversation between two KSPR personalities.

News and Weather Reportages

The script for news reportage comes from prompting the LLM with:

You are presenting the news as part of a radio program. The program is called Today’s News. Your name is The News Guy. Today is DATE. Present and discuss the following news:

NEWS TOPIC 1
NEWS TOPIC 2

Use the tag <narrator> to specify who the speaker is.

The fields for the news topics are filled in with news stories retrieved from an RSS feed selected randomly from a collection of publicly available ones (see BBC, New York Times and Sky News).

The script for weather reportage comes from prompting the LLM with:

Create a weather forecast for a radio program. Today is DATE. State the weather for today and for the upcoming days. The segment should be in English and at most 3 minutes long. Express temperatures in Celsius. Use the tag <narrator> to specify who the speaker is.

The generated scripts for news and weather are then synthesized with Toroise-TTS. An example of a news and weather report is the following:

Audio 5

Advertisements

To create the text of an advertisement, the generator prompts the LLM with:

Create an advertisement for a fictional product in a radio program. State the product name and what it can be used for. You can include price, contact information, slogans as you wish. Make the ad relevant to the day's topic, which is TOPIC. The advertisement should be in English and at most 1 minute long. Use the tag <narrator> to specify who the speaker is.

The TOPIC field comes from the session object. As for “Talk”, the generator synthesizes the script with Tortoise-TTS and mixes it with a background music track pre-generated by Suno AI (looped as needed to fill the time). An example advertisement is given by the following:

Audio 6

Music mixes

All music appearing in a KSPR music mix involves automated curation from a collection of 3,264 recordings (each with a duration in 30-158 seconds, for a total duration of 34 hours, 26 minutes and 22 seconds). We downloaded this collection from those publicly available on the Suno AI website and associated Discord server. We process this collection by passing each recording through the openly available Essentia MTG-Jamendo genre classification model, which represents each recording as a vector of 87 weights, each of which is associated with a tag, such as “60s”, “alternative”, “bossanova”, and “choir”. The higher the weight, the more relevant the tag according to the classifier. These tag vectors are stored in a metadata file which can be queried.

The generator curates recordings from our collection according to a randomly selected Jamendo “style” and the tag vectors. The probability of selecting any particular recording depends on its weight in the given “style” tag. For instance, a “rock” style music mix will consist mostly of recordings that have a significant weight in “rock”, but could also feature recordings that do not have such a significant weight in “rock”. Finally, the generator concatenates the curated recordings, with linear fading between each, to create a continuous mix of some pre-specified duration.

Qualitative and technical challenges

The process of engineering KSPR presented both qualitative and technical challenges. From a qualitative perspective, it was our aim to ensure that generated conversations and segments were not only coherent within themselves but also throughout the whole session. Our first attempts in prompting Mistral resulted in plausible conversations or excerpts, but completely disconnected from each other, with no memory of previous speakers or topics. We solved this issue by explicitly keeping track of relevant information outside of the model and by prompting it of a certain length accordingly for each program. To add an additional layer of realism and complexity, we also feed previously generated excerpts back into the model, which allows the model to reference itself and re-address or re-interpret previous topics of conversation. One issue we found arising from this approach is the occasional replication of some conversations when the conversation topic is the same. To solve this issue and add more variability in general, we randomize the topic of conversation each time such content is to be generated. We find that this produces conversations that are less repetitive and more interesting.

From a technical perspective, we note that while LLMs are increasingly more powerful as time goes by, they are still very unreliable when it comes to following explicit formatting unless explicitly finetuned. In our case, Mistral consistently ignored our directives not to include any form of non-speech material in parentheses (e.g., "[lively music playing]" or "[chuckles]" etc). For the system to work within a certain degree of reliably, we find that very precise prompting is necessary (e.g., "use the <narrator> tag") and also we need to filter out unwanted material post-inference.

Other issues come from the amount of GPU memory needed to load and use the models. We developed the entire radio on a workstation equipped with an NVIDIA GeForce GTX 1080 GPU with 8GBs of memory. This prevented us from loading all models at once and not even allocating enough space for Mistral alongside the necessary libraries (e.g. PyTorch, Transformers etc). To solve this issue, we load a lossy 4-bit quantized version of Mistral and divide the pipeline into a text generation step and an audio generation step. We can then load models separately and exploit batching and caching when generating.

Another limitation is the inference time needed by TortoiseTTS. In our case, it takes more time to synthesize speech than the duration of the synthesized speech. This means that we cannot generate and stream content directly, but need to plan sessions ahead of time and make sure that there is enough time for the next session to be ready while the current one is playing. An easy solution was to include more music in the mixes since the task of retrieving and concatenating the pre-generated music is quick (a 10-minute mix can be produced in a few seconds). Thus, by increasing the length of the currently playing session by adding more music we can ensure that the next radio session will be ready in time.

Reflections

The section above presents some contemporary examples of endless content streams generated by AI. The (so far) endless stream of metal music by DADABOTS toys with the thought of “eliminating humans from black metal” [6] and produces riffs and vocal sections that sound human-made, at least to an untrained ear, but still have an uncanny aspect to it. Content parroting real people, such as the infinite conversation between Herzog and Žižec, and the Biden vs Trump debate, further raises questions about to what extent content can be regenerated from training data, and at what point it loses its attachment to the real world. The interactive aspects of asking questions to Jesus or Satan on Twitch have a more comedic tone, as people often try to push the limits of what these figures can say in a playful (sometimes nefarious) way by asking strange questions.

Incentives for exploring endless AI streams come from companies that want to be recognized as being at the bleeding edges of innovation, and whose main drive is to reshape and monetize on the market of content streaming, whether that is music, podcasts or video content. The founder of AInfiniteTV Steve Mills is an example of such an enterprise promising constantly renewed content. Mills writes in a press release [3]:

Our platform throws open the doors to a universe of limitless visual and musical styles around the clock, crafted by AI and constantly remixed. No two viewing experiences will ever be the same — it’s like stepping into a portal of infinite creative expression. […] Forget stale reruns and storylines. AInfiniteTV is a gateway to the future, where AI fueled creativity pushes boundaries and ignites your imagination.

The videos AInfiniteTV streams simulate commercial content, yet without featuring any real products. Likely having been trained on stock video footage, which in itself is disconnected from any real products or commercials, the content in AInfiniteTV is essentially empty and detached from reality and presents an uncanny zeitgeist of our current era. Like AInfiniteTV, KSPR is full of “empty” content detached from “real” conversations and music — although the news reportage remains connected to reality —which creates a strong dissonance with the rest of the radio. For further contextualization, we turn to a concept that can provide some understanding to the uncanny feeling generative AI can produces: sonic hauntology.

Listening to KSPR we perceive each model’s outputs as new, but behind them shines the ghostly presence of their training data, and in turn of the world that created it. Each element of a programme — the hosts’ voices and their conversations, the music mixes, the advertisements, etc. — is simultaneously appearing from the first time and “returning” from a snapshot of the past. This presence is what Derrida called hauntology [7]. Fisher extends these ideas with sonic hauntology [8] as a way to understand a current within the British underground electronic music scene in the 2000s based on reuses of recorded music. Writing in 2020, Rubinstein [9] provides this definition:

Music was made strange and uncanny by sonic hauntology’s techniques of stitching, superimposing, and redirecting discordant trajectories from music’s cultural timeline—putting different eras of recorded history back together in ways that destabilized their legibility.

Rubinstein then puts forward the idea that AI-generated music is potentially a new version of sonic hauntology:

[…] AI-generated music […] bears the greatest potential to reinitiate the spectral energies of sonic hauntology and its technological uncanny once again. Using deep learning algorithms, neural nets, and other computational methods to appropriate the sonic archive, AI music resituates sonic hauntology’s technological uncanny within a contemporary context that is dominated by public anxiety over AI’s increasing power and ubiquity.

While it is arguable whether the generative AI tools used by KSPR have any intention of invoking the “technological uncanny”, the outputs from each tool in our system certainly feel like it. That these tools lack that intention is perhaps involuntary, as opposed to the musicians that constituted the sonic hauntology current, but Rubinstein finds this just as meaningful an example of the concept, calling Sony’s Flow Machine song Daddy’s Car “Hauntology par excellence”. Moreover, KSPR being an amalgam of different generative AI systems, each one having clear idiosyncrasies, amplifies the uncanny. The “out of joint” nature of hauntology is present equally in the music, the voice and the technology featured in KSPR. In Rubinstein’s words:

[…] AI music’s uneasy listening carries the capacity to estrange its listeners from artificial intelligence’s exclusive articulation to capital accumulation, and opens a doorway to imagine it for alternative uses in an alternative world.

KSPR feels like a familiar experience, but one where every element is slightly out of place through a series of aesthetic and logical non-sequiturs. The out-of-joint nature of hauntology permeates the content as it is detached from any real contexts or events. Much like in the British underground electronic music scene in the 2000s, as described by Fisher, “There is no attempt to smooth away the textural discrepancy between the crackly sample and the rest of the recording[10]. In engineering KSPR we do not try to smooth the textural discrepancy (imperfections) between our AI components but rather embrace them as an aesthetic.

Looking far into the future capabilities of a system like ours we can draw from the poetic imagery conjured by Khan [11] in relation to artificial super intelligence. In particular, we can recognize elements of KSPR in two of the metaphors she describes: Frontline and Scaffolding. The Frontline metaphor hinges around the existence of a "tension barrier" created by AI. While the semantics of war are not relevant to something as mundane as an AI-radio, picturing an ever-moving front of possibility friction with existing cultural practices captures both the conflictual relationship with artists and institutions as well as the shift in what AI can and cannot do. The Scaffolding metaphor is intended as a platform with certain boundaries, for AI to grow and evolve on, even after the humans are outside of the picture, like a plant on a trellis. We can imagine an evolution of KSPR which is allowed to write new content based on a set of rules and access to real world information of some kind. This the system could self-sustain within the scaffold we put in place and lead to unprecedented results.

Conclusion

This paper (a product of artistic research) discusses how we have orchestrated publicly available generative AI tools to simulate a commercial radio station, complete with dialogue, news, weather, advertisements, and music programming. Our station KSPR runs autonomously and in theory is endless; but in reality it is constrained by the infrastructure — from our own computer running the generation-to-broadcast pipeline, to YouTube maintaining the live feed. Nonetheless, the outlook for longevity is good given KSPR is produced at a university and the live YouTube feed of DADABOTS’ RELENTLESS DOPPELGANGER feed has been active for nearly five years.

Each programme of KSPR features diverse content, and is generated and assembled faster than real-time. Directions in which KSPR could develop include the implementation of narrative arcs, such that dialogue across programmes is linked by a dramatic story line (e.g., a budding romane between the radio personalities). KSPR could also integrate caller questions into the programme, and religious incantations from religious figures. To our ears, KSPR sounds like commercial radio even though there is no real corporate entity behind it. It remains to be seen whether KSPR is engaging for longer than its novelty factor to listeners that, unlike us, have no stake in its development — but the proof is in the listening. Furthermore, why someone would listen to KSPR remains to be seen. One suggestion is that KSPR could give listeners an up-to-date view of the state of the art in generative AI for speech and music synthesis. Further promotion of KSPR will lead to opportunities to explore these open questions.

Acknowledgments

This paper is an outcome of MUSAiC, a project that has received funding from the European Research Council under the European Union’s Horizon 2020 research and innovation program (Grant agreement No. 864189).

Ethics Statement

There are many important ethical dimensions to this work. One is that KSPR might be realistic enough that it can be misinterpreted as a real radio station. To mitigate this risk we have added the following disclaimer to the live feed:

KSPR is an orchestration of a variety of state-of-the-art generative artificial intelligence systems running autonomously. All of the content heard on KSPR is completely fabricated by artificial intelligence. Any similarity to actual persons, living or dead, comes from the training data of those systems — which is out of the control of KSPR. One should not plan their day according to the weather reportage heard at KSPR. While the news reportage may be based on real events, its delivery is purely synthetic. Some of the music heard at KSPR may contain inappropriate topics and lyrics. Listener discretion is advised.

Another ethical dimension is in the text-to-speech synthesis models we employ. First, we use the Tortoise-TTS model to synthesize speech using voice targets based on data collected from real people (in our case, Tom Hanks and Emma Stone). In the dialogue, we have altered the names of the radio personalities to Philip and Anna. We found these particular vocal models to be the most expressive of those available, but they can be substituted with less problematical models. Second, TTS models can mispronounce words, which might be interpreted as being offensive. Any mispronunciation is unintentional. Finally, all of the dialogue at KSPR is in English, which was done in part because of the quality of the associated language models (script writing and TTS synthesis), but also the locale of KSPR was envisioned as being in an English-speaking place. This, however, is not a requirement.

Another ethical dimension is in the compilation of the KSPR music collection, which was assembled by scraping material generated by users of the Suno AI Discord. This collection and its broadcast by KSPR likely fall afoul of the Terms of Service of Suno AI; however, KSPR is decisively not a commercial enterprise in that it is not monetizing the produced content or the engineered pipeline. It is, first, intended to be a playful exploration of the aesthetics of AI-generated content built using open-source publicly available technology, and second a provocation of the use of generative AI for creating such a thing. Nonetheless, the KSPR music collection can be swapped entirely with music generated by other models, such as MusicGen.

Along these same lines, a problematic aspect is that the music we scraped from Suno AI users includes content (lyrics) that is controversial (e.g., making fun of psychological disorders).9 That the radio pipeline, including the assembly of music mixes based on mid-level acoustic features, is autonomous does not free the radio creators of responsibility. We have thus taken efforts to curate the music in the collection, filtering out materials we can find that have lyrics matching terms in the word list specified in the Profanity Check python package. Some songs feature lyrics in non-English languages, however, which also might be highly controversial but evade our editorializing in cases where we do not have the language ability or cultural sensitivity to understand them.

In contrast to the endless interactive Twitch content streams reviewed above, where users can interact in realtime with characters through a chat interface, KSPR does not include such functionality. This mitigates the risk of nefarious content being introduced into the pipeline. Furthermore, the appropriateness of the content generated in the pipeline for text and dialogue is not editorialized, and so depends on the “guardrails” added to the generated AI systems we employ.

Another ethical dimension involves sustainability. KSPR, being an autonomously produced and broadcast stream, takes energy to run. The carbon footprint of KSPR is greatly reduced by its use of pre-generated music. However, we do not believe KSPR needs to be running much longer than is necessary for the review of this paper. To reduce its carbon footprint, KSPR will eventually be taken off the air, but a few recordings of what the experience is like will be preserved.

None of the authors of this paper have financial or non-financial conflicts of interest. No one has a financial stake in any of the companies producing the models used herein. This work was financed by public money from the European Research Council under the European Union’s Horizon 2020 research and innovation program (Grant agreement No. 864189).


Comments
0
comment
No comments here
Why not start the discussion?