Press Clipping
02/13/2019
Article
Why is this Valentine’s song made by an AI app so awful?

Do you hate AI as a buzzword? Do you despise the millennial whoop? Do you cringe every time Valentine’s Day arrives? Well – get ready for all those things you hate in one place. But hang in there – there’s a moral to this story.

Now, really, the song is bad. Like laugh-out-loud bad. Here’s iOS app Amadeus Code “composing” a song for Valentine’s Day, which says love much in the way a half-melted milk chocolate heart does, but – well, I’ll let you listen, millennial pop cliches and all:

Fortunately this comes after yesterday’s quite stimulating ideas from a Google research team – proof that you might actually use machine learning for stuff you want, like improved groove quantization and rhythm humanization. In case you missed that:

Now, as a trained composer / musicologist, I do find this sort of exercise fascinating. And on reflection, I think the failure of this app tells us a lot – not just about machines, but about humans. Here’s what I mean.

Amadeus Code is an interesting idea – a “songwriting assistant” powered by machine learning, delivered as an app. And it seems machine learning could generate, for example, smarter auto accompaniment tools or harmonizers. Traditionally, those technologies have been driven by rigid heuristics that sound “off” to our ears, because they aren’t able to adequately follow harmonic changes in the way a human would. Machine learning could – well, theoretically, with the right dataset and interpretation – make those tools work more effectively. (I won’t re-hash an explanation of neural network machine learning, since I got into that in yesterday’s article on Magenta Studio.)

https://amadeuscode.com/

You might well find some usefulness from Amadeus, too.

This particular example does not sound useful, though. It sounds soulless and horrible.

Okay, so what happened here? Music theory at least cheers me up even when Valentine’s Day brings me down. Here’s what the developers sent CDM in a pre-packaged press release:

We wanted to create a song with a specific singer in mind, and for this demo, it was Taylor Swift. With that in mind, here are the parameters we set in the app.

Bpm set to slow to create a pop ballad
To give the verses a rhythmic feel, the note length settings were set to “short” and also since her vocals have great presence below C, the note range was also set from low~mid range.
For the chorus, to give contrast to the rhythmic verses, the note lengths were set longer and a wider note range was set to give a dynamic range overall.

After re-generating a few ideas in the app, the midi file was exported and handed to an arranger who made the track.

Wait – Taylor Swift is there just how, you say?

Taylor’s vocal range is somewhere in the range of C#3-G5. The key of the song created with Amadeus Code was raised a half step in order to accommodate this range making the song F3-D5.

From the exported midi, 90% of the topline was used. The rest of the 10% was edited by the human arranger/producer: The bass and harmony files are 100% from the AC midi files.

Now, first – these results are really impressive. I don’t think traditional melodic models – theoretical and mathematical in nature – are capable of generating anything like this. They’ll tend to fit melodic material into a continuous line, and as a result will come out fairly featureless.

No, what’s compelling here is not so much that this sounds like Taylor Swift, or that it sounds like a computer, as it sounds like one of those awful commercial music beds trying to be a faux Taylor Swift song. It’s gotten some of the repetition, some of the basic syncopation, and oh yeah, that awful overused millennial whoop. It sounds like a parody, perhaps because partly it is – the machine learning has repeated the most recognizable cliches from these melodic materials, strung together, and then that was further selected / arranged by humans who did the same. (If the machines had been left alone without as much human intervention, I suspect the results wouldn’t be as good.)

In fact, it picks up Swift’s ticks – some of the funny syncopations and repetitions – but without stringing them together, like watching someone do a bad impression. (That’s still impressive, though, as it does represent one element of learning – if a crude one.)

To understand why this matters, we’re going to have to listen to a real Taylor Swift song. Let’s take this one:i’

Okay, first, the fact that the real Taylor Swift song has words is not a trivial detail. Adding words means adding prosody – so elements like intonation, tone, stress, and rhythm. To the extent those elements have resurfaced as musical elements in the machine learning-generated example, they’ve done so in a way that no longer is attached to meaning.

No amount of analysis, machine or human, can be generative of lyrical prosody for the simple reason that analysis alone doesn’t give you intention and play. A lyricist will make decisions based on past experience and on the desired effect of the song, and because there’s no real right or wrong to how do do that, they can play around with our expectations.

Part of the reason we should stop using AI as a term is that artificial intelligence implies decision making, and these kinds of models can’t make decisions. (I did say “AI” again because it fits into the headline. Or, uh, oops, I did it again. AI lyricists can’t yet hammer “oops” as an interjection or learn the playful setting of that line – again, sorry.)

Now, you can hate the Taylor Swift song if you like. But it’s catchy not because of a predictable set of pop music rules so much as its unpredictability and irregularity – the very things machine learning models of melodic space are trying to remove in order to create smooth interpolations. In fact, most of the melody of “Blank Space” is a repeated tonic note over the chord progression. Repetition and rhythm are also combined into repeated motives – something else these simple melodic models can’t generate, by design. (Well, you’ll hear basic repetition, but making a relationship between repeated motives again will require a human.)

I’m saying part of the mistake here is assuming an analytical model will work as a generative model. Not just a machine model – any model.
It may sound like I’m dismissing computer analysis. I’m actually saying something more (maybe) radical – I’m saying part of the mistake here is assuming an analytical model will work as a generative model. Not just a machine model – any model.

This mistake is familiar, because almost everyone who has ever studied music theory has made the same mistake. (Theory teachers then have to listen to the results, which are often about as much fun as these AI results.)

Music theory analysis can lead you to a deeper understanding of how music works, and how the mechanical elements of music interrelate. But it’s tough to turn an analytical model into a generative model, because the “generating” process involves decisions based on intention. If the machine learning models sometimes sound like a first year graduate composition student, that may be that the same student is steeped in the analysis but not in the experience of decision making. But that’s important. The machine learning model won’t get better, because while it can keep learning, it can’t really make decisions. It can’t learn from what it’s learned, as you can.

Yes, yes, app developers – I can hear you aren’t sold yet.

For a sense of why this can go deep, let’s turn back to this same Taylor Swift song. The band Imagine Dragons picked it up and did a cover, and, well, the chord progression will sound more familiar than before.

As it happens, in a different live take I heard the lead singer comment (unironically) that he really loves Swift’s melodic writing.

But, oh yeah, even though pop music recycles elements like chord progressions and even groove (there’s the analytic part), the results take on singular personalities (there’s the human-generative side).

“Stand by Me” dispenses with some of the ticks of our current pop age – millennial whoops, I’m looking at you – and at least as well as you can with the English language, hits some emotional meaning of the words in the way they’re set musically. It’s not a mathematical average of a bunch of tunes, either. It’s a reference to a particular song that meant something to its composer and singer, Ben E. King.

This is his voice, not just the emergent results of a model. It’s a singer recalling a spiritual that hit him with those same three words, which sets a particular psalm from the Bible. So yes, drum machines have no soul – at least until we give them one.

“Sure,” you say, “but couldn’t the machine learning eventually learn how to set the words ‘stand by me’ to music”? No, it can’t – because there are too many possibilities for exactly the same words in the same range in the same meter. Think about it: how many ways can you say these three words?

“Stand by me.”

Where do you put the emphasis, the pitch? There’s prosody. What melody do you use? Keep in mind just how different Taylor Swift and Ben E. King were, even with the same harmonic structure. “Stand,” the word, is repeated as a suspension – a dissonant note – above the tonic.

And even those observations still lie in the realm of analysis. The texture of this coming out of someone’s vocal cords, the nuances to their performance – that never happens the same way twice.

Analyzing this will not tell you how to write a song like this. But it will throw light on each decision, make you hear it that much more deeply – which is why we teach analysis, and why we don’t worry that it will rob music of its magic. It means you’ll really listen to this song and what it’s saying, listen to how mournful that song is.

And that’s what a love song really is:

If the sky that we look upon
Should tumble and fall
Or the mountain should crumble to the sea
I won’t cry, I won’t cry
No, I won’t shed a tear
Just as long as you stand
Stand by me

The number of different ways to set this short snippet of text isn’t of text is already daunting. Nuances of that, then nuances of the performance, change the meaning and emotional impact. Machine learning for its part is only from a crude transcription – but even detailed audio may not help either. You could also write pages and pages of different analyses. Generating materials from analytical models is a great way to test models – and that make all of this exciting. BUT transcriptions don’t model composition because they don’t model meanings or intent. So you shouldn’t expect meaningful output – not now, but also in the future.

Stand by me.

Now that’s writing a love song.

So happy Valentine’s Day, humans. And machines.

PS – let’s give credit to the songwriters, and a gentle reminder that we each have something to sing that only we can:
Singer Ben E. King, Best Known For ‘Stand By Me,’ Dies At 76 [NPR]