AI Speech

PUBLISHED ON FEB 12, 2023 - LAST UPDATE FEB 12, 2023 — GENERAL

I recently found out about The ElevenLabs Speech Synthesis beta from a colleague at work (Martin Adams). It’s a really neat tool that given some text could create a really good human sounding narration from that text.

So I decided to have a play, and that’s how the audio got added to the AI Wars article on this site.

I signed up for the beta on a paid tier, then I recorded a 2ish minute audio clip of me reading an article I happened to have open in my browser, a piece on St James Place by Henry Taper. It was a plain first read, unedited recording, complete with stumbling and pauses.

My first training sample

Uploading that into the instant voice cloning on Eleven Labs gave me the voice that’s used to read the article on AI. The instant voice cloning was indeed pretty instant. As was the production of my text in samples.

That’s pretty cool, but it did also make me wonder, what if I read something different, with more practice. Different emotion. How different would “my” voice generated by the AI be? First thing was to find some content I’d be more familiar with and could read differently.

So, next, I’ve trained a new Instant Voice. I decided to read the start of Chapter 2 of The Wintersmith by Terry Pratchett. It’s a Young Adult book. I’m familiar with it. It’s a totally different tone. I tried to smile while I read it.

The Wintersmith- Chapter 2

OK, so it does sound like I’m reading a much younger children’s book to my children (who are too old by far for that) but, it’s me and it’s a very different sample.

So I’ve taken that, and thrown the first two paragraphs of my AI Wars article into it:

first test, default settings

The first test seems pretty monotone. Not much like how I was reading. In my shared article I first tested, I’d changed the variability to 45%. SO let’s try that setting:

45% stability

That sounds a lot more natural. So I then took both training samples, uploaded them to create a new voice based on both samples together. Again going with my random 45% variability.

Two Samples

That’s worse. Some unnatural pause points and some natural ones missing. But it really does sound a lot like me. So much so I don’t like listening to it. I think I probably need to find a suitable source material for a good sample. Practice read it a few times. Do a good recording, then play further with the Stability and Clarity settings. Dialling the Clarity setting up to 80% from the default 45% produces this for example:

Last test

It’s incredibly interesting tech, and clear that some playing with the generation tool can allow you to produce chunks of great reading you can stitch together, much faster than working with a real voice actor. So, of course, here is this article, read by my latest “me”:

Listen to this full article
TAGS: AI