
Tymely News
Open Source AI TTS Audiobooks
I'd like to share a little about my more recent audiobook projects, because my research has recently hit gold that may be of value to other authors that want to produce audiobooks, but tend to steer clear of the big AI companies.
Why AI?
At the very beginning of my attempts to make audiobooks from my novels, I tried recording myself reading. My first experiment was an abysmal failure. Since I didn't have the slightest clue what I was doing, it took me three hours to produce one paragraph that sounded good.
I used Audacity for that experiment, which is my go-to tool for manually recording and manipulating audio.
Here's a quick list of things that went wrong that day:
- I'm not a voice actor and the learning curve was steep
- My audio levels were all over the place and a had to re-record the same lines, over and over
- This is something more experience could solve, however
- I'm not an audio engineer and the learning curve was steep
- I tried to watch the UV meter built into Audacity, to avoid red-lining my mic, leading to divided focus
- Again, experience could solve this
- My mic is not pro quality
- If I'd had a better mic, I probably would have had an easier time getting good results
- Fixing this would cost money I don't have
- My home is too noisy and wrong for audio work
- The air conditioners in the background, fans, etc.
- Winter would have been a better time, but the heater is still an issue
- This is fixable, but I would need supplies to build an audio booth, which costs money I don't have
Paying a voice actor to read was never an option, because I don't have the money. It costs a minimum of $900 to pay someone to read a book and even more if you want it to be read well, with no guarantee of sales to recoup the investment.
In the end, I gave up on the project for a while, until a friend pointed out the fact there's AI voices available. That's when my research brought me across three different companies peddling AI voices to authors, allowing their work to be sold in online stores without massive up-front expense:
- Apple
- OpenAI
There's also ElevenLabs, but I don't include them in the list, because when I was looking into them, they had no existing deals for distribution of the results, which is an important distinction.
There are others, but those are the names I know.
Why Open Source?
All of the options from big companies come with the same downsides:
Exclusivity: Most every online store that accepts AI-read audiobooks only allows one specific company's software to be used for reading. For example, Apple only accepts their own software. This issue really grinds my gears, because I love open source software and these short-sighted policies leave it out in the cold, for no reason at all. I did eventually find some stores that allow open source audiobooks, but I'll cover them in a later section.
Lack of Control: I would have to surrender control of the process of generating the audio master files to cloud servers owned by massive mega-corporations. That does not appeal to me, because it will be in my hands, on my own computer, or not at all. I will not put up with the rug being pulled out from under me by the changing whims of the CEO a mega-corporation.
Predatory Pricing: Most of the viable options for authors are offered up supposedly for free, bound up with those exclusivity deals. However, I've seen enough of life to know that almost nothing of such substantial value can ever actually be free and such deals should always be carefully examined. The up-front price may be zero, but it will cost you something later. For some people, that may be worth it, but in this case, I have serious doubts about so much computing power remaining free forever.
Now here's the reason I call that pricing model predatory: today it's free, but it can't possibly stay that way forever. I would compare it to a drug dealer that says, "The first taste is free!" They can afford to do that, because they know exactly how addictive their product is, but after the customer is hooked, oh my, the price just goes up!
I expect the current AI bubble to burst soon and then all of those free offers will evaporate, leaving authors with unfinished files they have to pay to complete, lest they waste lots of invested time. Alternatively, authors will come back to do a new audiobook and find out they can't do it for free anymore, because the investors of those huge companies demand a return on their investment.
The nature of this rug-pulling behavior disgusts me, but I'm certain it's coming. To quote the movie War Games, "The only winning move is not to play."
These are the reasons I choose open source, every day of every week. My computer exists to serve me, not the corporate overlords of a company that shall remain nameless, which is currently turning one of the most popular operating systems into an ad-serving machine. Are people actually paying for an operating system full of ads? Ridiculous, but that's what my reading of news items tells me.
I stick with Linux for a reason. It has never stabbed me in the back and I know it never will.
So it is with the software I use. I do my best to avoid using software that doesn't run directly on my computer, because the cloud is not something I can control and that rug can be pulled out from under me at any time.
I try to keep the exceptions for tools that only work via the internet, like E-mail, news sites, social media and distribution of my novels.
Some Useful Tools (TTS Engines)
In the end, I decided to try building audiobooks with AI-based TTS (text-to-speech) engines I can run locally. The ones I'm using are nice, because they sound like natural human voices, though to a varying degree.
Here are the best open source tools I've found, so far:
- Kokoro-tts
- This has become my favorite, because the pronunciation is almost perfect and it can mix its existing voices together
- It excels at reading well
- Chatterbox
- This is a zero-shot TTS engine, meaning you supply a voice sample and it reads by producing a voice that mimics it
- It also excels at reading well, but is too slow for my overall needs
- My favorite part of this engine is the voice changer, allowing an audio clip of one voice to be morphed into that of another
- That retains the accent of the first, but makes it sound like the second
- This can be used separately and that's much faster than using this engine to read
- Piper
- Piper has a massive library of existing voices (more than 1000!) and excels at real-time reading
- Unfortunately, the quality isn't the best
- Parler-tts
- Produces CD quality audio (44 Khz) and can produce voices from a text description of a manner of speaking, allowing for some hint of mood
- Parler often produces noisy audio clips, but there are prompts that can reduce or eliminate it, at least most of the time
- The biggest downside is the extreme slowness: without a GPU for acceleration, it's unbearably slow
- This could easily be used to produce random voices based on description to feed into Chatterbox's voice changer, giving a more permanent variation
There are many others out there, but these are the best I've found and been able to get running reliably.
My Current Process
I've begun getting some very good results with both customization and quality though the following process:
- Choose a character to find a voice for
- Pick a Piper voice that fits the character's tone
- At this point, all that matters is the overall sound
- Don't bother with matching accent
- Filter noise with RNNoise, if required
- This is a AI noise filter
- I've found that the default settings are utter garbage
- I think that's the reason this tool has such a bad reputation for distorting voices
- Instead, set VAD % between 90% and 99%
- 99% is for the least noise reduction and 90% is the threshold beyond which I consider a noisy voice unusable
- Pitch shift it with SoX if it's not exactly right
- Don't worry if the voice sounds artificial from the pitch change, because it won't matter
- Generate a sample reading some Harvard sentences
- Two of them in a row is about the right length
- Setting length scale to roughly 1.25 will be helpful for the next step, but vary by need, since some voices are naturally faster or slower
- Feed that sample into Chatterbox to produce a new sample
- This should sound far better than the original version of the Piper voice
- Save the resulting audio clip for the long term
- It will later be fed into Chatterbox's voice changer
- If you like the voice as-is, go ahead and use Chatterbox for further reading
- I personally find Chatterbox a bit slow and like to play with voice mannerisms, so I normally keep going
- The really beautiful part of this is what happens with pitch shifts of Piper voices
- They sound natural when Chatterbox mimics them!
- Mimicking a pitch-shifted sample made to sound like a bad imitation of a child will produce results that resemble a real child
- That's the magic of a wisely-wielded AI process: garbage in, real data out
- Examine Kokoro voices
- Now is the time to match accent and mannerisms
- Sometimes, mixing two voices together is the best option, allowing producing softer versions of accents, or unique mannerisms
- Use Kokoro to read text, producing audio clips
- Feed audio from Kokoro through Chatterbox's voice changer to match the reading to the sample that fits the tone of the character
- It will retain the accent and mannerisms from Kokoro
- It will use the tone of the original Piper voice, but should sound far better
Now comes the real fun detail: because any Kokoro voice fed through the voice changer will sound like the same person, I have the option to alter mannerisms and accent as I see fit. This makes room for some slight emotion.
For example, I've been using a particular Kokoro voice along with a slight speed increase for an angry/excited version, while I use another voice at normal speed for calm.
I'll demonstrate with a character from the current novel I'm aiming to make an audiobook for, She Hunts Demons.
Clayton Simmons is the private detective that's the partner of the main character. In the book, he normally speaks with an American accent, but at one point he gets upset and slips back to his native British accent, while at other times, he gets angry. He eventually makes a bargain with a demon, allowing him to take on a heavily-muscled cat-man form that's seven to eight feet tall, depending on how much punishment he takes. That's four voice variations for one character.
For the sake of a demonstration, some examples of the resulting voices follow, reading a test phrase:
"A rainbow is a meteorological phenomenon that is caused by reflection, refraction and dispersion of light in water droplets resulting in a spectrum of light appearing in the sky."
Simmons Calm
Simmons in his default calm state. I used Kokoro's 'am_liam' voice for accent and mannerisms. The Piper voice used for tone is speaker 800 of the LibriTTS model and that doesn't change between character states.
Simmons Angry/Excited
This is Simmons when he's angry or excited. I used Kokoro's 'am_adam' voice for accent and mannerisms. The difference is subtle, but it should add a little something to the book.
Simmons British
This is Simmons when he gets emotional and drops back to his native accent. I used Kokoro's 'bm_daniel' voice for accent and mannerisms.
Simmons Cat-Man
Last of all is Simmons in his cat-man form. I started from the angry version, but made one small change: the Piper voice was pitched down by 150 hundredths of a semitone via SoX, to make him deep and menacing.
Where to Publish?
All of that is well and good, but you might be asking the same question I asked for close to a year: where the heck can I publish AI-read audiobooks made with open source software?
I've found only two options, so far:
- Itch.io
- You can publish almost anything in the Itch store, but it focuses on games
- Bandcamp.com
- Bandcamp focuses on music, but they also carry audiobooks
- Their policies don't preclude AI-read books, which was a real shocker to me
- I only found out because there's already some on the platform
- Mostly public domain books
- They might have been using ElevenLabs software, but I'm uncertain
I've got two audiobooks on Bandcamp at the moment:
- Starwitch
- As the time of writing, this was produced strictly using Piper and RNNoise
- That required a lot of extra work to generate lines over and over until the pronunciation was passable
- I'm eager to rebuild this one with my new multi-engine technique, but that's going to have to wait until She Hunts Demons is done
- The Most Powerful Words
- This is a short story I used as a testbed for modifying my personal audiobook software to handle more TTS engines than just Piper
- It uses only basic Kokoro voices, though some are pitch-shifted
- I ought to also redo this one
In Closing
I hope you find this information useful for your own audio and voice projects. I hope it helps you produce quality audiobooks of your own.
I really would like to see a future in which AI models that run on local hardware is the norm, rather than the exception, because this cloud obsession everyone has these days is an unsettling trend. Why run software in the cloud, when local hardware can do the job? Why rent when you can own? After all, most that do real electronic work do so on a computer, so why rent another one on top of that?
I may be forced to generate chapters of my audiobooks with overnight runs, since my GPU isn't suitable for running these AI models, but it still beats letting some mega-corporation control my fate.
Even if you read your own books, perhaps Chatterbox's voice changer could be used to spice up your audiobooks with character voices? That's something I'd like to give a shot, once I learn more about voice acting.
Tags: audio, audiobooks