These days, compared to 20 years or even just a decade ago, the average person spends a lot of their time reading. Just think about it for a second. We spend most of our time at work facing some form of text or another, and when returning home from the daily grind, we turn to social media feeds for our entertainment. Yet along with this text overload came many other means of receiving data: both in video form (e.g. YouTube, Netflix, etc.) and audio formats (e.g. audio books, podcasts and others). As a developer, the idea of turning text into speech has always been exciting to me. I remember when I first downloaded the Magic Goody translator program and they had a feature to pronounce words and even read entire sentences out loud. This was my first real exposure to Text To Speech (TTS) technology and my mind was well and truly captivated.
Unfortunately, however, I was unable to pursue the matter much further, as my online search capabilities were considerably limited at the time, and any time spent in internet cafés was also charged at a premium rate. Nonetheless, I did somehow find out that the required API — the robotic sounding Microsoft Speech API, that is — already existed on Microsoft Windows. This satisfied my initial curiosity, and after a while, I stopped thinking about the potential applications.
An Investigation Into Text to Speech Technology
My next investigation into TTS technology was again fuelled by curiosity, as I sought to understand where exactly TTS was today. This was largely due to the fact that I was presented with a Qumo Libro II e-reader, which had the ability to convert text to audio. What I learnt truly fascinated me. Not only was the e-reader built upon Linux itself, but there were a number of other solutions with ready-made TTS engines. Nonetheless, because the voice still had a distinctly robotic cadence, I had no real desire to try it.
As time went on, however — and new cloud technologies began to emerge — I kept a close watch on many new products from Amazon Web Services (AWS). And along came Amazon Polly, which was presented in 2016 at the annual AWS Re:Invent conference. It was all incredibly exciting! All the voice samples presented there had lost their inorganic qualities, and were impressive in the fluidity of pronunciation. One of the example applications also showcased at the conference was a WordPress plugin, which allowed users to give their eyes a rest and listen to articles instead.
At the time, I was already listening to a number of audiobooks and podcasts, so I wanted to be able to pair Polly with other resources that were unrelated to the WordPress platform. Unfortunately, however, at that point, there was no available documentation — and worst of all, as it turned out, I had no access to their API either. Consequently, I was never able to implement this concept and had to put it in my metaphorical idea box :).
Several years passed by, when news finally came that the number of voices for Amazon Polly increased, and I promptly fired up again.
On that note, however, the following tasks had yet to be addressed:
1) Convert an article on a site to text
2) Use Amazon Polly to break a record
And as a bonus: not pay as much for AWS usage and try to fit into their Free Tier as well :).
Solving Text to Speech Technology Problems
In order to solve the first problem, I wanted to use the same method used by modern browsers when switching over to their Reader mode. Yet this was quickly followed by disappointment, as each browser had gone its own way and implemented the feature however it pleased.
Nevertheless, thanks to Mozilla, Firefox’s Reader View functionality had been moved into a separate library. After reading and attempting to write a small sample, I concluded that this would solve my issue.
The resulting architecture was approximately the following: HTTP API -> AWS Lambda -> S3.
- HTTP API, created with the AWS API Gateway (HTTP Listener) – this ran the logic that I wrote for translating an article on a site into text
- AWS Lambda – this contained the parameters to translate an article on a site into text and save the resulting text to AWS S3
- AWS S3 – for Textual Data Storage (.txt)
Bingo. Not only did this solve the first issue entirely, but it also laid the groundwork to tackle the second problem as well. In this case, however, because I had already decided to use the AWS Lambda Runtime, I opted for Java 11 as the version of Java to be used. I also decided that the created text file on S3 should serve as a trigger to start converting text to audio, as I wanted to connect these two tasks with one another.
As a result, we were left with the S3 -> Lambda -> Polly -> S3 architecture, in which:
- S3 creates an event (trigger) – when the previous lambda creates a file, a lambda is triggered, which is then able to work with Polly
- AWS Lambda – reads a file from S3 and sends it over to Polly
- AWS Polly – converts the file to audio
- AWS S3 – saves the resulting mp3 file with a recording of the website’s article
And that was that for a while, as I was finally able to listen to articles :).
Dealing with other issues
It is important to also mention that I did have to face a few problems, however: first, converting a website to text requires a lot of memory and time, so I turned to Lambda’s runtime configuration to raise both runtime and memory consumption. Second, Amazon Polly has specific character limits in place, and thus larger texts are processed as background jobs. This leads to the problem of not knowing when exactly each voice-over will become available. I do hope that this can be solved in the near future, by the way, but I was unable to find anything about it in the documentation.
As for the results themselves, here is an example of our article on Git, in the very pleasant-sounding voice of Amy.
The Many Uses of Text to Speech Technology
At SPG, we believe that TTS translation technologies have great potential for application in a number of different fields:
- providing a voice-over for articles on sites, like this last one I saw on TheVerge
- voice acting for call centre chatbots
- interactive voice interfaces (i.e. Alexa, Siri)
- NPCs in games
- and more!