When using IBM Watson Speech to Text (STT) and Text to Speech (TTS) services for my Cognitive Candy project I started off using WAV file format. That was the easy choice since WAV is a raw audio format requiring no additional software for encoding.
My test was simple. I sent Watson the text below and measured the time to get the resulting speech file. I repeated the test for each format Watson supports: WAV, FLAC and OGG. Further down are my results.
“I have been assigned to handle your order status request. I am sorry to inform you that the items you requested are back-ordered. We apologize for the inconvenience. We don’t know when those items will become available. Maybe next week but we are not sure at this time. Because we want you to be a happy customer, management has decided to give you a 50% discount!”
Stats: 370 characters sent; resulting speech is 27 sec long.
Summary of Results
|File Format||Size (kB)||Avg Latency (ms)|
Here’s a nice chart showing speech format vs size.
I used typical & default settings for each file format, meaning that each format could potentially be optimized for better performance. So this study is likely only useful to give a ballpark comparison.
Latency is the round trip (request/response) measured from Candy’s perspective (Candy is Raspberry Pi based), so it includes latency of my own network as well Watson’s processing time.
It’s very convenient to work with WAV files, however, there are substantial benefits moving to OGG Vorbis .
For the average latency, I had 5 runs with the following results:
|Transfer Size (kb)||1228.8||579||231|
|Average (ms) =||3523.8||2134||1921|