Comparison of WAV, FLAC and OGG audio formats: size and latency

When using IBM Watson Speech to Text (STT) and Text to Speech (TTS) services for my Cognitive Candy project I started off using WAV file format.  That was the easy choice since WAV is a raw audio format requiring no additional software for encoding.

Once I got my application working, and started looking for ways to improve overall system latency, so I decided to study the benefits of moving to FLAC and OGG Vorbis file formats.

My test was simple. I sent Watson the text below and measured the time to get the resulting speech file. I repeated the test for each format Watson supports: WAV, FLAC and OGG. Further down are my results.

“I have been assigned to handle your order status request.  I am sorry to inform you that the items you requested are back-ordered. We apologize for the inconvenience.  We don’t know when those items will become available. Maybe next week but we are not sure at this time. Because we want you to be a happy customer, management has decided to give you a 50% discount!”

Stats: 370 characters sent; resulting speech is  27 sec long.

Summary of Results

File Format Size (kB) Avg Latency (ms)
WAV 1228.8 3523.8
FLAC 579 2134
OGG 231 1921

Here’s a nice chart showing speech format vs size.

Considerations:

I used typical & default settings for each file format, meaning that each format could potentially be optimized for better performance. So this study is likely only useful to give a ballpark comparison.

Latency is the round trip (request/response) measured from Candy’s perspective (Candy is Raspberry Pi based), so it includes latency of my own network as well Watson’s processing time.

 

Conclusion

It’s very convenient to work with WAV files, however, there are substantial benefits moving to OGG Vorbis .

 

Reference

For the average latency, I had 5 runs with the following results:

Latency Experiment
Transfer Size (kb) 1228.8 579 231
run # WAV FLAC OGG
1 3145 2043 2066
2 4414 1900 1719
3 3648 2218 1663
4 3157 2250 2070
5 3255 2259 2087
Average (ms) = 3523.8 2134 1921
54.52%