A lesson learned from my Cognitive Candy project is that Candy’s response time is a key factor for a great user experience. When people talked to Candy, they expected ‘her’ to respond in the same cadence a person would. People’s excitement and engagement level seemed to quickly drop off if response time were too long.
Studies have shown [1] that 200 milliseconds is the typical ‘gap time’ when people take turns talking. And that’s where the challenge lies for a voice-enabled device like Candy. It’s very hard to perform all required operations in that time:
- listen to user’s speech and convert it to text (Speech to Text).
- interpret user’s intent (text input) and generate a response (text output)
- from text response play speech response (Text to Speech).
In this article, I focus on improving latency on #3 and show that implementing a caching strategy reduces latency time from 500-2500 ms to 2ms flat! 🙂
The diagram below explains the information flow of item #3. In essence, the control block manages the ‘speech cache’ and calls up the IBM Watson Text to Speech service in the cloud when needed.
Cache Performance
Let’s look at the performance improvements I got. I performed two sets of tests: a short text (“Hello world”) and long text (see below). Table below shows the latency for the Watson cloud service is proportional to the text length, and that cache has a consistent latency around 2.2ms.
Short Text | Long Text | |
Cloud Latency | 463.2 ms | 2430.7 ms |
Cache Latency | 2.3 ms | 2.1 ms |
File Size | 8.7Kb | 198Kb |
Speech Duration | 1.28 sec | 22.3 sec |
Even for a simple ‘Hello World”, the conversion time latency is ˜500ms. That alone is above to 200 ms response time target I mentioned on above. So the need & importance for s TTS cache is obvious.
Long text: “I have been assigned to handle your order status request. I am sorry to inform you that the items you requested are back-ordered. We apologize for the inconvenience. We don’t know when those items will become available. Maybe next week but we are not sure at this time. Because we want you to be a happy customer, management has decided to give you a 50% discount!”
Implementing Text to Speech with Cache
Here’s my Node-Red flow that performs the TTS Cache Control.
How does it work? Operation of the Text to Speech with Cache
‘Cache Search’ looks for the incoming string in the cache object (not shown above, but the cache object is loaded from a file into a context variable).
If ‘Cache Search’ finds the incoming string, it passes the file name to be played to the ‘Play Audio’ node.
If not found, the string goes to Watson TTS, the resulting audio is saved as a file in the local storage, and the cache index file is updated (also as a file in local storage.)
Here’s a sample cache index file:

Code:
[{"id":"3de6aca1.301b04","type":"exec","z":"5d32c0c0.5a8a3","command":"omxplayer","addpay":true,"append":"","useSpawn":"","timer":"","name":"Play","x":630,"y":80,"wires":[[],[],[]]},{"id":"45f20593.7f955c","type":"function","z":"5d32c0c0.5a8a3","name":"Play Audio","func":"\nbase = \"/tmp/cache_speech/\";\nmsg.payload = base + msg.file;\n\nreturn msg;","outputs":"1","noerr":0,"x":427.1428756713867,"y":219.999981880188,"wires":[["1704af8e.4f3a3"]]},{"id":"5051c394.5cb85c","type":"function","z":"5d32c0c0.5a8a3","name":"Cache Search","func":"msg.query = msg.payload;\n\n// for latency computation\nvar id = msg._msgid.replace('.',''); //remove '.' as that messes up flow.get;\nvar now = new Date().getTime();\nflow.set(id, now);\n// end of latency\n\nvar cache = flow.get ('cache_speech') || 0;\nif (cache === 0) \n{\n msg.cache = \"Cache not initialized.\";\n return [null, msg];\n}\n\nvar len = cache.length;\n\nfor (var idx=0;idx<len;idx++)\n{\n if (cache[idx].key === msg.query)\n {\n msg.file = cache[idx].file;\n msg.cache = \"Found in Cache\";\n return [msg, null];\n } \n}\n\nmsg.cache = \"Entry not in Cache\";\nreturn [null, msg];","outputs":"2","noerr":0,"x":202.14287567138672,"y":200.999981880188,"wires":[["45f20593.7f955c"],["efaeee4c.1a2a2"]]},{"id":"79ddfad6.6d7c74","type":"function","z":"5d32c0c0.5a8a3","name":"Latency","func":"\nvar id = msg._msgid;\nid = id.replace('.',''); //remove '.' as that messes up flow.get\nvar now = new Date().getTime();\n\n\nvar startTime = flow.get(id);\nvar dt = now - startTime;\nmsg.latency = \n{\n 'tts': dt + \"ms\", \n 'id':id, \n 'start':startTime,\n 'end':now\n};\nreturn msg;\n","outputs":1,"noerr":0,"x":640,"y":140,"wires":[["c1b220aa.57df8"]]},{"id":"ed5811e.f032df","type":"function","z":"5d32c0c0.5a8a3","name":"name file","func":"//generate the filename\nvar d = new Date();\nvar n = d.getTime();\nmsg.file = n + \".ogg\";\n\nreturn msg;\n","outputs":1,"noerr":0,"x":200,"y":340,"wires":[["6c548c82.edc9e4","45f20593.7f955c","839d03f6.a6ced"]]},{"id":"c1b220aa.57df8","type":"debug","z":"5d32c0c0.5a8a3","name":"tts latency","active":true,"console":"false","complete":"latency.tts","x":790,"y":140,"wires":[]},{"id":"efaeee4c.1a2a2","type":"watson-text-to-speech","z":"5d32c0c0.5a8a3","name":"Watson TTS","lang":"english","voice":"en-US_MichaelVoice","format":"audio/ogg; codecs=opus","x":190,"y":280,"wires":[["ed5811e.f032df"]]},{"id":"6c548c82.edc9e4","type":"function","z":"5d32c0c0.5a8a3","name":"Cache Save","func":"var cache = flow.get ('cache_speech') || 0;\nif (cache === 0) \n{\n return null;\n}\n\nvar newEntry =\n { \n 'key': msg.query,\n 'file': msg.file\n };\ncache.push(newEntry);\n\nflow.set('cache_speech', cache);\n\nmsg.payload = cache;\nreturn msg;","outputs":1,"noerr":0,"x":367.1428756713867,"y":339.999981880188,"wires":[["93641ef2.f6896"]]},{"id":"839d03f6.a6ced","type":"function","z":"5d32c0c0.5a8a3","name":"Save Speech","func":"base = \"/tmp/cache_speech/\";\nmsg.filename = base + msg.file;\nmsg.payload = msg.speech;\nreturn msg;","outputs":1,"noerr":0,"x":367.1428756713867,"y":379.999981880188,"wires":[["391f9a2c.8ac976"]]},{"id":"93641ef2.f6896","type":"file","z":"5d32c0c0.5a8a3","name":"cache_speech","filename":"/tmp/cache_speech/cache_speech.json","appendNewline":false,"createDir":false,"overwriteFile":"true","x":537.1428756713867,"y":339.999981880188,"wires":[]},{"id":"391f9a2c.8ac976","type":"file","z":"5d32c0c0.5a8a3","name":"SaveSpeech","filename":"","appendNewline":false,"createDir":false,"overwriteFile":"true","x":527.1428756713867,"y":379.999981880188,"wires":[]},{"id":"cd3b9857.dafc58","type":"function","z":"5d32c0c0.5a8a3","name":"reset cache file","func":"//TODO: write something to rm -rf *.ogg in the cache folder\n\nvar cache = \n [\n { \n key: \"cache test A\",\n file: \"abc.wav\"\n },\n { \n key: \"cache test B\",\n file: \"efg.wav\"\n \n }\n ];\nflow.set('cache_speech', cache);\nmsg.payload = cache;\nreturn msg;","outputs":"1","noerr":0,"x":300,"y":560,"wires":[["53652bb2.620434"]]},{"id":"36dc7f41.3d1ce","type":"inject","z":"5d32c0c0.5a8a3","name":"DELETE","topic":"","payload":"go","payloadType":"str","repeat":"","crontab":"","once":false,"x":120,"y":560,"wires":[["cd3b9857.dafc58"]]},{"id":"db331921.fbd6b8","type":"inject","z":"5d32c0c0.5a8a3","name":"SHOW","topic":"","payload":"this is a test","payloadType":"str","repeat":"","crontab":"","once":false,"x":110,"y":520,"wires":[["bf70964b.fd6418"]]},{"id":"bf70964b.fd6418","type":"function","z":"5d32c0c0.5a8a3","name":"dump mem cache","func":"var cache = flow.get ('cache_speech');\nmsg.payload = cache;\nreturn msg;","outputs":"1","noerr":0,"x":310,"y":520,"wires":[["c9e81f8c.c54a9"]]},{"id":"fa3d97e0.883498","type":"file in","z":"5d32c0c0.5a8a3","name":"cache_speech","filename":"/tmp/cache_speech/cache_speech.json","format":"utf8","x":300,"y":480,"wires":[["85419828.fe5188"]]},{"id":"1eecfdb9.9cf8a2","type":"inject","z":"5d32c0c0.5a8a3","name":"LOAD","topic":"","payload":"go","payloadType":"str","repeat":"","crontab":"","once":true,"x":110,"y":480,"wires":[["fa3d97e0.883498"]]},{"id":"51fb6bd8.3bdea4","type":"debug","z":"5d32c0c0.5a8a3","name":"","active":false,"console":"false","complete":"payload","x":670,"y":480,"wires":[]},{"id":"53652bb2.620434","type":"file","z":"5d32c0c0.5a8a3","name":"cache_speech","filename":"/tmp/cache_speech/cache_speech.json","appendNewline":false,"createDir":true,"overwriteFile":"true","x":480,"y":560,"wires":[]},{"id":"c9e81f8c.c54a9","type":"debug","z":"5d32c0c0.5a8a3","name":"","active":true,"console":"false","complete":"payload","x":490,"y":520,"wires":[]},{"id":"85419828.fe5188","type":"function","z":"5d32c0c0.5a8a3","name":"load cache","func":"var cache = JSON.parse(msg.payload);\n\nflow.set('cache_speech', cache);\n\nmsg.payload = cache;\n\nreturn msg;\n","outputs":"1","noerr":0,"x":490,"y":480,"wires":[["51fb6bd8.3bdea4"]]},{"id":"340934f.0a540cc","type":"comment","z":"5d32c0c0.5a8a3","name":"------------------------tts cache ---------------------------------------------------------------------------","info":"","x":370,"y":440,"wires":[]},{"id":"90933cae.df713","type":"inject","z":"5d32c0c0.5a8a3","name":"Hello world!","topic":"","payload":"Hello world!","payloadType":"str","repeat":"","crontab":"","once":false,"x":110,"y":60,"wires":[["985ccdc2.0cf15"]]},{"id":"647b59ff.a58d98","type":"inject","z":"5d32c0c0.5a8a3","name":"Long Text","topic":"","payload":"I have been assigned to handle your order status request. I am sorry to inform you that the items you requested are back-ordered. We apologize for the inconvenience. We don't know when those items will become available. Maybe next week but we are not sure at this time. Because we want you to be a happy customer, management has decided to give you a 50% discount!","payloadType":"str","repeat":"","crontab":"","once":false,"x":100,"y":100,"wires":[["985ccdc2.0cf15"]]},{"id":"985ccdc2.0cf15","type":"link out","z":"5d32c0c0.5a8a3","name":"","links":["636a1d0d.8f2214"],"x":255,"y":80,"wires":[]},{"id":"636a1d0d.8f2214","type":"link in","z":"5d32c0c0.5a8a3","name":"input","links":["985ccdc2.0cf15"],"x":75,"y":200,"wires":[["5051c394.5cb85c"]]},{"id":"1704af8e.4f3a3","type":"link out","z":"5d32c0c0.5a8a3","name":"","links":["bdc2f653.e17cc8"],"x":575,"y":220,"wires":[]},{"id":"bdc2f653.e17cc8","type":"link in","z":"5d32c0c0.5a8a3","name":"play","links":["1704af8e.4f3a3"],"x":535,"y":80,"wires":[["3de6aca1.301b04","79ddfad6.6d7c74"]]}] References: [1] The Incredible Thing We Do During Conversations http://www.theatlantic.com/science/archive/2016/01/the-incredible-thing-we-do-during-conversations/422439/
You must be logged in to post a comment.