MIT Enterprise Forum: Advancing Voice Tech
Voice is on the verge of becoming a primary tech interface, according to speakers at the Advances in Speech Technology MIT Enterprise Forum in New York (February 15).
Key areas of development include the emerging field of "conversation design" – creating personalised and context-sensitive dialogues between humans and machines – and the rapidly advancing area of intelligent speech transcription. We round up the highlights:
- Voice-to-Visual Spectrum: Dan Padgett, head of conversation design at Google, said the company is building an ecosystem spanning voice-only (for example, smart speakers), intermodal (smartphones, for instance) and visual-only (like televisions) devices. The goal is to co-ordinate experiences across these platforms, Padgett explained: "If [users] ask about shirts at Gap, it might be better to move them to a screen and show what they look like," rather than verbally describing every shirt.
- Google also launched the petite Home Mini smart speaker last autumn, and announced screen-equipped smart displays similar to the Amazon Echo Show at CES last month, set to launch mid-year. Google Home devices are now available in 12 markets spanning eight languages.
- Cracking the Conversation Code: AI-powered speech recognition is advancing quickly, with a word-error rate of 4.9% in 2018 compared to 6.1% at the end of 2016, said Padgett – who describes his role as "teaching robots to talk to humans".
- While technology can recognise speech, understanding what users mean remains challenging. For example, 'Springfield' could refer to a town in Missouri, Massachusetts or elsewhere; while a request to play 'Yesterday' is likely to mean the Beatles track – but the Boyz II Men cover is also a popular version.
- It's crucial for voice-led tech to acknowledge ambiguity, let users clarify their intent, and remember user choices for the next encounter. Slowing down the interaction is preferable to getting it wrong, given users' low tolerance for error.
- Designing Oral Dialogues: Designing for voice revolves around interactions that are "linear, always moving forward and ephemeral", said Padgett. Information must be structured to support easy recall. The "end-focus principle", for instance, requires that old or known information is stacked at the front, with new facts at the back.
- Strides in Transcription Tech: Several companies showcased AI-fuelled speech transcription tech, primarily targeted at enterprise customers. Silicon Valley speech-tech start-up AISense is seeking to "translate all spoken conversations as usable data, regardless of linguistic difficulty", said Seamus McAteer, head of revenue and partnerships.
- Launched in February 2018, the company's Otter app records and transcribes audio – for example, from meetings or calls. AISense also powers a feature introduced in January 2018 by video-conferencing firm Zoom that offers automatic transcriptions of meetings held on the platform. This feature is likely to become standard for web conferencing in the next few years.
- Stenopoly, the first product from Philadelphia-based Lovoco, performs real-time bilingual captioning, translation and transcription. The company is primarily targeting the meetings industry and academia.
- Californian tech firm VoiceBase's software performs speech analytics, pulling out keywords and topics. A user can click on a keyword to find all instances of it in audio or video recordings, or search millions of voice conversations with a simple query. The company also uses deep learning to analyse conversations at scale and predict, for instance, which customers will convert in a given timeframe, or which transactions are fraudulent.
See 10 Tech Trends to Watch in 2018, CES 2018: Home Electronics and IFA Berlin 2017 for more on the rise of talkative technology. For an overview of voice-first marketing developments, read Advertising in the Alexa Era.