Speech Recognition in the Covid Era, an Interview with Jeff Adams, Founder and CEO of Cobalt Speech
Updated: May 17
I recently caught up with Jeff Adams, Founder and CEO of Cobalt Speech. Jeff is a pioneer in the voice/speech recognition industry with over 20 years of experience. Prior to founding Cobalt, he led the team at Amazon that developed the Amazon Alexa. Today, Cobalt provides custom speech technologies for many applications including language learning, healthcare and agriculture. The company is developing voice/speech technologies beyond what is currently covered by Alexa, Google Assistant and Siri.
In our chat, we discussed these 3 main speech recognition trends amidst Covid:
We are making headway in using speech to detect Covid symptoms through biomarkers in people's voices. If successful, this would make Covid testing accessible to the masses
We will increasingly see speech recognition tech used for devices in public places to reduce touching screens
Speech transcription will play a key role in the surge of video conferencing despite not being extremely accurate
Let’s start with some industry backdrop. Voice and speech recognition is a $12B global market today and is expected to grow at 17% CAGR over the next 5 years. Can you break down that landscape for us?
I would segment the market into 3 main areas.
Speech Transcription. This is one of the most traditional applications of speech recognition, but there’s still a ton of demand for it. Use cases include transcribing phone calls, meetings, and lectures.
Conversational Interfaces. This is a pretty broad category, and refers to natural language/spoken interfaces. Functions that fall into this category are voice control of machinery, smart speakers, and chatbots.
Speaker Analytics. This is a newer area within speech technology that detects not only what a person is saying, but how they are saying it. Call centers use this to monitor customer satisfaction and auto manufacturers use this to detect driver fatigue.
What Covid-specific projects are you currently working on?
We’ve gotten a significant amount of Covid-specific project requests in the last 2 months. One proposal we received is for a Covid-checkpoint kiosk. It’s a kiosk that people would interact with when entering the workplace or other public space. A speech-based, no-touch solution is critical here because we don’t want people touching the same screen. The kiosk would ask you a series of questions about your health before confirming it is safe for you to enter.
We are also working on projects to directly recognize Covid symptoms through biomarkers in a patient’s speech. Cobalt spin-off company Canary Speech, a trailblazer in speech-based healthcare diagnosis, is making significant headway in these efforts. (Stay tuned next week for a separate interview with Canary Speech founder Henry O’Connell).
Zoom has seen an explosion in usage, adding 100 million users in the month of April to reach a total of 300 million active users. What role will Cobalt play in the surge of video conferencing as a whole?
We have a lot of customers who are approaching us to work on solutions to transcribe video conferences. I think a lot of businesses have taken a while to accept the severity of the situation, so developments have not been as quick to roll out as we may have hoped. We are working diligently to negotiate these projects and expect new partnerships over the next few months. (Note: Zoom currently has a live meeting transcription option powered by Otter.ai.)
Technically speaking, how is transcribing a meeting different from transcribing a voicemail?
Transcribing an in-person meeting is much more difficult than transcribing a voicemail. In a meeting, there are many participants and people tend to talk over each other. They stutter and hesitate more than they would in a voicemail. Participants usually speak around a central microphone and are varying distances away from it.
Video conferences are actually a bit easier. We already know who the separate speakers are because they are speaking into individual devices. Speakers are also less likely to interrupt each other on a video conference. Today, I’m not aware of any technologies that are as accurate as human ears in transcribing meetings.
Continuing on this topic, I agree that teams ultimately trust a human to interpret live meetings, but looping someone in just to take meeting notes seems like a waste of human capital. What’s the best approach to companies who want an accurate but efficient way of transcribing meetings?
I think the best solution in the near term is a hybrid approach: technology and human. This means letting the transcription technology do its work, then having a human monitor it and clean it up. This combination will be more accurate than either approach by itself.
What’s the primary measure of accuracy in speech recognition and what’s considered a good number?
The primary measure of accuracy in speech recognition is a metric called word error rate. This is the number of erroneous words transcribed divided by the total words spoken. There seems to be a magic number of about 17% error. It’s not perfect, but this is the tipping point for our understanding. Anything less accurate than that, and it’s hard for people to understand the output. When it’s more accurate, something clicks and we can fill in the blanks.
Where is technology today on recognizing speech from people with different accents and backgrounds?
Accents are no longer the red flag they used to be in speech, but rather, a yellow flag. We can still get very good results from people with different accents. It’s always going to be a challenge when there is more variation in the data set, but it’s certainly something we’re getting better at. Cobalt is working on a number of projects to make speech recognition more robust and inclusive.
What insights can you share on Speaker Analytics and what role will the automotive industry play? (Automotive is expected to be the biggest growth end market for speech/voice over the next 5 years.)
We’re experiencing a new age in Speaker Analytics. To recap, these are the solutions that detect not just what you say, but how you say it. As humans, we do this all the time with each other when we deduce a person’s mood/intention through cues in their voice.
Automotive has been using voice technology for a long time to eliminate distractions in the car. This includes helping to tune your radio or roll up your window. Now, automotive is using Speaker Analytics technology to detect if a driver is fatigued, intoxicated or otherwise unfit to drive. This is a great application that improves driver safety. Finally, in the age of self-driving cars, voice applications will extend functionality to provide entertainment to the passenger.
In conclusion, I want to check in on how the Cobalt team is doing. How many employees are on the Cobalt team today?
Today, we have 27 extremely talented employees. 27 is actually a lucky number for us because Cobalt is atomic element #27 on the periodic table.
Fun Founder Facts
1. Baking Bread is Jeff’s favorite indoor activity during shelter-in-place.
2. Abe’s / Artichoke Pizza is Jeff’s favorite restaurant around UC Berkeley. Like me, Jeff is a UC Berkeley alum, but he has not been back in many years.
About the Founder
Jeff Adams is a prominent figure in speech & language technology and has over 20 years of industry experience. Until 2009, he worked at Nuance / Dragon where he was responsible for developing and improving speech recognition for Nuance’s “Dragon” dictation software. He presided over many of the critical improvements in the 1990s and 2000s that brought the technology into the mainstream and enabled widespread consumer adoption. Following that, Jeff joined Yap, a company specializing in voicemail-to-text transcription. Yap was eventually acquired by Amazon, so Jeff joined Amazon and went on to lead the team that developed Amazon Alexa. He left Amazon in 2014 to found Cobalt Speech and Language. He holds 25 patents in speech technology. Jeff is originally from Hayward/Pleasanton CA, and now resides in Tyngsboro, Massachusetts.
Cobalt is always looking for business development partners to explore new speech applications with. Interested parties can reach out to him at firstname.lastname@example.org.
About the Author
Bonnie Young runs the Amplified blog. She shares her insights on market trends and interviews founders that are shaking up the tech scene. Please reach out to her on LinkedIn with your questions and feedback.