Scott Stephenson got the idea for Deepgram, his speech-recognition startup, in an unlikely place — deep underground.
Four years ago, Stephenson was a postdoctoral researcher working on a project to detect “dark matter,” the term for the thus-far undetected collection of sub-atomic particles that theoretically comprises the vast majority of matter in the universe.
He and his team set up sensors at the former Homestake Gold Mine in South Dakota, some two miles below the surface of the Earth. (Experiments such as Stephenson’s that aim to detect rarely or never-before-seen particles are often conducted deep underground to help filter out more common particles and radiation.)
After getting their experiment up and running Stephenson and his colleagues had little to do during the eight to 10 hours a day they’d spend down in the mine except monitor the equipment to make sure nothing was going wrong. That left them with plenty of spare time.
They played ping pong, mined bitcoin, and messed around to fill the hours. But then they started talking about finding a way to make a backup copy of their lives and design it so they could find and replay the most interesting moments. They figured the easiest way to do that at the time would be to record audio as they went about their days, talking, working, and interacting. So they started making the recordings, ending up with hundreds of hours worth of audio.
But they quickly realized that the recordings were full of lots of useless data — lots of noise, long periods of silence, and many uninteresting moments — and there wasn’t an easy way to find what was important within them. They tried using the speech recognition and transcription applications that were available at the time, but none were up to the task.
“Whatever we tried, it was just really bad or non-existent,” Stephenson told Business Insider on Friday. “And we were like, ‘Wow, I bet we could build something better than this.'”
Making sense of audio is similar to searching for dark matter
Given that computer scientists specializing in speech recognition have been working on the technology for nearly 70 years, that thought might seem a bit audacious. After all, Stephenson and his team were particle physicists, not speech recognition specialists. But it turns out that the two disciplines have a lot in common these days.
In searching for dark matter, Stephenson and his team were using sensors to collect wave forms and using machine learning techniques to sift through the data they’d collected to try to find particular ones. The wave forms their systems were combing through looked a lot like recordings of sound waves. Stephenson and his colleagues figured they could use techniques similar to what they used to try to detect dark matter to make sense of audio recordings. They built a prototype, and it worked. By happenstance, they got introduced to an investor, and Stephenson ended up forming Deepgram around the speech recognition technology they’d developed.
“We are not speech people, and we are outsiders in that way,” Stephenson said.
Speech recognition systems typically work in a kind of linear, or step-by-step fashion, he said. They might start by trying to strip out background noise. Then they might try to decipher the phonemes — the individual word sounds — in the audio. Next, they might try to guess what the words are based on the language models. They they might try to deduce phrases or sentences, based on other models of word usage.
By contrast, Deepgram’s system uses deep learning techniques to do all of those steps all at once, Stephenson said. Raw audio gets fed into it, and the system itself develops its own models of how to best decipher it, based on human training.
“You pump audio into it, you turn the crank on training, and then you end up with a world-class model on the other side,” he said.
Deepgram’s service can be customized for particular clients
Just by itself, Deepgram’s system is somewhat more accurate than those offered by the biggest players in the game — IBM, Nuance, Amazon, and Google, Stephenson said. But unlike those systems, Deepgram’s can easily create individual language models for each of the company’s customers and applications. When it uses those custom models, the system is significantly more accurate than those of other companies, going from more than 70% accuracy to more than 90%, he said.
“This is something that nobody else can offer,” he said.
For now, San Francisco-based Deepgram is focusing on corporate call centers and video conferencing services, offering quick and even real-time transcription services, Stephenson said. Among its customers are call center operator Genesys and conferencing system provider Poly. Its service can be used to monitor whether customer service agents are compliant with regulations and are following company guidelines when interacting with customers. It could also potentially be used in real time to monitor a support call and provide an agent during the conversation with information related to the problem they’re discussion with the customer, he said. The company charges customers based on the amount of audio they upload to its system.
Deepgram is working on a way to automatically detect the language speakers are using and adapt to different kinds of accents. The company is also working on a way to recognize different voices from lower-quality audio recordings.
Stephenson has believers in his technology and vision. On Wednesday, Deepgram announced it has raised $12 million in a Series A funding round led by Wing VC. Among its other backers is Nvidia.
The company, which has 40 people, plans to use the money to invest in research, engineering and marketing.
“Deepgram is a new product that’s not based on the old tech, but just works better — better faster more reliable more accurate,” Stephenson said.
Here is the pitch deck Stephenson and Deepgram used to raise its new funding: