Google has disclosed plans to improve its Cloud Speech-to-Text API using the same speech recognition technology that’s being utilized to power Google Assistant and Search. The updated API (formerly known as Cloud Speech API) is predicted to enhance its voice recognition performance and reduce transcription mistakes by as much as 54 percent.
The announcement came last week via Google’s blog post. The offered changes would let developers allow Internet of Things (IoT) devices to reply to users, power voice response systems for call centers, and transform text-based media into speech.
The updates being made to the API by the Mountain View-based company could be a sign that the company is more determined to bring AI-powered tools to its systems.
Google has made a number of key improvements to the API. One of the main updates is how developers can choose between several machine learning models. There are currently four speech recognition models to choose from. These include models for short inquiries and voice commands, understanding audio from phone calls, understanding audio from video and the default model.
The company has also included a video model in the update. Apparently, this model has been upgraded to process sound from videos and/or audio from several speakers. The model utilizes machine learning that’s similar to what’s used in YouTube captioning. The new video model will reportedly reduce errors by as much as 64 percent.
There’s also the “enhanced phone_call” model, one of the first opt-in programs for logging data. The model will utilize callers’ data to improve the system. Those who choose to participate in the program will receive access to the model.
The updated Speech-to-Text API also rolled out a punctuation model. Google has admitted that transcriptions from the previous model were atrocious because of the unusual punctuation (or the lack of it). To be fair, punctuating transcribed audio is very challenging, but Google says that its new Speech-to-Text model will show transcriptions that are easier to read and understand. It will reportedly have fewer run-on sentences and more commas, question marks, and periods.
The improved punctuation is due to a new LTM neural network which automatically suggests the punctuation marks to use in the text. The feature can be a big help when taking notes by voice or when doing conference calls.
The Speech-to-Text update allows developers to tag the transcribed video or audio with optional recognition metadata. While it’s not clear how this will benefit developers at the moment, Google insists it will utilize the data it accumulated from users to determine what features it will prioritize.
The improvements Google made to its Speech-to-Text API seems to indicate that the company wants to attract more business users. Its new phone call and video transcription models appear to be particularly geared for tasks like those carried out by call centers. The API can now support two to four speakers and takes into consideration background noises like hold music and line static.
The API can also be used to transcribe video broadcasts of sporting events like basketball. Sports broadcasts involve the use of multiple speakers, like the hosts, player interviewers, advertisements, plus the cheers of the crowd, sound effects and other noises attributed to the game.
In a blog post, Dan Aharon, Google’s Cloud AI product manager, pointed out that use of the Speech-to-Text API has been steadily increasing. He also pointed out that “Access to quality speech transcription technology opens up a world of possibilities for companies that want to connect with and learn from their users.”
Google’s video and “enhanced phone_call” modes are now available for English transcription. Additional languages will soon be available as well.
Audio transcripts will still cost about $0.006 per 15 seconds while the video model will cost around $0.012 per 15 seconds. However, companies can avail of a free trial period for the video model at $0.006 per 15 seconds that will run through May 31.