A summary of the talk Lucian Precup and Aline Paponaud gave at the Open Source Experience conference, with a link to the video, screenshots, photos and more.
Last November we were present at the Open Source Experience conference, which opened its gates for its first edition. On this occasion, we presented the Open Source ecosystem around Speech-To-Text technologies. Earlier this year, at Berlin Buzzwords - the conference dedicated to Search, Store and Streaming technologies - we presented our approach for improving Speech-To-Text technologies with Elasticsearch : Speech to text with Elasticsearch. OSXP was for us an occasion to make a more generic presentation and to share our knowledge in French.
Speech-To-Text (STT) technologies have greatly evolved thanks to Machine Learning and Deep Learning. This technology is nowadays widely available via Cloud APIs at operators such as Google, Apple, Facebook, Amazon, Microsoft, etc. In most of the cases, everything is processed on the server side: the recording of the voice or the media content are uploaded to the Cloud Service Provider and processed remotely.
A certain dependency to the platform is introduced as often these STT functions are only available in their specific ecosystem.
Open Source technologies in this area also made a lot of progress over the last three years. There are more and more Open Source libraries such as Kaldi (under the Apache license) which provides models, algorithms and recipes that can be used in other libraries. Vosk, a library also under the Apache license, provides additional support for languages and programming languages. The historical Open Source library, CMU Sphinx, ceased in favor of the new Machine Learning powered technologies.
Mozilla launched Common Voice, a project integrating Deep Speech - a Deep Learning technology. Unlike other technologies that collect their user’s data, the Common Voice project is designed around an opt-in model: it is you that decide to donate your voice and your time to the community so the technology can advance. Concretely, when you go to the Common Voice website, you can read a text or listen to a recording and validate its transcription, your donation counting towards the daily goal.
The way Deep Learning works is nicely explained by Mozilla Research in the following picture. The Speech-To-Text technology will make propositions, will validate them, then it will adjust, re-configure and learn. A large quantity of data is necessary to ensure the correct learning and the quality of the model. Hence the call for the community.
The Vosk Api technology, which is the most advanced among the Open Source technologies today, can ensure the transcription of spoken phrases in real time. And all this offline, with the help of very small trained models. In our demo, we used a model under 50 Mb for French and English. Extensions are possible and we showed an approach to transcribe paragraphs spoken in several languages.
We identified several use cases for Speech-To-Text technologies: text transcription, indexing audio and video content within a search engine, documenting podcasts, making conferences and lectures accessible to everyone and also recognizing user queries in the context of a voice assistant.
We finished our presentation with a demonstration of Speech-To-Text technologies in the context of all.site, the collaborative search engine. The information is more and more available as multimedia content on the Internet but also internally in Intranets. Particularly with e-learning, the media content can sometimes be the majority, surpassing text resources. Technologies such as Vosk Api allow all.site‘s crawlers to go beyond text and text metadata, by exploring and indexing the information available in audio and video content.
We would like to thank the conference organizers for their flawless service. We would also like to thank the public for their warm welcome.
Below, you can find the recording of the conference, available thanks to OSXP’s live streaming.
The video: From voice to text, the power of the Open Source ecosystem (in French)