From voice to text, the power of the Open Source ecosystem - return on the OSXP conference

01/12/2021

Auteur(s) :

Lucian Precup

Temps de lecture : 4 minute(s)

A summary of the talk Lucian Precup and Aline Paponaud gave at the Open Source Experience conference, with a link to the video, screenshots, photos and more.

From voice to text, the power of the Open Source ecosystem - return on the OSXP conference

Last November we were present at the Open Source Experience conference, which opened its gates for its first edition. On this occasion, we presented the Open Source ecosystem around Speech-To-Text technologies. Earlier this year, at Berlin Buzzwords - the conference dedicated to Search, Store and Streaming technologies - we presented our approach for improving Speech-To-Text technologies with Elasticsearch : Speech to text with Elasticsearch. OSXP was for us an occasion to make a more generic presentation and to share our knowledge in French.

Speech-To-Text (STT) technologies have greatly evolved thanks to Machine Learning and Deep Learning. This technology is nowadays widely available via Cloud APIs at operators such as Google, Apple, Facebook, Amazon, Microsoft, etc. In most of the cases, everything is processed on the server side: the recording of the voice or the media content are uploaded to the Cloud Service Provider and processed remotely.

A certain dependency to the platform is introduced as often these STT functions are only available in their specific ecosystem.

The Google Search functionality available on Google Chrome

Open Source technologies in this area also made a lot of progress over the last three years. There are more and more Open Source libraries such as Kaldi (under the Apache license) which provides models, algorithms and recipes that can be used in other libraries. Vosk, a library also under the Apache license, provides additional support for languages and programming languages. The historical Open Source library, CMU Sphinx, ceased in favor of the new Machine Learning powered technologies.

Mozilla launched Common Voice, a project integrating Deep Speech - a Deep Learning technology. Unlike other technologies that collect their user’s data, the Common Voice project is designed around an opt-in model: it is you that decide to donate your voice and your time to the community so the technology can advance. Concretely, when you go to the Common Voice website, you can read a text or listen to a recording and validate its transcription, your donation counting towards the daily goal.

The Deep Speech and Mozilla Common Voice projects

The way Deep Learning works is nicely explained by Mozilla Research in the following picture. The Speech-To-Text technology will make propositions, will validate them, then it will adjust, re-configure and learn. A large quantity of data is necessary to ensure the correct learning and the quality of the model. Hence the call for the community.

The Vosk Api technology, which is the most advanced among the Open Source technologies today, can ensure the transcription of spoken phrases in real time. And all this offline, with the help of very small trained models. In our demo, we used a model under 50 Mb for French and English. Extensions are possible and we showed an approach to transcribe paragraphs spoken in several languages.

We identified several use cases for Speech-To-Text technologies: text transcription, indexing audio and video content within a search engine, documenting podcasts, making conferences and lectures accessible to everyone and also recognizing user queries in the context of a voice assistant.

We finished our presentation with a demonstration of Speech-To-Text technologies in the context of all.site, the collaborative search engine. The information is more and more available as multimedia content on the Internet but also internally in Intranets. Particularly with e-learning, the media content can sometimes be the majority, surpassing text resources. Technologies such as Vosk Api allow all.site‘s crawlers to go beyond text and text metadata, by exploring and indexing the information available in audio and video content.

all.site screenshot: a text search result coming from multimedia content

We would like to thank the conference organizers for their flawless service. We would also like to thank the public for their warm welcome.

Image of Aline Paponaud and Lucian Precup on stage at OSXP — Aline Paponaud and Lucian Precup on stage at OSXP

The warm welcome of the audience at OSXP — Thanks to the audience for their warm welcome at OSXP

Below, you can find the recording of the conference, available thanks to OSXP’s live streaming.

The video: From voice to text, the power of the Open Source ecosystem (in French)

Feedback - Fine-tuning a VOSK model

05/01/2022

all.site is a collaborative search engine. It works like Bing or Google but it has the advantage of being able to go further by indexing for example media content and organizing data from systems like Slack, Confluence or all the information present in a company's intranet.

Read the article

Feedback - Indexing of media file transcripts

17/12/2021

Read the article

When queries are very verbose

22/02/2021

In this article, we present a simple method to rewrite user queries so that a keyword-based search engine can better understand them. This method is very useful in the context of a voice search or a conversation with a chatbot, context in which user queries are generally more verbose.

Read the article