In collaboration with the NRC, CRIM is developing audio indexing and speaker recognition tools adapted to Indigenous languages.
Through a long-term collaboration with the National Research Council of Canada (NRC), CRIM is working on the application of audio indexing and speaker recognition technologies to Indigenous languages. The team will work in collaboration with Indigenous community-based organizations and Indigenous communities across Canada.
The challenge: lack of content indexing
Over the years, hundreds of thousands of hours of speech have been recorded in various Indigenous languages. Unfortunately, these recordings are generally not annotated or indexed. Surprisingly, even the speech data currently being collected by Indigenous communities and linguists have this problem. Because researchers do not have the tools to segment speech data at the time of recording, the amount of unannotated data in Indigenous languages is steadily increasing.
CRIM, an expert in voice technologies
As a key participant in this vast pan-Canadian project, CRIM is carrying out two projects that will serve as the basis for the development of a dozen speech recognition systems adapted to the target languages.
Project 1 – Speech segmentation to facilitate data annotation
Our experts are developing simple tools to segment voice recordings.
- Voice activity detection separates audio files into voice and non-voice data. Our experts have developed and tested a detector based on a deep neural network trained on large amounts of speech in different languages;
- Speaker extraction is used to identify when a given speaker is speaking, using a short sample of the speaker’s voice (e.g. a query). We have developed a system based on i-vectors and are currently enhancing it with a deep learning approach;
- We have created a linguistic location tool that allows us to identify spoken Inuktitut and Eastern Cree, based on a 5-second sample, among 32 languages.
These tools can be used by software that linguists are familiar with. Thus, they should facilitate the annotation of the speech being collected for a variety of languages.
Project 2 – Indexing tool for searching through content by keyword
We aim to build systems that will allow us to search for specific words or phrases in audio recordings in some Aboriginal languages. This will not be complete speech recognition, and we will not build systems that can produce high quality transcriptions of everything that is said in a recording. Rather, the systems will allow for audio keyword searching, so that users will be able to quickly search long audio recordings for specific words or topics. To achieve this goal, we need to adapt key components of speech recognition that model words, phonemes and speech sounds, and find their limitations when applied to Indigenous languages.
- We found that the usual verbal representations do not work for Inuktitut. In English, a vocabulary of 20,000 words is large enough that only 5% of the words in a new text are not included in the vocabulary. In contrast, our collection of Inuktitut documents contains a vocabulary of 1.3 million separate words, yet in any new Inuktitut text, approximately 60% of the words have never been seen before, due to the agglutinating linguistic structure of Inuktitut. We are developing new approaches that allow us to model the rich vocabulary found in many of Canada’s Aboriginal languages without using a limited set of words.
- We have been able to automatically produce phonetic transcriptions in Eastern Cree with less than 10% error, creating a system from scratch with only four hours of pre-transcribed material. This is accurate enough to help linguists in their race to document certain languages before there are no more speakers.
- We have shown that a speech recognizer trained on a large amount of English can find the exact position of words in audio recordings, even for Inuktitut and Cree texts, making it possible to create audio books with synchronized text for use as teaching materials and language learning applications.
To date, our work has focused on Inuktitut and Cree data. The Pirurvik Centre is providing valuable assistance on the Inuktitut aspect of this project. We are now targeting other languages, such as Tsuut’inai and Michif, to explore their specific properties and ensure that our tools are applicable to a wide range of Indigenous languages.
Stay tuned for updates on this long-term project!
Les technologies au service des langues autochtones (L’actualité)
La revitalisation des langues autochtones, un travail de longue haleine (Radio-Canada)
De nouvelles technologies développées à Montréal pour préserver les langues autochtones (Radio-Canada)