Application of CRIM's speech technologies to indigenous languages

Indexation of Indigenous language audio recordings to enable keyword search

La revitalisation des langues autochtones, un travail de longue haleine (Espaces Autochtones, September 5, 2019)

Project launch - Press release (December 5, 2018)

De nouvelles technologies développées à Montréal pour préserver les langues autochtones (Espaces Autochtones, December 6, 2018)

Project to segment and index audio recordings of Indigenous languages (Project description - NRC website)

The NRC's collaboration with CRIM is focused on applying audio indexing and speaker recognition technologies to Indigenous languages. Over the years, hundreds of thousands of hours of speech have been recorded in various Indigenous languages. Unfortunately, these recordings are typically not annotated or indexed. Surprisingly, even speech data being collected now by Indigenous communities and linguists have this problem: because researchers lack the tools for segmenting speech data as they are being recorded, the stock of unannotated speech data in Indigenous languages is constantly growing.


CRIM’s experts are tackling two aspects of this problem.

Speech segmentation for easier data annotation

Nous développons des outils simples pour segmenter les enregistrements vocaux.

  • Voice activity detection separates audio files into speech and non-speech data.  We developed and tested a deep neural network based detector, trained on large amounts of speech in various languages;
  • Speaker retrieval is used to identify when a given speaker is talking, using a short sample of the speaker’s voice (query-by-example). We developed a system based on i-vectors  and are currently improving it with a deep learning approach;
  • We created a language labelling tool that can identify spoken  Inuktitut and East Cree, based on a 5-second sample, out of 32 languages.

These tools can be used by software that linguists are familiar with and should make annotation of speech currently being collected easier for a variety of languages.


Automatic segmentation displayed in ELAN linguistic annotation software.

Indexation tool for keyword search in content

We also plan to build systems that will make it possible to search for particular words or phrases in audio recordings in some Indigenous languages. This will not be full speech recognition and we will not be creating systems that are able to produce high-quality transcriptions of everything that was said in a recording. Rather, the systems will enable audio keyword search, so that users will be able to search quickly through long audio recordings for particular words or topics. To reach that goal, we must adapt the main components of speech recognition which model words, phonemes and speech sounds, and find their limits when applied to Indigenous languages.

  • We found that usual word-based representations do not work for Inuktitut. In English, a vocabulary of 20,000 words is large enough so that only 5% of the words in a new text will be out of the vocabulary. In contrast, our Inuktitut document collection contains a vocabulary of 1.3 million distinct words, and yet in any new Inuktitut text about 60% of the words have never been seen before, because of Inuktitut’s agglutinative language structure. We are developing new approaches that can model the rich vocabulary observed in many Indigenous languages in Canada without relying on a limited set of words.

Out of vocabulary rate stays high even with very large vocabularies.

  • We were able to automatically produce phonetic transcriptions of East Cree with less than 10% error, creating a system from scratch with only four hours of pre-transcribed material. This is accurate enough to help linguists in their race to document some languages before there are no speakers left.
  • We showed that a speech recognizer trained on a large amounts of English can find exact word positions in audio recordings, even for Inuktitut and Cree texts, which makes it possible to create audio books with synchronized text to be used as educational material and language learning apps.

Inuktitut text aligned with audio recording.

So far, our work has been focused on Inuktitut and Cree data. The Pirurvik Centre is providing valuable assistance on the Inuktitut aspect of this project. We are now targeting other languages, such as Tsuut’inai and Michif, to explore their specific properties and ensure that our tools are applicable to a broad range of Indigenous languages.


Recent news

  • Retour sur l'AGA 2020

    Le CRIM a tenu sa 35e assemblée générale annuelle le 18 juin 2020. Plusieurs membres étaient présents pour souligner une année de croissance importante pour l’organisme.


Upcoming event

  • Gala des Prix Innovation 2020 de l'ADRIQ
    19 November 2020 0:00
    Palais des Congrès de Montréal
    Le Gala Prix Innovation 2020 de l'ADRIQ aura lieu le 19 novembre 2020, au Palais des congrès de Montréal.
  • AIxSPACE RT @AIxSPACE_ca: Registrations are open for AIxSPACE! We're looking forward to meeting you all at the 1st event dedicated to #AI applied to…
  • Vous souhaitez rencontrer des grandes compagnies, chercheurs et startups afin d'avoir une discussion de fond sur la…

Recent Publications

  • On The Performance of Time-Pooling Strategies for End-to-End Spoken Language Identification

  • An ensemble Based Approach for Generalized Detection of Spoofing Attacks to Automatic Speaker Recognizers