Application of CRIM's speech technologies to indigenous languages

Indexation of Indigenous language audio recordings to enable keyword search

La revitalisation des langues autochtones, un travail de longue haleine (Espaces Autochtones, September 5, 2019)

Project launch - Press release (December 5, 2018)

De nouvelles technologies développées à Montréal pour préserver les langues autochtones (Espaces Autochtones, December 6, 2018)

Project to segment and index audio recordings of Indigenous languages (Project description - NRC website)

The NRC's collaboration with CRIM is focused on applying audio indexing and speaker recognition technologies to Indigenous languages. Over the years, hundreds of thousands of hours of speech have been recorded in various Indigenous languages. Unfortunately, these recordings are typically not annotated or indexed. Surprisingly, even speech data being collected now by Indigenous communities and linguists have this problem: because researchers lack the tools for segmenting speech data as they are being recorded, the stock of unannotated speech data in Indigenous languages is constantly growing.


CRIM’s experts are tackling two aspects of this problem.

Speech segmentation for easier data annotation

Nous développons des outils simples pour segmenter les enregistrements vocaux.

  • Voice activity detection separates audio files into speech and non-speech data.  We developed and tested a deep neural network based detector, trained on large amounts of speech in various languages;
  • Speaker retrieval is used to identify when a given speaker is talking, using a short sample of the speaker’s voice (query-by-example). We developed a system based on i-vectors  and are currently improving it with a deep learning approach;
  • We created a language labelling tool that can identify spoken  Inuktitut and East Cree, based on a 5-second sample, out of 32 languages.

These tools can be used by software that linguists are familiar with and should make annotation of speech currently being collected easier for a variety of languages.


Automatic segmentation displayed in ELAN linguistic annotation software.

Indexation tool for keyword search in content

We also plan to build systems that will make it possible to search for particular words or phrases in audio recordings in some Indigenous languages. This will not be full speech recognition and we will not be creating systems that are able to produce high-quality transcriptions of everything that was said in a recording. Rather, the systems will enable audio keyword search, so that users will be able to search quickly through long audio recordings for particular words or topics. To reach that goal, we must adapt the main components of speech recognition which model words, phonemes and speech sounds, and find their limits when applied to Indigenous languages.

  • We found that usual word-based representations do not work for Inuktitut. In English, a vocabulary of 20,000 words is large enough so that only 5% of the words in a new text will be out of the vocabulary. In contrast, our Inuktitut document collection contains a vocabulary of 1.3 million distinct words, and yet in any new Inuktitut text about 60% of the words have never been seen before, because of Inuktitut’s agglutinative language structure. We are developing new approaches that can model the rich vocabulary observed in many Indigenous languages in Canada without relying on a limited set of words.

Out of vocabulary rate stays high even with very large vocabularies.

  • We were able to automatically produce phonetic transcriptions of East Cree with less than 10% error, creating a system from scratch with only four hours of pre-transcribed material. This is accurate enough to help linguists in their race to document some languages before there are no speakers left.
  • We showed that a speech recognizer trained on a large amounts of English can find exact word positions in audio recordings, even for Inuktitut and Cree texts, which makes it possible to create audio books with synchronized text to be used as educational material and language learning apps.

Inuktitut text aligned with audio recording.

So far, our work has been focused on Inuktitut and Cree data. The Pirurvik Centre is providing valuable assistance on the Inuktitut aspect of this project. We are now targeting other languages, such as Tsuut’inai and Michif, to explore their specific properties and ensure that our tools are applicable to a broad range of Indigenous languages.


Recent news

  • Valorisation de la recherche québécoise

    Le CRIM salue l’importance que le Ministre Pierre Fitzgibbon accorde à la valorisation de la recherche québécoise et l’ampleur des ressources qu’il y consacrera.


Upcoming event

  • Santé et sécurité du travail 2020 - Événement les Affaires
    23 September 2020 8:30
    Présentation en ligne
    Le CRIM est fier d'être partenaire de la 10e édition de la conférence Santé et sécurité du travail organisée par les Événement Les Affaires. Présentation en ligne.
  • Merci @Grenier_enbref! #donnees #geospatial @opengeospatial @NASA @environnementca @ESA_EO @RNCan
  • @economie_quebec @SciChefQC @lsirois007 @inocanada @FPInnovations #Technologie #Recherche #Industrie #Investissement #Croissance

Recent Publications

  • An end-to-end approach for the verification problem: learning the right distance

  • The Indigenous Languages Technology Project at NRC Canada: an empowerment-oriented approach to developing language software