Application of CRIM's speech technologies to indigenous languages


Indexation of Indigenous language audio recordings to enable keyword search

Project launch - Press release (December 5, 2018)

CBC article (In French - Espaces Autochtones - December 6, 2018)

The NRC's collaboration with CRIM is focused on applying audio indexing and speaker recognition technologies to Indigenous languages. Over the years, hundreds of thousands of hours of speech have been recorded in various Indigenous languages. Unfortunately, these recordings are typically not annotated or indexed. Surprisingly, even speech data being collected now by Indigenous communities and linguists have this problem: because researchers lack the tools for segmenting speech data as they are being recorded, the stock of unannotated speech data in Indigenous languages is constantly growing.

CRIM’s experts are tackling two aspects of this problem.


Speech segmentation for easier data annotation

Nous développons des outils simples pour segmenter les enregistrements vocaux.

  • Voice activity detection separates audio files into speech and non-speech data.  We developed and tested a deep neural network based detector, trained on large amounts of speech in various languages;
  • Speaker retrieval is used to identify when a given speaker is talking, using a short sample of the speaker’s voice (query-by-example). We developed a system based on i-vectors  and are currently improving it with a deep learning approach;
  • We created a language labelling tool that can identify spoken  Inuktitut and East Cree, based on a 5-second sample, out of 32 languages.

These tools can be used by software that linguists are familiar with and should make annotation of speech currently being collected easier for a variety of languages.

 

Automatic segmentation displayed in ELAN linguistic annotation software.


Indexation tool for keyword search in content

We also plan to build systems that will make it possible to search for particular words or phrases in audio recordings in some Indigenous languages. This will not be full speech recognition and we will not be creating systems that are able to produce high-quality transcriptions of everything that was said in a recording. Rather, the systems will enable audio keyword search, so that users will be able to search quickly through long audio recordings for particular words or topics. To reach that goal, we must adapt the main components of speech recognition which model words, phonemes and speech sounds, and find their limits when applied to Indigenous languages.

  • We found that usual word-based representations do not work for Inuktitut. In English, a vocabulary of 20,000 words is large enough so that only 5% of the words in a new text will be out of the vocabulary. In contrast, our Inuktitut document collection contains a vocabulary of 1.3 million distinct words, and yet in any new Inuktitut text about 60% of the words have never been seen before, because of Inuktitut’s agglutinative language structure. We are developing new approaches that can model the rich vocabulary observed in many Indigenous languages in Canada without relying on a limited set of words.

Out of vocabulary rate stays high even with very large vocabularies.

  • We were able to automatically produce phonetic transcriptions of East Cree with less than 10% error, creating a system from scratch with only four hours of pre-transcribed material. This is accurate enough to help linguists in their race to document some languages before there are no speakers left.
  • We showed that a speech recognizer trained on a large amounts of English can find exact word positions in audio recordings, even for Inuktitut and Cree texts, which makes it possible to create audio books with synchronized text to be used as educational material and language learning apps.

Inuktitut text aligned with audio recording.

So far, our work has been focused on Inuktitut and Cree data. The Pirurvik Centre is providing valuable assistance on the Inuktitut aspect of this project. We are now targeting other languages, such as Tsuut’inai and Michif, to explore their specific properties and ensure that our tools are applicable to a broad range of Indigenous languages.

Teams

Recent news

  • ClimateData.ca - An exceptional tool for Canadian leaders!
    15/08/2019

    Launch of the ClimateData.ca portal in the presence of the Honourable Catherine McKenna, Minister of Environment and Climate Change.

    +

Upcoming event

  • Batimatech 2019
    17 September 2019 9:00
    Les Studios des 7 Doigts à Montréal
    Batimatech 2019, le 17 septembre 2019, sous le thème: l'avenir de la construction aujourd'hui.
    +
  • Le CRIM recrute! Conseiller en recrutement en TI
  • Chambly Express.ca RT @chamblyexpress: Un outil 2.0. pour lutter contre les changements climatiques #changements #climatiques #site #web #2019 #outil #techno…

Recent Publications

  • Forage de données géospatiales, quelques applications

    +
  • Des technologies perturbatrices pour de futures applications du bâtiment intelligent utilisant AI

    +