Abstract

    Open Access Research Article Article ID: AMP-5-149

    Development online models for automatic speech recognition systems with a low data level

    Mamyrbayev ОZh*, Oralbekova DO*, Alimhan K, Othman M and Zhumazhanov B

    Speech recognition is a rapidly growing field in machine learning. Conventional automatic speech recognition systems were built based on independent components, that is an acoustic model, a language model and a vocabulary, which were tuned and trained separately. The acoustic model is used to predict the context-dependent states of phonemes, and the language model and lexicon determine the most possible sequences of spoken phrases. The development of deep learning technologies has contributed to the improvement of other scientific areas, which includes speech recognition. Today, the most popular speech recognition systems are systems based on an end-to-end (E2E) structure, which trains the components of a traditional model simultaneously without isolating individual elements, representing the system as a single neural network. The E2E structure represents the system as one whole element, in contrast to the traditional one, which has several independent elements. The E2E system provides a direct mapping of acoustic signals in a sequence of labels without intermediate states, without the need for post-processing at the output, which makes it easy to implement. Today, the popular models are those that directly output the sequence of words based on the input sound in real-time, which are online end-to-end models. This article provides a detailed overview of popular online-based models for E2E systems such as RNN-T, Neural Transducer (NT) and Monotonic Chunkwise Attention (MoChA). It should be emphasized that online models for Kazakh speech recognition have not been developed at the moment. For low-resource languages, like the Kazakh language, the above models have not been studied. Thus, systems based on these models have been trained to recognize Kazakh speech. The results obtained showed that all three models work well for recognizing Kazakh speech without the use of external additions.

    Keywords:

    Published on: Aug 23, 2022 Pages: 107-111

    Full Text PDF Full Text HTML DOI: 10.17352/amp.000049
    CrossMark Publons Harvard Library HOLLIS Search IT Semantic Scholar Get Citation Base Search Scilit OAI-PMH ResearchGate Academic Microsoft GrowKudos Universite de Paris UW Libraries SJSU King Library SJSU King Library NUS Library McGill DET KGL BIBLiOTEK JCU Discovery Universidad De Lima WorldCat VU on WorldCat

    Indexing/Archiving

    Pinterest on AMP