TY  - BOOK
AU  - Ahmed Ismail Meawed Zahran,
AU  - Aly Aly Fahmy
AU  - Khaled Wassif
AU  - Hanaa Mobarez
TI  - Enhancement of mispronunciation detection using deep learning techniques
U1  - 006.31 
PY  - 2024///
KW  - Deep Learning
KW  - Automatic pronunciation assessment
KW  - pronunciation scoring
KW  - pre-trained speech representations
KW  - self-supervised speech representation learning
N1  - Thesis (M.Sc)-Cairo University, 2024; Bibliography: pages 104-111; Issues also as CD
N2  - In language learning applications, pronunciation assessment models are a necessary component for providing feedback on a userâs pronunciation skills. Pronunciation scoring literature has been largely dependent on feature-based models like Goodness-of-Pronunciation (GOP) and deep-learning based speech recognition. In the past few years, transformer-based self-supervised learning (SSL) has enabled the introduction of large pre-trained models that can be used to produce powerful contextualized speech representations, which has shown improvement in several downstream tasks. We propose End-to-End Regressor (E2E-R), an end-to-end model for pronunciation scoring that is built through fine-tuning a pre-trained SSL model. E2E-R is developed using a two-stage approach. In the first stage, a pre-trained SSL model is fine-tuned on a phoneme recognition task, which results in a model that can produce accurate phoneme vector representations. In the second stage, a pronunciation scoring model is built using transfer learning. This model utilizes a Siamese neural network to compare pronounced phoneme representations with embeddings that represent the correct pronunciation of canonical phonemes. The result of the comparison is used as the pronunciation score. Experimental results show that our proposed model achieves a Pearson correlation coefficient (PCC) of 0.68 on the speechocean762 dataset, which is almost the same as the PCC achieved by the state-of-the-art GOPT (PAII-A), without the need for additional native speech data, feature engineering, or an external forced alignment module. To the best of our knowledge, this work represents the first utilization of a pre-trained SSL model in end-to-end phoneme-level pronunciation scoring.; ØªØ¹Ø¯ ÙÙØ§Ø°Ø¬ ØªÙÙÙÙ Ø§ÙÙØ·Ù ÙÙ Ø§ÙØ¹ÙØ§ØµØ± Ø§ÙØ¶Ø±ÙØ±ÙØ© ÙÙ ØªØ·Ø¨ÙÙØ§Øª ØªØ¹ÙÙ Ø§ÙÙØºØ©Ø Ø­ÙØ« Ø£ÙÙØ§ ØªØ³Ø§Ø¹Ø¯ ÙÙ ØªÙØ¯ÙÙ Ø§ÙÙÙØ§Ø­Ø¸Ø§Øª Ø­ÙÙ ÙÙØ§Ø±Ø§Øª Ø§ÙÙØ·Ù ÙØ¯Ù Ø§ÙÙØ³ØªØ®Ø¯ÙÙÙ. ØªØ¹ØªÙØ¯ Ø§ÙØ£Ø¨Ø­Ø§Ø« Ø§ÙØ¹ÙÙÙØ© ÙÙ ÙØ¬Ø§Ù ØªÙÙÙÙ Ø§ÙÙØ·Ù Ø¥ÙÙ Ø­Ø¯ ÙØ¨ÙØ± Ø¹ÙÙ Ø§ÙÙÙØ§Ø°Ø¬ Ø§ÙÙØ§Ø¦ÙØ© Ø¹ÙÙ Ø§ÙØ³ÙØ§Øª (features) ÙØ«Ù ÙÙÙØ°Ø¬ Ø¬ÙØ¯Ø© Ø§ÙÙØ·Ù (GOP)Ø ÙÙØ§ ØªØ¹ØªÙØ¯ Ø£ÙØ¶Ø§ Ø¹ÙÙ ØªÙÙÙØ§Øª Ø§ÙØªØ¹Ø±Ù Ø¹ÙÙ Ø§ÙÙÙØ§Ù Ø§ÙÙØ§Ø¦ÙØ© Ø¹ÙÙ Ø§ÙØªØ¹ÙÙ Ø§ÙØ¹ÙÙÙ. ÙÙÙ Ø§ÙØ³ÙÙØ§Øª Ø§ÙÙÙÙÙØ© Ø§ÙÙØ§Ø¶ÙØ©Ø Ø³ÙØ­Øª ØªÙÙÙØ© Ø§ÙØªØ¹ÙÙ Ø§ÙØ®Ø§Ø¶Ø¹ ÙÙØ¥Ø´Ø±Ø§Ù Ø§ÙØ°Ø§ØªÙ Ø¨Ø§Ø³ØªØ®Ø¯Ø§Ù ÙÙØ§Ø°Ø¬ Ø§ÙÙØ­ÙÙØ§Øª (transformers) Ø¨Ø¥ØªØ§Ø­Ø© ÙÙØ§Ø°Ø¬ ÙØ¯Ø±Ø¨Ø© ÙØ³Ø¨ÙÙØ§ Ø°Ø§Øª Ø­Ø¬Ù ÙØ¨ÙØ±Ø ÙÙÙÙÙ Ø§Ø³ØªØ®Ø¯Ø§Ù ÙØ°Ù Ø§ÙÙÙØ§Ø°Ø¬ ÙØ¥ÙØªØ§Ø¬ ØªÙØ«ÙÙØ§Øª Ø¯ÙÙÙØ© ÙÙÙÙØ§ÙØ Ø­ÙØ« Ø£Ù ÙØ°Ù Ø§ÙØªÙØ«ÙÙØ§Øª ØªØ¹ØªÙØ¯ ÙÙ ØªØ´ÙÙÙÙØ§ Ø¹ÙÙ Ø³ÙØ§Ù Ø§ÙÙÙØ§Ù. Ø³Ø§Ø¹Ø¯Øª ÙØ°Ù Ø§ÙØªÙØ«ÙÙØ§Øª Ø¨Ø¯ÙØ±ÙØ§ ÙÙ ØªØ­Ø³ÙÙ Ø§ÙØ¹Ø¯ÙØ¯ ÙÙ ØªØ·Ø¨ÙÙØ§Øª ØªÙÙÙÙÙØ¬ÙØ§ Ø§ÙØªØ¹Ø±Ù Ø¹ÙÙ Ø§ÙÙÙØ§Ù. ÙÙ ÙØ°Ù Ø§ÙØ±Ø³Ø§ÙØ©Ø ÙÙÙÙ Ø¨Ø·Ø±Ø­ ÙÙÙØ°Ø¬ End-to-End Regressor (E2E-R)Ø ÙÙÙ ÙÙÙØ°Ø¬ Ø´Ø§ÙÙ (end-to-end) ÙØªÙÙÙÙ Ø§ÙÙØ·Ù ØªÙ Ø¥ÙØ´Ø§Ø¤Ù Ø¹Ù Ø·Ø±ÙÙ Ø§ÙØ¶Ø¨Ø· Ø§ÙØ¯ÙÙÙ ÙÙÙÙØ°Ø¬ ÙÙØ¯Ø±Ø¨ ÙØ³Ø¨ÙÙØ§ Ø¨ØªÙÙÙØ© Ø§ÙØªØ¹ÙÙ Ø§ÙØ®Ø§Ø¶Ø¹ ÙÙØ¥Ø´Ø±Ø§Ù Ø§ÙØ°Ø§ØªÙ. ÙÙÙÙÙØ§ ØªÙØ³ÙÙ Ø·Ø±ÙÙØ© Ø¨ÙØ§Ø¡ ÙÙÙØ°Ø¬ E2E-R Ø¥ÙÙ ÙØ±Ø­ÙØªÙÙ. ÙÙ Ø§ÙÙØ±Ø­ÙØ© Ø§ÙØ£ÙÙÙØ ÙÙØ³ØªØ®Ø¯Ù Ø§ÙØ¶Ø¨Ø· Ø§ÙØ¯ÙÙÙ ÙØ¶Ø¨Ø· ÙÙÙØ°Ø¬ ÙÙØ¯Ø±Ø¨ ÙØ³Ø¨ÙÙØ§ Ø¹Ù Ø·Ø±ÙÙ Ø§ÙØªØ¹ÙÙ Ø§ÙØ®Ø§Ø¶Ø¹ ÙÙØ¥Ø´Ø±Ø§Ù Ø§ÙØ°Ø§ØªÙ Ø¹ÙÙ ÙÙÙØ© Ø§ÙØªØ¹Ø±Ù Ø¹ÙÙ Ø§ÙÙØ­Ø¯Ø§Øª Ø§ÙØµÙØªÙØ©Ø ÙÙØ§ ÙØªØ±ØªØ¨ Ø¹ÙÙÙ Ø¥ÙØ¬Ø§Ø¯ ÙÙÙØ°Ø¬ Ø¨Ø¥ÙÙØ§ÙÙ Ø¥ÙØªØ§Ø¬ ØªÙØ«ÙÙØ§Øª Ø¯ÙÙÙØ© ÙÙÙØ­Ø¯Ø§Øª Ø§ÙØµÙØªÙØ©. ÙÙ Ø§ÙÙØ±Ø­ÙØ© Ø§ÙØ«Ø§ÙÙØ©Ø ÙØªÙ Ø¨ÙØ§Ø¡ ÙÙÙØ°Ø¬ ÙØªÙÙÙÙ Ø§ÙÙØ·Ù Ø¨Ø§Ø³ØªØ®Ø¯Ø§Ù ØªÙÙÙØ© ÙÙÙ Ø§ÙØªØ¹ÙÙ. ÙØ³ØªØ®Ø¯Ù ÙØ°Ø§ Ø§ÙÙÙÙØ°Ø¬ Ø´Ø¨ÙØ© Ø¹ØµØ¨ÙØ© Ø³ÙØ§ÙÙØ© (Siamese Neural Network)  ÙÙÙØ§Ø±ÙØ© Ø§ÙØªÙØ«ÙÙØ§Øª Ø§ÙØµÙØªÙØ© Ø§ÙØªÙ ØªÙØ«Ù Ø§ÙÙØ­Ø¯Ø§Øª Ø§ÙØµÙØªÙØ© Ø§ÙØªÙ ÙØ·ÙÙØ§ Ø§ÙÙØ³ØªØ®Ø¯Ù ÙØ¹ ØªØ¶ÙÙÙØ§Øª (embeddings) ØªÙØ«Ù Ø§ÙÙØ·Ù Ø§ÙØµØ­ÙØ­ ÙÙÙØ­Ø¯Ø§Øª Ø§ÙØµÙØªÙØ© Ø§ÙØªÙ ÙØ§Ù ÙÙØ¨ØºÙ ÙÙÙØ³ØªØ®Ø¯Ù Ø£Ù ÙÙØ·ÙÙØ§. ØªÙØ³ØªØ®Ø¯Ù Ø¯Ø±Ø¬Ø© Ø§ÙØªÙØ§Ø«Ù Ø§ÙÙØ§ØªØ¬Ø© Ø¹Ù ÙØ°Ù Ø§ÙÙÙØ§Ø±ÙØ© ÙØªÙÙÙÙ ÙÙØ·Ù Ø§ÙÙØ³ØªØ®Ø¯Ù. ØªØ¸ÙØ± ÙÙØ§ Ø§ÙÙØªØ§Ø¦Ø¬ Ø§ÙØªØ¬Ø±ÙØ¨ÙØ© Ø£Ù Ø§ÙÙÙÙØ°Ø¬ Ø§ÙÙØ·Ø±ÙØ­ ÙØ­ÙÙ ÙØ¹Ø§ÙÙ Ø§Ø±ØªØ¨Ø§Ø· Ø¨ÙØ±Ø³ÙÙ (PCC) ÙØ¨ÙØº Ù ,Ù¦Ù¨ Ø¹ÙØ¯ Ø§Ø®ØªØ¨Ø§Ø±Ù Ø¨Ø§Ø³ØªØ®Ø¯Ø§Ù ÙØ¬ÙÙØ¹Ø© Ø¨ÙØ§ÙØ§Øª speechocean762Ø ÙÙÙ ØªÙØ±ÙØ¨ÙØ§ ÙÙØ³ ÙØ¹Ø§ÙÙ Ø§Ø±ØªØ¨Ø§Ø· Ø¨ÙØ±Ø³ÙÙ Ø§ÙØ°Ù ÙØ­ÙÙÙ ÙÙÙØ°Ø¬ GOPT (PAII-A)Ø ÙØ§ÙØ°Ù ÙØ¹Ø¯ Ø§ÙØ£ÙØ¶Ù ÙÙ ØªÙÙÙÙ Ø§ÙÙØ·Ù ÙÙ Ø§ÙÙÙØª Ø§ÙØ­Ø§ÙÙØ ÙÙØ°Ø§ Ø¯ÙÙ Ø§ÙØ­Ø§Ø¬Ø© Ø¥ÙÙ Ø¬ÙØ¹ ÙØ¬ÙÙØ¹Ø§Øª ÙÙ Ø§ÙØ¨ÙØ§ÙØ§Øª ØªØ­ØªÙÙ Ø¹ÙÙ Ø¨ÙØ§ÙØ§Øª ØµÙØªÙØ© Ø¨Ø§ÙÙØ³Ø§Ù Ø§ÙØ£Ù ÙÙÙØ³ØªØ®Ø¯ÙÙÙØ Ø£ÙØ§ÙØ¥Ø³ØªØ¹Ø§ÙØ© Ø¨ÙÙØ¯Ø³Ø© Ø§ÙØ³ÙØ§ØªØ Ø£Ù Ø§Ø³ØªØ®Ø¯Ø§Ù ÙÙØ§Ø°Ø¬ Ø¥Ø¶Ø§ÙÙØ© ÙÙØ­ØµÙÙ Ø¹ÙÙ ÙÙØ§ÙØ¹ Ø¨Ø¯Ø§ÙØ§Øª ÙÙÙØ§ÙØ§Øª Ø§ÙÙØ­Ø¯Ø§Øª Ø§ÙØµÙØªÙØ©. Ø¹ÙÙ Ø­Ø¯ ÙØ¹Ø±ÙØªÙØ§Ø ÙØ´ÙÙ ÙØ°Ø§ Ø§ÙØ¹ÙÙ Ø£ÙÙ Ø§Ø³ØªØ®Ø¯Ø§Ù ÙÙÙÙØ°Ø¬ ÙÙØ¯Ø±Ø¨ ÙØ³Ø¨ÙÙØ§ Ø¨ØªÙÙÙØ© Ø§ÙØªØ¹ÙÙ Ø§ÙØ®Ø§Ø¶Ø¹ ÙÙØ¥Ø´Ø±Ø§Ù Ø§ÙØ°Ø§ØªÙ ÙÙ Ø§ÙØªÙÙÙÙ Ø§ÙØ´Ø§ÙÙ ÙÙÙØ·Ù Ø¹ÙÙ ÙØ³ØªÙÙ Ø§ÙÙØ­Ø¯Ø§Øª Ø§ÙØµÙØªÙØ©
ER  -