MARC View

000			07232namaa22004331i 4500
003			OSt
005			20250807111138.0
008			250623s2024 ua a\|\|\|frm\|\|\| 000 0 eng d
040			_aEG-GICUC _beng _cEG-GICUC _dEG-GICUC _erda
041	0		_aeng _beng _bara
049			_aDeposit
082	0	4	_a006.31
092			_a006.31 _221
097			_aM.Sc
099			_aCai01.20.03.M.Sc.2024.Ah.E
100	0		_aAhmed Ismail Meawed Zahran, _epreparation.
245	1	0	_aEnhancement of mispronunciation detection using deep learning techniques / _cby Ahmed Ismail Meawed Zahran ; Supervision Prof. Aly Aly Fahmy, Prof. Khaled Wassif, Dr. Hanaa Mobarez.
246	1	5	_aتحسين أساليب كشف أخطاء النطق باستخدام التعلم العميق
264		0	_c2024.
300			_a111 leaves : _billustrations ; _c30 cm. + _eCD.
336			_atext _2rda content
337			_aUnmediated _2rdamedia
338			_avolume _2rdacarrier
502			_aThesis (M.Sc)-Cairo University, 2024.
504			_aBibliography: pages 104-111.
520		3	_aIn language learning applications, pronunciation assessment models are a necessary component for providing feedback on a user’s pronunciation skills. Pronunciation scoring literature has been largely dependent on feature-based models like Goodness-of-Pronunciation (GOP) and deep-learning based speech recognition. In the past few years, transformer-based self-supervised learning (SSL) has enabled the introduction of large pre-trained models that can be used to produce powerful contextualized speech representations, which has shown improvement in several downstream tasks. We propose End-to-End Regressor (E2E-R), an end-to-end model for pronunciation scoring that is built through fine-tuning a pre-trained SSL model. E2E-R is developed using a two-stage approach. In the first stage, a pre-trained SSL model is fine-tuned on a phoneme recognition task, which results in a model that can produce accurate phoneme vector representations. In the second stage, a pronunciation scoring model is built using transfer learning. This model utilizes a Siamese neural network to compare pronounced phoneme representations with embeddings that represent the correct pronunciation of canonical phonemes. The result of the comparison is used as the pronunciation score. Experimental results show that our proposed model achieves a Pearson correlation coefficient (PCC) of 0.68 on the speechocean762 dataset, which is almost the same as the PCC achieved by the state-of-the-art GOPT (PAII-A), without the need for additional native speech data, feature engineering, or an external forced alignment module. To the best of our knowledge, this work represents the first utilization of a pre-trained SSL model in end-to-end phoneme-level pronunciation scoring.
520		3	_aتعد نماذج تقييم النطق من العناصر الضرورية في تطبيقات تعلم اللغة، حيث أنها تساعد في تقديم الملاحظات حول مهارات النطق لدى المستخدمين. تعتمد الأبحاث العلمية في مجال تقييم النطق إلى حد كبير على النماذج القائمة على السمات (features) مثل نموذج جودة النطق (GOP)، كما تعتمد أيضا على تقنيات التعرف على الكلام القائمة على التعلم العميق. وفي السنوات القليلة الماضية، سمحت تقنية التعلم الخاضع للإشراف الذاتي باستخدام نماذج المحولات (transformers) بإتاحة نماذج مدربة مسبقًا ذات حجم كبير، ويمكن استخدام هذه النماذج لإنتاج تمثيلات دقيقة للكلام، حيث أن هذه التمثيلات تعتمد في تشكيلها على سياق الكلام. ساعدت هذه التمثيلات بدورها في تحسين العديد من تطبيقات تكنولوجيا التعرف على الكلام. في هذه الرسالة، نقوم بطرح نموذج End-to-End Regressor (E2E-R)، وهو نموذج شامل (end-to-end) لتقييم النطق تم إنشاؤه عن طريق الضبط الدقيق لنموذج مُدرب مسبقًا بتقنية التعلم الخاضع للإشراف الذاتي. يمكننا تقسيم طريقة بناء نموذج E2E-R إلى مرحلتين. في المرحلة الأولى، يُستخدم الضبط الدقيق لضبط نموذج مُدرب مسبقًا عن طريق التعلم الخاضع للإشراف الذاتي على مهمة التعرف على الوحدات الصوتية، مما يترتب عليه إيجاد نموذج بإمكانه إنتاج تمثيلات دقيقة للوحدات الصوتية. في المرحلة الثانية، يتم بناء نموذج لتقييم النطق باستخدام تقنية نقل التعلم. يستخدم هذا النموذج شبكة عصبية سيامية (Siamese Neural Network) لمقارنة التمثيلات الصوتية التي تمثل الوحدات الصوتية التي نطقها المستخدم مع تضمينات (embeddings) تمثل النطق الصحيح للوحدات الصوتية التي كان ينبغي للمستخدم أن ينطقها. تُستخدم درجة التماثل الناتجة عن هذه المقارنة كتقييم لنطق المستخدم. تظهر لنا النتائج التجريبية أن النموذج المطروح يحقق معامل ارتباط بيرسون (PCC) يبلغ ٠,٦٨ عند اختباره باستخدام مجموعة بيانات speechocean762، وهو تقريبًا نفس معامل ارتباط بيرسون الذي يحققه نموذج GOPT (PAII-A)، والذي يعد الأفضل في تقييم النطق في الوقت الحالي، وهذا دون الحاجة إلى جمع مجموعات من البيانات تحتوي علي بيانات صوتية باللسان الأم للمستخدمين، أوالإستعانة بهندسة السمات، أو استخدام نماذج إضافية للحصول على مواقع بدايات ونهايات الوحدات الصوتية. على حد معرفتنا، يشكل هذا العمل أول استخدام لنموذج مُدرب مسبقًا بتقنية التعلم الخاضع للإشراف الذاتي في التقييم الشامل للنطق على مستوى الوحدات الصوتية.
530			_aIssues also as CD.
546			_aText in English and abstract in Arabic & English.
650		0	_aDeep Learning
653		1	_aAutomatic pronunciation assessment _apronunciation scoring _apre-trained speech representations _aself-supervised speech representation learning
700	0		_aAly Aly Fahmy _ethesis advisor.
700	0		_aKhaled Wassif _ethesis advisor.
700	0		_aHanaa Mobarez _ethesis advisor.
900			_b01-01-2024 _cAly Aly Fahmy _cKhaled Wassif _cHanaa Mobarez _UCairo University _FFaculty of Computers and Artificial Intelligence _DDepartment of Computer Science
905			_aShimaa _eEman Ghareb
942			_2ddc _cTH _e21 _n0
999			_c172706