Panel data analysis using supervised machine learning techniques / by Omar Ahmed Mohamed Ahmed Afifi ; Supervised Prof. Salah Mahdy Ramadan, Dr. Amal Mohamed Abdel Fatah.

By:

Omar Ahmed Mohamed Ahmed Afifi [preparation.]

Contributor(s):

Material type:

TextLanguage: English Summary language: English, Arabic Producer: 2025Description: 72 Leaves : illustrations ; 30 cm. + CDContent type:

text

Media type:

Unmediated

Carrier type:

volume

Other title:

تحليل بيانات القطاع باستخدام تقنيات التعلم الآلي الخاضعة للإشراف [Added title page title]

Subject(s):

DDC classification:

006.31

Available additional physical forms:

Issues also as CD.

Dissertation note: Thesis (M.Sc)-Cairo University, 2025. Summary: Panel data analysis allows researchers to achieve greater statistical validity in policy analysis and program evaluation through more advanced research designs than cross-sectional data models. Panel (or longitudinal) data refers to data collected from the same individuals across multiple time periods. This data type consists of repeated time-series observations (𝑇) for a significant number of cross-sectional units (𝑁), such as countries, companies, randomly chosen individuals, etc. This thesis discusses a comparison between the three conventional models of panel data, referred to as statistical panel models (Pooled OLS, Fixed Effects, and Random Effects), and three of the supervised machine learning techniques (Support Vector Regression, Random Forest Regressor, and Gradient Boosting Regressor) that have been used in literature to model panel data. The comparison is done in terms of prediction performance by fitting each of the six models and calculating diagnostic metrics (MSE, Bias, AIC, and BIC), then comparing the different values of the models. The first comparison is an empirical study that investigates the impact of education and experience on individual wages using panel data from Greene (2008). This dataset was analyzed using the six models: three classical statistical panel data models (POLS, FE, RE) and three supervised machine learning techniques (SVR, RFR, GBR). The empirical results show that the machine learning techniques outperform the statistical models across all evaluation metrics, including Mean Squared Error (MSE), Bias, Akaike Information Criterion (AIC), and Bayesian Information Criterion (BIC). Among the machine learning techniques, Gradient Boosting and Support Vector Regression achieve the most accurate and efficient fits. The statistical models exhibit relatively higher error and complexity, with the Fixed Effects model performing the worst due to its exclusion of important time-invariant regressors. The second comparison is based on a controlled simulation study using an assumed true data-generating process (DGP), evaluated across 16 combinations of cross-sectional units (𝑁 = 10,50,100,200) and time periods (𝑇 = 10,50,100,200). Each scenario was simulated over 1000 iterations to obtain stable average metrics. The findings reveal that statistical panel data models – particularly Pooled OLS and Random Effects – consistently achieve near-zero bias across all configurations, while Fixed Effects suffers from persistent bias due to model misspecification. Meanwhile, machine learning techniques demonstrate superior performance in terms of predictive performance, achieving substantially lower Mean Squared Error (MSE), AIC, and BIC values, especially as the panel size increases. Among the ML models, Gradient Boosting consistently provides the most accurate and well-balanced results, highlighting its strength in capturing complex relationships in data rich panel structures. The final part of the thesis recommends, for future work, exploring machine learning techniques other than the three used, introducing more values of 𝑁 and 𝑇 for simulation, doing simulation on different panel data settings (Unbalanced, Dynamic, etc.), and doing the simulation using different DGPs to determine whether the comparison results will change. Summary: في هذه الرسالة تمت المقارنة بين ثلاث طرق تقليدية لتقدير بيانات البانل، تُعرف باسم نماذج البانل الإحصائية (الانحدار الخطي المجمّع، نموذج التأثيرات الثابتة، ونموذج التأثيرات العشوائية)، وثلاثة من تقنيات التعلم الآلي الخاضع للإشراف (انحدار المتجهات الداعمة، انحدار الغابة العشوائية، وانحدار التعزيز الاشتقاقي) التي استُخدمت في الأدبيات لنمذجة بيانات البانل. أُجريت المقارنة من حيث دقة التقدير عن طريق تركيب كل نموذج من النماذج الستة وحساب بعض المقاييس التشخيصية (مثل متوسط الخطأ التربيعي، مقياس التحيز، معيار أكايكي، ومعيار بيز)، ثم مقارنة القيم المختلفة للنماذج. تم إجراء المقارنة الأولى باستخدام مثال من بيانات حقيقية حول تأثير سنوات الخبرة والتعليم على أجور الأفراد العاملين. أظهرت النتائج التطبيقية أن تقنيات التعلم الآلي الثلاثة تفوقت بوضوح على نماذج البانل الكلاسيكية في جميع المقاييس التشخيصية. تمت المقارنة الثانية باستخدام بيانات محاكاة في 16 تركيبة مختلفة تم تنفيذ كل تجربة محاكاة على 1000 تكرار لضمان الاستقرار في حساب المتوسطات الإحصائية للمقاييس التشخيصية. أظهرت تقنيات التعلم الآلي تحيزًا أعلى في العينات الصغيرة، لكنه ينخفض بشكل ملحوظ مع زيادة حجم البيانات، حيث حقق انحدار التعزيز الاشتقاقي أفضل أداء من حيث تقليل التحيز عند أكبر أحجام العينة. كما تفوقت تقنيات التعلم الآلي على النماذج الكلاسيكية في متوسط الخطأ التربيعي، ومعياري أكايكي وبيز، خاصة عند تكبير حجم البيانات.

Tags from this library: No tags from this library for this title. Log in to add tags.

Average rating: 0.0 (0 votes)

Holdings
Item type	Current library	Home library	Call number	Status	Barcode
Thesis	قاعة الرسائل الجامعية - الدور الاول	المكتبة المركزبة الجديدة - جامعة القاهرة	Cai01.18.04.M.Sc.2025.Om.P (Browse shelf(Opens below))	Not for loan	01010110093534000

Thesis (M.Sc)-Cairo University, 2025.

Bibliography: pages 64 -69.

Panel data analysis allows researchers to achieve greater statistical validity in policy
analysis and program evaluation through more advanced research designs than cross-sectional data
models. Panel (or longitudinal) data refers to data collected from the same individuals across
multiple time periods. This data type consists of repeated time-series observations (𝑇) for a
significant number of cross-sectional units (𝑁), such as countries, companies, randomly chosen
individuals, etc.

This thesis discusses a comparison between the three conventional models of panel data,
referred to as statistical panel models (Pooled OLS, Fixed Effects, and Random Effects), and three
of the supervised machine learning techniques (Support Vector Regression, Random Forest
Regressor, and Gradient Boosting Regressor) that have been used in literature to model panel data.
The comparison is done in terms of prediction performance by fitting each of the six models and
calculating diagnostic metrics (MSE, Bias, AIC, and BIC), then comparing the different values of
the models.

The first comparison is an empirical study that investigates the impact of education and
experience on individual wages using panel data from Greene (2008). This dataset was analyzed
using the six models: three classical statistical panel data models (POLS, FE, RE) and three
supervised machine learning techniques (SVR, RFR, GBR). The empirical results show that the
machine learning techniques outperform the statistical models across all evaluation metrics,
including Mean Squared Error (MSE), Bias, Akaike Information Criterion (AIC), and Bayesian
Information Criterion (BIC). Among the machine learning techniques, Gradient Boosting and
Support Vector Regression achieve the most accurate and efficient fits. The statistical models
exhibit relatively higher error and complexity, with the Fixed Effects model performing the worst
due to its exclusion of important time-invariant regressors.
The second comparison is based on a controlled simulation study using an assumed true
data-generating process (DGP), evaluated across 16 combinations of cross-sectional units (𝑁 =
10,50,100,200) and time periods (𝑇 = 10,50,100,200). Each scenario was simulated over 1000
iterations to obtain stable average metrics. The findings reveal that statistical panel data models – particularly Pooled OLS and Random Effects – consistently achieve near-zero bias across all
configurations, while Fixed Effects suffers from persistent bias due to model misspecification.
Meanwhile, machine learning techniques demonstrate superior performance in terms of predictive
performance, achieving substantially lower Mean Squared Error (MSE), AIC, and BIC values,
especially as the panel size increases. Among the ML models, Gradient Boosting consistently
provides the most accurate and well-balanced results, highlighting its strength in capturing
complex relationships in data rich panel structures.

The final part of the thesis recommends, for future work, exploring machine learning
techniques other than the three used, introducing more values of 𝑁 and 𝑇 for simulation, doing
simulation on different panel data settings (Unbalanced, Dynamic, etc.), and doing the simulation
using different DGPs to determine whether the comparison results will change.

في هذه الرسالة تمت المقارنة بين ثلاث طرق تقليدية لتقدير بيانات البانل، تُعرف باسم نماذج البانل الإحصائية (الانحدار الخطي المجمّع، نموذج التأثيرات الثابتة، ونموذج التأثيرات العشوائية)، وثلاثة من تقنيات التعلم الآلي الخاضع للإشراف (انحدار المتجهات الداعمة، انحدار الغابة العشوائية، وانحدار التعزيز الاشتقاقي) التي استُخدمت في الأدبيات لنمذجة بيانات البانل. أُجريت المقارنة من حيث دقة التقدير عن طريق تركيب كل نموذج من النماذج الستة وحساب بعض المقاييس التشخيصية (مثل متوسط الخطأ التربيعي، مقياس التحيز، معيار أكايكي، ومعيار بيز)، ثم مقارنة القيم المختلفة للنماذج. تم إجراء المقارنة الأولى باستخدام مثال من بيانات حقيقية حول تأثير سنوات الخبرة والتعليم على أجور الأفراد العاملين. أظهرت النتائج التطبيقية أن تقنيات التعلم الآلي الثلاثة تفوقت بوضوح على نماذج البانل الكلاسيكية في جميع المقاييس التشخيصية. تمت المقارنة الثانية باستخدام بيانات محاكاة في 16 تركيبة مختلفة تم تنفيذ كل تجربة محاكاة على 1000 تكرار لضمان الاستقرار في حساب المتوسطات الإحصائية للمقاييس التشخيصية. أظهرت تقنيات التعلم الآلي تحيزًا أعلى في العينات الصغيرة، لكنه ينخفض بشكل ملحوظ مع زيادة حجم البيانات، حيث حقق انحدار التعزيز الاشتقاقي أفضل أداء من حيث تقليل التحيز عند أكبر أحجام العينة. كما تفوقت تقنيات التعلم الآلي على النماذج الكلاسيكية في متوسط الخطأ التربيعي، ومعياري أكايكي وبيز، خاصة عند تكبير حجم البيانات.

Issues also as CD.

Text in English and abstract in Arabic & English.

There are no comments on this title.

to post a comment.

جامعة القاهرة

المكتبة المركزية الجديدة

مكتبة جامعة القاهرة الأهلية

Panel data analysis using supervised machine learning techniques / by Omar Ahmed Mohamed Ahmed Afifi ; Supervised Prof. Salah Mahdy Ramadan, Dr. Amal Mohamed Abdel Fatah.