Big data clustering : A mathematical programming framework / by Nayera Mostafa Ahmed ; Supervised Dr Mahmoud Mostafa Rashwan, Dr Ahmed El-Tabey Okasha.

By:

Nayera Mostafa Ahmed [preparation.]

Contributor(s):

Material type:

TextLanguage: English Summary language: English, Arabic Producer: 2025Description: 91 pages : illustrations ; 25 cm. + CDContent type:

text

Media type:

Unmediated

Carrier type:

volume

Other title:

التحليل العنقودي للبيانات الضخمة : إطار برمجة رياضية [Added title page title]

Subject(s):

DDC classification:

519.5

Available additional physical forms:

Issues also as CD.

Dissertation note: Thesis (M.Sc)-Cairo University, 2025. Summary: Traditional k-means clustering faces computational bottlenecks with big data due to quadratic complexity scaling. While parallel implementations exist, most employ heuristic load balancing without mathematical optimization foundations. This thesis develops a mathematical programming framework for parallel k-means (MP-PKmean) that explicitly models clustering objectives and parallelization constraints within a unified optimization formulation. The framework introduces binary decision variables for cluster and processor assignments with constraints ensuring optimal workload distribution. Four theoretical guarantees are established: mathematical equivalence with sequential algorithms, optimal load balancing, realistic speedup bounds incorporating overhead analysis, and algorithmic equivalence. Comprehensive validation through 2,200 experiments across synthetic (1K-1M samples) and real- world datasets from five domains (botanical, chemical, medical, cybersecurity, physics) spanning 150-5M samples demonstrates substantial performance improvements while preserving clustering quality. Key findings: 50,000 samples threshold for parallel benefits, maximum speedups of 2.79× (synthetic) and 1.96× (real applications), optimal 37-54% efficiency with 4-core configurations, and statistical clustering quality equivalence across all configurations. The framework establishes deployment guidelines: sequential processing for < 50K samples, 4- core configurations for 50K-500K samples (37-54% efficiency), and 8-core configurations for >500K samples (speedups >2.5×). Cross-domain validation confirms universal applicability determined by dataset size rather than domain characteristics. This research bridges mathematical optimization theory with practical parallel computing, providing theoretical rigor and empirical validation for scalable clustering solutions. Summary: يواجه نهج K-means التقليدي صعوبات حسابية كبيرة عند التعامل مع البيانات الضخمة بسبب تعقيده التربيعي في الحسابات. ورغم وجود نسخ موازية له، فإن أغلبها يعتمد على توزيع الاحمال بشكل تقريبي تفتقر إلى أساس رياضي. تهدف هذه الرسالة إلى تطوير إطار برمجة رياضية للتحليل العنقودي البيانات باستخدام K-means بشكل متوازي(MP-PKmeans)، بحيث يجمع بين هدف التحليل العنقوديوقيود المعالجة المتوازية ضمن صياغة موحدة. يقوم الإطار المقترح بإدخال متغيرات ثنائية لاتخاذ القرارات الخاصة بتخصيص النقاط إلى المجموعات (clusters) والمعالجات (processors)، مع فرض قيود رياضية تضمن توزيع الحمل بشكل مثالي. وقد تم إثبات أربع خصائص نظرية رئيسية: التطابق الرياضي مع النسخة التسلسلية ، تحقيق موازنة حمل مثالية بحد أقصى انحراف نقطة واحدة بين المعالجات، اشتقاق حدود عملية للتسريع (speedup) تأخذ في الاعتبار تكاليف التزامن والاتصال وفق مبادئ الحوسبة المتوازية، والتكافؤ مع المعالجة التسلسلية. وللتحقق تجريبياً، تم إجراء2,200 تجربة على بيانات مولدة (من 1,000 إلى مليون عينة) وبيانات حقيقية في خمسة مجالات مختلفة (النباتات، الكيمياء، الطب، الأمن السيبراني، والفيزياء) بحجم يتراوح بين 150 و5 مليون عينة. أظهرت النتائج تحسينات كبيرة في الأداء مع الحفاظ على جودة التجميع. ومن أبرز النتائج:تحديد 50,000 عينة كحد فاصل لبدء الاستفادة من المعالجة المتوازية، تحقيق سرعة قصوى2.79× في البيانات المولدة و1.96× في البيانات الحقيقية، كفاءة مثالية بين 37%–54% عند استخدام 4 معالجات للبيانات المتوسطة الحجم، وتطابق إحصائي في جودة التجميع بين النسخة المتوازية والتسلسلية (p > 0.05). كما يقدّم الإطار إرشادات عملية للتطبيق: المعالجة التسلسلية للبيانات أقل من 50,000 عينة، واستخدام 4 معالجات للبيانات من 50,000 حتى 500,000 ، و8 معالجات للبيانات الأكبر لتحقيق تسريع يفوق 2.5×. وقد أثبتت التجارب عبر مختلف المجالات أن حجم البيانات—وليستخصائصها أو بعدها—هو العامل الحاسم في الأداء، مما يؤكد عالمية الإطار وقابليته للتطبيق على نطاق واسع. تُسهم هذه الدراسة في ربط النظرية الرياضية للأمثليةبالتطبيق العملي في الحوسبة المتوازية للبيانات الضخمة، مقدمةً أساساً نظرياً صارماً ودعماً تجريبياً موثقاً.

Tags from this library: No tags from this library for this title. Log in to add tags.

Average rating: 0.0 (0 votes)

Holdings
Item type	Current library	Home library	Call number	Status	Barcode
Thesis	قاعة الرسائل الجامعية - الدور الاول	المكتبة المركزبة الجديدة - جامعة القاهرة	Cai01.03.01.M.Sc.2025.Na.B (Browse shelf(Opens below))	Not for loan	01010110093371000

Thesis (M.Sc)-Cairo University, 2025.

Bibliography: pages 68 -75.

Traditional k-means clustering faces computational bottlenecks with big data due to quadratic
complexity scaling. While parallel implementations exist, most employ heuristic load balancing
without mathematical optimization foundations. This thesis develops a mathematical
programming framework for parallel k-means (MP-PKmean) that explicitly models clustering
objectives and parallelization constraints within a unified optimization formulation.
The framework introduces binary decision variables for cluster and processor assignments with
constraints ensuring optimal workload distribution. Four theoretical guarantees are established:
mathematical equivalence with sequential algorithms, optimal load balancing, realistic speedup
bounds incorporating overhead analysis, and algorithmic equivalence.
Comprehensive validation through 2,200 experiments across synthetic (1K-1M samples) and real-
world datasets from five domains (botanical, chemical, medical, cybersecurity, physics) spanning
150-5M samples demonstrates substantial performance improvements while preserving clustering
quality. Key findings: 50,000 samples threshold for parallel benefits, maximum speedups of 2.79×
(synthetic) and 1.96× (real applications), optimal 37-54% efficiency with 4-core configurations,
and statistical clustering quality equivalence across all configurations.
The framework establishes deployment guidelines: sequential processing for < 50K samples, 4-
core configurations for 50K-500K samples (37-54% efficiency), and 8-core configurations for
>500K samples (speedups >2.5×). Cross-domain validation confirms universal applicability
determined by dataset size rather than domain characteristics.
This research bridges mathematical optimization theory with practical parallel computing,
providing theoretical rigor and empirical validation for scalable clustering solutions.

يواجه نهج K-means التقليدي صعوبات حسابية كبيرة عند التعامل مع البيانات الضخمة بسبب تعقيده التربيعي في الحسابات. ورغم وجود نسخ موازية له، فإن أغلبها يعتمد على توزيع الاحمال بشكل تقريبي تفتقر إلى أساس رياضي. تهدف هذه الرسالة إلى تطوير إطار برمجة رياضية للتحليل العنقودي البيانات باستخدام K-means بشكل متوازي(MP-PKmeans)، بحيث يجمع بين هدف التحليل العنقوديوقيود المعالجة المتوازية ضمن صياغة موحدة.
يقوم الإطار المقترح بإدخال متغيرات ثنائية لاتخاذ القرارات الخاصة بتخصيص النقاط إلى المجموعات (clusters) والمعالجات (processors)، مع فرض قيود رياضية تضمن توزيع الحمل بشكل مثالي. وقد تم إثبات أربع خصائص نظرية رئيسية: التطابق الرياضي مع النسخة التسلسلية ، تحقيق موازنة حمل مثالية بحد أقصى انحراف نقطة واحدة بين المعالجات، اشتقاق حدود عملية للتسريع (speedup) تأخذ في الاعتبار تكاليف التزامن والاتصال وفق مبادئ الحوسبة المتوازية، والتكافؤ مع المعالجة التسلسلية.
وللتحقق تجريبياً، تم إجراء2,200 تجربة على بيانات مولدة (من 1,000 إلى مليون عينة) وبيانات حقيقية في خمسة مجالات مختلفة (النباتات، الكيمياء، الطب، الأمن السيبراني، والفيزياء) بحجم يتراوح بين 150 و5 مليون عينة. أظهرت النتائج تحسينات كبيرة في الأداء مع الحفاظ على جودة التجميع. ومن أبرز النتائج:تحديد 50,000 عينة كحد فاصل لبدء الاستفادة من المعالجة المتوازية، تحقيق سرعة قصوى2.79× في البيانات المولدة و1.96× في البيانات الحقيقية، كفاءة مثالية بين 37%–54% عند استخدام 4 معالجات للبيانات المتوسطة الحجم، وتطابق إحصائي في جودة التجميع بين النسخة المتوازية والتسلسلية (p > 0.05).
كما يقدّم الإطار إرشادات عملية للتطبيق: المعالجة التسلسلية للبيانات أقل من 50,000 عينة، واستخدام 4 معالجات للبيانات من 50,000 حتى 500,000 ، و8 معالجات للبيانات الأكبر لتحقيق تسريع يفوق 2.5×. وقد أثبتت التجارب عبر مختلف المجالات أن حجم البيانات—وليستخصائصها أو بعدها—هو العامل الحاسم في الأداء، مما يؤكد عالمية الإطار وقابليته للتطبيق على نطاق واسع.
تُسهم هذه الدراسة في ربط النظرية الرياضية للأمثليةبالتطبيق العملي في الحوسبة المتوازية للبيانات الضخمة، مقدمةً أساساً نظرياً صارماً ودعماً تجريبياً موثقاً.

جامعة القاهرة

المكتبة المركزية الجديدة

مكتبة جامعة القاهرة الأهلية

Big data clustering : A mathematical programming framework / by Nayera Mostafa Ahmed ; Supervised Dr Mahmoud Mostafa Rashwan, Dr Ahmed El-Tabey Okasha.