An approach for big data integration/ Randa Mohamed Abd El-ghafar Ata ; Supervision of Prof. Dr. Ali El-bastawissy, Mervat Gheith, Dr. Eman Nasr

By:

Randa Mohamed Abd El-ghafar Ata [preparation.]

Contributor(s):

Material type: Text

TextLanguage: English Summary language: English, Arabic Producer: 2022Description: 167 Leaves : illustrations ; 30 cm. + CDContent type:

text

Media type:

Unmediated

Carrier type:

volume

Other title:

/منهجيه لتكامل البيانات الكبيره [Added title page title]

Subject(s):

DDC classification:

005.7

Available additional physical forms:

Issued also as CD

Dissertation note: Thesis (Ph.D)-Cairo University, 2022. Summary: In this thesis, we proposed two approaches. The first one is an Efficient Multi-Phase Blocking Strategy (EMPBS) for Big Data (BD). The proposed blocking strategy has disjoint blocks and less time complexity compared to some other blocking techniques. The implementation of EMPBS presents promising results as it reduced about 84% of the average number of comparisons. In the second one, we proposed a novel and efficient Entity Resolution approach for BD. The proposed approach utilizes several Natural Language Processing techniques and it is implemented using Apache Spark. It consists of five subsequent phases. The proposed approach is a generic as it accepts different types of datasets. It can integrate data from different sources. We used HashingTF to generate the vectors; which is a fast and space-efficient way of vectoring features. Using Soundex and Stemming before applying Locality Sensitive Hashing help to reduce the length of features and thus feature vectors will be more space-efficient, which enhances the performance time. To test the scalability, we used one, two, three, and four working nodes. The evaluation shows that the proposed approach can distribute the similarity computation and classification among the computational resources and scale with the available working nodes.Summary: في هذه الأطروحة ، اقترحنا طريقتين: الأول هو استراتيجية الحجب الفعالة متعددة المراحل (EMPBS) للبيانات الضخمة. تحتوي إستراتيجية الحجب المقترحة على كتل منفصلة وتعقيد زمني أقل مقارنة ببعض تقنيات الحجب الأخرى. يقدم تنفيذ EMPBS نتائج واعدة حيث قلل حوالي 84 ٪ من متوسط عدد المقارنات. اما في الطريقه الثانية ، اقترحنا نهجًا جديدًا وفعالًا لمعالجه تكرارات البيانات الكبيره. يستخدم النهج المقترح العديد من تقنيات معالجة اللغة الطبيعية ويتم تنفيذه باستخدام Apache Spark. ويتكون من خمس مراحل متلاحقه. النهج المقترح عام لأنه يقبل أنواعًا مختلفة من انواع البيانات. ويمكنه كذلك دمج البيانات من مصادر مختلفة. استخدمنا HashingTF ؛ وهي طريقة سريعة وفعالة. ساعد استخدام Soundex و Stemming في تقليل طول الكيانات الداخله في المقارنات ، وبالتالي ستكون أكثر توفيرًا للمساحة ، مما يعزز وقت الأداء. لاختبار قابلية التوسع ، استخدمنا عقد عمل واحدة واثنتين وثلاث وأربع. يوضح التقييم أن النهج المقترح يمكن أن يوزع حساب التشابه والتصنيف بين الموارد المتاحه داخل cluster.

Tags from this library: No tags from this library for this title. Log in to add tags.

Average rating: 0.0 (0 votes)

Holdings
Item type	Current library	Home library	Call number	Status	Barcode
Thesis	قاعة الرسائل الجامعية - الدور الاول	المكتبة المركزبة الجديدة - جامعة القاهرة	Cai01.18.02.Ph.D.2022.Ra.A (Browse shelf(Opens below))	Not for loan	01010110088516000

Thesis (Ph.D)-Cairo University, 2022.

Bibliography: pages 151-163.

In this thesis, we proposed two approaches. The first one is an Efficient Multi-Phase Blocking Strategy (EMPBS) for Big Data (BD). The proposed blocking strategy has disjoint blocks and less time complexity compared to some other blocking techniques. The implementation of EMPBS presents promising results as it reduced about 84% of the average number of comparisons. In the second one, we proposed a novel and efficient Entity Resolution approach for BD. The proposed approach utilizes several Natural Language Processing techniques and it is implemented using Apache Spark. It consists of five subsequent phases. The proposed approach is a generic as it accepts different types of datasets. It can integrate data from different sources. We used HashingTF to generate the vectors; which is a fast and space-efficient way of vectoring features. Using Soundex and Stemming before applying Locality Sensitive Hashing help to reduce the length of features and thus feature vectors will be more space-efficient, which enhances the performance time. To test the scalability, we used one, two, three, and four working nodes. The evaluation shows that the proposed approach can distribute the similarity computation and classification among the computational resources and scale with the available working nodes.

في هذه الأطروحة ، اقترحنا طريقتين: الأول هو استراتيجية الحجب الفعالة متعددة المراحل (EMPBS) للبيانات الضخمة. تحتوي إستراتيجية الحجب المقترحة على كتل منفصلة وتعقيد زمني أقل مقارنة ببعض تقنيات الحجب الأخرى. يقدم تنفيذ EMPBS نتائج واعدة حيث قلل حوالي 84 ٪ من متوسط عدد المقارنات. اما في الطريقه الثانية ، اقترحنا نهجًا جديدًا وفعالًا لمعالجه تكرارات البيانات الكبيره. يستخدم النهج المقترح العديد من تقنيات معالجة اللغة الطبيعية ويتم تنفيذه باستخدام Apache Spark. ويتكون من خمس مراحل متلاحقه. النهج المقترح عام لأنه يقبل أنواعًا مختلفة من انواع البيانات. ويمكنه كذلك دمج البيانات من مصادر مختلفة. استخدمنا HashingTF ؛ وهي طريقة سريعة وفعالة. ساعد استخدام Soundex و Stemming في تقليل طول الكيانات الداخله في المقارنات ، وبالتالي ستكون أكثر توفيرًا للمساحة ، مما يعزز وقت الأداء. لاختبار قابلية التوسع ، استخدمنا عقد عمل واحدة واثنتين وثلاث وأربع. يوضح التقييم أن النهج المقترح يمكن أن يوزع حساب التشابه والتصنيف بين الموارد المتاحه داخل cluster.

Issued also as CD

Text in English and abstract in Arabic & English.

There are no comments on this title.

to post a comment.

Click on an image to view it in the image viewer