Image captioning using deep learning / by Asmaa Ahmed El-Sayed Osman ; Supervision Prof. Reda Abd-Elwahab Alkhoribi, Prof. Khaled Mostafa Elsayed, Dr. Mohamed A. Wahby Shalaby, Dr. Mona M. Soliman.

By:

Asmaa Ahmed El-Sayed Osman [preparation.]

Contributor(s):

Material type:

TextLanguage: English Summary language: English, Arabic Producer: 2024Description: 91 leaves : illustrations ; 30 cm. + CDContent type:

text

Media type:

Unmediated

Carrier type:

volume

Other title:

عنونة الصور بإستخدام التعلم العميق [Added title page title]

Subject(s):

DDC classification:

006.31

Available additional physical forms:

Issues also as CD.

Dissertation note: Thesis (Ph.D)-Cairo University, 2024. Summary: Image captioning is a major artificial intelligence research field that involves visual interpretation and linguistic description of a corresponding image. To properly caption an image, computer vision and language models should be used to have the ability to describe the corresponding image in a statement which is both expressive and brief. Successful image captioning relies on acquiring as much information as feasible from the input image. One of these essential bits of knowledge is the topic that the image is associated with. In order to produce these topics, the related work approaches relied on topic modelling techniques, which were constructed from the caption text. The issue with topic modelling, which is applied only to caption texts for topics extraction, is that the topics ignore the image's semantic content. Concept modeling technique, on the other hand, takes both the caption data and the images into account when determining which concepts to extract. The concept modeling technique can be utilized in image captioning for completely capturing the image contexts and make use of these contexts to produce more accurate image descriptions. The image captioning task is challenging since it needs an in-depth understanding of the image's semantic properties. Furthermore, the image captioning process should have the ability to represent the semantic information and hence generating human-like phrases. In this thesis, the concept modeling technique is used to propose novel concept-based models for image captioning. Concept modeling technique is used on the dataset images and captions to extract set of novel concept vectors and the images embedding. After that, the output of the concept model is fed to the decoder. LSTM have traditionally been utilized in the related work to decode the captions’ embeddings and the concept vectors in order to predict the following word of the caption. So, an image captioning model is firstly proposed based on the concept model by utilizing the LSTM. In the proposed concept-based model using LSTM, the concept vectors are passed to an LSTM in addition to the caption embeddings to learn the LSTM weights. Image features are then merged with the LSTM output to predict the next words. Transformer is utilized in image captioning to provide more accurate captions with less complexity due to its parallelization capabilities and the recursion bypassing. The conventional transformer makes use of one encoder for representing the image’s features and one decoder to decode the partial captions by paying attention to the features. To pay attention to the concept data in addition to the image features, this research work proposes two multi-encoder transformer architectures. A new encoder has been added to the transformer to reflect the concept information. Moreover, the transformer decoder is modified by adding an encoder-decoder attention layer so that the attention can be paid to the concept information that is represented by the new encoder. The encoder of the first proposed multi-encoder transformer architecture is identical to the standard transformer encoder. However, the encoder of the second proposed multi-encoder transformer architecture is inspired from the vision transformer encoder. Concept-based models are proposed for image captioning by utilizing the proposed multi-encoder transformer architectures. The concept vectors and images’ embeddings are first derived from the concept model. After that, the first transformer encoder receives the concept vectors and the second transformer encoder receives the images’ embeddings. Then, the transformer decoder is utilized for generating the next words in the caption depending on the information fed by the two encoders and the embeddings of the partial captions. The Microsoft COCO and Flickr30K datasets are utilized to compare the proposed model with state-of-the-art approaches using the standard evaluation metrics CIDEr, ROUGE, METEOR, BLEU, and SPICE. The proposed models are found to be outperforming state-of-the-art topic-based approaches while also being easier to implement and requiring less computational complexity. In addition to the three proposed concept-based English image captioning models, the proposed concept-based model using vision based multi-encoder transformer architecture has been applied to Arabic dataset in order to validate the proposed model on different language. Flickr8k dataset with Arabic captions is used for evaluating the proposed Arabic image captioning model. Summary: يعد التعليق على الصور أحد مجالات أبحاث الذكاء الاصطناعي الرئيسية التي تتضمن التفسير البصري والوصف اللغوي للصورة المقابلة. للتعليق على الصورة بشكل صحيح، يجب استخدام الرؤية الحاسوبية ونماذج اللغة لتكون لديك القدرة على وصف الصورة المقابلة في عبارة معبرة ومختصرة. تعتمد التسميات التوضيحية الناجحة للصور على الحصول على أكبر قدر ممكن من المعلومات من الصورة الأصلية. أحد هذه الأجزاء الأساسية من المعرفة هو الموضوع الذي ترتبط به الصورة. من أجل إنتاج هذه الموضوعات، اعتمدت مناهج العمل ذات الصلة على تقنيات نمذجة الموضوع، والتي تم إنشاؤها من نص التسمية التوضيحية. تكمن مشكلة نمذجة الموضوع، والتي يتم تطبيقها فقط على نصوص التسميات التوضيحية لاستخراج الموضوعات، في أن الموضوعات تتجاهل المحتوى الدلالي للصورة. من ناحية أخرى، تأخذ تقنية نمذجة المفاهيم كلا من بيانات التسمية التوضيحية والصور في الاعتبار عند تحديد المفاهيم التي يجب استخراجها. يمكن استخدام تقنية نمذجة المفاهيم في التسميات التوضيحية للصور لالتقاط سياقات الصورة بالكامل والاستفادة من هذه السياقات لإنتاج أوصاف صور أكثر دقة. تعد التسميات التوضيحية للصور مهمة صعبة لأنها تتطلب فهما متعمقا للخصائص الدلالية للصورة. علاوة على ذلك ، يجب أن تكون عملية التسميات التوضيحية للصور قادرة على تمثيل المعلومات الدلالية وبالتالي توليد عبارات شبيهة بوصف الإنسان. في هذه الأطروحة، يتم استخدام تقنية نمذجة المفاهيم لاقتراح نماذج جديدة قائمة على المفاهيم لشرح الصور. تستخدم تقنية نمذجة المفاهيم على صور مجموعة البيانات والتعليقات التوضيحية لاستخراج مجموعة من متجهات المفاهيم الجديدة وتضمين الصور. بعد ذلك، يتم إرسال إخراج نموذج المفهوم إلى وحدة فك الترميز. تم استخدام LSTM تقليديا في الأعمال ذات الصلة لفك تشفير تضمينات التسميات التوضيحية ومتجهات المفاهيم من أجل التنبؤ بالكلمة التالية من التسمية التوضيحية. لذلك، يتم اقتراح نموذج قائم على المفهوم أولا للتسميات التوضيحية للصور من خلال استخدام LSTM. في النموذج المقترح القائم على المفهوم باستخدام LSTM ، يتم تمرير متجهات المفهوم إلى LSTM بالإضافة إلى تضمين التسمية التوضيحية لمعرفة أوزان LSTM. ثم يتم دمج ميزات الصورة مع إخراج LSTM للتنبؤ بالكلمات التالية. يستخدم المحول في التسميات التوضيحية للصور لتوفير تسميات توضيحية أكثر دقة مع تعقيد أقل بسبب قدراته على التوازي وتجاوز العودية. يستخدم المحول التقليدي برنامج تشفير لتمثيل ميزات الصورة ووحدة فك ترميز لفك تشفير التسميات التوضيحية الجزئية من خلال الانتباه إلى الميزات. للانتباه إلى بيانات المفهوم بالإضافة إلى ميزات الصورة، يقترح هذا العمل البحثي معماريتين للمحولات متعددة التشفير. تتم إضافة برنامج تشفير جديد إلى المحول ليعكس معلومات المفهوم. علاوة على ذلك ، يتم تعديل وحدة فك ترميز المحول عن طريق إضافة طبقة انتباه وحدة فك التشفير بحيث يمكن الانتباه إلى معلومات المفهوم التي يمثلها المشفر الجديد. يتطابق مشفر أول بنية محول متعدد التشفير المقترحة مع مشفر المحولات القياسي. ومع ذلك، فإن مشفر بنية محول التشفير المتعدد المقترح الثاني مستوحى من مشفر محول الرؤية. تم اقتراح نماذج قائمة على المفهوم لتعليق الصور من خلال استخدام بنيات المحولات متعددة التشفير المقترحة. يتم اشتقاق متجهات المفهوم وميزات CLIP أولاً من نموذج المفهوم. بعد ذلك، يستقبل مشفر المحول الأول متجهات المفهوم ويستقبل مشفر المحول الثاني ميزات CLIP. بعد ذلك، يتم استخدام وحدة فك ترميز المحولات لتوليد الكلمات التالية في التسمية التوضيحية بناءً على المعلومات المقدمة من المشفرين وتضمين التسميات التوضيحية الجزئية. يتم استخدام مجموعة بيانات Microsoft COCO لمقارنة النموذج المقترح بأحدث الأساليب من حيث مقاييس التقييم القياسية CIDEr وROUGE وMETEOR وBLEU وSPICE. وتبين أن النماذج المقترحة تتفوق في الأداء على أحدث الأساليب القائمة على الموضوع، في حين أنها أيضًا أسهل في التنفيذ وتتطلب تعقيدًا حسابيًا أقل. بالإضافة إلى نماذج التعليق على الصور الإنجليزية الثلاثة المقترحة القائمة على المفهوم، يتم تطبيق النموذج المقترح القائم على المفهوم باستخدام بنية محولات التشفير المتعددة القائمة على الرؤية على مجموعة البيانات العربية من أجل التحقق من صحة النموذج المقترح بلغة مختلفة. يتم استخدام مجموعة بيانات Flickr8k مع التسميات التوضيحية العربية لتقييم نموذج التسميات التوضيحية للصور العربية المقترح

Tags from this library: No tags from this library for this title. Log in to add tags.

Average rating: 0.0 (0 votes)

Holdings
Item type	Current library	Home library	Call number	Status	Barcode
Thesis	قاعة الرسائل الجامعية - الدور الاول	المكتبة المركزبة الجديدة - جامعة القاهرة	Cai01.20.01.Ph.D.2024.As.I (Browse shelf(Opens below))	Not for loan	01010110091025000

Thesis (Ph.D)-Cairo University, 2024.

Bibliography: pages 83-91.

Image captioning is a major artificial intelligence research field that involves visual interpretation and linguistic description of a corresponding image. To properly caption an image, computer vision and language models should be used to have the ability to describe the corresponding image in a statement which is both expressive and brief.
Successful image captioning relies on acquiring as much information as feasible from the input image. One of these essential bits of knowledge is the topic that the image is associated with. In order to produce these topics, the related work approaches relied on topic modelling techniques, which were constructed from the caption text. The issue with topic modelling, which is applied only to caption texts for topics extraction, is that the topics ignore the image's semantic content. Concept modeling technique, on the other hand, takes both the caption data and the images into account when determining which concepts to extract. The concept modeling technique can be utilized in image captioning for completely capturing the image contexts and make use of these contexts to produce more accurate image descriptions.
The image captioning task is challenging since it needs an in-depth understanding of the image's semantic properties. Furthermore, the image captioning process should have the ability to represent the semantic information and hence generating human-like phrases. In this thesis, the concept modeling technique is used to propose novel concept-based models for image captioning. Concept modeling technique is used on the dataset images and captions to extract set of novel concept vectors and the images embedding. After that, the output of the concept model is fed to the decoder. LSTM have traditionally been utilized in the related work to decode the captions’ embeddings and the concept vectors in order to predict the following word of the caption. So, an image captioning model is firstly proposed based on the concept model by utilizing the LSTM. In the proposed concept-based model using LSTM, the concept vectors are passed to an LSTM in addition to the caption embeddings to learn the LSTM weights. Image features are then merged with the LSTM output to predict the next words.
Transformer is utilized in image captioning to provide more accurate captions with less complexity due to its parallelization capabilities and the recursion bypassing. The conventional transformer makes use of one encoder for representing the image’s features and one decoder to decode the partial captions by paying attention to the features. To pay attention to the concept data in addition to the image features, this research work proposes two multi-encoder transformer architectures. A new encoder has been added to the transformer to reflect the concept information. Moreover, the transformer decoder is modified by adding an encoder-decoder attention layer so that the attention can be paid to the concept information that is represented by the new encoder. The encoder of the first proposed multi-encoder transformer architecture is identical to the standard transformer encoder. However, the encoder of the second proposed multi-encoder transformer architecture is inspired from the vision transformer encoder.
Concept-based models are proposed for image captioning by utilizing the proposed multi-encoder transformer architectures. The concept vectors and images’ embeddings are first derived from the concept model. After that, the first transformer encoder receives the concept vectors and the second transformer encoder receives the images’ embeddings. Then, the transformer decoder is utilized for generating the next words in the caption depending on the information fed by the two encoders and the embeddings of the partial captions.
The Microsoft COCO and Flickr30K datasets are utilized to compare the proposed model with state-of-the-art approaches using the standard evaluation metrics CIDEr, ROUGE, METEOR, BLEU, and SPICE. The proposed models are found to be outperforming state-of-the-art topic-based approaches while also being easier to implement and requiring less computational complexity.
In addition to the three proposed concept-based English image captioning models, the proposed concept-based model using vision based multi-encoder transformer architecture has been applied to Arabic dataset in order to validate the proposed model on different language. Flickr8k dataset with Arabic captions is used for evaluating the proposed Arabic image captioning model.

يعد التعليق على الصور أحد مجالات أبحاث الذكاء الاصطناعي الرئيسية التي تتضمن التفسير البصري والوصف اللغوي للصورة المقابلة. للتعليق على الصورة بشكل صحيح، يجب استخدام الرؤية الحاسوبية ونماذج اللغة لتكون لديك القدرة على وصف الصورة المقابلة في عبارة معبرة ومختصرة.
تعتمد التسميات التوضيحية الناجحة للصور على الحصول على أكبر قدر ممكن من المعلومات من الصورة الأصلية. أحد هذه الأجزاء الأساسية من المعرفة هو الموضوع الذي ترتبط به الصورة. من أجل إنتاج هذه الموضوعات، اعتمدت مناهج العمل ذات الصلة على تقنيات نمذجة الموضوع، والتي تم إنشاؤها من نص التسمية التوضيحية. تكمن مشكلة نمذجة الموضوع، والتي يتم تطبيقها فقط على نصوص التسميات التوضيحية لاستخراج الموضوعات، في أن الموضوعات تتجاهل المحتوى الدلالي للصورة. من ناحية أخرى، تأخذ تقنية نمذجة المفاهيم كلا من بيانات التسمية التوضيحية والصور في الاعتبار عند تحديد المفاهيم التي يجب استخراجها. يمكن استخدام تقنية نمذجة المفاهيم في التسميات التوضيحية للصور لالتقاط سياقات الصورة بالكامل والاستفادة من هذه السياقات لإنتاج أوصاف صور أكثر دقة.
تعد التسميات التوضيحية للصور مهمة صعبة لأنها تتطلب فهما متعمقا للخصائص الدلالية للصورة. علاوة على ذلك ، يجب أن تكون عملية التسميات التوضيحية للصور قادرة على تمثيل المعلومات الدلالية وبالتالي توليد عبارات شبيهة بوصف الإنسان. في هذه الأطروحة، يتم استخدام تقنية نمذجة المفاهيم لاقتراح نماذج جديدة قائمة على المفاهيم لشرح الصور. تستخدم تقنية نمذجة المفاهيم على صور مجموعة البيانات والتعليقات التوضيحية لاستخراج مجموعة من متجهات المفاهيم الجديدة وتضمين الصور. بعد ذلك، يتم إرسال إخراج نموذج المفهوم إلى وحدة فك الترميز. تم استخدام LSTM تقليديا في الأعمال ذات الصلة لفك تشفير تضمينات التسميات التوضيحية ومتجهات المفاهيم من أجل التنبؤ بالكلمة التالية من التسمية التوضيحية. لذلك، يتم اقتراح نموذج قائم على المفهوم أولا للتسميات التوضيحية للصور من خلال استخدام LSTM. في النموذج المقترح القائم على المفهوم باستخدام LSTM ، يتم تمرير متجهات المفهوم إلى LSTM بالإضافة إلى تضمين التسمية التوضيحية لمعرفة أوزان LSTM. ثم يتم دمج ميزات الصورة مع إخراج LSTM للتنبؤ بالكلمات التالية.
يستخدم المحول في التسميات التوضيحية للصور لتوفير تسميات توضيحية أكثر دقة مع تعقيد أقل بسبب قدراته على التوازي وتجاوز العودية. يستخدم المحول التقليدي برنامج تشفير لتمثيل ميزات الصورة ووحدة فك ترميز لفك تشفير التسميات التوضيحية الجزئية من خلال الانتباه إلى الميزات. للانتباه إلى بيانات المفهوم بالإضافة إلى ميزات الصورة، يقترح هذا العمل البحثي معماريتين للمحولات متعددة التشفير. تتم إضافة برنامج تشفير جديد إلى المحول ليعكس معلومات المفهوم. علاوة على ذلك ، يتم تعديل وحدة فك ترميز المحول عن طريق إضافة طبقة انتباه وحدة فك التشفير بحيث يمكن الانتباه إلى معلومات المفهوم التي يمثلها المشفر الجديد. يتطابق مشفر أول بنية محول متعدد التشفير المقترحة مع مشفر المحولات القياسي. ومع ذلك، فإن مشفر بنية محول التشفير المتعدد المقترح الثاني مستوحى من مشفر محول الرؤية.
تم اقتراح نماذج قائمة على المفهوم لتعليق الصور من خلال استخدام بنيات المحولات متعددة التشفير المقترحة. يتم اشتقاق متجهات المفهوم وميزات CLIP أولاً من نموذج المفهوم. بعد ذلك، يستقبل مشفر المحول الأول متجهات المفهوم ويستقبل مشفر المحول الثاني ميزات CLIP. بعد ذلك، يتم استخدام وحدة فك ترميز المحولات لتوليد الكلمات التالية في التسمية التوضيحية بناءً على المعلومات المقدمة من المشفرين وتضمين التسميات التوضيحية الجزئية.
يتم استخدام مجموعة بيانات Microsoft COCO لمقارنة النموذج المقترح بأحدث الأساليب من حيث مقاييس التقييم القياسية CIDEr وROUGE وMETEOR وBLEU وSPICE. وتبين أن النماذج المقترحة تتفوق في الأداء على أحدث الأساليب القائمة على الموضوع، في حين أنها أيضًا أسهل في التنفيذ وتتطلب تعقيدًا حسابيًا أقل. بالإضافة إلى نماذج التعليق على الصور الإنجليزية الثلاثة المقترحة القائمة على المفهوم، يتم تطبيق النموذج المقترح القائم على المفهوم باستخدام بنية محولات التشفير المتعددة القائمة على الرؤية على مجموعة البيانات العربية من أجل التحقق من صحة النموذج المقترح بلغة مختلفة. يتم استخدام مجموعة بيانات Flickr8k مع التسميات التوضيحية العربية لتقييم نموذج التسميات التوضيحية للصور العربية المقترح

Issues also as CD.

Text in English and abstract in Arabic & English.

There are no comments on this title.

to post a comment.

جامعة القاهرة

المكتبة المركزية الجديدة

مكتبة جامعة القاهرة الأهلية

Image captioning using deep learning / by Asmaa Ahmed El-Sayed Osman ; Supervision Prof. Reda Abd-Elwahab Alkhoribi, Prof. Khaled Mostafa Elsayed, Dr. Mohamed A. Wahby Shalaby, Dr. Mona M. Soliman.