报告题目：Multimodal Machine Learning: Efficient Visual-Language Deep Learning Models for Image Captioning and Cross-Modal Retrieval.
报告人：苏罗: Osolo Ian Raymond
报告介绍：Introduction to the topics
Key summary points: CNNs, RNNs and Transformers in Computer Vision and Natural Language Processing. Image captioning and cross-modal retrieval
We experience the world and react to it through a combination of senses such as sound, taste sight, etc. Likewise, machine learning models, initially designed to handle one modality e.g., text, image, or speech, are increasingly required to be able to handle multimodality data. This requires the designing of models that can process more than one modality at a time. Applications of this in the research world include image captioning, Visual Question Answering, text-to-speech and speech-to-text research. In real life practical situations, this could be employed to improve the lives of visually impaired people, make it safer to multi-task e.g., hands free texting and reading messages. Handling multimodality data is a complex task because it requires bridging the gap between two very different feature representations. In this work, we report our research, mainly focused on image captioning and partly on cross-modal retrieval where we design and analyze multimodality models, that employ CNN, RNN and Transformer architectures.
Join me as I take you on a journey through the progress and advancements in deep learning, as applied to visual language modelling, specifically image captioning, mainly, and also cross-modal retrieval. Presented below are 3 relevant topic summaries:
Topic of Report 1: Making images matter more: A Fourier Augmented image captioning transformer
报告一摘要: Summary of report 1
Many vision-language models that output natural language, such as image-captioning models, usually use image features merely for grounding the captions and most of the good performance of the model can be attributed to the language model, which does all the heavy lifting. In this report, we propose a method to make the images matter more by using fast Fourier transforms to further breakdown the input features and extract more of their intrinsic salient information, resulting in more detailed yet concise captions. Furthermore, we analyze and provide insight into the use of fast Fourier transform features as alternatives or supplements to regional features for self-attention in image-captioning applications.
Topic of Report 2: An analysis of the use of feed-forward sub-modules in a transformer-based visual-language multimodal environment
报告二摘要: Summary of report 2
Transformers have become the go-to architecture when dealing with computer vision and natural language processing deep learning tasks. This is because of their state-of-the-art performance in most of those tasks. The main feature of the transformers to which this good performance has been attributed is the self-attention mechanism. Not much research has gone into investigating whether they are indeed responsible for most of the good performance. In this report, we use image captioning as the choice of application to perform a comprehensive analysis of the effect of replacing the self-attention mechanism with feed-forward layers both for the image encoder and the text decoder. We investigate the effect on the memory usage, and sequence length where our experiments demonstrated many surprising results. This provides a qualitative analysis of the resulting captions, an empirical analysis of the evaluation metrics, and memory usage, providing a practical insight into the effect of this substitution in vision-language tasks while also demonstrating competitive results with the much simpler architecture.
Topic of Report 3: A Nonlinear Supervised Discrete Hashing framework for large-scale cross-modal retrieval
报告三摘要: Summary of report 3
In cross-modal retrieval, the biggest issue is the large semantic gap that exists between the feature distributions of heterogeneous data. This makes it very difficult to directly compute the relationships between different modalities. In order to bridge the heterogeneous gap, many techniques have been proposed to create an effective common latent common representation between the heterogeneous modalities, which can then be leveraged to bridge the gap so that the common representation can be computed efficiently by using common distance metrics. Some of the shortcomings of current supervised cross-modal hashing methods will be discussed. Then, a novel hashing based cross-modal retrieval method that uses food ingredient retrieval as a proof of concept will be presented.
报告人简介: Osolo Ian Raymond received his BTech degree in Electrical Engineering from Nelson Mandela University, South Africa and the M.Eng. degree in Software Engineering at Central South University, China where he is currently pursuing a PhD degree in Computer Science Application & Technology. He has published papers in reputed ESCI/SCI journals focusing on Image captioning and cross-modal retrieval. His research interests include Machine Learning, specifically, Deep Learning for Computer Vision, Natural Language Processing and Embedded Systems.