What is the state-of-the-art and how does MediaVerse aspire to address the current limitations?
What is New Media? While there is no commonly accepted definition of this term, it is mainly used to describe new forms and formats of digital media that are primarily meant for distribution through the Internet and social media platforms. Such new media formats, for instance, include image memes, TikTok videos, 360 video content, and the like.
In recent years, New Media have gradually overwhelmed digital life. Their use has enabled people to express themselves in new and potentially more attractive and engaging ways. At the same time, the advent and rapid growth of deep learning has made available powerful algorithms and models that are capable of analysing and extracting semantic information from media content. Such algorithms can provide us with useful annotations of new media which can then be utilized for practical purposes such as content retrieval, moderation and accessibility.
The scientific community has explored multiple avenues in the search for optimal approaches to semantic information extraction from media content. The past decade has been overwhelmed by a group of learning algorithms that revolutionized the field by bringing important gains in terms of accuracy and generalization on many traditional and modern tasks, the so-called Neural Networks! Convolutional Neural Networks (CNN) are the most used type of model when dealing with visual content  . For text content, Recurrent Neural Networks (RNN) (including Long-Short Term Memory (LSTM) networks) and Transformer architectures are widely used. It is worth noting that in 2020 the Vision Transformer (ViT) architecture – a non CNN-based neural network – achieved state of the art results for visual content analysis, indicating the growing adoption of Transformers by the multimedia community.
The main paradigms to content-based semantic annotation include:
- concept annotation, in which the objective is to predict a single or multiple label(s) corresponding to concepts, like persons or cars, encountered in the content (e.g. image classification, video classification, emoji classification, emoji-based reaction prediction),
- object detection, in which the objective is to localize and classify objects in the content (e.g. MSCOCO concepts, logo detection, traffic sign detection, face detection),
- entity extraction, in which the objective is to recognize entities, such as landmarks or celebrities, either in text documents or inside visual content and
- free text annotation, in which the objective is to generate a natural language caption that briefly describes the content (e.g. image captioning, video captioning).
In the New Media era, the main challenge in deploying such approaches is the absence of high volume datasets that include appropriate annotations for new data formats. Deep learning typically requires large amounts of annotations of relevant content in order to train reliable models, which is labor-intensive and in some cases prohibitive. To this end, a common practice is to employ transfer learning to leverage knowledge from different domains in models used for other similar problems. Recently, billion-parameter models with very deep architectures pre-trained on large-scale datasets have been exploited in order to achieve high accuracy in new domains relatively quickly and without a lot of training data. The pre-training of such deep models requires a tremendous amount of operations which can be achieved only by powerful computing infrastructures (GPU/TPU clusters), thus being inaccessible to most researchers and practitioners. Additionally, current computer vision and vision language solutions provide more general and presumably obsolete annotation leading to irrelevant descriptions of new media. To alleviate this issue, one could leverage the wealth of user-generated content-annotations that lie in social media platforms but at the same time should beware of the noise and weak labeling that comes with it, as well as different kinds of bias that may be introduced as a result of the data collection process.
One of the very promising features of MediaVerse is the capability for its users to easily perform semantic annotation of new media content uploaded in its instances, through deep learning models and methodologies such as the aforementioned ones. We envision our system to have the ability to provide relevant descriptions (examples include free text description, relevant tags or even vector representations) for all new media types in order to make media content easier to discover. Currently, this is anything but a solved problem and our work focuses on what information we could obtain from each new media type and how this can happen optimally. To demonstrate how the current state of the art models perform on new media types, we present a small experiment by applying widely used models on a trending visual meme.
|image meme||model [training set]: output|
Source: The Washington Post
Show, Attend and Tell [MSCOCO]: man sitting on top of chair with his phone
Faster R-CNN Inception V2 [MSCOCO]: person: 99.7%, person: 99.4%, person: 91.0% (identifies correctly three out of four persons)
Inception V3 [ImageNet]: web_site: 90.6%
Xception [ImageNet]: web_site: 80.0%
EfficientNetB7 [ImageNet]: web_site: 75.0%
The results, while being reasonable to a certain extent*, are disappointing as they do not recognize the person (Bernie Sanders) and the context (humour, US president inauguration). This example showcases the shortcomings of models that perform well on more traditional tasks and established benchmark datasets, when they are applied on new media types and datasets and it motivates us to adapt these models to be more effective in practical scenarios. Another limitation that MediaVerse seeks to overcome, is the inability of any system to provide labels for every niche subject or area. For that purpose, we plan to provide an easy-to-deploy training module that will be friendly enough for non-AI experts to be able to train their own models and generate meaningful descriptions for their content.
* Considering that the algorithms “see” a title above an image and surrounding color it is not that strange that some of them predict “web page”, but obviously they are fooled!