Learning in limited-data and limited-annotation scenarios

Thumbnail Image
Islam, Ashraful
Issue Date
Electronic thesis
Electrical engineering
Research Projects
Organizational Units
Journal Issue
Alternative Title
Deep learning has demonstrated its capability in a wide range of learning tasks. However, it generally requires a vast amount of annotated data to train deep models. In many areas, creating such datasets consumes a considerable amount of resources, time, and effort. This dramatically restricts the applicability of deep learning in many real-world scenarios. It is thus of paramount importance to develop deep learning models that can leverage the small amount of annotated data available. There are many restricted domains where a limited supply of training data is available. In this situation, transfer learning and few-shot learning can help mitigate the limited data issue. In transfer learning, representations learned from a large dataset are re-purposed to another domain that has limited annotated data. The general approach in transfer learning is to use the ImageNet-pretrained ResNet model as a feature extractor, and either train a classifier header on top of that or finetune the full network to predict the classes of the downstream dataset. However, we show that ResNet models trained with self-supervised losses, particularly self-supervised contrastive losses, can have better transferability than those trained with cross-entropy loss. We study the transferability of learned representations of different contrastive approaches in downstream linear evaluation, full-network transfer, few-shot recognition, and object detection tasks. The results show that the contrastive approaches learn representations that are easily transferable to different downstream tasks. We further observe that a joint objective of self-supervised contrastive loss with cross-entropy/supervised-contrastive loss leads to better transferability of these models over their supervised counterparts. Most self-supervised learning models rely on image classification tasks for evaluation. Due to the global (i.e., single label) nature of image classification, the existing methods use a single feature representation of the image to guide the respective loss functions. Many computer vision tasks (e.g., object detection, semantic segmentation, pose estimation) require spatially localized feature representations. We present a novel self-supervised learning method that imposes local consistency between corresponding regions of transformed versions of the same image, which can be used alongside any self-supervised learning method with minimum computational overhead. We report significant improvement over existing self-supervised pre-training methods with this approach for object detection and semantic segmentation. Next, we tackle the problem of cross-domain few-shot learning where there is a large shift between a base dataset that has many labeled samples and a target domain which mostly has a few labeled examples and many unlabeled samples. The goal is to learn a feature extractor network from the labeled base dataset and then adapt the weights from the few labeled samples from the target dataset. We propose a simple dynamic distillation-based approach to use unlabeled images from the target dataset. We impose consistency regularization by calculating predictions from the weakly-augmented versions of the unlabeled images from a teacher network and matching it with the strongly augmented versions of the same images from a student network. We show that the proposed network learns a representation that can be easily adapted to the target domain even though it has not been trained with target-specific classes during the pretraining phase. One particular domain where annotation is expensive is video action localization. It is very time consuming to label all start and end frames of every action in an untrimmed video. We develop a weakly-supervised action localization model that is trained on videos with only video-level action categories without any temporal supervision, and can predict the start and ending frames for each action in the video. We developed two methods to solve the task. In the first approach, we use deep metric learning to learn the embedding of video segments such that videos with similar actions are closer in the embedding space and videos with different actions are farther away. We propose a classification module to generate action labels for each segment in the video, and a deep metric learning module to learn the similarity between different action instances. We jointly optimize a balanced binary cross-entropy loss and a metric loss using a standard backpropagation algorithm. Next, we present a hybrid attention mechanism for weakly supervised temporal localization. We argue that existing multiple instance learning (MIL) approaches have a major limitation of only capturing the most discriminative frames of an action, ignoring the full extent of an activity. Moreover, these methods cannot model background activity effectively, which plays an important role in localizing foreground activities. We develop a novel hybrid attention mechanism that includes temporal soft, semi-soft and hard attentions to address these issues.
May 2022
School of Engineering
Full Citation
Rensselaer Polytechnic Institute, Troy, NY
Terms of Use
PubMed ID