Research
Within the broad field of computer vision, I am particularly interested in applying LLMs for vision, multi-modal vision,
and robustness & generalizable self-supervised learning. My Ph.D.
research focuses on learning robust representations for images and videos using limited
supervision.
My Masters thesis was focused on automated generation of video descriptions.
|
|
Open Vocabulary Multi-Label Video Classification
Rohit Gupta,
Mamshad Nayeem Rizve,
Ashish Tawari,
Jayakrishnan Unnikrishnan,
Son Tran,
Mubarak Shah
(under review)
Pre-print (available on request) /
code (coming soon)
Pre-trained vision-language models (VLMs) have enabled significant progress in open vocabulary computer vision tasks such as image classification, object detection and image segmentation. Some recent works have focused on extending VLMs to zero-shot single label action classification in videos. However, previous methods fall short in holistic video understanding which requires the ability to simultaneously recognize multiple actions and entities in the video in a zero-shot setting. We formulate this problem as open vocabulary multi-label video classification and propose a method to adapt a pre-trained VLM such as CLIP to solve this task. We leverage large language models (LLMs) to provide semantic guidance to the VLM about class labels to improve its open vocabulary performance with three key contributions. First, we propose an end-to-end trainable architecture that learns to prompt an LLM to generate soft attributes for the CLIP text-encoder to enable it to recognize novel classes. Second, we improve upon state of the art temporal modeling techniques for CLIP's vision encoder to enhance its zero-shot classification performance. Third, we propose a recipe for generating useful class labels from large unlabeled video datasets and demonstrate that models trained with the added synthetic labels further boosts open vocabulary classification performance. We demonstrate strong open vocabulary action and object recognition performance using a single model across multiple datasets.
|
|
Class Prototypes based Contrastive Learning for Classifying Multi-Label and Fine-Grained Educational Videos
Rohit Gupta,
Anirban Roy,
Sujeong Kim,
Claire Christensen,
...,
Ajay Divakaran,
Mubarak Shah
CVPR, 2023 (to appear)
Pre-print /
code (coming soon)
We propose Class Prototype Contrastive Learning to solve two key problems associated with practical Fine-grained Video
Classification: (a) Multi-Label nature of training data (b) Fusing information effectively across modalities.
Our method achieves strong results on the COIN and YouTube-8M datasets, and we also propose a novel video dataset
from the education domain with expert annotated labels, and designed such that understanding both video and speech
is essential for effective classification.
|
|
Contrastive Self-Supervised Learning Leads to Higher Adversarial Susceptibility
Rohit Gupta,
Naveed Akhtar,
Ajmal Mian,
Mubarak Shah
AAAI, 2023
arXiv /
code (coming soon)
Contrastive Self-Supervised Learning (CSL) results in significantly lower adversarial robustness
than supervised learning, even while achieving similar accuracy on clean data. We establish this
through extensive experiments and provide evidence to suggest that the lower robustness is caused
by presence of false negative pairs during CSL training.
|
|
TCLR: Temporal Contrastive Learning for Video Representation
Ishan Dave,
Rohit Gupta,
Mamshad Nayeem Rizve,
Mubarak Shah
CVIU 219, June 2022
arXiv /
code
Unlike images, videos contain significant temporal variation in motion and appearance.
Hence simple extensions of Contrastive Self-Supervised Learning (CSL) to videos fail to
capture temporal distinctiveness in the representation. We propose Temporal Contrastive Losses and
learn temporally distinct representations and achieve significant gains on downstream video tasks.
|
|