Localized Vision-Language Matching for Open-vocabulary Object Detection
German Conference on Pattern Recognition (GCPR), 2022
Abstract: In this work, we propose an open-vocabulary object detection method that, based on image-caption pairs, learns to detect novel object classes along with a given set of known classes. It is a two-stage training approach that first uses a location-guided image-caption matching technique to learn class labels for both novel and known classes in a weakly-supervised manner and second specializes the model for the object detection task using known class annotations.
We show that a simple language model fits better than a large contextualized language model for detecting novel objects. Moreover, we introduce a consistency-regularization technique to better exploit image-caption pair information. Our method compares favorably to existing open-vocabulary detection approaches while being data-efficient.
Paper
Supplementary
Poster
DownloadsImages and movies
BibTex reference
@InProceedings{BMB22, author = "M. Bravo and S. Mittal and T. Brox", title = "Localized Vision-Language Matching for Open-vocabulary Object Detection", booktitle = "German Conference on Pattern Recognition (GCPR)", month = " ", year = "2022", keywords = "Open-vocabulary Object Detection, Image-caption Matching, Weakly-supervised Learning, Multi-modal Training", url = "http://lmbweb.informatik.uni-freiburg.de/Publications/2022/BMB22" }