COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning

Ging, Simon; Zolfaghari, Mohammadreza; Pirsiavash, H.; Brox, Thomas

COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning

Simon Ging, Mohammadreza Zolfaghari, H. Pirsiavash, Thomas Brox

Advances in Neural Information Processing Systems (NeurIPS), Curran Associates, Inc., Vol.33: 22605--22618, 2020

Abstract: Many real-world video-text tasks involve different levels of granularity, such as frames and words, clip and sentences or videos and paragraphs, each with distinct semantics. In this paper, we propose a Cooperative hierarchical Transformer (COOT) to leverage this hierarchy information and model the interactions between different levels of granularity and different modalities. The method consists of three major components: an attention-aware feature aggregation layer, which leverages the local temporal context (intra-level, e.g., within a clip), a contextual transformer to learn the interactions between low-level and high-level semantics (inter-level, e.g. clip-video, sentence-paragraph), and a cross-modal cycle-consistency loss to connect video and text. The resulting method compares favorably to the state of the art on several benchmarks while having few parameters.

Paper

Poster

Downloads

Images and movies

BibTex reference

@InProceedings{GZB20,
  author       = "S. Ging and M. Zolfaghari and H. Pirsiavash and T. Brox",
  title        = "COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning",
  booktitle    = "Advances in Neural Information Processing Systems (NeurIPS)",
  volume       = "33",
  pages        = "22605--22618",
  month        = " ",
  year         = "2020",
  editor       = "H. Larochelle and M. Ranzato and R. Hadsell and M. F. Balcan and H. Lin",
  publisher    = "Curran Associates, Inc.",
  url          = "http://lmbweb.informatik.uni-freiburg.de/Publications/2020/GZB20"
}

Other publications in the database

» Simon Ging
» Mohammadreza Zolfaghari
» Thomas Brox

COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning

Images and movies

See also

BibTex reference

Other publications in the database