Multi-granularity Correspondence Learning from Long-term Noisy Videos

Yijie Lin, Jie Zhang, Zhenyu Huang, Jia Liu, Zujie Wen, Xi Peng
ICLR 2024 (Oral)

Abstract

Existing video-language studies mainly focus on learning short video clips, leaving long-term temporal dependencies rarely explored due to over-high computational cost of modeling long videos. To address this issue, one feasible solution is learning the correspondence between video clips and captions, which however inevitably encounters the multi-granularity noisy correspondence (MNC) problem. To be specific, MNC refers to the clip-caption misalignment (coarse-grained) and frame-word misalignment (fine-grained), hindering temporal learning and video understanding. In this paper, we propose NOise Robust Temporal Optimal traNsport (Norton) that addresses MNC in a unified optimal transport (OT) framework. In brief, Norton employs video-paragraph and clip-caption contrastive losses to capture long-term dependencies based on OT. To address coarse-grained misalignment in video-paragraph contrast, Norton filters out the irrelevant clips and captions through an alignable prompt bucket and realigns asynchronous clip-caption pairs based on transport distance. To address the fine-grained misalignment, Norton incorporates a soft-maximum operator to identify crucial words and key frames. Additionally, Norton exploits the potential faulty negative samples in clip-caption contrast by rectifying the alignment target with OT assignment to ensure precise temporal modeling. Extensive experiments on video retrieval, videoQA, and action segmentation verify the effectiveness of our method.


Method



Overview of our multi-granularity correspondence learning. We perform video-paragraph contrastive learning to capture long-term temporal correlations from a fine-to-coarse perspective. Specifically, we first utilize the log-sum-exp operator on the frame-word similarity matrix to obtain fine-grained similarity between clip and caption. Additionally, we append an alignable prompt bucket on the clip-caption similarity matrix to filter out the irrelevant clips or captions. By applying Sinkhorn iterations on the clip-caption similarity matrix, we effectively tackle the asynchronous problem and obtain the optimal transport distance as the video-paragraph similarity.


Dataset

  • Training Dataset HowTo100M
    • Download our prepared data feature from Baidu Cloud Disk (password: nk6e). Follow the instructions to process the data.

    • Video titles (77MB json, for reference only)
      Download
    • Video subtitles (Sentencified HTM)

      Sentencified HTM converts the original YouTube ASR (Automatic Speech Recognition) texts to full sentences using the method here.
      Download HTM-1.2M (9.9GB json) | Statistics

    • Video feature

      We use HowTo100M pre-trained S3D (MIL-NCE) to extract one video token per second at 30 fps following VideoCLIP, obtaining around 465 GB npy files.

  • Evaluation Dataset

    The downstream datasets and annotation files (e.g., `msrvtt/MSRVTT_JSFUSION_test.csv`) are now available for download on Baidu Cloud Disk. Access them via this link: https://pan.baidu.com/s/1KM60oabsr8TflzsRLwy7xQ?pwd=6akb.

    • YouCookII
    • MSRVTT


Comparison

The result of long video retrieval to demonstrate the effectiveness of temporal learning. We compared our proposed Norton with three standard strategies, namely, Cap. Avg. (Caption Average), DTW (Dynamic Time Warping), and OTAM (Ordered Temporal Alignment Module).



The visualization of re-alignment to demonstrate the robustness. We compared our method with the DTW and vanilla optimal transport.




Background of Noisy Correspondence

Noisy correspondence problem, i.e., mismatched data pairs, has garnered attention in diverse multi-modal applications, extending beyond video-text domains to encompass challenges in image-text retrieval (Huang et al., 2021; Qin et al., 2022; 2023; Han et al., 2023; Yang et al., 2023), cross-modal generation (Li et al., 2022), person re-identification (Yang et al., 2022), and graph matching (Lin et al., 2023).


BibTeX


      @inproceedings{lin2024norton,
        title={Multi-granularity Correspondence Learning from Long-term Noisy Videos},
        author={Lin, Yijie and Zhang, Jie and Huang, Zhenyu and Liu, Jia and Wen, Zujie and Peng, Xi},
        booktitle={Proceedings of the International Conference on Learning Representations},
        month={May},
        year={2024}
     }