WARP: An Efficient Engine for
Multi-Vector Retrieval

1ETH Zurich, 2UC Berkeley, 3Stanford University
SIGIR'25

*Work completed as a visiting student researcher at Stanford University.

Single-threaded CPU latency breakdown of (1) XTR's unoptimized reference implementation,
(2) a variant of XTR that we optimized, (3) the official ColBERTv2/PLAID system, and
(4) our proposed XTR/WARP on LoTTE Pooled.

Abstract

Multi-vector retrieval methods such as ColBERT and its recent variant, the ConteXtualized Token Retriever (XTR), offer high accuracy but face efficiency challenges at scale. To address this, we present WARP, a retrieval engine that substantially improves the efficiency of retrievers trained with the XTR objective through three key innovations:

  1. WARPSELECT for dynamic similarity imputation;
  2. implicit decompression, avoiding costly vector reconstruction during retrieval; and
  3. a two-stage reduction process for efficient score aggregation.
Combined with highly-optimized C++ kernels, our system reduces end-to-end latency compared to XTR's reference implementation by 41x, and achieves a 3x speedup over the ColBERTv2/PLAID engine, while preserving retrieval quality. WARP also reduces index sizes by a factor of 2x - 4x compared to XTR, enabling deployment on memory-constrained devices.

BibTeX

@inproceedings{scheerer2025warp,
  title = {WARP: An Efficient Engine for Multi-Vector Retrieval},
  author = {Scheerer, Jan Luca and Zaharia, Matei and Potts, Christopher and Alonso, Gustavo and Khattab, Omar},
  booktitle = {Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval},
  year = {2025}
}