A new data structure for the epsilon-approximate range emptiness problem

Author

Paul Cesaretti

Published

November 22, 2019

Constructing space-efficient data structures for answering approximate membership queries is a well-studied problem and one that is increasingly important in the era of big data. That is, given a set \(S\) of \(n\) elements from a large universe \([U] = \{ 0,2, ..., U -1 \}\), preprocess \(S\) so as to answer queries of the form, is \(x \in S\)? Since \(|S|\) is prohibitively large to fit directly into RAM, the approach is to give a sketch, or succinct summary, of the set that, while removing elements from S, gives good answers to specified queries. Hence, we allow such data structures to return a small fraction false positives in return for significant savings in space.

This talk will review a new data structure, along with it techniques and lower bounds that generalizes the functionality from single point queries to 1-D queries of intervals of length \(L\). Known as the \(\epsilon\)-approximate range emptiness problem, the paper by Mayank Goswami, Allan Grølund, Kasper Green Larsen, and Rasmus Pagh show that the space/error trade-off of a naive approach of querying a traditional Bloom filter \(L\) times cannot be improved asymptotically: Any data structure for answering approximate range emptiness queries on intervals of length up to \(L\) with false positive probability \(\epsilon\), must use space \(\Omega(n \lg(L/\epsilon)) − O(n)\) bits. While this does seem unfortunate, their data structure does answer such queries in \(O(1)\) time.