Learning to Estimate Low-frequency Tokens in Power-law Data Streams

A Bayesian nonparametric approach to count-min sketch under power-law data streams

The count-min sketch is a randomized data structure that provides estimates of tokens'frequencies in a large data stream using a compressed representation of the data by random hashing.In this paper, we rely on a recent bayesian nonparametric (bnp) view on the count-min sketch to develop a novel learning-augmented count-min sketch under power-law data streams.We assume that tokens in the stream are drawn from an unknown discrete distribution, which is endowedwith a normalized inverse gaussian process (nigp) prior.Then, using distributional properties of the nigp, we compute the posterior distribution of a token s frequency in the stream, given the hashed data, and in turn corresponding corresponding bnp estimates.Applications to synthetic and real data show that our approach achieves a remarkable performance in the estimation of low-frequency tokens, which is known to be a desirable feature in the context of natural language processing, where it is indeed common in the context of the power-law behaviour of the data.