# An Optimal Algorithm for $\ell _1$-Heavy Hitters in Insertion Streams and Related ProblemsAn Optimal Algorithm for -Heavy Hitters in Insertion Streams and Related Problems

CSA Department Indian Institute of Science
CSE Department IIT Kharagpur
School of Computer Science Carnegie Mellon University

ACM Trans. Algorithms, Vol. 15, No. 1, Article 2, Publication date: October 2018.
DOI: https://doi.org/10.1145/3264427

We give the first optimal bounds for returning the $\ell _1$-heavy hitters in a data stream of insertions, together with their approximate frequencies, closing a long line of work on this problem. For a stream of $m$ items in $\lbrace 1, 2, \ldots , n\rbrace$ and parameters , let $f_i$ denote the frequency of item $i$, i.e., the number of times item $i$ occurs in the stream. With arbitrarily large constant probability, our algorithm returns all items $i$ for which , returns no items $j$ for which , and returns approximations $\tilde{f}_i$ with for each item $i$ that it returns. Our algorithm uses $O(\varepsilon ^{-1} \log \varphi ^{-1} + \varphi ^{-1} \log n + \log \log m)$ bits of space, processes each stream update in $O(1)$ worst-case time, and can report its output in time linear in the output size. We also prove a lower bound, which implies that our algorithm is optimal up to a constant factor in its space complexity. A modification of our algorithm can be used to estimate the maximum frequency up to an additive $\varepsilon m$ error in the above amount of space, resolving Question 3 in the IITK 2006 Workshop on Algorithms for Data Streams for the case of $\ell _1$-heavy hitters. We also introduce several variants of the heavy hitters and maximum frequency problems, inspired by rank aggregation and voting schemes, and show how our techniques can be applied in such settings. Unlike the traditional heavy hitters problem, some of these variants look at comparisons between items rather than numerical values to determine the frequency of an item.

CCS Concepts: • Theory of computation → Sketching and sampling;

Additional Key Words and Phrases: Heavy hitters, streaming algorithms, voting, winner determination

ACM Reference format:
Arnab Bhattacharyya, Palash Dey, and David P. Woodruff. 2018. An Optimal Algorithm for $\ell _1$-Heavy Hitters in Insertion Streams and Related Problems. ACM Trans. Algorithms 15, 1, Article 2 (October 2018), 27 pages. https://doi.org/10.1145/3264427

## 1 INTRODUCTION

The data stream model has emerged as a standard model for processing massive data sets. Because of the sheer size of the data, traditional algorithms are no longer feasible, e.g., it may be hard or impossible to store the entire input, and algorithms need to run in linear or even sublinear time. Such algorithms typically need to be both randomized and approximate. Moreover, the data may not physically reside on any device, e.g., if it is Internet traffic, and so if the data is not stored by the algorithm, it may be impossible to recover it. Hence, many algorithms must work given only a single pass over the data. Applications of data streams include data warehousing [7, 29, 32, 35], network measurements [3, 20, 23, 28], sensor networks [9, 66], and compressed sensing [12, 31]. We refer the reader to recent surveys on the data stream model [18, 60, 61].

One of the oldest and most fundamental problems in the area of data streams is the problem of finding the $\ell _1$-heavy hitters (or simply, “heavy hitters”), also known as the top-$k$, most popular items, frequent items, elephants, or iceberg queries. Such algorithms can be used as subroutines in network flow identification at IP routers [28], association rules and frequent itemsets [2, 33, 36, 65, 69], iceberg queries and iceberg datacubes [7, 29, 32]. The survey [19] presents an overview of the state-of-the-art for this problem, from both theoretical and practical standpoints.

We now formally define the heavy hitters problem that we focus on in this article:

In the $(\varepsilon , \varphi)$-Heavy Hitters Problem, we are given parameters and a stream $a_1, \ldots , a_m$ of items $a_j \in \lbrace 1, 2, \ldots , n\rbrace .$ Let $f_i$ denote the number of occurrences of item $i$, i.e., its frequency. The algorithm should make one pass over the stream and at the end of the stream output a set $S \subseteq \lbrace 1, 2, \ldots , n\rbrace$ for which if , then $i \in S$, while if , then $i \notin S$. Further, for each item $i \in S$, the algorithm should output an estimate $\tilde{f}_i$ of the frequency $f_i$, which satisfies .

Note that other natural definitions of heavy hitters are possible and sometimes used. For example, $\ell _2$-heavy hitters are those items $i$ for which , and more generally, $\ell _p$-heavy hitters are those items $i$ for which . It is in this sense that Definition 1 corresponds to $\ell _1$-heavy hitters. While $\ell _p$-heavy hitters for $p\gt 1$ relax $\ell _1$-heavy hitters and algorithms for them have many interesting applications, we focus on the most direct and common formulation of the heavy hitters notion.

We are interested in algorithms that use as little space in bits as possible to solve the $(\varepsilon , \varphi)$-Heavy Hitters Problem. Further, we are also interested in minimizing the update time and reporting time of such algorithms. Here, the update time is defined to be the time the algorithm needs to update its data structure when processing a stream insertion. The reporting time is the time the algorithm needs to report the answer after having processed the stream. We allow the algorithm to be randomized and to succeed with probability at least $1-\delta$ for $0\lt \delta \lt 1$. We do not make any assumption on the ordering of the stream $a_1, \ldots , a_m$. This is desirable as often in applications one cannot assume a best-case or even a random order. We are also interested in the case when the length $m$ of the stream is not known in advance, and give algorithms in this more general setting.

The first algorithm for the $(\varepsilon , \varphi)$-Heavy Hitters Problem was given by Misra and Gries [55], who achieved $O(\varepsilon ^{-1} (\log n + \log m))$ bits of space for any $\varphi \gt \varepsilon$. This algorithm was rediscovered by Demaine et al. [23] and again by Karp et al. [39]. Another notable deterministic algorithm is space-saving [53], which is often implemented using randomization. Other than these algorithms, which are deterministic, there are also a number of randomized algorithms, such as the CountSketch [16], Count-Min sketch [21], sticky sampling [49], lossy counting [49], sample and hold [28], multi-stage bloom filters [15], and sketch-guided sampling [42]. Berinde et al. [6] show that using $O(k \varepsilon ^{-1} \log (mn))$ bits of space, one can achieve the stronger guarantee of reporting, for each item $i \in S$, $\tilde{f}_i$ with , where $F^{res(k)}_1 \lt m$ denotes the sum of frequencies of items in $\lbrace 1, 2, \ldots , n\rbrace$ excluding the frequencies of the $k$ most frequent items.

We emphasize that prior to our work the best known algorithms for the $(\varepsilon , \varphi)$-Heavy Hitters Problem used $O(\varepsilon ^{-1} (\log n + \log m))$ bits of space. Two previous lower bounds were known. The first is a lower bound of $\log (({n \atop 1/\varphi })) = \Omega (\varphi ^{-1} \log (\varphi n))$ bits, which comes from the fact that the output set $S$ can contain $\varphi ^{-1}$ items and it takes this many bits to encode them. The second lower bound is $\Omega (\varepsilon ^{-1})$, which follows from a folklore reduction from the randomized communication complexity of the Index problem. In this problem, there are two players, Alice and Bob. Alice has a bit string $x$ of length $(2\varepsilon)^{-1}$, while Bob has an index $i$. Alice creates a stream of length $(2\varepsilon)^{-1}$ consisting of one copy of each $j$ for which $x_j = 1$ and copies of a dummy item to fill the rest of the stream. She runs the heavy hitters streaming algorithm on her stream and sends the state of the algorithm to Bob. Bob appends $(2\varepsilon)^{-1}$ copies of the item $i$ to the stream and continues the execution of the algorithm. For , it holds that $i \in S$. Moreover, $f_i$ differs by an additive $\varepsilon m$ factor depending on whether $x_i = 1$ or $x_i = 0$. Therefore, by the randomized communication complexity of the Index problem [41], the -heavy hitters problem requires $\Omega (\varepsilon ^{-1})$ bits of space. Although this proof was for , no better lower bound is known for any $\varphi \gt \varepsilon$.

Thus, while the upper bound for the $(\varepsilon , \varphi)$-Heavy Hitters Problem is $O(\varepsilon ^{-1} (\log n + \log m))$ bits, the best-known lower bound is only $\Omega (\varphi ^{-1} \log n + \varepsilon ^{-1})$ bits. For constant $\varphi$, and $\log n \approx \varepsilon ^{-1}$, this represents a nearly quadratic gap in upper and lower bounds. Given the limited resources of devices that typically run heavy hitters algorithms, such as internet routers, this quadratic gap can be critical in applications.

A problem related to the $(\varepsilon , \varphi)$-Heavy Hitters Problem is estimating the maximum frequency in a data stream, also known as the $\ell _{\infty }$-norm. In the IITK 2006 Workshop on Algorithms for Data Streams, Open Question 3 asks for an algorithm to estimate the maximum frequency of any item up to an additive $\varepsilon m$ error using as little space as possible. The best-known space bound is still $O(\varepsilon ^{-1} \log n)$ bits, as stated in the original formulation of the question (note that the “$m$” in the question there corresponds to the “$n$” here). Note that if one can find an item whose frequency is the largest, up to an additive $\varepsilon m$ error, then one can solve this problem. The latter problem is independently interesting and corresponds to finding approximate plurality election winners in voting streams [24]. We refer to this problem as the $\varepsilon$-Maximum problem.

Finally, we note that there are many other variants of the $(\varepsilon , \varphi)$-Heavy Hitters Problem that one can consider. One simple variant of the above is to output an item of frequency within $\varepsilon m$ of the minimum frequency of any item in the universe. We refer to this as the $\varepsilon$-Minimum problem. This only makes sense for small universes, as otherwise outputting a random item typically works. This is useful when one wants to count the “number of dislikes” or in anomaly detection; see more motivation below. In other settings, one may not have numerical scores associated with the items, but rather, each stream update consists of a “ranking” or “total ordering” of all stream items. This may be the case in ranking aggregation on the web (see, e.g., References [48, 52]) or in voting streams (see, e.g., References [13, 17, 24, 73]). One may consider a variety of aggregation measures, such as the Borda score of an item $i$, which asks for the sum, over rankings, of the number of items $j \ne i$ for which $i$ is ranked ahead of $j$ in the ranking. Alternatively, one may consider the Maximin score of an item $i$, which asks for the minimum, over items $j \ne i$, of the number of rankings for which $i$ is ranked ahead of $j$. For these aggregation measures, one may be interested in finding an item whose score is an approximate maximum. This is the analogue of the $\varepsilon$-Maximum problem above. Or, one may be interested in listing all items whose score is above a threshold, which is the analogue of the $(\varepsilon , \varphi)$-Heavy Hitters Problem.

We give more motivation of these variants of heavy hitters in this section below and more precise definitions in Section 2.

### 1.1 Our Contributions

Our results are summarized in Table 1. We note that independently of this work and nearly parallelly, there have been improvements to the space complexity of the $\ell _2$-heavy hitters problem in insertion streams [11] and to the time complexity of the $\ell _1$-heavy hitters problem in turnstile streams [44].1 These works use very different techniques.

Table 1. The Bounds Hold for Constant Success Probability Algorithms and for $n$ Sufficiently Large in Terms of $\varepsilon$
Problem Space complexity
Upper bound Lower bound
$(\varepsilon , \varphi)$-Heavy hitters $O(\varepsilon ^{-1} \log \varphi ^{-1} + \varphi ^{-1} \log n + \log \log m )$ [Theorem 3 and 8] $\Omega (\varepsilon ^{-1} \log \varphi ^{-1} + \varphi ^{-1} \log n + \log \log m )$ [Theorem 10 and 15]
$\varepsilon$-Maximum and $\ell _{\infty }$-approximation $O(\varepsilon ^{-1} \log \varepsilon ^{-1} + \log n + \log \log m )$ [Theorem 2 and 8] $\Omega (\varepsilon ^{-1} \log \varepsilon ^{-1} + \log n + \log \log m )$ [Theorem 11 and 15]
$\varepsilon$-Minimum $O(\varepsilon ^{-1} \log \log \varepsilon ^{-1} + \log \log m )$ [Theorem 5 and 9] $\Omega (\varepsilon ^{-1} + \log \log m )$ [Theorem 12 and 15]
$\varepsilon$-Borda $O(n(\log \varepsilon ^{-1} + \log n) + \log \log m )$ [Theorem 6 and 9] $\Omega (n (\log \varepsilon ^{-1} + \log n) + \log \log m )$ [Theorem 13 and 15]
$\varepsilon$-Maximin $O(n \varepsilon ^{-2}\log ^2 n + \log \log m )$ [Theorem 7 and 9] $\Omega (n (\varepsilon ^{-2} + \log n) + \log \log m )$ [Theorem 14]

For the $(\varepsilon , \varphi)$-Heavy Hittersproblem and the $\varepsilon$-Maximum problem, we also achieve $O(1)$ update time and reporting time that is linear in the size of the output. The upper bound for $\varepsilon$-Borda (respectively, $\varepsilon$-Maximin) is for returning every item's Borda score (respectively, Maximin Score) up to an additive $\varepsilon mn$ (respectively, Additive $\varepsilon m$), while the lower bound for $\varepsilon$-Borda(respectively, $\varepsilon$-Maximin) is for returning only the approximate Borda score (respectively, Maximin Score) of an approximate maximum. Except for the $\varepsilon$-Minimum and $\varepsilon$-Maximin problems, the bounds are tight up to constant factors.

Our first contribution is an optimal algorithm and lower bound for the $(\varepsilon , \varphi)$-Heavy Hitters Problem. Namely, we show that there is a randomized algorithm with constant probability of success that solves this problem using

$O(\varepsilon ^{-1} \log \varphi ^{-1} + \varphi ^{-1} \log n + \log \log m)$
bits of space, and we prove a lower bound matching up to constant factors. In the unit-cost RAM model with $O(\log n)$ bit words, our algorithm has $O(1)$ update time and reporting time linear in the output size, under the standard assumptions that the length of the stream and universe size are at least $\text{poly}(\varepsilon ^{-1} \log (1/\varphi))$. Furthermore, we can achieve nearly the optimal space complexity even when the length $m$ of the stream is not known in advance. Although the results of Reference [6] achieve stronger error bounds in terms of the tail, which are useful for skewed streams, here we focus on the original formulation of the problem.

Next, we turn to the problem of estimating the maximum frequency in a data stream up to an additive $\varepsilon m$. We give an algorithm using

$O(\varepsilon ^{-1} \log \varepsilon ^{-1} + \log n + \log \log m)$
bits of space, improving the previous best algorithms, which required space at least $\Omega (\varepsilon ^{-1} \log n)$ bits, and show that our bound is tight. As an example setting of parameters, if $\varepsilon ^{-1} = \Theta (\log n)$ and $\log \log m = O(\log n)$, then our space complexity is $O(\log n \log \log n)$ bits, improving the previous $\Omega (\log ^2 n)$ bits of space algorithm. We also prove a lower bound showing our algorithm is optimal up to constant factors. This resolves Open Question 3 from the IITK 2006 Workshop on Algorithms for Data Streams in the case of insertion streams, for the case of “$\ell _1$-heavy hitters.” Our algorithm also returns the identity of the item with the approximate maximum frequency, solving the $\varepsilon$-Maximum problem.

We then focus on a number of variants of these problems. We first give nearly tight bounds for finding an item whose frequency is within $\varepsilon m$ of the minimum possible frequency. While this can be solved using our new algorithm for the $(\varepsilon , \varepsilon)$-Heavy Hitters Problem, this would incur $\Omega (\varepsilon ^{-1} \log \varepsilon ^{-1} + \log \log m)$ bits of space, whereas we give an algorithm using only $O(\varepsilon ^{-1} \log \log (\varepsilon ^{-1}) + \log \log m)$ bits of space. We also show a nearly matching $\Omega (\varepsilon ^{-1} + \log \log m)$ bits of space lower bound. We note that for this problem, a dependence on $n$ is not necessary, since if the number of possible items is sufficiently large, then outputting the identity of a random item among the first, say, $10\varepsilon ^{-1}$ items, is a correct solution with large constant probability.

Finally, we study variants of heavy hitter problems that are ranking-based. In this setting, each stream update consists of a total ordering of the $n$ universe items. For the $\varepsilon$-Borda problem, we give an algorithm using $O(n (\log \varepsilon ^{-1} + \log n) + \log \log m)$ bits of space to report the Borda score of every item up to an additive $\varepsilon m n$. We also show this is nearly optimal by proving an $\Omega (n (\log \varepsilon ^{-1} + \log n) + \log \log m)$ bit lower bound for the problem, even in the case when one is only interested in outputting an item maximum Borda score up to an additive $\varepsilon m n$. We remark that we need $\Omega (n\log n)$ space even to read an item in the stream for the $\varepsilon$-Borda and the $\varepsilon$-Maximin problems. For the $\varepsilon$-Maximin problem, we give an algorithm using $O(n \varepsilon ^{-2} \log ^2n + \log \log m)$ bits of space to report the maximin score of every item up to an additive $\varepsilon m$, and prove an $\Omega (n (\varepsilon ^{-2} + \log n) + \log \log m)$ bits of space lower bound even in the case when one is only interested in outputting the maximum maximin score up to an additive $\varepsilon m$. This shows that finding heavy hitters with respect to the maximin score is significantly more expensive than with respect to the Borda score.

### 1.2 Motivations for Variants of Heavy Hitters

While the $(\varepsilon , \varphi)$-Heavy Hitters and $\varepsilon$-Maximum problem are very well-studied in the data stream literature, the other variants introduced are not. We provide additional motivation for them here.

For the $\varepsilon$-Minimum problem, in our formulation, an item with frequency zero, i.e., one that does not occur in the stream, is a valid solution to the problem. In certain scenarios, this might not make sense, e.g., if a stream containing only a small fraction of IP addresses. However, in other scenarios, we argue this is a natural problem. For instance, consider an online portal where users register complaints about products. Here, minimum frequency items correspond to the “best” items. That is, such frequencies arise in the context of voting or more generally making a choice: in cases for which one does not have a strong preference for an item, but definitely does not like certain items, this problem applies, since the frequencies correspond to “number of dislikes.”

The $\varepsilon$-Minimum problem may also be useful for anomaly detection. Suppose one has a known set of sensors broadcasting information and one observes the “From:” field in the broadcasted packets. Sensors that send a small number of packets may be down or defective, and an algorithm for the $\varepsilon$-Minimum problem could find such sensors.

Finding items with maximum and minimum frequencies in a stream correspond to finding winners under plurality and veto voting rules, respectively, in the context of voting [10].2 The streaming aspect of voting could be crucial in applications like online polling [40], recommender systems [1, 34, 64] where the voters are providing their votes in a streaming fashion, and at every point in time, we would like to know the popular items. While in some elections, such as for political positions, the scale of the election may not be large enough to require a streaming algorithm, one key aspect of these latter voting-based problems is that they are rank-based, which is useful when numerical scores are not available. Orderings naturally arise in several applications—for instance, if a website has multiple parts, the order in which a user visits the parts given by its clickstream defines a voting, and for data mining and recommendation purposes, the website owner may be interested in aggregating the orderings across users. Motivated by this connection, we define similar problems for two other important voting rules, namely, Borda and maximin. The Borda scoring method finds its applications in a wide range of areas of artificial intelligence, for example, machine learning [14, 37, 63, 71], image processing [47, 56], information retrieval [4, 46, 62], and so on. The Maximin score is often used when the spread between the best and worst outcome is very large (see, e.g., p. 373 of Reference [59]). The maximin scoring method also has been used frequently in machine learning [38, 72], human computation [50, 51], and so on.

## 2 PRELIMINARIES

We denote the disjoint union of sets by $\sqcup$. We denote the set of all permutations of a set $\mathcal {U}$ by $\mathcal {L(U)}$. For a positive integer $\ell$, we denote the set $\lbrace 1, \ldots , \ell \rbrace$ by $[\ell ]$. In most places, we ignore floors and ceilings for the sake of notational simplicity.

### 2.1 Model of Input Data

The input data is an insertion-only stream of elements from some universe $\mathcal {U}$. In the context of voting, the input data is an insertion-only stream over the universe of all possible rankings (permutations).

### 2.2 Communication Complexity

We will use lower bounds on communication complexity of certain functions to prove space complexity lower bounds for our problems. Communication complexity of a function measures the number of bits that need to be exchanged between two players to compute a function whose input is split among those two players [74]. In a more restrictive one-way communication model, Alice, the first player, sends only one message to Bob, the second player, and Bob outputs the result. A protocol is a method that the players follow to compute certain functions of their input. Also the protocols can be randomized; in that case, the protocol needs to output correctly with probability at least $1-\delta$, for $\delta \in (0,1)$ (the probability is taken over the random coin tosses of the protocol). The randomized one-way communication complexity of a function $f$ with error probability $\delta$ is denoted by $\mathcal {R}_\delta ^{\text{1-way}}(f)$. Reference [43] is a standard reference for communication complexity.

### 2.3 Model of Computation

Our model of computation is the unit-cost RAM model on words of size $O(\log n)$, capable of generating uniformly random words and of performing arithmetic operations in $\lbrace +, -, \log _2\rbrace$ in one unit of time. We note that this model of computation has been widely used. We store an integer $C$ using a variable length array of Reference [8], which allows us to read and update $C$ in $O(1)$ time and $O(\log C)$ bits of space.

### 2.4 Universal Family of Hash Functions

A family of functions $\mathcal {H} = \lbrace h | h:A\rightarrow B \rbrace$ is called a universal family of hash functions if for all $a \ne b \in A$, , where $h$ is picked uniformly at random from $\mathcal {H}$.

We know that there exists a universal family of hash functions $\mathcal {H}$ from $[k]$ to $[\ell ]$ for every positive integer $\ell$ and every prime $k$ [45]. Moreover, $|\mathcal {H}|$, the size of $\mathcal {H}$, is $O(k^2)$.

### 2.5 Problem Definitions

We now formally define the problems we study here. Suppose we have $0 \lt \varepsilon \lt \varphi \lt 1$.

Given an insertion-only stream of length $m$ over a universe $\mathcal {U}$ of size $n$, find all items in $\mathcal {U}$ with frequency more than $\varphi m$, along with their frequencies up to an additive error of $\varepsilon m$, and report no items with frequency less than $(\varphi -\varepsilon)m$.

Given an insertion-only stream of length $m$ over a universe $\mathcal {U}$ of size $n$, find the maximum frequency up to an additive error of $\varepsilon m$.

Next, we define the minimum problem for $0 \lt \varepsilon \lt 1$.

Given an insertion-only stream of length $m$ over a universe $\mathcal {U}$ of size $n$, find the minimum frequency up to an additive error of $\varepsilon m$.

Next, we define related heavy hitters problems in the context of rank aggregation. The input is a stream of rankings (permutations) over an item set $\mathcal {U}$ for the problems below. The Borda score of an item $i$ is the sum, over all rankings, of the number of items $j \ne i$ for which $i$ is ranked ahead of $j$ in the ranking.

Given an insertion-only stream over a universe $\mathcal {L}(\mathcal {U})$ where $|\mathcal {U}|=n$, find all items with Borda score more than $\varphi mn$, along with their Borda score up to an additive error of $\varepsilon mn$, and report no items with Borda score less than $(\varphi -\varepsilon)mn$.

Given an insertion-only stream over a universe $\mathcal {L}(\mathcal {U})$ where $|\mathcal {U}|=n$, find the maximum Borda score up to an additive error of $\varepsilon mn$.

The maximin score of an item $i$ is the minimum, over all items $j \ne i$, of the number of rankings for which $i$ is ranked ahead of $j$.

Given an insertion-only stream over a universe $\mathcal {L}(\mathcal {U})$ where $|\mathcal {U}|=n$, find all items with maximin score more than $\varphi m$ along with their maximin score up to an additive error of $\varepsilon m$, and report no items with maximin score less than $(\varphi -\varepsilon)m$.

Given an insertion-only stream over a universe $\mathcal {L}(\mathcal {U})$ where $|\mathcal {U}|=n$, find the maximum maximin score up to an additive error of $\varepsilon m$.

Notice that the maximum possible Borda score of an item is $m(n-1) = \Theta (mn)$ and the maximum possible maximin score of an item is $m$. This justifies the approximation factors in Definitions 6 to 9. We note that finding an item with maximum Borda score within additive $\varepsilon m n$ or maximum maximin score within additive $\varepsilon m$ corresponds to finding an approximate winner of an election (more precisely, what is known as an $\varepsilon$-winner) [24, 25].

Let $r$ be a voting rule that is a function that takes as input a multi-set of rankings over a set of candidates and selects one candidate as the winner. Given a finite set ${\mathcal {C}}$ of candidates and a multi-set ${\mathcal {V}}$ of rankings over ${\mathcal {C}}$, a candidate $x\in {\mathcal {C}}$ is called an $\varepsilon$-winner () if $x$ can be made the winner (under the voting rule $r$) by changing at most $\varepsilon$ fraction of the rankings in ${\mathcal {V}}$.

## 3 ALGORITHMS

In this section, we present our upper bound results. Before describing specific algorithms, we record some claims for later use. Lemma 1 follows by checking whether we get all heads in $\log m$ tosses of a fair coin.

Suppose $m$ is a power of two.3 Then there is an algorithm $\mathcal {A}$ for choosing an item with probability that has space complexity of $O(\log \log m)$ bits and time complexity of $O(1)$ in the unit-cost RAM model.

We flip a fair coin $(\log _2 m)$ times and choose the item only if all the flips comes up “Heads.” We observe that we need only $O(\log \log m)$ bits to ensure that we toss at most $(\log _2 m)$ times.

We remark that the algorithm in Lemma 1 has optimal space complexity as shown in Proposition 2 in Appendix B.

Our second claim is a standard result for universal families of hash functions.

For $S\subseteq A$, $\delta \in (0,1)$, and universal family of hash functions :

For every $i\ne j \in S$, since $\mathcal {H}$ is a universal family of hash functions, we have . Now, we apply the union bound to get .

Our third claim is folklore and also follows from the celebrated DKW inequality [26]. We present a special case of the DKW inequality, which we will use to prove Lemma 3.

Let $\ell$ be a positive integer, $X\subseteq [\ell ]$ a finite multi-set, and $S$ a random sample from $X$ of size $k$. For $i\in [\ell ]$, let $f_i$ and $\hat{f}_i$ denote the frequency of $i$ in $X$ and $S$, respectively. Then, we have the following for any $\varepsilon \gt 0$:

Let $f_i$ and $\hat{f}_i$ be the frequencies of an item $i$ in a stream $\mathcal {S}$ and in a random sample $\mathcal {T}$ of size $r$ from $\mathcal {S}$, respectively. Then, for , with probability at least $1-\delta$, for every universe item $i$ simultaneously,

Follows immediately from Theorem 1.

For now, assume that the length of the stream is known in advance; we show in Section 3.5 how to remove this assumption.

### 3.1 List Heavy Hitters

For the $(\varepsilon ,\varphi)$-List heavy hitters problem, we present two algorithms. The first is slightly suboptimal, but simple conceptually and already constitutes a very large improvement in the space complexity over known algorithms. We expect that this algorithm could be useful in practice as well. The second algorithm is more complicated, building on ideas from the first algorithm, and achieves the optimal space complexity up to constant factors.

We note that both algorithms proceed by sampling $O(\varepsilon ^{-2} \log (1/\delta))$ stream items and updating a data structure as the stream progresses. In both cases, the time to update the data structure is bounded by $O(1/\varepsilon)$, and so, under the standard assumption that the length of the stream is at least $\textrm {poly}(\log (1/\delta) \varepsilon)$, the time to perform this update can be spread out across the next $O(1/\varepsilon)$ stream updates, since with large probability there will be no items sampled among these next $O(1/\varepsilon)$ stream updates. Therefore, we achieve worst-case update time of $O(1)$.4

3.1.1 A Simpler, Near-Optimal Algorithm.

Assume the stream length is known beforehand. Then there is a randomized one-pass algorithm $\mathcal {A}$ for the $(\varepsilon , \varphi)$-List heavy hitters problem that succeeds with probability at least $1-\delta$ using $O(\varepsilon ^{-1}(\log \varepsilon ^{-1} + \log \log \delta ^{-1}) + \varphi ^{-1}\log n + \log \log m )$ bits of space. Moreover, $\mathcal {A}$ has an update time of $O(1)$ and reporting time linear in its output size.

Overview. The overall idea is as follows. We sample $\ell =O(\varepsilon ^{-2})$ many items from the stream uniformly at random as well as hash the ids (the word “id” is short for identifier) of the sampled elements into a space of size $O(\varepsilon ^{-4})$. Now, both the stream length as well as the universe size are $\text{poly}(\varepsilon ^{-1})$. From Lemma 3, it suffices to solve the heavy hitters problem on the sampled stream. From Lemma 2, because the hash function is chosen from a universal family, the sampled elements have distinct hashed id's. We can then feed these elements into a standard Misra-Gries data structure with $\varepsilon ^{-1}$ counters, incurring space $O(\varepsilon ^{-1} \log \varepsilon ^{-1})$. Because we want to return the unhashed element id's for the heavy hitters, we also use $\log n$ space for recording the $\varphi ^{-1}$ top items according to the Misra-Gries data structure and output these when asked to report.

The pseudocode of our $(\varepsilon , \varphi)$-List heavy hitters algorithm is in Algorithm 1. By Lemma 3, if we select a subset $\mathcal {S}$ of size at least $\ell = 6\varepsilon ^{-2}{\log ({6}\delta ^{-1})}$ uniformly at random from the stream, then , where $f_i$ and $\hat{f_i}$ are the frequencies of item $i$ in the input stream and $\mathcal {S}$, respectively. First, we show that with the choice of $p$ in line 10 in Algorithm 1, the number of items sampled is at least $\ell$ and at most $11\ell$ with probability at least . Let $X_i$ be the indicator random variable of the event that the item $x_i$ is sampled for $i\in [m]$. Then the total number of items sampled $X = \sum _{i=1}^m X_i$. We have , since . Now, we have the following:

The inequality follows from the Chernoff bound and the value of $\ell$. From here onwards, we assume that the number of items sampled is in $[\ell , 11\ell ]$.

We use (a modified version of) the Misra-Gries algorithm [55] to estimate the frequencies of items in $\mathcal {S}$. The length of the table in the Misra-Gries algorithm is $\varepsilon ^{-1}$. We pick a hash function $h$ uniformly at random from a universal family of hash functions of size $|\mathcal {H}| = O(n^2)$. Note that picking a hash function $h$ uniformly at random from $\mathcal {H}$ can be done using $O(\log n)$ bits of space. Lemma 2 shows that there are no collisions in $\mathcal {S}$ under this hash function $h$ with probability at least $1-{\delta }/{3}$. From here onwards, we assume that there is no collision among the ids of the sampled items under the hash function $h$.

We modify the Misra-Gries algorithm as follows. Instead of storing the id of any item $x$ in the Misra-Gries table (table $\mathcal {T}_1$ in line 5 in Algorithm 1), we only store the hash $h(x)$ of the id $x$. We also store the ids (not the hash of the id) of the items with highest values in $\mathcal {T}_1$ in another table $\mathcal {T}_2$. Moreover, we always maintain the table $\mathcal {T}_2$ consistent with the table $\mathcal {T}_1$ in the sense that the $i{\text{th}}$ highest valued key in $\mathcal {T}_1$ is the hash of the $i{\text{th}}$ id in $\mathcal {T}_2$.

Upon picking an item $x$ with probability $p$, we create an entry corresponding to $h(x)$ in $\mathcal {T}_1$ and make its value one if there is space available in $\mathcal {T}_1$; decrement the value of every item in $\mathcal {T}_1$ by one if the table is already full; increment the entry in the table corresponding to $h(x)$ if $h(x)$ is already present in the table. When we decrement the value of every item in $\mathcal {T}_1$, the table $\mathcal {T}_2$ remains consistent, and we do not need to do anything else. Otherwise, there are three cases to consider. Case 1: $h(x)$ is not among the highest valued items in $\mathcal {T}_1$. In this case, we do not need to do anything else. Case 2: $h(x)$ was not among the highest valued items in $\mathcal {T}_1$ but now it is among the highest valued items in $\mathcal {T}_1$. In this case, the last item $y$ in $\mathcal {T}_2$ is no longer among the highest valued items in $\mathcal {T}_1$. We replace $y$ with $x$ in $\mathcal {T}_2$. Case 3: $h(x)$ was among the highest valued items in $\mathcal {T}_1$. When the stream finishes, we output the ids of all the items in table $\mathcal {T}_2$ along with the values corresponding to them in table $\mathcal {T}_1$. Correctness follows from the correctness of the Misra-Gries algorithm and the fact that there is no collision among the ids of the sampled items.

3.1.2 An Optimal Algorithm.

Assume the stream length is known beforehand. Then there is a randomized one-pass algorithm $\mathcal {A}$ for the $(\varepsilon , \varphi)$- List heavy hitters problem that succeeds with constant probability using $O(\varepsilon ^{-1}\log \varphi ^{-1} + \varphi ^{-1}\log n + \log \log m)$ bits of space. Moreover, $\mathcal {A}$ has an update time of $O(1)$ and reporting time linear in its output size.

Note that in this section, for the sake of simplicity, we ignore floors and ceilings and state the results for a constant error probability, omitting the explicit dependence on $\delta$.

Overview. As in the simpler algorithm, we sample $\ell = O(\varepsilon ^{-2})$ many stream elements and solve the $(\varepsilon /2,\varphi)$-List heavy hitters problem on this sampled stream. Also, the Misra-Gries algorithm for $(\varphi /2, \varphi)$-List heavy hitters returns a candidate set of $O(\varphi ^{-1})$ items containing all items of frequency at least $\varphi \ell$. It remains to count the frequencies of these $O(\varphi ^{-1})$ items with up to $\varepsilon \ell /2 = O(\varepsilon ^{-1})$ additive error, so that we can remove those whose frequency is less than $(\varphi - \varepsilon /2)\ell$.

Fix some item $i \in [n]$, and let $f_i$ be $i$’s count in the sampled stream. A natural approach to count $f_i$ approximately is to increment a counter probabilistically, instead of deterministically, at every occurrence of $i$. Suppose that we increment a counter with probability whenever item $i$ arrives in the stream. Let the value of the counter be $\hat{c}_i$, and let $\hat{f}_i = \hat{c}_i/p_i$. We see that and . It follows that if $p_i = \Theta (\varepsilon ^2 f_i)$, then $\mathsf {Var}[\hat{f}_i] = O(\varepsilon ^{-2})$, and hence, $\hat{f}_i$ is an unbiased estimator of $f_i$ with additive error $O(\varepsilon ^{-1})$ with constant probability. We call such a counter an accelerated counter as the probability of incrementing accelerates with increasing counts. For each $i$, we can maintain $O(\log \varphi ^{-1})$ accelerated counters independently and take their median to drive the probability of deviating by more than $O(\varepsilon ^{-1})$ down to $O(\varphi)$. So, with constant probability, the frequency for each of the $O(\varphi ^{-1})$ items in the Misra-Gries data structure is estimated within $O(\varepsilon ^{-1})$ error, as desired.

However, there are two immediate issues with this approach. The first problem is that we may need to keep counts for $\Omega (\ell) = \Omega (\varepsilon ^{-2})$ distinct items, which is too costly for our purposes. To get around this, we use a hash function from a universal family to hash the universe to a space of size $u = \Theta (\varepsilon ^{-1})$, and we work throughout with the hashed id's. We can then show that the space complexity for each iteration is $O(\varepsilon ^{-1})$. Also, the accelerated counters now estimate frequencies of hashed id's instead of actual items, but because of universality, the expected frequency of any hashed id is $\ell /u = O(\varepsilon ^{-1})$, our desired error bound.

The second issue is that we need a constant factor approximation of $f_i$, so that we can set $p_i$ to $\Theta (\varepsilon ^2 f_i)$. But because the algorithm needs to be one-pass, we cannot first compute $p_i$ in one pass and then run the accelerated counter in another. So, we divide the stream into epochs in which $f_i$ stays within a factor of 2, and use a different $p_i$ for each epoch. In particular, set $p_i^t = \varepsilon \cdot 2^{t}$ for . We want to keep a running estimate of $i$’s count to within a factor of 2 to know if the current epoch should be incremented. For this, we subsample each element of the stream with probability $\varepsilon$ independently and maintain exact counts for the observed hashed id's. It is easy to see that this requires only $O(\varepsilon ^{-1})$ bits in expectation. Consider any $i \in [u]$ and the prefix of the stream up to , and let $f_i(b)$ be $i$’s frequency in the prefix, let $\bar{c}_i(b)$ be $i$’s frequency among the samples in the prefix, and $\bar{f}_i(b) = \frac{\bar{c}_i(b)}{\varepsilon }$. We see that , Moreover, we show that for any $b \in [\ell ]$, $\bar{f}_i(b)$ is a 4-factor approximation of $f_i(b)$ with constant probability. By repeating $O(\log \varphi ^{-1})$ times independently and taking the median, the error probability can be driven down to $O(\varphi)$.

Now, for every hashed id $i \in [u]$, we need not one accelerated counter but $O(\log (\varepsilon f_i))$ many, one corresponding to each epoch $t$. When an element with hash id $i$ arrives at position $b$, we decide, based on $\bar{f}_i(b)$, the epoch $t$ it belongs to and then increment the $t$th accelerated counter with probability $p_i^t$. The storage cost over all $i$ is still $O(1/\varepsilon)$. Also, we iterate the whole set of accelerated counters $O(\log \varphi ^{-1})$ times, making the total storage cost $O(\varepsilon ^{-1}\log \varphi ^{-1})$.

Let $\hat{c}_{i,t}$ be the count in the accelerated counter for hash id $i$ and epoch $t$. Then, let $\hat{f}_i = \sum _t {\hat{c}_{i,t}}/{p_i^t}$. Clearly, . The variance is $O(\varepsilon ^{-2})$ in each epoch, and so, $\mathsf {Var}[\hat{f}_i]=O(\varepsilon ^{-2} \log \varepsilon ^{-1})$, not $O(\varepsilon ^{-2})$, which we wanted. This issue is fixed by a change in how the sampling probabilities are defined. We now go on to the formal proof.

Pseudocode appears in Algorithm 2. Note that the numerical constants are chosen for convenience of analysis and have not been optimized. Also, for the sake of simplicity, the pseudocode does not have the optimal reporting time, but it can be modified to achieve this; see the end of this proof for details.

By standard Chernoff bounds, with probability at least $99/100$, the length of the sampled stream . For $x \in [n]$, let $f_{\text{samp}}(x)$ be the frequency of $x$ in the sampled stream. By Lemma 3, with probability at least $9/10$, for all $x \in [n]$,

Now, fix $j \in [10 \log \varphi ^{-1}]$ and $x \in [n]$. Let $i = h_j(x)$ and $f_i = \sum _{y: h_j(y) = h_j(x)} f_{\text{samp}}(y)$. Then, for a random $h_j \in \mathcal {H}$, the expected value of $\frac{f_i}{s} - \frac{f_{\text{samp}}(x)}{s}$ is $\frac{\varepsilon }{100}$, since $\mathcal {H}$ is a universal mapping to a space of size $100\varepsilon ^{-1}$. Hence, using Markov's inequality and the above,

(1)

In Lemma 4 below, we show that for each $j \in [200 \log (12\varphi ^{-1})]$, with error probability at most $3/10$, $\hat{f}_j(x)$ (in line 23) estimates $f_i$ with additive error at most $5000\varepsilon ^{-1}$, hence estimating $\frac{f_i}{s}$ with additive error at most $\frac{\varepsilon }{2}$. Taking the median over $200 \log (12\varphi ^{-1})$ repetitions (line 24) makes the error probability go down to $\frac{\varphi }{6}$ using standard Chernoff bounds. Hence, by the union bound, with probability at least $2/3$, for each of the $2/\varphi$ keys $x$ with nonzero values in $\mathcal {T}_1$, we have an estimate of $\frac{f(x)}{m}$ within additive error $\varepsilon$, thus showing correctness.

Fix $x \in [n]$ and $j \in [200 \log 12\varphi ^{-1}]$, and let $i = h_j(x)$. Then, , where $\hat{f}_j$ is the quantity computed in line 23.

Index the sampled stream elements $1, 2, \dots , s$, and for $b \in [s]$, let $f_i(b)$ be the frequency of items with hash id $i$ restricted to the first $b$ elements of the sampled stream. Let $\bar{f}_i(b)$ denote the value of $\mathcal {T}_2[i,j]\cdot \varepsilon ^{-1}$ after the procedure Insert has been called for the first $b$ items of the sampled stream.

With probability at least $9/10$, for all $b \in [s]$ such that , $\bar{f}_i(b)$ is within a factor of 4 of $f_i(b)$.

Fix $b \in [s]$. Note that as $\mathcal {T}_2$ is incremented with rate $\varepsilon$. , and so by Chebyshev's inequality,

$\Pr [|\bar{f}_i(b) - f_i(b)| \gt f_i(b)/2] \lt \frac{4}{f_i(b) \varepsilon }.$
We now break the stream into chunks, apply this inequality to each chunk, and then take a union bound to conclude. Namely, for any integer , define $b_t$ to be the first $b$ such that if such a $b$ exists. Then,

So, with probability at least $9/10$, every $\bar{f}_i(b_t)$ and $f_i(b_t)$ are within a factor of 2 of each other for every . Since for every there exists a (namely the $t$ such that ) such that $f_i(b)$ is within a factor of ${2}$ from $f_i(b_t)$, the claim follows.

Assume the event in Claim 1 henceforth. Now, we are ready to analyze $\mathcal {T}_3$ and, in particular, $\hat{f}_j(x)$. First, observe that if $t\lt 0$ in line 15, at some position $b$ in the stream, then $\mathcal {T}_2[i,j]$ at that time must be at most 1,000, and so by standard Markov and Chernoff bounds, with probability at least 0.85,

(2)
Assume this event. Then, .

If the stream element at position $b$ causes an increment in $\mathcal {T}_3$ with probability $\varepsilon 2^t$ (in line 17), then , and so, . This must be the case for the highest $b = \bar{b}_t$ at which the count for $i$ in $\mathcal {T}_3$ increments at the $t$th slot. The number of such occurrences of $i$ is at most by Claim 1 (which can be applied, since $f_i(b) \gt 100\varepsilon ^{-1}$ by Equation (2)). So,

Elements inserted with probability 1 obviously do not contribute to the variance.

So, conditioning on the events mentioned, the probability that $\hat{f}_j(x)$ deviates from $f_i$ by more than $5{,}000\varepsilon ^{-1}$ is at most $1/50$. Removing all the conditioning yields what we wanted:

We next bound the space complexity.

With probability at least $2/3$, Algorithm 2 uses $O(\varepsilon ^{-1} \log \varphi ^{-1} + \varphi ^{-1} \log n + \log \log m)$ bits of storage, if $n = \omega (\varepsilon ^{-1})$.

The expected length of the sampled stream is $\ell = O(\varepsilon ^{-2})$. So, the number of bits stored in $\mathcal {T}_1$ is $O(\varphi ^{-1} \log n)$. For $\mathcal {T}_2$, note that in lines 13–15, for any given $j$, $\mathcal {T}_2$ is storing a total of $\varepsilon \ell = O(\varepsilon ^{-1})$ elements in expectation. So, for , there can be at most $O((\varepsilon 2^k)^{-1})$ hashed id's with counts between $2^k$ and $2^{k+1}$. Summing over all $k$’s and accounting for the empty cells gives $O(\varepsilon ^{-1})$ bits of storage, and so the total space requirement of $\mathcal {T}_2$ is $O(\varepsilon ^{-1} \log \varphi ^{-1})$.

The probability that a hashed id $i$ gets counted in table $\mathcal {T}_3$ is at most $10^{-6}\varepsilon ^3 \bar{f}_i^2(s)$ from line 15 and our definition of $\bar{f}_i$ above. Moreover, from Claim 1, we have that this is at most $16 \cdot 10^{-6} \varepsilon ^3 {f}_i^2(s)$ if $f_i \gt 100 \varepsilon ^{-1}$. Therefore, if $f_i = 2^k\cdot 100 \varepsilon ^{-1}$ with , then the expected value of a cell in $\mathcal {T}_3$ with first coordinate $i$ is at most $1{,}600\cdot 2^{2k} \varepsilon = 2^{O(k)}$. Taking into account that there are at most $O((\varepsilon 2^k)^{-1})$ many such id's $i$ and that the number of epochs $t$ associated with such an $i$ is at most $\log (16 \cdot 10^{-6} \varepsilon ^2 {f}_i^2) = O(\log (\varepsilon f_i)) = O(k)$ (from line 15), we get that the total space required for $\mathcal {T}_3$ is

\begin{align*} &\sum _{j = 1}^{O(\log \varphi ^{-1})} \left(O(\varepsilon ^{-1}) + \sum _{k=0}^\infty O((\varepsilon 2^k)^{-1}) \cdot O(k) \cdot O(k)\right) \\ &= O(\varepsilon ^{-1} \log \varphi ^{-1}), \end{align*}
where the first $O(\varepsilon ^{-1})$ term inside the summation is for the $i$’s with $f_i \lt 100 \varepsilon ^{-1}$. Since we have an expected space bound, we obtain a worst-case space bound with error probability $1/3$ by a Markov bound.

The space required for sampling is an additional $O(\log \log m)$, using Lemma 1.

We note that the space bound can be made worst case by aborting the algorithm if it tries to use more space.

The only remaining aspect of Theorem 3 is the time complexity. As observed in Section 3.1, the update time can be made $O(1)$ per insertion under the standard assumption of the stream being sufficiently long. The reporting time can also be made linear in the output by changing the bookkeeping a bit. Instead of computing $\hat{f}_j$ and $\hat{f}$ at reporting time, we can maintain them after every insertion. Although this apparently makes INSERT costlier, this is not true, in fact, because we can spread the cost over future stream insertions. The space complexity grows by a constant factor.

### 3.2 Maximum

By tweaking Algorithm 1 slightly, we get the following result for the $\varepsilon$-Maximum problem.

Assume the length of the stream is known beforehand. Then there is a randomized one-pass algorithm $\mathcal {A}$ for the $\varepsilon$- Maximum problem that succeeds with probability at least $1-\delta$ using bits of space. Moreover, the algorithm $\mathcal {A}$ has an update time of $O(1)$.

Instead of maintaining the table $\mathcal {T}_2$ in Algorithm 1, we just store the actual id of the item with maximum frequency in the sampled items.

### 3.3 Minimum

Assume the length of the stream is known beforehand. Then there is a randomized one-pass algorithm $\mathcal {A}$ for the $\varepsilon$-Minimum problem that succeeds with probability at least $1-\delta$ using bits of space. Moreover, the algorithm $\mathcal {A}$ has an update time of $O(1)$.

Overview. Pseudocode is provided in Algorithm 3. The idea behind our $\varepsilon$-Minimum problem is as follows. It is most easily explained by looking at the REPORT(x) procedure starting in line 13. In lines 14 and 15, we ask, Is the universe size $|U|$ significantly larger than $1/\varepsilon$? Note that if it is, then outputting a random item from $|U|$ is likely to be a solution. Otherwise, $|U|$ is $O(1/\varepsilon)$.

The next point is that if the number of distinct elements in the stream were smaller than $1/(\varepsilon \log (1/\varepsilon))$, then we could just store all the items together with their frequencies with $O(1/\varepsilon)$ bits of space. Indeed, we can first sample $O(1/\varepsilon ^2)$ stream elements so that all relative frequencies are preserved up to additive $\varepsilon$, thereby ensuring each frequency can be stored with $O(\log (1/\varepsilon))$ bits. Also, since the universe size is $O(1/\varepsilon)$, the item identifiers can also be stored with $O(\log (1/\varepsilon))$ bits. So if this part of the algorithm starts taking up too much space, we stop, and we know the number of distinct elements is at least $1/(\varepsilon \log (1/\varepsilon))$, which means that the minimum frequency is at most $O(m \varepsilon \log (1/\varepsilon))$. This is what is being implemented in steps 9-10 and 18-19 in the algorithm.

We can also ensure the minimum frequency is at least $\Omega (m \varepsilon / \log (1/\varepsilon))$. Indeed, by randomly sampling $O(\log (1/\varepsilon)/\varepsilon)$ stream elements, and maintaining a bit vector for whether or not each item in the universe occurs—which we can with $O(1/\varepsilon)$ bits of space, since $|U| = O(1/\varepsilon)$—any item with frequency at least $\Omega (\varepsilon m/ \log (1/\varepsilon))$ will be sampled and so if there is an entry in the bit vector that is empty, then we can just output that as our solution. This is what is being implemented in steps 8 and 16–17 of the algorithm.

Finally, we now know that the minimum frequency is at least $\Omega (m \varepsilon / \log (1/\varepsilon))$ and at most $O(m \varepsilon \log (1/\varepsilon))$. At this point if we randomly sample $O((\log ^6 1/\varepsilon)/\varepsilon)$ stream elements, then by Chernoff bounds all item frequencies are preserved up to a relative error factor of $(1 \pm 1/\log ^2 (1/\varepsilon))$, and in particular the relative minimum frequency is guaranteed to be preserved up to an additive $\varepsilon$. At this point, we just maintain the exact counts in the sampled stream but truncate them once they exceed $\textrm {poly}(\log (1/\varepsilon)))$ bits, since we know such counts do not correspond to the minimum. Thus, we only need $O(\log \log (1/\varepsilon))$ bits to represent their counts. This is implemented in step 11 and step 20 of the algorithm.

The pseudocode of our $\varepsilon$-Minimum algorithm is in Algorithm 3. If the size of the universe $|\mathcal {U}|$ is at least , then we return an item $x$ chosen from $\mathcal {U}$ uniformly at random. Note that there can be at most many items with frequency at least $\varepsilon m$. Hence, every item $x$ among other remaining many items has frequency less than $\varepsilon m$ and thus is a correct output of the instance. Thus, the probability that we answer correctly is at least $(1-\delta)$. From here on, let us assume .

Now, by the value of $p_j$, it follows from the proof of Theorem 2 that we can assume $\ell _j \lt |\mathcal {S}_j| \lt 11\ell _j$ for $j = 1, 2, 3$, which happens with probability at least . We first show that every item in $\mathcal {U}$ with frequency at least $\varepsilon m$ is sampled in $\mathcal {S}_1$ with probability at least . For that, let $X_i^j$ be the indicator random variable for the event that the $j{\text{th}}$ sample in $\mathcal {S}_1$ is item $i$ where $i\in \mathcal {U}$ is an item with frequency at least $\varepsilon m$. Let $\mathcal {H}\subset \mathcal {U}$ be the set of items with frequencies at least $\varepsilon m$. Then, we have the following:

Now, applying union bound, we get the following:
Hence, with probability at least , the output at line 17 is correct. Now, we show below that if the frequency of any item $x\in \mathcal {U}$ is at most , then $x\in \mathcal {S}_1$ with probability at least :

Hence, from here onwards, we assume that the frequency of every item in $\mathcal {U}$ is at least .

If the number of distinct elements is at most , then line 19 outputs the minimum frequency item up to an additive factor of $\varepsilon m$ due to Chernoff bound. Note that we need only bits of space for storing ids. Hence, $\mathcal {S}_2$ can be stored in space .

Now, we can assume that the number of distinct elements is at least . Hence, if $f(t)$ is the frequency of the item $t$ with minimum frequency, then we have .

Let $f_i$ be the frequency of item $i\in \mathcal {U}$, $e_i$ be the counter value of $i$ in $\mathcal {S}_3$, and . Now, again by applying Chernoff bound, we have the following for any fixed $i\in \mathcal {U}$:

Now, applying union bound, we get the following using the fact that :
Again, by applying Chernoff bound and union bound, we get the following:
Hence, the items with frequency more than are approximated up to a multiplicative factor of from below in $\mathcal {S}_3$. The counters of these items may be truncated. The other items with frequency at most are approximated up to relative error and thus up to an additive error of . The counters of these items would not get truncated. Hence, the item with minimum counter value in $\mathcal {S}_3$ is the item with minimum frequency up to an additive $\varepsilon m$.

We need bits of space for the bit vector $\mathcal {B}_1$ for the set $\mathcal {S}_1$. We need bits of space for the set $\mathcal {S}_2$ and bits of space for the set $\mathcal {S}_3$ (by the choice of truncation threshold). We need an additional $O\left(\log \log m \right)$ bits of space for sampling using Lemma 1. Moreover, using the data structure of Section 3.3 of [23] Algorithm 3 can be performed in $O(1)$ time. Alternatively, we may also use the strategy described in Section 3.1 of spreading update operations over several insertions to make the cost per insertion be $O(1)$.

### 3.4 Borda and Maximin

Assume the length of the stream is known beforehand. Then there is a randomized one-pass algorithm $\mathcal {A}$ for $(\varepsilon , \varphi)$- List Borda problem that succeeds with probability at least $1-\delta$ using $O(n(\log n + \log \frac{1}{\varepsilon } + \log \log \frac{1}{\delta } ) + \log \log m)$ bits of space.

Let $\ell = 6\varepsilon ^{-2} \log (6n\delta ^{-1})$ and . On each insertion of a vote $v$, select $v$ with probability $p$ and store for every $i \in [n]$, the number of candidates that candidate $i$ beats in the vote $v$. Keep these exact counts in a counter of length $n$.

Then it follows from the proof of Theorem 2 that with probability at least $(1-{\delta }/{3})$. Moreover, from a straightforward application of the Chernoff bound (see Reference [24]), it follows that if $\hat{s}(i)$ denotes the Borda score of candidate $i$ restricted to the sampled votes, then

$\Pr \left[\forall i \in [n], \left|\frac{m}{|\mathcal {S}|} \hat{s}(i) - s(i)\right| \lt \varepsilon m n\right] \gt 1-\delta .$

The space complexity for exactly storing the counts is $O(n \log (n \ell)) = O(n (\log n + \log \varepsilon ^{-1} + \log \log \delta ^{-1}))$ and the space for sampling the votes is $O(\log \log m)$ by Lemma 1.

Assume the length of the stream is known beforehand. Then there is a randomized one-pass algorithm $\mathcal {A}$ for the $(\varepsilon , \varphi)$- List maximin problem that succeeds with probability at least $1-\delta$ using $O(n \varepsilon ^{-2} \log ^2 n + n\varepsilon ^{-2} \log n \log \delta ^{-1} + \log \log m )$ bits of space.

Let and . We put the current vote in a set $\mathcal {S}$ with probability $p$. Then it follows from the proof of Theorem 2 that with probability at least . Suppose $|\mathcal {S}| = \ell _1$; let $\mathcal {S} = \lbrace v_i : i\in [\ell _1] \rbrace$ be the set of votes sampled. Let $D_\mathcal {E}(x,y)$ be the total number of votes in which $x$ beats $y$ and $D_\mathcal {S}(x, y)$) be the number of such votes in $\mathcal {S}$. Then by the choice of $\ell$ and the Chernoff bound (see Reference [24]), it follows that for every pair of candidates $x, y \in \mathcal {U}$. Note that each vote can be stored in $O(n\log n)$ bits of space. Hence, simply finding $D_\mathcal {S}(x,y)$ for every $x, y\in \mathcal {U}$ by storing $\mathcal {S}$ and returning all the items with maximin score at least $(\varphi - \varepsilon /2) \ell _1$ in $\mathcal {S}$ requires $O(n\varepsilon ^{-2} \log n (\log n + \log \delta ^{-1}) + \log \log m )$ bits of memory, with the additive $O(\log \log m)$ due to Lemma 1.

### 3.5 Unknown Stream Length

Now we consider the case when the length of the stream is not known beforehand. We present below an algorithm for $(\varepsilon , \varphi)$-List heavy hitters and $\varepsilon$-Maximum problems in the setting where the length of the stream is not known beforehand.

There is a randomized one-pass algorithm for the $(\varepsilon , \varphi)$-List heavy hitters and $\varepsilon$-Maximum problems with space complexity $O({\varepsilon ^{-1}}\log \varepsilon ^{-1} + \varphi ^{-1}\log n + \log \log m )$ bits and update time $O(1)$ even when the length of the stream is not known beforehand.

We describe below a randomized one-pass algorithm for the $(8\varepsilon , \varphi)$-List heavy hitters problem. We may assume that the length of the stream is at least ; otherwise, we use the algorithm in Theorem 2 and get the result. Now, we guess the length of the stream to be , but run an instance $\mathcal {I}_1$ of Algorithm 1 with at line 2. By the choice of the size of the sample (which is ), $\mathcal {I}_1$ outputs correctly with probability at least $(1-\delta)$, if the length of the stream is in . If the length of the stream exceeds , then we run another instance $\mathcal {I}_2$ of Algorithm 1 with at line 2. Again by the choice of the size of the sample, $\mathcal {I}_2$ outputs correctly with probability at least $(1-\delta)$, if the length of the stream is in . If the stream length exceeds , then we discard $\mathcal {I}_1$, free the space it uses, and run an instance $\mathcal {I}_3$ of Algorithm 1 with at line 2 and so on. At any point of time, we have at most two instances of Algorithm 1 running. When the stream ends, we return the output of the older of the instances we are currently running. We use the approximate counting method of Morris [58] to approximately count the length of the stream. We know that the Morris counter outputs correctly with probability using $O(\log \log m + k)$ bits of space at any point in time [30]. Also, since the Morris counter increases only when an item is read, it outputs correctly up to a factor of four at every position if it outputs correctly at positions $1, 2, 4, \ldots , 2^{\lfloor \log _2 m \rfloor }$; call this event $E$. Then we have by choosing and applying union bound over the positions $1, 2, 4, \ldots , 2^{\lfloor \log _2 m \rfloor }$. The correctness of the algorithm follows from the correctness of Algorithm 1 and the fact that we are discarding at most $\varepsilon m$ many items in the stream (by discarding a run of an instance of Algorithm 1). The space complexity and the $O(1)$ update time of the algorithm follow from Theorem 2, the choice of $k$ above, and the fact that we have at most two instances of Algorithm 1 currently running at any point of time.

The algorithm for the $\varepsilon$-Maximum problem is same as the algorithm above, except we use the algorithm in Theorem 4 instead of Algorithm 1.

Note that this proof technique does not seem to apply to our optimal Algorithm 2. Similar to Theorem 8, we get the following result for the other problems.

There are randomized one-pass algorithms for the $\varepsilon$- Minimum , $(\varepsilon ,\varphi)$- Borda , and $(\varepsilon ,\varphi)$- Maximin problems with space complexity , $O(n(\log n + \log \frac{1}{\varepsilon } + \log \log \frac{1}{\delta } ) + \log \log m )$, and bits, respectively, even when the length of the stream is not known beforehand. Moreover, the update time for $\varepsilon$- Minimum is $O(1)$.

## 4 HARDNESS

In this section, we prove space complexity lower bounds for the $\varepsilon$-Heavy hitters, $\varepsilon$-Minimum, $\varepsilon$-Borda, and $\varepsilon$-maximin problems. We present reductions from certain communication problems for proving space complexity lower bounds. Let us first introduce those communication problems with necessary results.

### 4.1 Communication Complexity

Let $t$ and $m$ be positive integers. Alice is given a string $x = (x_1, \ldots , x_t)\in [m]^t$. Bob is given an index $i\in [t]$. Bob has to output $x_i$.

The following is a well known result [43].

$\mathcal {R}_\delta ^{\text{1-way}}({\rm I\scriptstyle{NDEXING}}_{m,t}) = \Omega (t \log m)$ for constant $\delta \in (0,1)$.

Let $t$ and $m$ be positive integers. Alice is given a string $x = (x_1, \ldots , x_t)\in [m]^t$. Bob is given an integer $i\in [t]$ and $(x_1, \ldots , x_{i-1})$. Bob has to output $x_i$.

The following communication complexity lower bound result is due to Reference [27] by a simple extension of the arguments of Bar-Yossef et al. [5].

$\mathcal {R}_\delta ^{\text{1-way}}({\rm A\scriptstyle{UGMENTED-INDEXING}}_{m,t}) = \Omega ((1-\delta)t \log m)$ for any $\delta \lt 1 - \frac{3}{2m}$.

Reference [68] defines a communication problem called Perm, which we generalize to $\varepsilon$-Perm as follows.

Alice is given a permutation $\sigma$ over $[n]$, which is partitioned into many contiguous blocks. Bob is given an index $i\in [n]$ and has to output the block in $\sigma$ where $i$ belongs.

Our lower bound for $\varepsilon$-Perm matches the lower bound for Perm in Lemma 1 in Reference [68] when . For the proof, the reader may find useful some information theory facts described in Appendix A.

, for any constant .

Let us assume $\sigma$, the permutation Alice has, is uniformly distributed over the set of all permutations. Let $\tau _j$ denotes the block the item $j$ is in for $j\in [n]$, $\tau = (\tau _1, \ldots , \tau _n)$, and $\tau _{\lt j} = (\tau _1, \ldots , \tau _{j-1})$. Let $M(\tau)$ be Alice's message to Bob, which is a random variable depending on the randomness of $\sigma$ and the private coin tosses of Alice. Then, we have . Hence, it is enough to lower bound $I(M(\tau); \tau)$. Then we have the following by chain rule:

\begin{align*} I(M(\tau); \tau) &= \sum _{j=1}^n I(M(\tau); \tau _j | \tau _{\lt j})\\ &= \sum _{j=1}^n H(\tau _j | \tau _ \lt j) - H(\tau _j | M(\tau), \tau _\lt j)\\ &\geqslant \sum _{j=1}^n H(\tau _j | \tau _ \lt j) - \sum _{j=1}^n H(\tau _j | M(\tau))\\ &= H(\tau) - \sum _{j=1}^n H(\tau _j | M(\tau)). \end{align*}
The number of ways to partition $n$ items into blocks is , which is . Hence, we have . Now, we consider $H(\tau _j | M(\tau))$. By the correctness of the algorithm, Fano's inequality, we have . Hence, we have the following:

Finally, we consider the Greater-than problem.

Alice is given an integer $x\in [n]$ and Bob is given an integer $y\in [n], y\ne x$. Bob has to output 1 if $x\gt y$ and 0 otherwise.

The following result is due to References [54, 67]. We provide a simple proof of it that seems to be missing in the literature.5

$\mathcal {R}_\delta ^{\text{1-way}}({\rm G{\scriptstyle REATER-THAN}}_{n}) = \Omega (\log n)$, for every .

We reduce the Augmented-indexing $_{2,\lceil \log n\rceil + 1}$ problem to the Greater-than $_{n}$ problem thereby proving the result. Alice runs the Greater-than $_{n}$ protocol with its input number whose representation in binary is $a=(x_1x_2\ldots x_{\lceil \log n\rceil }1)_2$. Bob participates in the Greater-than $_{n}$ protocol with its input number whose representation in binary is $b=(x_1x_2\ldots x_{i-1}1\underbrace{0 \ldots 0}_{(\lceil \log n\rceil -i+1) 0^{\prime }s})_2$. Now $x_i=1$ if and only if $a\gt b.$

### 4.2 Reductions

We observe that a trivial bits lower bound for $(\varepsilon , \varphi)$-List heavy hitters, $(\varepsilon , \varphi)$-List borda, $(\varepsilon , \varphi)$-List maximin follows from the fact that any algorithm may need to output many items from the universe. Also, there is a trivial $\Omega (n \log n)$ lower bound for $(\varepsilon , \varphi)$-List borda and $(\varepsilon , \varphi)$-List maximin, because each stream item is a permutation on $[n]$, hence requiring $\Omega (n \log n)$ bits to read.

We show now a space complexity lower bound of $\Omega (\frac{1}{\varepsilon }\log \frac{1}{\varphi })$ bits for the $\varepsilon$-Heavy hitters problem.

Suppose the size of universe $n$ is at least and that $\varphi \gt 2\varepsilon$. Any randomized one pass $(\varepsilon ,\varphi)$- Heavy hitters algorithm with success probability at least $(1-\delta)$ must use bits of space, for constant $\delta \in (0,1)$.

Consider the Indexing problem where Alice is given a string and Bob is given an index . We assume $\varphi \gt 2\varepsilon$. The stream we generate is over (this is possible, since ).

Let $m$ be a large positive integer. Alice generates a stream of length $m/2$ by inserting $\varepsilon m$ copies of $(x_j, j)$ for each . Alice now sends the memory content of the algorithm to Bob. Bob resumes the run of the algorithm by generating another stream of length $m/2$ by inserting $(\varphi - \varepsilon)m$ copies of $(j,i)$ for each . The length of the stream is $m$, the frequency of the item $(x_i, i)$ is $\varphi m$, while the frequency of every other item is $(\varphi - \varepsilon)m$ or $\varepsilon m$. Hence, from the output of the $(\varepsilon ,\varphi)$-Heavy hitters algorithm, Bob knows $i$ with probability at least $(1-\delta)$. Now the result follows from Lemma 5, since $\frac{1}{\varepsilon } \log \frac{1}{\varphi -\varepsilon } \gt \frac{1}{\varepsilon } \log \frac{1}{\varphi }$.

We now use the same idea as in the proof of Theorem 10 to prove an $\Omega (\frac{1}{\varepsilon }\log \frac{1}{\varepsilon })$ space complexity lower bound for the $\varepsilon$-Maximum problem.

Suppose the size of universe $n$ is at least $\frac{1}{\varepsilon ^2}$. Any randomized one pass $\varepsilon$-Maximum algorithm with success probability at least $(1-\delta)$ must use $\Omega (\frac{1}{\varepsilon }\log \frac{1}{\varepsilon })$ bits of space, for constant $\delta \in (0,1)$.

Consider the Indexing problem where Alice is given a string and Bob is given an index . The stream we generate is over (this is possible, since ). Let $m$ be a large positive integer. Alice generates a stream of length in such a way that the frequency of every item in is at least $\lfloor {\varepsilon m}/{2}\rfloor$ and the frequency of any other item is 0. Alice now sends the memory content of the algorithm to Bob. Bob resumes the run of the algorithm by generating another stream of length ${m}/{2}$ in such a way that the frequency of every item in is at least $\lfloor {\varepsilon m}/{2}\rfloor$ and the frequency of any other item is 0. The frequency of the item $(x_i, i)$ is at least $\lfloor \varepsilon m\rfloor$ where as the frequency of every other item is at most $\lfloor {\varepsilon m}/{2}\rfloor$. Hence, the -Maximum algorithm must output $(x_i, i)$ with probability at least $(1-\delta)$. Now the result follows from Lemma 5.

For $\varepsilon$-Minimum, we prove a space complexity lower bound of bits.

Suppose the universe size $n$ is at least . Then any randomized one pass $\varepsilon$-Minimum algorithm must use bits of space.

We reduce from Indexing to -Minimum thereby proving the result. Let the inputs to Alice and Bob in Indexing be and an index , respectively. Alice and Bob generate a stream $\mathcal {S}$ over the universe . Alice puts two copies of item $j$ in $\mathcal {S}$ for every $j\in \mathcal {U}$ with $x_j=1$ and runs the -Minimum algorithm. Alice now sends the memory content of the algorithm to Bob. Bob resumes the run of the algorithm by putting two copies of every item in in the stream $\mathcal {S}$. Bob also puts one copy of in $\mathcal {S}$. Suppose the size of the support of is $\ell$. Since for every , we have the following. If $x_i=0$, then the -Minimum algorithm must output $i$ with probability at least $(1-\delta)$. If $x_i=1$, then the -Minimum algorithm must output with probability at least $(1-\delta)$. Now the result follows from Lemma 5.

We show next a bits space complexity lower bound for $\varepsilon$-Borda.

Any one pass algorithm for $\varepsilon$-Borda must use bits of space.

We reduce $\varepsilon$-Perm to $\varepsilon$-Borda. Suppose Alice has a permutation $\sigma$ over $[n]$ and Bob has an index $i\in [n]$. The item set of our reduced election is $\mathcal {U}= [n]\sqcup \mathcal {D}$, where $\mathcal {D}= \lbrace d_1, d_2, \ldots , d_{2n}\rbrace$. Alice generates a vote over the item set $\mathcal {U}$ from $\sigma$ as follows. The vote is where $\mathcal {B}_j$ for is defined as follows:

\begin{align*} \mathcal {B}_j &= d_{(j-1)2\varepsilon n + 1} \succ d_{(j-1)2\varepsilon n + 2} \succ \cdots \succ d_{(2j-1)\varepsilon n} \\ &\succ \sigma _{j\varepsilon n + 1} \succ \cdots \succ \sigma _{(j+1)\varepsilon n} \succ d_{(2j-1)\varepsilon + 1} \succ \cdots \succ d_{2j\varepsilon n}. \end{align*}

Alice runs the $\varepsilon$-Borda algorithm with the vote and sends the memory content to Bob. Let $\mathcal {D}_{-i} = \mathcal {D}\setminus \lbrace i\rbrace$, $\overrightarrow{\mathcal {D}_{-i}}$ be an arbitrary but fixed ordering of the items in $\mathcal {D}_{-i}$, and let $\overleftarrow{\mathcal {D}_{-i}}$ be the reverse ordering of $\overrightarrow{\mathcal {D}_{-i}}$. Bob resumes the algorithm by generating two votes each of the form $i\succ \overrightarrow{\mathcal {D}_{-i}}$ and $i\succ \overleftarrow{\mathcal {D}_{-i}}$. Let us call the resulting election $\mathcal {E}$. The number of votes $m$ in $\mathcal {E}$ is 5. The Borda score of the item $i$ is at least $12n$. The Borda score of every item $x\in \mathcal {U}$ is at most $9n$. Hence, for , the $\varepsilon$-Borda algorithm must output the item $i$. Moreover, it follows from the construction of that an $\varepsilon mn$ additive approximation of the Borda score of the item $i$ reveals the block where $i$ belongs in the $\varepsilon$-Perm instance.

We next give a nearly tight lower bound for the $\varepsilon$-maximin problem.

Any one-pass algorithm for $\varepsilon$-maximin requires $\Omega (n/\varepsilon ^2)$ memory bits of storage.

We reduce from Indexing. Let $\gamma = 1/\varepsilon ^2$. Suppose Alice has a string $y$ of length $(n-\gamma)\cdot \gamma$, partitioned into $n-\gamma$ blocks of length $\gamma$ each. Bob has an index $\ell = i + (j-\gamma -1)\cdot \gamma$ where $i \in [\gamma ], j \in \lbrace \gamma +1, \dots , n\rbrace$. The Indexing problem is to return $y_\ell$ for which there is a $\Omega (|y|) = \Omega (n/\varepsilon ^2)$ lower bound (Lemma 5).

The initial part of the reduction follows the construction in the proof of Theorem 6 in Reference [70], which we encapsulate in the following lemma.

Given $y$, Alice can construct a matrix $P \in \lbrace 0,1\rbrace ^{n \times \gamma }$ using public randomness, such that if $P^i$ and $P^j$ are the $i{\rm th}$ and $j{\rm th}$ rows of $P$, respectively, then with probability at least $2/3$, if $y_\ell = 1$ and if $y_\ell = 0$.

Let Alice construct $P$ according to Lemma 9 and then adjoin the bitwise complement of the matrix $P$ below $P$ to form the matrix $P^{\prime } \in \lbrace 0,1\rbrace ^{2n \times \gamma }$; note that each column of $P^{\prime }$ has exactly $n$ 1s and $n$ 0s. Now, we interpret each row of $P$ as a candidate and each column of $P$ as a vote in the following way: for each $v \in [\gamma ]$, vote $v$ has the candidates in $\lbrace c : P^{\prime }_{c,v}=1\rbrace$ in ascending order in the top $n$ positions and the rest of the candidates in ascending order in the bottom $n$ positions. Alice inserts these $\gamma$ votes into the stream and sends the state of the $\varepsilon$-Maximin algorithm to Bob as well as the Hamming weight of each row in $P^{\prime }$. Bob inserts $\gamma$ more votes, in each of which candidate $i$ comes first, candidate $j$ comes second, and the rest of the $2n-2$ candidates are in arbitrary order.

Note that because of Bob's votes, the maximin score of $j$ is the number of votes among the ones cast by Alice in which $j$ defeats $i$. Since $i \lt j$, in those columns $v$ where $P_{i,v} = P_{j,v}$, candidate $i$ beats candidate $j$. Thus, the set of votes in which $j$ defeats $i$ is $\lbrace v \mid P_{i,v} = 0, P_{j,v}=1\rbrace$. The size of this set is $\frac{1}{2}\left(\Delta (P^i, P^j) + |P^j| - |P^i|\right)$. Therefore, if Bob can estimate the maximin score of $j$ up to $\sqrt {\gamma }/4$ additive error, he can find $\Delta (P^i,P^j)$ up to $\sqrt {\gamma }/2$ additive error as Bob knows $|P^i|$ and $|P^j|$. This is enough, by Lemma 9, to solve the Indexing problem with probability at least $2/3$.

Finally, we show a space complexity lower bound that depends on the length of the stream $m$.

Any one pass algorithm for $\varepsilon$-Heavy hitters, $\varepsilon$-Minimum, $\varepsilon$-Borda, and $\varepsilon$-maximin must use $\Omega (\log \log m)$ memory bits, even if the stream is over a universe of size 2, for every $\varepsilon \lt \frac{1}{4}$.

It is enough to prove the result only for $\varepsilon$-Heavy hitters, since the other three problems reduce to $\varepsilon$-Heavy hitters for a universe of size 2. Suppose we have a randomized one pass $\varepsilon$-Heavy hitters algorithm that uses $s(m)$ bits of space. Using this algorithm, we will show a communication protocol for the Greater-than $_{m}$ problem whose communication complexity is $s(2^m)$, thereby proving the statement. The universal set is $\mathcal {U}=\lbrace 0,1\rbrace$. Alice generates a stream of $2^x$ many copies of the item 1. Alice now sends the memory content of the algorithm. Bob resumes the run of the algorithm by generating a stream of $2^y$ many copies of the item 0. If $x\gt y$, then the item 1 is the only $\varepsilon$-winner; whereas if $x\lt y$, then the item 0 is the only $\varepsilon$-winner.

## Information Theory Facts

For a discrete random variable $X$ with possible values $\lbrace x_1,x_2,\ldots ,x_n\rbrace$, the Shannon entropy of $X$ is defined as $H(X)=-\sum _{i=1}^{n}\Pr (X=x_i)\log _2 \Pr (X=x_i)$. Let $H_b(p)=-p\log _2 p-(1-p)\log _2 (1-p)$ denote the binary entropy function when $p \in (0,1)$. For two random variables $X$ and $Y$ with possible values $\lbrace x_1,x_2,\ldots ,x_n\rbrace$ and $\lbrace y_1,y_2,\ldots ,y_m\rbrace$, respectively, the conditional entropy of $X$ given $Y$ is defined as $H(X\ |\ Y)=\sum _{i,j}\Pr (X=x_i,Y=y_j)\log _2\frac{\Pr (Y=y_j)}{\Pr (X=x_i,Y=y_j)}$. Let $I(X; Y) = H(X) - H(X\ |\ Y) = H(Y) - H(Y\ |\ X)$ denote the mutual information between two random variables $X, Y$. Let $I(X; Y\ |\ Z)$ denote the mutual information between two random variables $X, Y$ conditioned on $Z$, i.e., $I(X ; Y\ |\ Z) = H(X\ |\ Z) - H(X\ |\ Y, Z)$. The following summarizes several basic properties of entropy and mutual information.

Let $X, Y, Z, W$ be random variables.

1. If $X$ takes value in $\lbrace 1,2, \ldots , m\rbrace$, then $H(X) \in [0, \log m]$.
2. and .
3. If $X$ and $Z$ are independent, then we have . Similarly, if $X, Z$ are independent given $W$, then .
4. (Chain rule of mutual information) $I(X, Y; Z) = I(X; Z) + I(Y; Z\ |\ X).$ More generally, for any random variables $X_1, X_2, \ldots , X_n, Y$, $\textstyle I(X_1, \ldots , X_n; Y) = \sum _{i = 1}^n I(X_i; Y\ |\ X_1, \ldots , X_{i-1})$. Thus, .
5. (Fano's inequality) Let $X$ be a random variable chosen from domain $\mathcal {X}$ according to distribution $\mu _X$, and $Y$ be a random variable chosen from domain $\mathcal {Y}$ according to distribution $\mu _Y$. For any reconstruction function $g : \mathcal {Y} \rightarrow \mathcal {X}$ with error $\delta _g$,

We refer readers to [22] for a nice introduction to information theory.

## APPENDIX B

We remark that the algorithm in Lemma 1 has optimal space complexity as shown in Proposition 2 below which may be of independent interest.

Any algorithm that chooses an item from a set of size $n$ with probability $p$ for , in unit cost RAM model must use $\Omega (\log \log m)$ bits of memory.

The algorithm generates $t$ bits uniformly at random (the number of bits it generates uniformly at random may also depend on the outcome of the previous random bits) and finally picks an item from the say $x$. Consider a run $\mathcal {R}$ of the algorithm where it chooses the item $x$ with smallest number of random bits getting generated; say it generates $t$ random bits in this run $\mathcal {R}$. This means that in any other run of the algorithm where the item $x$ is chosen, the algorithm must generate at least $t$ many random bits. Let the random bits generated in $\mathcal {R}$ be $r_1, \ldots , r_t$. Let $s_i$ be the memory content of the algorithm immediately after it generates $i^{th}$ random bit, for $i\in [t]$, in the run $\mathcal {R}$. First notice that if $t \lt \log _2 n$, then the probability with which the item $x$ is chosen is more than $\frac{1}{n}$, which would be a contradiction. Hence, . Now, we claim that all the $s_i$’s must be different. Indeed otherwise, let us assume $s_i = s_j$ for some $i\lt j$. Then the algorithm chooses the item $x$ after generating $t-(j-i)$ many random bits (which is strictly less than $t$) when the random bits being generated are $r_1, \ldots , r_i, r_{j+1}, \ldots , r_t$. This contradicts the assumption that the run $\mathcal {R}$ we started with chooses the item $x$ with smallest number of random bits generated.

## ACKNOWLEDGMENT

We thank Jelani Nelson for a helpful conversation that led us to discover an error in a previous version of the article.

## REFERENCES

• Gediminas Adomavicius and Alexander Tuzhilin. 2005. Toward the next generation of recommender systems: A survey of the state-of-the-art and possible extensions. IEEE Trans. Knowl. Data Eng. 17, 6 (2005), 734–749.
• Rakesh Agrawal and Ramakrishnan Srikant. 1994. Fast algorithms for mining association rules in large databases. In Proceedings of 20th International Conference on Very Large Data Bases (VLDB’94). 487–499.
• Arvind Arasu, Shivnath Babu, and Jennifer Widom. 2003. CQL: A language for continuous queries over streams and relations. In Proceedings of the 9th International Workshop on Database Programming Languages (DBPL’03). 1–19.
• Javed A. Aslam and Mark Montague. 2001. Models for metasearch. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 276–284.
• Ziv Bar-Yossef, T. S. Jayram, Ravi Kumar, and D. Sivakumar. 2002. An information statistics approach to data stream and communication complexity. In Proceedings of the 43rd Annual IEEE Symposium on Foundations of Computer Science (FOCS’02). IEEE, 209–218.
• Radu Berinde, Piotr Indyk, Graham Cormode, and Martin J. Strauss. 2010. Space-optimal heavy hitters with strong error bounds. ACM Trans. Database Syst. 35, 4 (2010), 26. DOI: http://dx.doi.org/10.1145/1862919.1862923
• Kevin S. Beyer and Raghu Ramakrishnan. 1999. Bottom-up computation of sparse and Iceberg CUBEs. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD’99). 359–370.
• Daniel K. Blandford and Guy E. Blelloch. 2008. Compact dictionaries for variable-length keys and data with applications. ACM Trans. Algor. 4, 2 (2008), 17.
• Philippe Bonnet, Johannes Gehrke, and Praveen Seshadri. 2001. Towards sensor database systems. In Proceedings of the 2nd International Conference on Mobile Data Management (MDM’01). 3–14.
• Felix Brandt, Vincent Conitzer, Ulle Endriss, Jérôme Lang, and Ariel Procaccia. 2015. Handbook of Computational Social Choice. Cambridge University Press.
• Vladimir Braverman, Stephen R. Chestnut, Nikita Ivkin, and David P. Woodruff. 2016. Beating countsketch for heavy hitters in insertion streams. Retrieved from http://arxiv.org/abs/1511.00661.
• E. J. Candes, J. Romberg, and T. Tao. 2006. Stable signal recovery from incomplete and inaccurate measurements. Commun. Pure Appl. Math. 59 (2006), 1207–1223.
• Ioannis Caragiannis and Ariel D. Procaccia. 2011. Voting almost maximizes social welfare despite limited communication. Artific. Intell. 175, 910 (2011), 1655–1671. DOI: http://dx.doi.org/10.1016/j.artint.2011.03.005
• Ben Carterette and Desislava Petkova. 2006. Learning a ranking from pairwise preferences. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 629–630.
• Yousra Chabchoub, Christine Fricker, and Hanene Mohamed. 2009. Analysis of a bloom filter algorithm via the supermarket model. In Proceedings of the 21st International Teletraffic Congress (ITC’09). 1–8.
• Moses Charikar, Kevin Chen, and Martin Farach-Colton. 2004. Finding frequent items in data streams. Theoret. Comput. Sci. 312, 1 (2004), 3–15.
• Vincent Conitzer and Tuomas Sandholm. 2005. Communication complexity of common voting rules. In Proceedings of the 6th ACM Conference on Electronic Commerce. ACM, 78–87.
• Graham Cormode. 2012. Sketch techniques for massive data. In Synopses for Massive Data: Samples, Histograms, Wavelets, Sketches, Graham Cormode, Minos Garofalakis, Peter J. Haas, and Chris Jermaine (Eds.). Foundations and Trends in Databases, Vol. 4. Now Publishers Inc., Hanover, MA, 1–294. DOI: http://dx.doi.org/10.1561/1900000004
• Graham Cormode and Marios Hadjieleftheriou. 2008. Finding frequent items in data streams. Proc. VLDB Endow. 1, 2 (2008), 1530–1541.
• Graham Cormode, Flip Korn, S. Muthukrishnan, and Divesh Srivastava. 2008. Finding hierarchical heavy hitters in streaming data. Trans. Knowl. Discov. Data 1, 4 (2008).
• Graham Cormode and S. Muthukrishnan. 2005. An improved data stream summary: The count-min sketch and its applications. J. Algor. 55, 1 (2005), 58–75.
• Thomas M. Cover and Joy A. Thomas. 2012. Elements of Information Theory. John Wiley & Sons.
• Erik D. Demaine, Alejandro López-Ortiz, and J. Ian Munro. 2002. Frequency estimation of internet packet streams with limited space. In Proceedings of the European Symposium on Algorithms (ESA’02). Springer, 348–360.
• Palash Dey and Arnab Bhattacharyya. 2015. Sample complexity for winner prediction in elections. In Proceedings of the 14th International Conference on Autonomous Systems and Multiagent Systems (AAMAS’15).
• Palash Dey and Y. Narahari. 2015. Estimating the margin of victory of an election using sampling. In Proceedings of the International Joint Conferences on Artificial Intelligence (IJCAI’15). AAAI Press, 1120–1126.
• A. Dvoretzky, J. Kiefer, and J. Wolfowitz. 1956. Asymptotic minimax character of the sample distribution function and of the classical multinomial estimator. Ann. Math. Stat. 27, 3 (1956), 642–669.
• Funda Ergün, Hossein Jowhari, and Mert Sağlam. 2010. Periodicity in streams. In Proceedings of the International Conference on Approximation Algorithms for Combinatorial Optimization Problems and International Conference on Randomization and Computation (APPROX-RANDOM’10). Springer, 545–559.
• Cristian Estan and George Varghese. 2003. New directions in traffic measurement and accounting: Focusing on the elephants, ignoring the mice. ACM Trans. Comput. Syst. 21, 3 (2003), 270–313.
• Min Fang, Narayanan Shivakumar, Hector Garcia-Molina, Rajeev Motwani, and Jeffrey D. Ullman. 1998. Computing iceberg queries efficiently. In Proceedings of 24th International Conference on Very Large Data Bases (VLDB’98). 299–310.
• Philippe Flajolet. 1985. Approximate counting: A detailed analysis. BIT Numer.l Math. 25, 1 (1985), 113–134.
• Anna C. Gilbert, Martin J. Strauss, Joel A. Tropp, and Roman Vershynin. 2007. One sketch for all: Fast algorithms for compressed sensing. In Proceedings of the 39th Annual ACM Symposium on Theory of Computing. 237–246.
• Jiawei Han, Jian Pei, Guozhu Dong, and Ke Wang. 2001. Efficient computation of iceberg cubes with complex measures. In Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data. 1–12.
• Jiawei Han, Jian Pei, and Yiwen Yin. 2000. Mining frequent patterns without candidate generation. In Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data. 1–12.
• Jonathan L. Herlocker, Joseph A. Konstan, Loren G. Terveen, and John T. Riedl. 2004. Evaluating collaborative filtering recommender systems. ACM Trans. Info. Syst. 22, 1 (2004), 5–53.
• John Hershberger, Nisheeth Shrivastava, Subhash Suri, and Csaba D. Tóth. 2005. Space complexity of hierarchical heavy hitters in multi-dimensional data streams. In Proceedings of the 24th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems. 338–347.
• Christian Hidber. 1999. Online association rule mining. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD’99). 145–156.
• Tin Kam Ho, Jonathan J. Hull, and Sargur N. Srihari. 1994. Decision combination in multiple classifier systems. IEEE Trans. Pattern Anal. Mach. Intell. 16, 1 (1994), 66–75.
• Albert Jiang, Leandro Soriano Marcolino, Ariel D. Procaccia, Tuomas Sandholm, Nisarg Shah, and Milind Tambe. 2014. Diverse randomized agents vote to win. In Advances in Neural Information Processing Systems. 2573–2581.
• Richard M. Karp, Scott Shenker, and Christos H. Papadimitriou. 2003. A simple algorithm for finding frequent elements in streams and bags. ACM Trans. Database Syst. 28, 1 (2003), 51–55.
• Peter Kellner, Joe Twyman, and Anthony Wells. 2011. Polling voting intentions. In Political Communication in Britain. Springer, 94–108.
• Ilan Kremer, Noam Nisan, and Dana Ron. 1999. On randomized one-round communication complexity. Computation. Complex. 8, 1 (1999), 21–49.
• Abhishek Kumar and Jun (Jim) Xu. 2006. Sketch guided sampling—Using on-line estimates of flow size for adaptive data collection. In Proceedings of the 25th IEEE International Conference on Computer Communications (INFOCOM’06).
• Eyal Kushilevitz and Noam Nisan. 1997. Communication Complexity. Cambridge University Press, New York, NY.
• Kasper Green Larsen, Jelani Nelson, Huy Le Nguyen, and Mikkel Thorup. 2016. Optimal space heavy hitters with fast update and query. Retrieved from http://ita.ucsd.edu/workshop/16/files/abstract/abstract_3738.txt.
• Charles E. Leiserson, Ronald L. Rivest, Clifford Stein, and Thomas H. Cormen. 2001. Introduction to algorithms. MIT Press 5 (2001), 2–2.
• Hang Li. 2014. Learning to rank for information retrieval and natural language processing. Synth. Lect. Hum. Lang. Technol. 7, 3 (2014), 1–121.
• Alessandra Lumini and Loris Nanni. 2006. Detector of image orientation based on borda count. Pattern Recogn. Lett. 27, 3 (2006), 180–186.
• Nikos Mamoulis, Man Lung Yiu, Kit Hung Cheng, and David W. Cheung. 2007. Efficient top-k Aggregation of Ranked Inputs. ACM Trans. Database Syst. 32, 3 (2007), 19. DOI: http://dx.doi.org/10.1145/1272743.1272749
• Gurmeet Singh Manku and Rajeev Motwani. 2002. Approximate frequency counts over data streams. In Proceedings of the 28th International Conference on Very Large Data Bases. VLDB Endowment, 346–357.
• Andrew Mao, Ariel D. Procaccia, and Yiling Chen. 2012. Social choice for human computation. In Proceedings of the 4th Workshop on Human Computation (HCOMP’12).
• Andrew Mao, Ariel D. Procaccia, and Yiling Chen. 2013. Better human computation through principled voting. In Proceedings of of the 27th Conference on Artificial Intelligence (AAAI’13).
• Amélie Marian, Nicolas Bruno, and Luis Gravano. 2004. Evaluating Top-k Queries Over Web-Accessible Databases. ACM Trans. Database Syst. 29, 2 (2004), 319–362. DOI: http://dx.doi.org/10.1145/1005566.1005569
• Ahmed Metwally, Divyakant Agrawal, and Amr El Abbadi. 2005. Efficient computation of frequent and top-k elements in data streams. In Proceedings of the 10th International Conference on Database Theory (ICDT’05). Springer-Verlag, Berlin, 398–412. DOI: http://dx.doi.org/10.1007/978-3-540-30570-5_27
• Peter Bro Miltersen, Noam Nisan, Shmuel Safra, and Avi Wigderson. 1998. On data structures and asymmetric communication complexity.J. Comput. Syst. Sci. 57, 1 (1998), 37–49. Retrieved from http://dblp.uni-trier.de/db/journals/jcss/jcss57.html#MiltersenNSW98.
• Jayadev Misra and David Gries. 1982. Finding repeated elements. Sci. Comput. Program. 2, 2 (1982), 143–152.
• Md. Maruf Monwar and Marina L. Gavrilova. 2009. Multimodal biometric system using rank-level fusion approach. IEEE Trans. Syst. Man Cybernet., Part B: Cybernet. 39, 4 (2009), 867–878.
• J. Strother Moore. 1981. J. Algor. (June 1981), 208–209.
• Robert Morris. 1978. Counting large numbers of events in small registers. Commun. ACM 21, 10 (1978), 840–842.
• D. J. Mullen and B. Roth. 1991. Decision Making: Its Logic and Practice. Rowman and Littlefield Publishers, Inc., Savage, MD.
• Shanmugavelayutham Muthukrishnan. 2005. Data Streams: Algorithms and Applications. Now Publishers Inc.
• Jelani Nelson. 2012. Sketching and streaming algorithms for processing massive data. XRDS: Crossroads ACM Mag. Students 19, 1 (2012), 14–19.
• Rabia Nuray and Fazli Can. 2006. Automatic ranking of information retrieval systems using data fusion. Info. Process. Manage. 42, 3 (2006), 595–614.
• Adarsh Prasad, Harsh Pareek, and Pradeep Ravikumar. 2015. Distributional rank aggregation, and an axiomatic analysis. In Proceedings of the 32nd International Conference on Machine Learning (ICML’15). 2104–2112.
• Paul Resnick and Hal R. Varian. 1997. Recommender systems. Commun. ACM 40, 3 (1997), 56–58.
• Ashok Savasere, Edward Omiecinski, and Shamkant B. Navathe. 1995. An efficient algorithm for mining association rules in large databases. In Proceedings of 21th International Conference on Very Large Data Bases (VLDB’95). 432–444.
• Nisheeth Shrivastava, Chiranjeeb Buragohain, Divyakant Agrawal, and Subhash Suri. 2004. Medians and beyond: New aggregation techniques for sensor networks. In Proceedings of the 2nd International Conference on Embedded Networked Sensor Systems (SenSys’04). 239–249.
• D. Smirnov. 1988. Shannon's Information Methods for Lower Bounds for Probabilistic Communication Complexity. Master's thesis. Moscow University.
• Xiaoming Sun and David P. Woodruff. 2015. Tight bounds for graph problems in insertion streams. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM’15). 435–448. DOI: http://dx.doi.org/10.4230/LIPIcs.APPROX-RANDOM.2015.435
• Hannu Toivonen. 1996. Sampling large databases for association rules. In Proceedings of 22th International Conference on Very Large Data Bases (VLDB’96). 134–145.
• Dirk Van Gucht, Ryan Williams, David P. Woodruff, and Qin Zhang. 2015. The communication complexity of distributed set-joins with applications to matrix multiplication. In Proceedings of the 34th ACM Symposium on Principles of Database Systems. ACM, 199–212.
• Maksims N. Volkovs and Richard S. Zemel. 2014. New learning methods for supervised and unsupervised preference aggregation. J. Mach. Learn. Res. 15, 1 (2014), 1135–1176.
• Gang Wang and Frederick H. Lochovsky. 2004. Feature selection with conditional mutual information maximin in text categorization. In Proceedings of the 13th ACM International Conference on Information and Knowledge Management. ACM, 342–349.
• Lirong Xia. 2012. Computing the margin of victory for various voting rules. In Proceedings of the 13th ACM Conference on Electronic Commerce. ACM, 982–999.
• Andrew Chi-Chih Yao. 1979. Some complexity questions related to distributive computing (preliminary report). In Proceedings of the 11th Annual ACM Symposium on Theory of Computing. ACM, 209–213.

## Footnotes

• 1In a turnstile stream, updates modify an underlying $n$-dimensional vector $x$ initialized at the zero vector; each update is of the form $x \leftarrow x+ e_i$ or $x \leftarrow x-e_i$, where $e_i$ is the $i$th standard unit vector. In an insertion stream, only updates of the form $x \leftarrow x+ e_i$ are allowed.
• 2In fact, the first work [57] to formally pose the heavy hitters problem couched it in the context of voting.
• 3In all our algorithms, whenever we pick an item with probability $p\gt 0$, we can assume, without loss of generality, that is a power of two. If not, then we replace $p$ with $p^{\prime }$ where is the largest power of two more than . This does not affect correctness and performance of our algorithms.
• 4We emphasize that this is stronger than an amortized guarantee, as on every insertion, the cost will be $O(1)$.
• 5A similar proof appears in Reference [41], but theirs gives a weaker lower bound.

Partially supported by DST Ramanujan grant DSTO1358 and DRDO Frontiers Project DRDO0687.

Work partially completed while at IISc. Partially supported by DST INSPIRE grant DST/INSPIRE/04/2016/001479.

Work partially completed while at IBM Almaden Research Center.

Authors’ addresses: A. Bhattacharyya, CSA Department Indian Institute of Science; email: arnabb@iisc.ac.in; P. Dey, A-204, Department of Computer Science and Engineering, Indian Institute of Technology, Kharagpur, West Bengal-721302, India; email: palash.dey@cse.iitkgp.ac.in; D. P. Woodruff, School of Computer Science Carnegie Mellon University; email: dwoodruf@cmu.edu.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org.

©2018 Association for Computing Machinery. 1549-6325/2018/10-ART2 \$15.00
DOI: https://doi.org/10.1145/3264427

Publication History: Received September 2016; revised July 2018; accepted July 2018