Stanford University — CS154: Automata and Complexity
Luca Trevisan and Ryan Williams
Handout 2
1/26/2012
Notes on Streaming Algorithms
A streaming algorithm is an algorithm that receives its input as a “stream” of data,
and that proceeds by making only one pass through the data. As for any other kind of
algorithm, we want to design streaming algorithms that are fast and that use as little
memory as possible. A DFA is a streaming algorithm that uses a constant amount
of memory, processes each data item in its input stream in constant time, and solves
a decision problem defined on the input. In general, we are interested in streaming
algorithms for numerical computations, and other computations with a non-binary
output, and we are interested in algorithm whose memory use and processing-timeper-input-item are not constants, as long as we can make them feasibly small even
for very large streams.
The streaming model is appropriate for settings in which the data to be processed
is not stored anywhere but it is generated dynamically, and fed to the streaming
algorithm as it is being generated. Typical examples are the stream of measurements
from a sensor network, stream of transactions from a online store, of search strings
in a search engine, of page requests coming to a web server, and so on.
In this lecture we will see how to apply the Myhill-Nerode theorem and the ideas in
its proof to argue about memory lower bounds for streaming algorithms.
1
Most Frequent Element
An example of a simple but non-trivial streaming algorithm is given by the following
typical interview question: suppose we are allowed to make one pass through a sequence x1 , . . . , xn of elements from a set Σ, and that we know that one value from Σ
occurs more than n/2 times in the sequence; find the most frequently repeated value
using only two variables, using a total memory of only log2 n + log2 |Σ| bits. (If you
haven’t seen this puzzle before, think about how you would do it.)
The problem of finding a most frequently occurring element (MFE) in a data stream is
an important one in many of the settings that motivate the streaming model. (Think
of keeping track of the page that receives the most hits, or the best selling item, etc.)
In general, that is, without the guarantee that there is an element occurring in a
majority of places in the sequence, there is, unfortunately, no memory-efficient way
to find a most frequent element.
We will show that every deterministic streaming algorithm that solves the MFE problem must use at least Ω(n · `) bits of memory, where n is the length of the strings and
` = log2 |Σ| is the bit-length of one element of Σ. We will assume 2` > n2 .
1
We will prove the memory lower bound by relating the amount of memory needed
by a streaming algorithm for MFE to the number of states needed by any DFA for a
related problem. Let’s identify Σ with the set {0, . . . , 2` − 1}.
Definition 1 For every n, Σ, define the language Ln,Σ ⊆ Σn of sequences x1 , . . . , xn ,
xi ∈ Σ in which 0 is a most frequently occurring algorithms.
Fact 2 Suppose that there is a deterministic streaming algorithm for MFE that uses
≤ m(n, Σ) bits of memory; then there is a DFA for Ln,Σ that has ≤ 2m(n,Σ) states.
Proof: A streaming algorithm that uses m bits of memory has only 2m distinct
internal states, and so we can construct an automaton with 2m states that can simulate
the execution of the algorithm; at the end of sequence, we define the internal states
that lead the algorithm to output 0 are accepting states for the DFA, the others are
not accepting. This automaton recognizes the language Ln,Σ .
Now we can apply the Myhill-Nerode theorem to Ln,Σ .
Lemma 3 The language Ln,Σ has 2Ω(n`) distinguishable strings.
Proof: For every subset S = {s1 , . . . , s(n−5)/2 } ⊂ Σ − {0} of
of Σ, consider the sequence
n−5
2
nonzero elements
xS := 0, 0, 0, s1 , s1 , . . . , s n−5 , s n−5
2
2
of length n − 2 where 0 is repeated 3 times, the elements of S are repeated twice each,
and no other element is present.
We see that all such sequences are distinguishable strings for Ln,Σ , because if we
consider any two different sequences xS and xT , and we take an element s which
belongs to S but not to T , we see that attaching (s, s) to xS gives us a sequence not
in Ln,Σ (because the most frequent element is s 6= 0, which occurs 4 times), while
attaching (s, s) to xT gives us an element of Ln,Σ , because 0 occurs 3 times and all
other elements occur only once.
The number of distinguishable strings that we have constructed is
2` − 1
n−5
2
where we use the fact that
N
K
≥
≥
2` − 1
e · n−5
2
N K
eK
! n−5
2
≥ 2Ω(n`)
and the assumption that 2` > n2 , so that
!
2` − 1
e · n−5
2
≥ Ω(2`/2 )
2
Putting all together we have.
Theorem 4 Every deterministic streaming algorithm for the MFE problem requires
memory Ω(n`), where n is the length of the stream and ` is the bit-length of each data
item, assuming 2` > n2 .
2
Number of Distinct Elements
Another useful computation to make over a data stream is to figure out how many
distinct elements (DE) there are in the stream. For example, from a stream of page
requests we would like to know from how many distinct IP address we are getting
requests, as a first approximation of the number of unique readers.
For simplicity, we will continue to work with the assumption |Σ| = 2` > n2 .
To prove lower bounds for this problem, instead of reducing to a regular language
and finding distinguishable strings for it, we will reason directly about the streaming
algorithm.
Definition 5 We say that two streams x, y are distinguishable for a streaming problem P on inputs of length n, if there is a stream z such that all the correct answers
to problem P for input x · z are different from all the correct answers to problem P
for input y · z, and the streams x · z and y · z have length n.
So two streams x, y are distinguishable for the MFE problem if there is a stream z
such that none of the most frequent elements of x · z is also a most frequent element
of y · z. Two streams x, y are distinguishable for the DE problem if there is a stream
z such that the number of distinct elements in x · z is different from the number of
distinct element of y · z.
The following fact gives us a way to prove memory lower bounds.
Lemma 6 Suppose that we can find D(n, Σ) strings in Σ∗ that are distinguishable
for problem P on inputs of length n. Then, every deterministic streaming algorithm
for P must use ≥ log2 D(n, Σ) memory on inputs of length n.
Proof: Suppose that we have a streaming algorithm A that uses m(n, Σ) < log2 D(n, Σ)
bits of memory and solves P on inputs of length n. Then A has ≤ 2m(n,Σ) < D(n, Σ)
distinct internal states, and there are two distinguishable strings x and y such that
A is in the same internal state after reading x and after reading y. This means that,
3
for every z, A gives the same output for the input x · z and the input y · z. So, for
every z, there must be an output that is correct both for the input x · z and the input
y · z, contradicting the distinguishability of x and y.
Note that we could have derived the lower bound for MFE from Lemmas 3 and 6
without reducing to a regular language.
Theorem 7 There are 2Ω(n`) distinguishable strings for the DE problem on inputs of
length n, hence the DE problem requires Ω(n`) bits of memory for every deterministic
streaming algorithm.
Proof: For every subset S = {s1 , . . . , sn /2} ⊆ Σ of size n/2, consider the sequence
xS = s1 , . . . , sn /2. For every two sequences xS and xT , we can see that they are
distinguishable, because xS · xS has n/2 distinct elements, and xT · xS has strictly
more than n/2 distinct elements.
The number of distinguishable strings we have constructed is
`
2
n
2
≥
2 · 2`
en
n/2
≥ 2Ω(n`)
What about approximating the number of distinct elements?
Theorem 8 There are 2Ω(n`) distinguishable strings for the problem of approximating
DE within a ±20% relative error on inputs of length n, hence the DE problem requires
Ω(n`) bits of memory for every deterministic streaming algorithm.
Proof: We will use the following fact without proof: under the usual assumption
2` > n2 , there is a collection S of 2Ω(nl) subsets of Σ, each of size n/2, and such that
every two sets S, T ∈ S have at most n/10 elements in common.
Now, for every set S ∈ S, consider the sequence xS = s1 , . . . , sn /2. For every two
sequences xS and xT , we can see that they are distinguishable, because xS · xS has
n/2 distinct elements, and so the range of possible answers of a 20%-approximate
algorithm is between .4n and .6n, while xT · xS has at least .9n distinct elements and
so the range of valid answers is between .72n and 1.08n, so no answer can be valid
for both sequences.
The DE problem, however, admits very efficient randomized approximate streaming
algorithms. For example, it is possible to achieve a 1% approximation with high
probability using only O(` + log n) bits of memory.
4
The basic idea is to randomly pick a hash function h : Σ → [0, 1] that randomly maps
data items to reals in the range of 0 to 1. Then, given a sequence x1 , . . . , xn , we
compute h(xi ) for each i, and store the minimum hash value m; the output is 1/m.
The point of the algorithm is that (assuming h to be a perfectly random hash function), the process of evaluating h(xi ) for each i and defining m to be the minimum is
the same probabilistic process as picking k random real numbers in [0, 1], where k is
the number of distinct elements in x1 , . . . , xn , then defining m to be the minimum.
The latter process is well understood, and m tends to be approximately 1/k, so that
1/m is an approximation to k.
Here is a simplified version of the analysis: we want to show that there is a good
probability that the minimum of k random real numbers in the range [0, 1] is Ω(1/k)
and O(1/k). First, we see that the probability that the minimum is more than 5/k is
k
5
≤ e−5 < 0.007
1−
k
and the probability that the minimum is less than 1/10k is at most 1/10.
The storage required to implement the algorithm is the memory used to store h, plus
the memory needed to store the current minimum. Real numbers are represented
with finite precision, which affects the algorithm negligibly, and h is picked not as a
completely random function (which would require storage space proportional to |Σ|)
but as “pairwise independent” hash function, which requires storage O(log |Σ|). The
analysis of the completely random case needs to be adjust to deal with the more
limited randomness property of the hash function used in the implementation.
Both the probability of finding a good approximation and the range of approximation
can be improved with various techniques.
5