Data Stream Algorithms Notes from a series of lectures by S. Muthu Muthukrishnan Guest Lecturer: Andrew McGregor The 2009 Barbados Workshop on Computational Complexity March 1st – March 8th, 2009 Organizer: Denis Thérien Scribes: Anıl Ada, Eric Allender, Arkadev Chattopadhyay, Matei David, Laszlo Egri, Faith Ellen, Ricard Gavaldà, Valentine Kabanets, Antonina Kolokolova, Michal Koucký, Franois Lemieux, Pierre McKenzie, Phuong Nguyen, Toniann Pitassi, Kenneth Regan, Nicole Schweikardt, Luc Segoufin, Pascal Tesson, Thomas Thierauf, Jacobo Torán. 1 2 Lecture 1. Data Streams Lecturer: S. Muthu Muthukrishnan Scribes: Anıl Ada and Jacobo Torán We start with a puzzle. Puzzle 1: Given an array A[1..n] of log n bit integers, sort them in place in O(n) time. 1.1 Motivation The algorithms we are going to describe act on massive data that arrive rapidly and cannot be stored. These algorithms work in few passes over the data and use limited space (less than linear in the input size). We start with three real life scenarios motivating the use of such algorithms. Example 1: Telephone call. Every time a cell-phone makes a call to another phone, several calls between switches are being made until the connection can be established. Every switch writes a record for the call over approx. 1000 Bytes. Since a switch can receive up to 500 million calls a day, this adds up to something like 1 Terabyte per month information. This is a massive amount of information but has to be analyzed for different purposes. An example is searching for drop calls trying to find out under what circumstances such drop calls happen. It is clear that for dealing with this problem we do not want to work with all the data, but just want to filter with a few passes the useful information. Example 2: The Internet. The Internet is made of a network of routers connected to each other, receiving and sending IP packets. Each IP packet contains a packet log including its source and destination addresses as well as other information that is used by the router to decide which link to take for sending it. The packet headers have to be processed at the rate at which they flow through the router. Each package takes about 8 nanoseconds to go through a router and modern routers can handle a few million packets per second. Keeping the whole information would need more than one Terabyte information per day and router. Statistical analysis of the traffic through the router can be done, but this has to be performed on line at nearly real time. Example 3: Web Search. Consider a company for placing publicity in the Web. Such a company has to analyze different possibilities trying to maximize for example the number of clicks they would get by placing an add for a certain price. For this they would have to analyze large amounts of data including information on web pages, numbers of page visitors, add prices and so on. Even if the company keeps a copy of the whole net, the analysis has to be done very rapidly since this information is continuously changing. Before we move on, here is another puzzle. Puzzle 2: Suppose there are n chairs around a circular table that are labelled from 0 to n − 1 in order. So chair i is in between chairs i − 1 and i + 1 mod n. There are two infinitely smart players 3 that play the following game. Initially Player 1 is sitting on chair 0. The game proceeds in rounds. In each round Player 1 chooses a number i from {1, 2, . . . , n − 1}, and then Player 2 chooses a direction left or right. Player 1 moves in that direction i steps and sits on the corresponding chair. Player 1’s goal is to sit on as many different chairs as possible while Player 2 is trying to minimize this quantity. Let f (n) denote the maximum number of different chairs that Player 1 can sit on. What is f (n)? Here are the solutions for some special cases. f (2) = 2 f (3) = 2 f (4) = 4 f (5) = 4 f (7) = 6 f (8) = 8 f (p) = p − 1 for p prime f (2k ) = 2k 1.2 Count-Min Sketch In this section we study a concrete data streaming question. Suppose there are n items and let F [1..n] be an array of size n. Index i of the array will correspond to item i. Initially all entries of F are 0. At each point in time, either an item i is added, in which case we increment F [i] by one, or an item is deleted, in which case we decrement F [i] by one. Thus, F [i] equals the number of copies of i in the data, or in other words, the frequency of i. We assume F [i] ≥ 0 at all times. As items are being added and deleted, we only have O(log n) space to work with, i.e. logarithmic in the space required to represent F explicitly. Here we think of the entries of F as words and we count space in terms of number of words. We would like to estimate F [i] at any given time. Our algorithm will be in terms of two parameters ǫ and δ. With 1 − δ probability, we want the error to be within a factor of ǫ. The algorithm is as follows. Pick log( 1δ ) hash functions hj : [n] → [e/ǫ] chosen uniformly at random from a family of pair-wise independent hash functions. We think of hj (i) as a bucket for i corresponding to the jth hash function. We keep a counter for each bucket, count(j, hj (i)). Initially all buckets are empty, or all counters are set to 0. Whenever an item i is inserted, we increment count(j, hj (i)) by 1 for all j. Whenever an item i is deleted, we decrement count(j, hj (i)) by 1 for all j (see Figure 1.1). Our estimation for F [i], denoted by F̃ [i], will be minj count(j, hj (i)). Claim 1. Let kF k = 1. F̃ [i] ≥ F [i]. P i F [i]. 2. F̃ [i] ≤ F [i] + ǫkF k with probability at least 1 − δ. 4 1 h1 e ǫ ··· +1 h2 .. . 2 +1 +1 +1 hlog( 1 ) δ +1 Figure 1.1: Each item is hashed to one cell in each row. Proof. The first part is clear. For the second part, denote by Xji the contribution of items other than i to the (j, hj (i))th cell (bucket) of Figure 1.2. It can be easily shown that ǫ E[Xji ] = kF k. e Then by Markov’s inequality,   Pr F̃ > F [i] + ǫkF k = Pr ∀j F [i] + Xji > F [i] + ǫkF k  = Pr ∀j Xji > eE[Xji ]  log 1/δ 1 ≤ 2 Thus, we conclude that we can estimate F [i] within an error of ǫkF k with probability at least 1 − δ using O((1/ǫ) log(1/δ)) space. Observe that this method is not effective for estimating F [i] for small F [i]. On the other hand, in most applications, one is interested in estimating the frequency of high frequency objects. It is possible to show that the above is tight, with respect to the space requirements, using a reduction from the communication complexity problem Index. In this problem, Alice holds an n bit string x ∈ {0, 1}n and Bob holds a log n bit string y ∈ {0, 1}log n . We view Bob’s input as an integer i ∈ [n]. We consider the one-way probabilistic communication model. Therefore only Alice is allowed to send Bob information. Given the information from Alice, Bob needs to determine the value xi . In this model, it is well known that Alice needs to send Ω(n) bits in order for Bob to determine xi with constant probability greater than 1/2. Lemma 1. In order to estimate F [i] within an error of ǫkF k with constant probability, one needs to use Ω(1/ǫ) space. 5 Proof. Given an instance of the Index problem (x, y), where x denotes Alice’s input, y denotes Bob’s input and |x| = n, choose ǫ such that n = 2ǫ1 . Construct the array F [0.. 2ǫ1 ] as follows. If xi = 1 then set F [i] = 2 and if xi = 0 then set F [i] = 0 and increment F [0] by 2 (initially F [0] = 0). With this construction, clearly we have kF k = 1/ǫ. Suppose we can estimate F [i] within error ǫkF k = 1 with constant probability and s space. This means we can determine the value of xi : if the estimate for F [i] is above 1 then xi = 1 and xi = 0 otherwise. Now the Ω(n) = Ω(1/ǫ) lower bound on the communication complexity of Index implies a lower bound of Ω(1/ǫ) for s. Homework: Given a data stream as an array A[1..n], how P can we estimate given another data stream B[1..n], how can we estimate i A[i]B[i]? References: [CM04], [Mut09] 6 P i A[i]2 ? If we are Lecture 2. Streaming Algorithms via Sampling Lecturer: S. Muthu Muthukrishnan Scribes: Faith Ellen and Toniann Pitassi 2.1 Estimating the number of distinct elements This lecture presents another technique for streaming algorithms, based on sampling. Definition 1. Let a1 , a2 , . . . , an denote a stream of items from some finite universe [1..m]. Let Dn = |{a1 , a2 , . . . , an }| be the number of distinct elements in the stream, and let Un be the number of unique items in the stream, i.e. the number of items that occur exactly once. Let F be the frequency vector, where F [i] is the number of times that item i occurs in the stream, for each i ∈ [1..m]. Then Dn is the number of nonzero entries in the frequency vector F and Un is the number of entries of F with value 1. Our goal is to get good estimates for Un /Dn and Dn . 2.1.1 Estimating Un /Dn First we will try to estimate Un /Dn . We assume that n is known. We can easily choose an item uniformly at random from the stream, by choosing each item with probability 1/n. Doing this k times in parallel gives us a sample of size k. The problem with this approach (uniform sampling from the data stream) is that heavy items, i.e. those with high frequency, are likely to appear many times. Since each such item doesn’t contribute to Un and only contributes to Dn once, such a sample is not helpful for estimating Un /Dn . Instead, we would like to be able to sample nearly uniformly from the set of (distinct) items in the stream, i.e. element x is chosen with probability close to 1/Dn . To do this, let h be a permutation of the universe chosen uniformly at random from among all such permutations. The idea is to remember the item s in the stream with the smallest value of h(s) seen so far and the number of times this item has occurred. Specifically, as we see each item ai in the stream, we compute h(ai ). If h(ai ) < h(s) (or i = 1), then s is set to ai and c(s) is set to 1. If h(ai ) = h(s), then increment c(s). If h(ai ) > h(s), then do nothing. Note that, for any subset S of the universe, each item in S is equally likely to be mapped to the smallest value by h among all the elements in S. In particular, each element in the set of items in the stream has probability 1/Dn of being chosen (i.e. mapped to the smallest value by h) and, thus, being the value of s at the end of the stream. At any point in the stream, c(s) is the number of times s has occurred so far in the stream, since we start counting it from its first occurrence. Doing this k times independently in parallel gives us a collection of samples s1 , . . . , sk of size k. We will choose k = O(log(1/δ)/ǫ2 ). Let c1 , . . . , ck be the number of times each of these 7 items occurs in the stream. Our estimate for Un /Dn will be #{i | ci = 1}/k. Since Prob[ci = 1] = Un /Dn , using Chernoff bounds it can be shown that with probability at least (1 − δ), (1 − ǫ)Un /Dn ≤ #{i | ci = 1}/k ≤ (1 + ǫ)Un /Dn . Thus, #{i | ci = 1}/k is a good estimator for Un /Dn . It’s not necessary to have chosen a random permutation from the set of all permutations. In fact, simply storing the chosen permutation takes too much space. It suffices to randomly choose a hash function from a family of functions such that, for any subset of the universe, every element in the subset S has the smallest hashed value for the same fraction (1/|S|) of the functions in the family. This is called minwise hashing and it was defined in a paper by Broder, et al. They proved that any family of hash functions with the minwise property must be very large. Indyk observed that an approximate version of the minwise property is sufficient. Specifically, for any subset S of the universe, each element in the subset has the smallest hashed value for at least a fraction 1/((1 + ǫ)|S|) of the functions in the family. There is a family of approximately minwise hash functions of size nO(log n) , so (log n)2 bits are sufficient to specify a function from this family. An application for estimating Un /Dn comes from identifying distributed denial of service attacks. One way these occur is when an adversary opens many connections in a network, but only sends a small number of packets on each. At any point in time, there are legitimately some connections on which only a small number of packets have been sent, for example, for newly opened connections. However, if the connections on which only a small number of packets have been sent is a large fraction of all connections, it is likely a distributed denial of service attack has occurred. 2.1.2 Estimating Dn Now we want to estimate Dn , the number of distinct elements in a1 , . . . , an ∈ [1..m]. Suppose we could determine, for any number t, whether Dn < t. To get an approximation to within a factor of 2, we could estimate Dn by determining whether Dn < 2i for all i = 1, . . . , log2 m. Specifically, we estimate Dn by 2k , where k = min{i|ci = 0}. If we do these tests in parallel, the time and space both increase by a factor of log2 m. To determine whether Dn < t, randomly pick a hash function h from [1..m] to [1..t]. Let c be the number of items that hash to bucket 1. We’ll say that Dn < t if c = 0 and say that Dn ≥ t if c > 0. To record this as we process the stream requires a single bit that tells us whether c is 0 or greater than 0. Specifically, for each item ai , if h(ai ) = 1, then we set this bit to 1. If Dn < t, then the probability that no items in the stream hash to bucket 1 (i.e. that c = 0) is (1 − 1/t)Dn > (1 − 1/t)t ≈ 1/e. If Dn > 2t, then the probability no items in the stream hash to bucket 1 (i.e. that c = 0) is (1−1/t)Dn < (1−1/t)2t ≈ 1/e2 . More precisely, using a Taylor series approximation, P r[c = 0|Dn ≥ (1 + ǫ)t] ≤ 1/e − ǫ/3 and P r[c = 0|Dn < (1 − ǫ)t] ≥ 1/e + ǫ/3. To improve the probability of being correct, repeat this several times in parallel and take majority answer. This give the following result. Theorem 1. It is possible to get an estimate t for Dn using O[(1/ǫ2 ) log(1/δ) log m] words of space such that Prob[(1 − ǫ)t ≤ Dn < (1 + ǫ)t] ≥ 1 − δ. 8 2.2 Extensions for inserts and deletes Exercise Extend the algorithm for approximating the number of distinct items to allow the stream to include item deletions as well as item insertions. The algorithm described above for sampling nearly uniformly from the set of (distinct) items in the stream doesn’t extend as easily to allow deletions. The problem is that if all occurrences of the item with the minimum hash value are deleted at some point in the stream, we need to replace that item with another item. However, information about other items that have appeared in the stream and the number of times each has occurred has been thrown away. For example, suppose in our sampling procedure all of the samples that we obtain happen to be items that are inserted but then later deleted. These samples will clearly be useless for estimating the quantities of interest. We’ll use a new trick that uses sums in addition to counts. Choose log2 m hash functions hj : [1..m] to [1..2j ], for j = 1, . . . , log2 m. For the multiset of items described by the current prefix of the stream, we will maintain the following information, for each j ∈ [1.. log2 m]: 1. Dj′ , which is an approximation to the number of distinct items that are mapped to location 1 by hj , 2. Sj , which is the exact sum of all items that are mapped to location 1 by hj , and 3. Cj , which is the exact number of items that are mapped to location 1 by hj . For each item ai in the stream, if hj (ai ) = 1, then Cj is incremented or decremented and ai is added to or subtracted from Sj , depending on whether ai is being inserted or deleted. The number of distinct elements is dynamic: at some point in the stream it could be large and then later on it could be small. Thus, we have log2 m hash functions and maintain the information for all of them. If there is a single distinct item in the current multiset that is mapped to location 1 by hj , then Sj /Cj is the identity of this item. Notice that, because Sj and Cj are maintained exactly, this works even if the number of distinct items in the current multiset is very large and later becomes 1. Suppose that Dj′ is always bounded below and above by (1 − ǫ) and (1 + ǫ) times the number of distinct items hashed to location 1 by hj , respectively, for some constant ǫ < 1. Then there is only 1 distinct item hashed to location 1 by hj , if and only if Dj′ = 1. If Dj′ = 1, then Sj /Cj can be returned as the sample. If there is no j such that Dj = 1, then no sample is output. If the hash functions are chosen randomly (from a good set of hash functions), then each distinct item is output with approximately equal probability. Instead of getting just one sample, for many applications, it is better to repeat this (1/ǫ2 ) log(1/δ) times in parallel, using independently chosen hash functions. We’ll call this the sampling data structure. Yesterday, we had an array F [1..m] keeping track of the number of occurrences of each of the possible items in the universe [1..m]. We calculated the heavy hitters (i.e. items i whoseP number of occurrences, F [i], is at somePconstant fraction of the total number of occurrences, m i=1 F [i]) Pleast m 2 F [i] , and quantiles. Today, we estimated the number of F [i], and estimated F [i], m i=1 i=1 distinct elements, i.e., #{i | F (i) > 0}. The following definition gives a more succinct array 9 for answering many of the questions that we’ve looked at so far (i.e., distinct elements, quantiles, number of heavy hitters.) Definition 2. Let I[1..k] be an array, where I[j] is the number of items that appear j times, i.e. the number of items with frequency j, and k ≤ n is the maximum number of times an item can occur. For example, I[1] is the number of unique Pk items, items that appear exactly once. Heavy hitters are items that have frequency at least φ i=1 I[i], for some constant φ. We’d like to apply the CM sketch directly to the I array. The problem is how to update I as we see each successive item in the stream. If we know how many times this item has previously been seen, we could decrement that entry of I and increment the following entry. However, we don’t know how to compute this directly from I. The sampling data structure as described above, which can be maintained as items are added and deleted, allows the entries of the I array to be approximated. 2.3 Homework Problems 1. A (directed or undirected) graph with n vertices and m < n2 distinct edges is presented as a stream. Each item of the stream is an edge, i.e. a pair of vertices (i, j). Each edge may occur any number of times in the stream. Edge deletions do not occur.PLet di be the number of distinct neighbors of vertex i. The goal is to approximate M2 = i d2i . It is called M2 since it is analogous to F2 from yesterday. The key difference is that M2 only counts a new item if it is distinct, i.e. it hasn’t appeared before. √ The best known algorithm for this problem uses space O((1/ǫ4 ) n log n). It can be obtained by combining two sketches, for example, the CM sketch and minwise hashing. (In general, the mixing and matching of different data structures can be useful.) The solution to this problem doesn’t depend on the input being a graph. The problem can be viewed as an array of values, where each input increments two array entries. Although the space bound is sublinear in n, we would like to use only (log n)O(1) space. This is open. 2. Sliding window version of sampling: Input a sequence of items, with no deletions. Maintain a sample uniformly chosen from among the set of distinct items in the last w items. The space used should be O(log w). Note that if minwise hashing is used and the last copy of the current item with minimum hashed value is about to leave the window, a new item will need to be chosen. 10 Lecture 3. Some Applications of CM-Sketch Lecturer: S. Muthu Muthukrishnan Scribes: Arkadev Chattopadhyay and Michal Koucký 3.1 Count-Min Sketch Prelude: Muthu is a big fan of movies. What we will see today is like the movie “The Usual Suspects” with Kevin Spacey: 12 years of research fit into one sketch. It will also take some characteristics of another great movie “Fireworks” by Takashi Beat Kitano. That movie has three threads which in the end meet. This lecture will have three threads. Problem 1 (from yesterday): Sort an array A[1, . . . , n] of log2 n-bit integers in place in linear time. √ √ Solution idea: With a bit of extra space, say O( n), one could run n-way radix sort to sort the array in O(n) time. Where do we get this extra space? Sort/re-arrange the elements according to the highest order bit. Now, we can save a bit per element by representing the highest order bit implicitly. This yields O(n/ log n) space to run the radix sort. The details are left to the reader. There are also other solutions. Problem 2: We have a stream of items from the universe {1, . . . , n} and we want to keep a count F [x] of every single item x. We relax the problem so that we do not have to provide a precise count but only some approximation F̃ [x]: F [x] ≤ F̃ [x] ≤ F [x] + ǫ n X F [i]. i=1 Solution: For t that will be picked later, let q1 , . . . , qt are the first t primes. Hence, qt ≈ t ln t. We will keep t arrays of counters Fj [1, . . . , qj ], j = 1, . . . , t. All the counters will be set to zero at beginning and whenever an item x arrives we will increment all counters Fj [x mod qj ] by one. Define F̃ [x] = minj=1,...,t Fj [x mod qj ]. Claim 2. For any x ∈ {1, . . . , n}, n log2 n X F [x] ≤ F̃ [x] ≤ F [x] + F [i]. t i=1 Proof. The first inequality is trivial. For the second one note that for any x′ ∈ {1, . . . , n}, x′ 6= x, x′ mod qj = x mod qj for at most log2 n different j’s. This is implied by Chinese Reminder Theorem. Hence, at most log2 n counters corresponding to x may get incremented as a result of an arrival of x′ . Since this is true for all x′ 6= x, the counters corresponding to x may get over-counted 11 P by at most log2 n · x′ ∈{1,...,n}\{x} F [x′ ] in total. On average they get over-counted by at most log2 n P · x′ ∈{1,...,n}\{x} F [x′ ], so there must be at least one of the counters corresponding to x that t gets over-counted by no more than this number. 2 We choose t = logǫ2 n . This implies that we will use space O(t2 log t) = O( logǫ2 n log log n), where we measure the space in counters. This data structure is called Count-Min Sketch (CM sketch) and was introduced in G. Cormode, S. Muthukrishnan: An improved data stream summary: the count-min sketch and its applications. J. Algorithms 55(1):58-75 (2005). It is actually used in Sprinter routers. Intermezzo: Al Pacino’s second movie: Godfather (depending on how you count). Problem 3: We have two independent streams of elements from {1, . . . , n}. Call the frequency (count) of items in one of them A[1, . . . , n] and B[1, . . . , n] in the other one. Estimate X = Pn i=1 A[i] · B[i] with additive error ǫ · ||A||1 · ||B||1 . Solution: Again we use CM sketch for each of the streams: T A = (TjA [1, . . . , qj ])j=1,...,t and T B = (TjB [1, . . . , qj ])j=1,...,t , and we output estimate X̃ = min j=1,...,t X k=1 Claim 3. X ≤ X̃ ≤ X + TjA [k] · TjB [k]. log2 n ||A||1 · ||B||1. t Proof. The first inequality is trivial. For the second one note again that for any x, x′ ∈ {1, . . . , n}, x′ 6= x, x′ mod qj = x mod qj for at most log2 n different j’s. This means that the term P ′ A[x] · B[x ] contributes only to at most log2 n of the sums k=1 TjA [k] · TjB [k]. Hence again, the total over-estimate is bounded by log2 n · ||A||1 · ||B||1 and the average one by logt2 n ||A||1 · ||B||1. Clearly, there must be some j for which the over-estimate is at most the average. Choosing t = log2 n ǫ 2 gives the required accuracy of the estimate and requires space O( logǫ2 n log log n). Intermezzo: Mario is not happy: for vectors A = B = (1/n, 1/n, . . . , 1/n) the error is really large compare to the actual value. Well, the sketch works well for vectors concentrated on few elements. P Problem 4: A single stream A[1, . . . , n] of elements from {1, . . . , n}. Estimate F2 = ni=1 (A[i])2 . Solution: The previous problem provides a solution with an additive error ǫ||A||21. We can do better. So far, our CM sketch was deterministic, based on arithmetic modulo primes. In general one can take hash functions h1 , . . . , ht : {1, . . . , n} → {1, . . . , w} and keep a set of counters Tj [1, . . . , w], j = 1, . . . , t. On arrival of item x one increments counters Tj [hj (x)], j = 1, . . . , t. The hash functions hj are picked at random from some suitable family of hash functions. In such a case one wants to guarantee that for a given stream of data, the estimates derived from this CM sketch are good with high probability over the random choice of the hash functions. For the problem of estimating F2 we will use a family of four-wise independent hash functions. Our sketch 12 will consists of counters Tj [1, . . . , w], j = 1, . . . , t, for even w. To estimate F2 we calculate for each j w/2 X Yj = (Tj [2k − 1] − Tj [2k])2 , k=1 and we output the median X̃ of Yj ’s. Claim 4. For t = O(ln 1δ ) and w = 1 , ǫ2 |X̃ − F2 | ≤ ǫF2 with probability at least 1 − δ. Proof. Fix j ∈ {1, . . . , t}. First observe that E[Yj ] = F2 . To see this let us look at the contribution of terms A[x] · A[y] for x, y ∈ {1, . . . , n} to the expected value of Yj . Let us define a random variable fx,y so that for x 6= y, fx,y = 2 if hj (x) = hj (y), fx,y = −2 if hj (x) = 2k = hj (y) + 1 or hj (y) = 2k = hj (x) + 1 for some k, and fx,y = 0 otherwise. For x = y, fx,y = 1 always. Notice for x 6= y, fx,y = 2 with P probability 1/w and also fx,y = −2 with probability 1/w. It is straightforward to verify that Yj = x,y fx,y A[x] · A[y]. Clearly, if x 6= y then E[fx,y ] = 0. By linearity of expectation, E[Yj ] = X x,y E[fx,y ] · A[x] · A[y] = F2 . Now we show that V ar[Yj ] ≤  V ar[Yj ] = E  X x,y 8 2 F . w 2 fx,y A[x] · A[y] − X x !2  X fx,y A[x] · A[y]  = E  = E " !2  A[x] · A[x]  x6=y X x6=y,x′ 6=y ′ # fx,y · fx′ ,y′ · A[x] · A[y] · A[x′ ] · A[y ′ ] . For (x, y) 6= (x′ , y ′), x 6= y, x′ 6= y ′ E [fx,y · fx′ ,y′ · A[x] · A[y] · A[x′ ] · A[y ′ ]] = 0 13 because of the four-wise independence of hj . For (x, y) = (x′ , y ′), x 6= y, x′ 6= y ′   1 1 ′ ′ · 4 + · 4 A[x]2 · A[y]2 E [fx,y · fx′ ,y′ · A[x] · A[y] · A[x ] · A[y ]] = w w 8 2 · A[x] · A[y]2. = w Hence, 8 V ar[Yj ] ≤ w · X A[x]2 x !2 Applying Chebyshev’s inequality, and the fact that w = = 1 ǫ2 8 2 F . w 2 we get,   1 Pr |Yj − F2 | ≥ ǫF2 ≤ 8 Since each hash function is chosen independently, we can apply Chernoff bound to conclude that taking O(log(1/δ)) hash functions is enough to guarantee that the median of the Yj ’s gives an approximation of F2 with additive error less than ǫF2 with probability at least 1 − δ. Definition: Let A[1, . . . , n] be Pnthe count of items in a stream. For a constant φ < 1, item i is called a φ-heavy hitter if A[i] ≥ φ j=1 A[j]. Problem 5: Find all φ-heavy hitters of a stream. Solution: First we describe a procedure that finds all φ-heavy hitters given access to any sketching method. In this method, we form log n streams B0 , . . . Blog n−1 in the following way: i Bi [j] = j2 X A[k] k=(j−1)2i +1 This means that Bi [j] = Bi−1 [2j − 1] + Bi−1 [2j]. When a new element arrives in stream A, we update simultaneously the sketch of each Bi . Finally, in order to find φ-heavy hitters of A, we do a binary search by making hierarchical point queries on the log n streams that we created, in thePfollowing way: we start at Blog n−1 . We query Blog n−1 [1] and Blog n−1 [2]. If Blog n−1 [1] ≥ φ nk=1 A[k] = T (say), then we recursively check the two next level nodes Blog n−2 [1] and Blog n−2 [2] and so on. In other words, the recursive procedure is simply the following: if Bi [j] ≥ T , then descend into Bi [2j − 1] and Bi [2j]. If Bi [j] < T , then this path of recursion is terminated. If i = 0, and Bi [j] ≥ T , then we have found a heavy hitter. Clearly, this procedure finds all heavy hitters if the point queries worked correctly. The number of queries it makes can be calculated in the following way: for each i, Bi can have at most 1/φ heavy hitters and the algorithm queries at most twice the number of heavy-hitters of a stream. Thus, at most (2 log n/φ) point queries are made. 14 If we implement the probabilistic version of CM-sketch, as described in the solution to Problem 4 above, it is not hard to see that each point-query can be made to return an answer with positive additive error bounded by ǫ, with probability 1 − δ, by using roughly log(1/δ) pairwise1 independent hash functions, where each hash function has about O(1/ǫ) hash values. Such a sketch uses  1 O ǫ log(1/δ) space. For the application to our recursive scheme here for finding heavy hitters, we want that with probability at least (1 − δ), none of the at most (2 log n/φ) queries fail2 . Thus, using probabilistic  2 log n CM-sketch with space O 1ǫ log n log φδ and probability (1 − δ), we identify all φ-heavy hitters and not return any element whose count is less than (φ − ǫ)-fraction of the total count. Reference: [CM04] 1 Note for making point queries we just need pairwise independence as opposed to 4-wise independence used for estimating the second moment in the solution to Problem 4 before. 2 A query fails if it returns a value with additive error more than an ǫ-fraction of the total count. 15 Lecture 4. Graph and Geometry Problems in the Stream Model Lecturer: Andrew McGregor Scribes: Matei David and François Lemieux In the lectures so far, we considered numerical data streams. In this lecture, we consider streams describing graphs and metric spaces. A graph stream is a stream of edges E = {e1 , e2 , . . . , em } describing a graph G on n vertices. A geometric stream is a stream of points X = {p1 , p2 , . . . , pm } from some metric space (χ, d). We’re now interested in estimating properties of G or X, e.g., diameter, shortest paths, maximal matchings, convex hulls. This study is motivated by practical questions, e.g., edges in the graph can be pairs of IP addresses or phone numbers that communicate. In general, m is the number of edges. Unless stated otherwise, we’ll assume each edge appears only once in the stream. We’re interested in both directed and undirected graphs. We’re using Õ in our bounds, hiding dependence on polylogarithmic terms in m and n. Further, we assume single points can be stored in Õ(1) space and that the distance d(p1 , p2 ) can be computed easily if both p1 and p2 are stored in memory. The specific problems we consider are counting the number of triangles in a graph (Section 4.1), computing a matching in a graph (Section 4.2), clustering points (Section 4.3), and computing graph distances (Section 4.4). 4.1 Counting Triangles The Problem. Let T3 denote the number of triangles in a graph G. When G is presented as a stream of edges, we’re interested in estimating T3 up to a factor of (1 + ǫ) with probability 1 − δ, given the promise that T3 > t for some t. Warmup. We start with a simple algorithm using Õ(ǫ−2 (n3 /t) log δ −1 ) space. Note, this only improves on keeping all edges in O(n2 ) space when t = ω(n). 1. 2. 3. 4. pick 1/ǫ2 triples (u1 , v1 , w1), (u2 , v2 , w2 ), . . . as edges stream by, check that all 3 edges in every triple are present estimate T3 by the number of triples for which we found all 3 edges repeat steps 1-3 for log δ −1 times (in parallel), output the average of the estimates Note that the probability (ui , vi , wi ) is a triangle in G is precisely T3 / bounds yield the desired correctness bounds. n 3  . Standard Chernoff Theorem 2. To determine whether T3 > 0, Ω̃(n2 ) space is required, even for randomized algorithms. 16 Proof. We give a reduction from 2-player Set-Disjointness: Alice and Bob have a n × n matrices A, B, and they are trying to determine if ∃i, j such that A(i, j) = B(i, j) = 1. By [Raz92], this requires Ω(n2 ) bits even for protocols that are correct with probability 3/4. Suppose there is a space s algorithm for determining if T3 > 0. Let G be a graph on 3n vertices, with V = {u1 , . . . , un , v1 , . . . , vn , w1 , . . . , wn }, and initial edges E = {(ui, vi ) : i ∈ [n]}. Alice adds edges {(ui , wj ) : A(i, j) = 1}, and Bob adds edges {(vi , wj ) : B(i, j) = 1}. Alice starts simulating the algorithm until it processes the initial edges and her own, then communicates the memory of the algorithm to Bob, using s bits. He continues the simulation, eventually obtains the output of the algorithm, and announces it using one more bit. For correctness, observe that G contains a triangle (i.e., T3 > 0) iff the inputs to the protocol intersect. Observe that the lower bound works even for algorithms that are allowed several passes over the input stream. Theorem 3 (Sivakumar et. al). There is an algorithm using space Õ(ǫ−2 (nm/t)2 log δ −1 ). Proof. The algorithm reduces this problem to that of computing frequency moments of a related stream. Given the graph stream σ, construct a new stream σ ′ as follows: for every edge (u, v), generate all triples (u, v, w) for w ∈ V \ {u, v}. Denote by Ti the number of triples in V for which exactly i edges are present in G. Observe that the k-th frequency moment of σ ′ is X Fk (σ ′ ) = (#(u, v, w))k = 1 · T1 + 2k · T2 + 3k · T3 , (u,v,w) and that 1 3 · F1 + · F2 . 2 2 Hence, good approximations for F0 , F1 , F2 suffice to give an approximation for T3 . T3 = F0 − Theorem 4 (Buriol et. al). There is an algorithm using space Õ(ǫ−2 (nm/t) log δ −1 ). Proof. We can obtain a better algorithm using the following idea. 1. pick an edge ei = (u, v) uniformly at random from the stream 2. pick w uniformly at random from V \ {u, v} 3. if ej = (u, w) and ek = (v, w) for j, k > i exist, return 1, else return 0 To obtain an algorithm, we run this basic test many times in parallel, and we output the average of these runs, scaled by a certain amount. 4.2 Maximum Weight Matching The Problem. We now consider the Maximum Weight P Matching problem: Given a stream of weighted edges (e, we ), find M ⊂ E that maximizes e∈M we such that no two edes in M share an endpoint. 17 Warmup. Let’s us first find a 2-approximation for the unweighted case using only Õ(n) space. Given each edge (e, we ) from the stream, we must decide if we add it to our current matching. For , we consider all previously choosen edges that share an end point with (e, we ) and we compute the sum v of their weights. If we > v then we remove these edges from M and replace them with (e, we ). It is a simple exercice to show that the weight OPT of the optimal solution is at most twice the weight of any maximal matching. We will sketch the proof of the following result from [McG05]: √ Theorem 5. There is a 3+ 2-approximation algorithm for the Maximal Weight Matching problem that uses Õ(n) space . Before giving the algorithm, we mention that result has been improved by a series of recent results: 5.59... [Zel08] and 5.24... [Sar09] and that it is an open question to prove a lower bound or a much better result. Let γ be some parameter. The algorithm is the following: • At all time, we maintain a matching M • On seeing an edge (e, we ), suppose that e′ ∈ M and (maybe) e′′ ∈ M have a common end point with e • If we ≥ (1 + γ)(we′ + we′′ ) then replace e′ and e′′ by e in M. For the analysis, we use the following (macabre) definitions to describe the execution of the algorithm: • An edge e kills and edge e′ if e′ was removed from current matching when e arrived. • We say an edge is a survivor if it is in the final matching. • For survivor e, the trail of the deads is T (e) = C1 ∪ C1 ∪ · · · , where C0 = e and [ {edges killed by e′ } Ci = e′ ∈Ci−1 For any set of edges S we define w(S) = P e∈S we , where we is the weight of the edge e Lemma 2. Let S be the set of survivors and w(S) be the weight of the final matching. 1. w(T (S)) ≤ w(S)/γ 2. OPT ≤ (1 + γ)(w(T (S)) + 2w(S)) √ Put together this give OPT ≤ (1/γ + 3 + 2γ)w(S) and γ = 1/ 2 gives Theorem 5. Proof. 1. Observe first that the T (e) are disjoints. Hence, it suffices to observe that for each e ∈ S we have: X (1 + γ)w(T (e)) = (1 + γ)w(Ci ) = w(T (e) + we ) i≥1 18 2. We can charge the weights of edges in OPT to S ∪ T (S) such that each edge e ∈ T (S) is charged at most (1 + γ)w(e) and each edge e ∈ S is charged at most 2(1 + γ)w(e). More details are given in [FKM+ 05]. 4.3 K-Center Clustering Due to the lack of time, the topics discussed in this section and the next one have only been sketched. The Problem. We are given an integer k, and a stream of n distinct points X = (p1 , p2 , . . . , pn ) from a metric space (χ, d). We need to find a set of k points Y ⊆ X that minimizes maxi miny∈Y d(pi , y). Since we need to output k points, we consider the case where we have Ω(k) memory to store them. Warmup. The standard Greedy algorithm for this problem works in small space, and it obtains a 2-approximation if given the optimal value, OP T : set radius to 2 · OP T , then pick as a centre any node which is not covered by the previous centres. If only given bounds a ≤ OP T ≤ b on the optimal radius, one can obtain a (2 + ǫ) approximation algorithm by running the original algorithm in parallel with several values for OP T : a, (1 + ǫ)a, (1 + ǫ)2 a, . . . , b. This requires space Õ(k log(1+ǫ) (b/a)), which is not good when b/a is large. Theorem 6. [MK08a, Guh09] There exists a (2+ǫ)-approximation algorithm using space Õ(kǫ−1 log ǫ−1 ). 4.4 Distance Estimation in a Graph The Problem. We are given a stream with the (unweighted) edges from a graph G. This defines the shortest path metric dG : V ×V (where dG (u, v) is the shortest path between u and v in G.) The problem is to estimate dG (u, v) for some vertices u, v. We can consider the problem where u, v are known in advance of seeing the graph stream, and also when they are not known in advance. A common method for approximating graph distance is via the construction of a spanner. Definition 3. Given a graph G = (V, E), a t-spanner of G is a graph H = (V, E ′ ) such that for all u, v, dG (u, v) ≤ dH (u, v) ≤ t · dG (u, v). Theorem 7. [Elk07, Bas08] There is an algorithm that accept as input a stream of edges from a graph and that computes a 2t − 1 spanner using Õ(n1+1/t ) space. 19 Lecture 5. Functional Approximation Lecturer: S. Muthu Muthukrishnan Scribes: Laszlo Egri and Phuong Nguyen 5.1 Setting Let D be a subset of RN , called a dictionary. Given an input vector A ∈ RN and an natural number b ≥ 1, we wish to find b vectors D1 , D2 , . . . , Db from D so that min {kA − α1 ,α2 ,...,αb b X i=1 αi Di k2 : αi ∈ R for 1 ≤ i ≤ b} (5.1) is minimal. For each subset {D1 , D2 , . . . , Db } of D, the quantity (5.1) always exists (it is the distance from the vector A to the subspace generated by {D1 , D2 , . . . , Db }) and is also called the error of the subset {D1 , D2 , . . . , Db }. So here we are asking for the subset of size b with smallest error. For example, if b = N and D contains a basis D1 , D2 , . . . , DN for RN , then we can take this basis as our output. The minimal value for (5.1) for this output is 0 and is achieved by taking αi so that αi Di is the projection of A along the corresponding basis vector Di . For another example, suppose that D consists of an orthonormal basis for RN , and b < N. Then the error (5.1) is minimal when we take D1 , D2 , . . . , Db to be the b unit vectors with largest projections of A, and αi = A · Di . In the second example, computing a projection of A takes time O(N ). So the naive algorithm that computes all projections of A and then chooses b largest among them takes time O(N 2 ). The basic question is whether we can improve on this running time. We will show later that if D is some special basis (e.g., the Haar wavelet) then we need only linear time. In practice, for an application (e.g., audio, video, etc.) it is important to find the “right” basis that is suitable to the common kind of queries (i.e., A and b). 5.2 Arbitrary Dictionary We show that the general setting, where the dictionary is arbitrary, it is NP-hard even to estimate whether the minimal error (5.1) is 0. We do this by reducing the estimation problem to the exact set cover problem. Let U = {1, 2, . . . , n}, U is called the set of ground elements. Given a collection S1 , S2 , . . . , Sm of subsets of U and a number d ≤ n, the exact set cover problem is to find an exact cover of size ≤ d, i.e., a collection of d pairwise disjoint subsets Sk1 , Sk2 , . . . , Skd such that d [ Sk i = U i=1 20 (The set cover problem is defined in the same way but the subsets Ski are not required to be pairwise disjoint.) The problem is NP-hard to approximate, in the sense that for any given constant η < 1, there is a reduction from SAT to exact set cover so that • a YES instance of SAT results in an instance of exact set cover with a cover of size ≤ ηd, • a NO instance of SAT produces an instance of exact set cover with no cover of size ≥ d. We represent the inputs S1 , S2 , . . . , Sm to the exact set cover problem by an n × m matrix M, where Mi,j = 1 iff i ∈ Sj (Thus the j-th column of M is the characteristic vector of Sj .) For the reduction, consider the dictionary consisting of the characteristic vectors (also denoted by Sj ) of Sj (for 1 ≤ j ≤ m), A = ~1 (the all-1 vector of length n) and b = ηd. It is easy to see that an exact cover of size s ≤ b = ηd gives rise to a subset {D1 , D2 , . . . , Db } such that A= s X Di i=1 (Here we take α1 = α2 = . . . = αs = 1, αs+1 = . . . = αb = 0 The vectors Di are precisely those Ski that belong to the exact cover.) Consequently, if there is an exact cover of size ηd, then the minimal value of (5.1) is 0. Otherwise, if there are no exact cover of size ≥ d, then the error (5.1) of each subset {D1 , D2 , . . . , Db } is always at least the distance from A to the subspace generated by {D1 , D2 , . . . , Db }. Let h be the smallest distance from A to any subspace generated by b vectors in S, then h > 0. The error (5.1) is at least h, and hence is strictly greater than 0. 5.3 An orthogonal basis: Haar wavelet Suppose that N = 2k . An orthogonal basis for RN based on the Haar wavelet can be described as follows. Consider a fully balanced binary tree T of depth k with N = 2k leaves and N − 1 inner nodes. Each inner node in T is labeled with an integer n, 1 ≤ n < N in the following way: the root is labeled with 1, and for 2 ≤ n < N with binary notation n = nt−1 nt−2 . . . n0 (where 2 ≤ t ≤ k) the node labeled by n is the other endpoint of the path that starts at the root and follows the direction specified by the bits nt−2 , . . . , n0 . (Here if nt−2 is 0 then we follow the left child of the root, otherwise if nt−2 is 1 then we follow the right child of the root, etc.) 21 Number the leaves of T from left to right with 1, 2, . . . , N. For each n (1 ≤ n < N) the basic vector wn : wn = (u1 , u2 , . . . uN ) is defined so that for all index i:   0 ui = 1   −1 if i is not a descendant of n if i is a descendant of the left child of n if i is a descendant of the right child of n More precisely, For a node n, 1 ≤ n < N, let ℓ(n) (resp. r(n)) be the left-most (resp. right-most) leaf descendant of n, and t(n) be the number of the right-most leaf descendant of the left child of n. (Note that ℓ(n) ≤ t(n) < r(n).) Now define the basic vector wn = (0, 0, . . . , 0, 1, 1, . . . , 1, −1, −1, . . . , −1, 0, 0, . . . , 0) (5.2) where the 1’s are from position ℓ(n) to t(n), and the −1’s are from position t(n) + 1 to r(n). For example, for the root: w1 = (1, 1, . . . , 1, −1, −1, . . . , −1) where the first half of the coordinates are 1 and the other half are -1. For another example, w2k−1 = (1, −1, 0, 0, . . . , 0), wN −1 = (0, 0, . . . , 0, 1, −1) Finally, define wN = (1, 1, . . . , 1) It is easy to verify that hwn1 , wn2 i = 0 for all n1 6= n2 . So we have a set of N vectors that are orthogonal, and hence they form a basis for RN . We also assume that each vector wn is normalized so we get an orthonormal basis. 5.4 An efficient way of finding the b Haar wavelets with the largest projections Now, given a query A and b < N, we need to find a subset of basis vectors D1 , D2 , . . . , Db so that (5.1) is minimal. This is equivalent to finding such a subset D1 , D2 , . . . , Db that the projection of A on the subspace generated by D1 , D2 , . . . , Db is maximum. We will show that this can be done in time O(N log N) (better than the obvious algorithm that takes time Ω(N 2 )). Indeed, the running time can be improved to linear in N, but we will not prove it here. The inner nodes in the tree T above consist of k layers. For example, the root alone makes up the first layer, and all nodes n where n ≥ 2k−1 make up the k-th layer. Because the Haar basis is an orthogonal basis, we can solve the problem stated in the Section 5.1 by finding the b largest projections. We can do this by calculating the inner product with each of these N vectors. This would take O(N 2 ) time. The question is whether we can do it 22 faster. So far we have been looking at items that have been either inserted or deleted. We can now look at a simpler streaming model where there is a vector and you are looking at the vector left to right, just reading one character at a time. In other words, we are inserting the i-th component, the i + 1-th component, and so on. So we are going to take a signal A with N components, read it left to right, keep computing something which in the end will give us the b largest wavelet coefficients. More precisely, the idea comes from the following observation. For 1 ≤ i ≤ N let Ai be the vector A at time i, i.e. the vector whose first i coordinates are the same as that of A, and the remaining coordinates are 0. In other words, if A = (A[1], A[2], . . . , A[N]), then Ai = (A[1], . . . , A[i], 0, 0, . . .) Consider the path pi from the root to leaf i in the tree described above. Observe that if n is to the left of this path (i.e. r(n) < i), then the projection of A on wn is determined already by Ai : hA, wn i = hAi , wn i Thus, the high level idea of the algorithm is to compute recursively for i = 1, 2, . . . , N the b basis vectors wn , where n is to the left of the path pi , that give the largest value of hAi , wn i. For this, we will also have to maintain the dot products hAi , wm i for every node m that lie on the current path pi . Observe that to keep track of this information we need O(b + log(N)) space. Consider now inserting the (i + 1)-th element A[i + 1] (i.e. the (i + 1)-st step in the algorithm). Let pi+1 be the path from the root to the (i + 1)-th leaf node. We want to compute the inner product of the partial input vector (i.e. when only the first i + 1 components have been inserted) with each vector w corresponding to a node on the path pi+1 . For simplicity, we assume that the entries of the Haar basis vectors are 0, −1, 1, but note that actually, the Haar wavelets are normalized. There are three things to observe: 1. Observe that by the definition of the Haar wavelet basis, if n is on the path pi+1 then the (i + 1)-th component of wn is either 1 or −1. Assume that w is a node of both pi and pi+1 . In this case, if the (i + 1)-th element of wn is a 1 (−1), then hAi+1 , wn i = hAi , wn i ± A[i + 1] So to update hAi , wn i we simply need to add (subtract) A[i + 1] to (from) the current value. If w is a “new” node, i.e. it does not appear in pi , then hAi+1 , wn i = A[i + 1]. 2. Intuitively, the path we consider in the binary tree at each step is “moving” left to right. Consider a wavelet vector w ′ that corresponds to a node of the binary tree that is to the left of pi+1 . More formally, assume that for some j < i + 1, w ′ corresponds to a node n of pj , but n is not in pi+1 . Then the (i + 1)-th component of w ′ is 0 by the definition of the Haar wavelet basis and therefore the inner product of w ′ with A is not affected by A[i + 1]. 3. The inner product of A at time i with wavelet vectors that correspond to nodes which did not yet appear in any path is 0. 23 To keep track of the b largest coefficients, we need space O(b). We can use a heap to store this information. Observe that the time required is O(N log N) since N elements are inserted and whenever an element is inserted, we need to update O(log(N)) inner products along the path from the root to the i-th leaf. We note that there are other ways to do this in linear time. We consider now a slightly different problem. We want to place the previous problem into a real streaming context. In other words, we consider the input vector A and we allow insertions and deletions at any point. We are looking at the frequency of items, i.e. A[i] represents the frequency of item i. The query is to find the best b-term representation using the Haar wavelet basis. We discuss some informal ideas Muthu gave about this problem. Clearly, if we get information about A from left to right, we can just do what we did before. But that is not very useful. So observe that any time we update a particular element, it corresponds to updating the coefficients of log(N) wavelet vectors. We have N wavelet basis vectors, so consider the vector W that stores the coefficients of the basis vectors when the signal A is expressed in the Haar wavelet basis. Now you can think of updating an element in A as updating log(N) elements in W . In a sense, now we are facing the heavy hitter problem, i.e. we need the b largest elements of W . Using techniques we have seen before, it is possible to find b wavelet vectors whose linear combination (where the e such that the following coefficients are the inner products of the wavelet vectors with A) is R, OP T OP T e holds: k A − R k≤k A − Rb k +ǫ k A k2 , where Rb is the best b-term representation of A. e k≤ (1 + ǫ) k A − ROP T k. This There is also another algorithm that guarantees that k A − R b algorithm is more complicated and Muthu gave only the high-level intuition. The difficulty with directly getting the top k elements as A gets updated is the following. We can get the large ones using a CM sketch. But if there are a lot of “medium” ones, we will not be able to get them in a small space. But you can get them if you estimate the large ones, subtract it out from the signal (use linearity of CM sketch), look at the remaining signal. (You are not estimating the large ones exactly, so the rest of the signal has some error.) Try to find the heavy hitters again. You repeat and the residual error keeps on going down. With a reasonable amount of iteration, you will get the estimation. At a high level, it is a greedy algorithm. 5.5 A variation on the previous problem Let D be a dictionary consisting of Haar wavelets. Let A be a signal with N components. Let b be the number of vectors we want to combine: X Rb = αi Di . D1 ,...,Db ∈D In the previous sections, we wanted to minimize the error 2 k A − Rb k = N X i=1 (A[i] − Rb [i])2 . The following problem is open (in 2009 March). As before, the basis is the Haar wavelet basis. There is another vector π with positive entries that is also part of the input. The vector π has the 24 same number N of entries as the signal vector A and it is normalized to 1, i.e. problem is to minimize the “weighted error”: N X i=1 PN i=1 π[i] = 1. The π(i)(A[i] − Rb [i])2 . (Note that if all the entries of π are equal, then this is just the previous problem.) The problem is well-motivated: these representations are used to approximate signals. The question is that when you approximate a signal what do you do with it? In the database context, for example, people often look at queries for individual items. So usually databases keep record which items are asked more often than others, and this is what the vector π corresponds to. Some informal aspects of the problem: Recall that in the original problem, the coefficients were the inner products of the Haar wavelets with the signal A. It is no longer the case when we have weighted norms. When we do sparse approximation we don’t just have to come up with which vectors to choose but also we have to come up with the right choice of coefficients. It is harder to work with this, but we could make the problem easier as follows. We could assume that once we picked the vectors we use the coefficients only along that direction, i.e. we assume that the coefficients are inner products with the signal vector. If we do this then there is an interesting O(N 2 b2 ) dynamic programming algorithm. (This algorithm uses the binary tree we used before and benefits from the fact that each node in the binary tree has at most log(N) ancestors. This makes it possible to take a look at all possible subsets of the log(N) ancestors of a node in linear time.) Sparse approximation people use Haar wavelets because Haar wavelets work well for the signals they work with. But if you put arbitrary weights as we done above, then the Haar basis might not be the best basis. One question is: if we know the class of weights, which dictionary should we use? Another question would be: what weights would be good for those signals for which Haar wavelets give a good basis? Muthu mentioned that they can get good approximations when they use piecewise linear weights. You can also ask the same questions about a Fourier-dictionary. 25 Lecture 6. Compressed Sensing Lecturer: S. Muthu Muthukrishnan Scribes: Nicole Schweikardt and Luc Segoufin 6.1 The problem Assume a signal A ∈ Rn , for n large. We want to reconstruct A from linear measurements hA, ψi i, where each ψi is a vector in Rn , and hA, ψi i denotes the inner product of A and ψi . For suitably chosen ψi , n measurements suffice to fully reconstruct A (if the set of all ψi forms a basis of Rn ). However, we would like to do only k measurements for k << n. The question is which measurements should be done in order to minimize the error between what we measure and the actual value of A. We fix some notation necessary for describing the problem precisely: We assume that an orthonormal basis ψ1 , . . . , ψn of Rn is given. The dictionary Ψ is the n × n matrix the i-th row of which consists of the vector ψi . The measurements hA, ψi i, for i = 1, . . . , n, form the vector θ(A) := ΨA, the vector of coordinates of A respect to the basis ψ1 , . . . , ψn . Note that by the Pwith n orthonormality of Ψ one obtains that A = i=1 θi (A)ψi , where θi (A) denotes the i-th component of θ(A). In the area of sparse approximation theory one seeks for a representation of A that is sparse in the sense that it uses few coefficients. Formally, one looks for a set K ⊆ {1, . . . , n} of coefficients such that k = |K| << n such that for the vector X θi (A)ψi R(A, K) := i∈K the error ||A − R(A, K)||22 = mal, Pn − Ri (A, K))2 is as small as possible. Since Ψ is orthonorX ||A − R(A, K)||22 = θi (A)2 . i=1 (Ai i6∈K Thus, the error is minimized if K consists of the k coordinates of highest absolute value in the vector θ(A). In the following, we write θj1 , θj2 , . . . , θjn to denote the components of the vector θ(A), ordered in descending absolute value, i.e., ordered such that |θj1 | ≥ |θj2 | ≥ · · · ≥ |θjn |. k Furthermore, we write Ropt (A) to denote the vector R(A, K) where K is a set of size k for which the error is minimized, i.e., k X k Ropt (A) = θji ψji . (6.3) i=1 Of course, the optimal choice of K depends on the signal A which is not known in advance. The ultimate goal in compressed sensing can be described as follows: Identify a large class of signals A and a dictionary Ψ′ , described by a k × n matrix, such that instead of performing the n 26 measurements ΨA, already the k measurements Ψ′ A suffice for reconstructing a vector R(A) such that the error ||A − R(A)||22 is provably small on all signals A in the considered class. A particluar class of signals for which results have been achieved in that direction is the class of p-compressible signals described in the next section. 6.2 p-compressible signals We assume that a dictionary Ψ is given. Furthermore, let us fix p to be a real number with 0 < p < 1. Definition 4. A signal A is called p-compressible (with respect to Ψ) iff for each i ∈ {1, . . . , n}, |θji | = O(i−1/p ). Obviously, if A is p-compressible, then ||A − k Ropt (A)||22 = n X i=k+1 θj2i ≤ Z n O((i−1/p )2 ) k+1 ≤ Cp · k 1−2/p k for a suitable number Cp . Thus, if we assume p to be fixed, the optimal error ||A − Ropt (A)||22 is of k k size at most Copt = O(k 1−2/p ) (for Copt := Cp · k 1−2/p ). The following result shows that for reconstructing a vector R such that the error ||A − R||22 is k of size O(Copt ), already k log n measurements suffice. Theorem 8 (Donoho 2006; Candès and Tao 2006). There exists a (k log n) × n matrix Ψ′ such that the following is true for all p-compressible signals A: when given the vector Ψ′ A ∈ Rk log n , one can reconstruct (in time polynomial in n) a vector R ∈ Rn such that ||A − R||22 = O(Ckopt). The proof details are beyond the scope of this lecture; the overall structure of the proof is by showing the existence of Ψ′ by proving the following: if Ψ′ is chosen to be the matrix T Ψ, where T is a random (k log n) × n matrix with entries in {−1, 1}, then the probability that Ψ′ satisfies the theorem will be nonzero. A crucial step in the proof is to use “the L1 trick”, i.e., to consider the L1 -norm || · ||1 instead of the L2 -norm || · ||2 and solve a suitable linear program. Note that in lecture #5 we already considered the particular case where Ψ is a Haar wavelet basis, and solved similar questions as that of Donoho and Candès and Tao for that particular case. 6.3 An explicit construction of Ψ′ Theorem 8 states that a matrix Ψ′ exists, and the proof of Theorem 8 shows that Ψ′ can be chosen as T Ψ, for a suitable (k log n) × n matrix T . The goal in this section is to give an explicit, deterministic construction of T . k Recall from Section 6.1 that θ(A) = ΨA. Our goal is to approximate the vector Ropt (A) from ′ equation (6.3) by a vector R that can be found using only the measurements Ψ A instead of using all the measurements ΨA. Clearly, if Ψ′ = T Ψ, then Ψ′ A = T ΨA = T θ(A). Since we want to 27 k use Ψ′ A = T θ(A) to find Ropt (A), we clearly should choose T in such a way that it picks up the k largest components (w.r.t. the absolute value) of θ(A). Note the striking similarity between this problem and the following combinatorial group testing problem: We have a set U = {1, . . . , n} of items and a set D of distinguished items, |D| ≤ k. We identify the items in D by performing “group tests” on subsets Si ⊆ U. The output of each group test is 0 or 1, revealing whether the subset Si contains at least one distinguished item, i.e., |Si ∩ D| ≥ 1. Collections of O(k log n)2 ) nonadaptive tests are known which identify each of the distinguished items precisely. For the special case where only k-support signals are considered (i.e., signals A where at most k of the components in θ(A) are nonzero), a solution of the combinatorial group testing problem almost immediately gives us a matrix T with the desired properties. For the more general case of p-compressible signals, the following is known. Theorem 9 (Cormode and Muthukrishnan, 2006). We can construct a poly(k, ε, log n) × n matrix T in time polynomial in k and n such that the following is true for the matrix Ψ′ := T Ψ and for all p-compressible signals A: when given the vector Ψ′ A, one can reconstruct a vector R ∈ Rn such k that ||A − R||22 ≤ ||A − Ropt (A)||22 + εCkopt . The construction of T in the proof of Theorem 9 is based on the following two facts (where [n] := {1, . . . , n}). Fact 1 (k-separative strong set). Given n and k, for l = k 2 log2 n, one can find l sets S1 , · · · , Sl included in [n] such that for all X ⊆ [n] with |X| ≤ k we have ∀x ∈ X, ∃i such that Si ∩ X = {x}. Fact 2 (k-separative set). Given n and k, for m = k log2 n, one can find m sets S1 , · · · , Sm included in [n] such that for all X ⊆ [n] with |X| ≤ k we have ∃i such that |Si ∩ X| = 1. Furthermore, we need the following notations for describing the matrix T : 1. Given a u × n matrix M and a v × n matrix N, M ⊕ N denotes the (u + v) × n matrix consisting of the rows of M followed by the rows of N. 2. Given a vector B ∈ Rn and a u × n matrix M, B ⊗ M denotes the u × n matrix whose element (i, j) is Bj ∗ M[i, j]. If N is a v × n matrix then N ⊗ M is a uv × n matrix obtained by applying the vector operation on each row of N, using ⊕ to merge the results. 3. The Hamming matrix H is the log n × n matrix such that column i is the binary coding of i. We add an extra row to H with 1 everywhere. With n = 8 this yields :   1 1 1 1 0 0 0 0  1 1 0 0 1 1 0 0     1 0 1 0 1 0 1 0 . 1 1 1 1 1 1 1 1 This basically corresponds to a binary search strategy for a set of size n. 28 Now set k ′ as a suitable function of k and p and let m be the number corresponding to n and k ′ as given by Fact 2 and let S ′ be the corresponding k ′ -separating sets. Set k ′′ as a (k ′ log n)2 and let l be the number corresponding to n and k ′′ as given by Fact 1 and let S ′′ be the corresponding strong k ′′ -separative sets. From S ′ form the characteristic matrix M ′ and from S ′′ form the characteristic matrix M ′′ . Let T be the matrix (M ′ ⊗ H) ⊕ M ′′ . This matrix has a number of rows that is: m ∗ log n + l which is poly-log in n by construction. The fact that this matrix has the desired properties can be found in [Cormode and Muthukrishnan, SIROCCO 2006]. 6.4 Literature A bibliography on compressed sensing can be found at http://dsp.rice.edu/cs. In particular the following references were mentioned during the lecture: • David Donoho: Compressed sensing. IEEE Trans. on Information Theory, 52(4), pp. 1289– 1306, April 2006. • Emmanuel Candès and Terence Tao: Near optimal signal recovery from random projections: Universal encoding strategies? IEEE Trans. on Information Theory, 52(12), pp. 5406–5425, December 2006. • Graham Cormode and S. Muthukrishnan: Combinatorial Algorithms for Compressed Sensing. SIROCCO 2006, LNCS volume 4056, pp. 280–294, Springer-Verlag, 2006. • S. Muthukrishnan: Some Algorithmic Problems and Results in Compressed Sensing. FortyFourth Annual Allerton Conference. (The article is available on Muthu’s webpage at http: //www.cs.rutgers.edu/˜muthu/resrch_chrono.html) 29 Lecture 7. The Matador’s Fate and the Mad Sofa Taster: Random Order and Multiple Passes Lecturer: Andrew McGregor Scribes: Pierre McKenzie and Pascal Tesson 7.1 Introduction The previous lectures have focused on algorithms that get a single look at the input stream of data. This is the correct model in many applications but in other contexts a (typically small) number of passes through input may be allowed. For example, it can be a reasonable model for massive distributed data. We want to understand the inherent trade-offs between the number of passes and the space complexity of algorithms for problems considered in previous lectures. We have also considered the space complexity of algorithms in a “doubly-worst-case” sense. We are of course assuming worst-case data but, implicitly, we have also assumed that the order of presentation of the data is chosen adversarially. This simplifies the analysis and provides guarantees on the space required for the computation but average-case analysis is often more appropriate: in many real-world applications, such as space-efficiently sampling salaries from a database of employees for example, data streams are relatively unstructured. We thus consider the complexity of algorithms on random-order streams and again seek to identify cases where this point of view provides significant gains in space-complexity. These gains sometimes come simply from a sharper analysis of algorithms discussed earlier but, more interestingly, we can also tweak existing algorithms to take full advantage of the random order model. Lower bounds for this model are obviously trickier but more meaningful from a practical point of view. 7.2 Basic Examples Smallest value not in the stream Consider first the task of identifying the smallest value x that is not occurring in a stream of length m consisting of values in [n]. Let us look at variants of the problem where some promise on the input is provided. Version 1: Promised that m = n − 1 and that all elements but x occur (i.e. all elements but x occur exactly once) In this case, the obvious solution is to keep a running sum S of the elements of the stream ∼ and get x = m(m + 1/2) − S. The space required is Θ(1) Version 2: Promised that all elements less than x occur exactly once. 30 ∼ In this case, the complexity is Θ(m1/p ) where p is the number of passes. Version 3: No promise. ∼ In this case, the complexity is Θ(m/p) where p is the number of passes. Increasing subsequence As a second example, consider the problem of finding an increasing subsequence of length k in the stream, given that such a subsequence exists. Liben-Nowell et al. [LNVZ06] gave an upper bound p of space complexity O(k 1+1/2 −1 ) which was later shown to be tight [GM09]. Medians and approximate medians Obviously, the assumption that the data arrives in a random order is often of little help. But there are classical examples for which the gains are significant. Consider the problem of finding the median of m elements in [n] using only polylog(m, n) space. We also consider the easier problem of finding a t-approximate median, i.e. an element x whose rank in the m elements is m/2 ± t. If we assume that the stream is given in an adversarial order, and if we impose a polylog(m, n) ∼ space bound, we can find a t-approximate median if t = Ω(m/polylog(m) and that √ bound is tight. However, if we assume that the stream is random-order, we can get t down to Ω( m). √ This bound is not known to be optimal but t-approximate medians cannot be computed for t = ω( 3 m). Suppose instead that we want to compute the exact median. The bounds above show that this is not possible in space polylog(m, n) even assuming random order. However the bounds are for 1-pass algorithms and given sufficiently many passes, we can identify the exact median without exceeding our space bound. Specifically, the number of passes needed/sufficient for this task it ∼ ∼ Θ(log m/ log log m) in the adversarial model and only Θ(log log m) in the random-order model. Theorem 10. In the adversarial order model, one can find an element of rank m/2±ǫm in a single ∼ pass and using O(1/ǫ) space. Moreover, one can find the exact median in O(log m/ log log m) ∼ passes using O(1) space. Proof. We have already discussed the one-pass result. In fact, we even showed that it is possible to find quantiles, i.e. find for any i ∈ [ǫ−1 ] an element of rank iǫm ± ǫm The multi-pass algorithm is built through repeated applications of this idea. In a first pass, we set ǫ = 1/ log m and find a and b with rank(a) = m/2 − 2/ log m ± m/ log m and rank(b) = m/2 + 2/ log m ± m/ log m Now in pass 2, we can find the precise rank of a and b and from there recurse on elements within the range [a, b]. Note that this range is of size at most m/ log m and every pair of passes similarly shrinks the range by a factor of log m. Hence O(log m/ log log m) passes are sufficient to find the median. 31 Let us now focus on lower bound arguments for this same problem. Theorem 11. Finding an element of rank m/2 ± mδ in a single pass requires Ω(m1−δ ) space. Proof. Once again the proof relies on a reduction from communication complexity. Specifically we look at the communication complexity of the I NDEX function: Alice is given x ∈ {0, 1}t, Bob is given j ∈ [t] and the function is defined as I NDEX(x, j) = xj . Note that the communication complexity of this problem is obviously O(log n) if Bob is allowed to speak first. However, if we consider one-way protocols in which Alice speaks first, then Ω(t) bits of communication are necessary (this is easy to show using a simple information-theoretic argument). We want to show that Alice and Bob can transform a single pass algorithm for the approximate median into a protocol for I NDEX. Given x, Alice creates the stream elements {2i + xi : i ∈ [t]} while Bob appends to the stream t − j copies of 0 and j − 1 copies of 2t + 2. Clearly, the median element in the resulting stream is 2j + xj and Alice can send to Bob the state of the computation after passing through her half of the stream. Hence, finding the exact median requires Ω(m) space. To get the more precise result about approximate medians, it suffices to generate 2mδ + 1 copies of each of the elements: any mδ -approximate median still has to be 2j + xj and since the resulting stream is of length O(tmδ ), the communication complexity lower bound translates into an Ω(m/mδ ) lower bound on the space complexity of the streaming algorithm. If we hope to obtain lower bounds in the multi-pass model using the same approach, we cannot simply rely on the lower bound for I NDEX. Intuitively, each pass through the stream forces Bob to send to Alice the state of the computation after the completion of the first pass and forces Alice to send to Bob the state of the computation after the completion of the first half of the second pass. Instead, we consider a three-party3 communication game in the “pointer-jumping” family. Alice is given a t × t matrix X, Bob is given y ∈ [t]t and Charlie is given i ∈ [t]. Let j ∈ [t] be defined as j = yi : the players’ goal is to compute Xij with an A → B → C → A → B → C protocol. (Alice speaks first, followed by Bob, . . .) Any ABCABC protocol for this function requires Ω(t) communication [NW93]. √ Theorem 12. Finding the exact median in two passes requires space Ω( m). Proof. The reduction from the pointer-jumping problem works as follows. Again, we think of Alice, Bob and Charlie as creators of, respectively, the first, second and last third of a stream. Therefore, a space-efficient 2-pass algorithm can be translated into a cheap ABCABC communication protocol. Let T > 2t + 2 and let ok = T (k − 1). Alice creates for each k ∈ [t], the elements Ak = {2ℓ + Xℓk : ℓ ∈ [t]} + ok . Bob creates for each k ∈ [t], the elements Bk = {t − yk copies of 0 and yk − 1 copies of B} + ok . In other words, Alice and Bob’s stream elements form t nonoverlapping blocks. Each such block has 2t − 1 elements and follows the pattern of the 1-pass reduction. Charlie on the other hand adds t(t − i) copies of 0 and t(i − 1) copies of Bot . It is convenient to also think of these as t blocks of (t − i) copies of 0s and t blocks of (i − 1) copies of 3 Note that the multi-party model considered here is the “number in hand model” and not the “number on the forehead” model. 32 Bot . The total number of blocks in our construction is 2t − 1 and the median element in the stream is the median element of the median block. The elements generated by Charlie guarantee that the median block is the ith one and by construction of the Ak and Bk , the median of this block is oi + 2j + Xij where j = yi . A 2-pass algorithm using space S yields an ABCABC protocol of cost 5S for the pointerchasing problem and since the stream above is of length O(t2 ), the communication complexity √ lower bound translates into an Ω( m) space lower bound for two-pass algorithms. Throughout our discussion on medians, we assumed that the length m of the stream is known in advance. This is crucial for the upper bounds: one can in fact show that when m is not known a priori, an algorithm for the exact median requires Ω(m) space even if the data is presented in sorted order. (Left as an exercise) 7.3 Taking advantage of random streams ∼ √ Theorem 13. In the random order model, one can find an element of rank m/2 ± O( m) in a ∼ stream of elements from [n] in a single pass using O(1) space. Proof. We split the stream into O(log m) segments of length O(m/ log m). We set a1 = −∞, b1 = +∞. At the ith step we process the ith segment. We enter the ith step thinking that we have ai and bi fulfilling rank(ai ) < m/2 < rank(bi ), and then • we pick in the ith segment the first c fulfilling ai < c < bi • we use the rest of the ith segment to compute an estimate r̃ of rank(c), by setting r̃ to O(log m) × the rank of c within the ith segment ∼ √ • if r̃ is within O( m) of m/2 then we output c, otherwise we proceed to step i + 1 after setting  (ai , c) if r̃ > m/2 (i.e. the median is likely below c) (ai+1 , bi+1 ) = (c, bi ) if r̃ < m/2. This algorithm finds an approximation to the median with high probability. The probability analysis uses a variant of Chernoff-Hoeffding bounds applied to sampling without replacement [GM09]. ∼ The algorithm manipulates a constant number of (log mn)-bit numbers so uses O(1) space. The above algorithm is more accurate than the CM sketch but the latter yields all the quantiles. By increasing the number of passes, yet a better approximation to the median can be ob∼ √ ∼ tained. A second pass can improve the approximation from ±O( m) to ±O(m1/4 ), a third pass ∼ to ±O(m1/8 ), and so on. One needs care to account for the fact that the input is not rerandomized ∼ at each pass [GK01]. The exact median can be found in space O(1) using O(log log m) passes [GM09]. 33 Turning to lower bounds to match the above theorem, the difficulty in extending the communication complexity technique to the random order model is to account for randomisation when ∼ partitioning the player inputs. It can be shown that approximating the median to ±O(mδ ) in one 1−3δ pass requires Ω(m 2 ) space. This is known to be tight when δ = 0 but an open problem is to 1 prove Ω(m 2 −δ ). See [GM09] for a refined statement that takes the number of passes into account. The bearing of the communication complexity of the Hamming distance problem (in which Alice and Bob want to estimate the Hamming distance between their respective n-bit strings) on the data stream complexity of computing frequency moments was not treated in these lectures. The 2-line story here is: • see [GM09], • see [CCM08]. 34 Lecture 8. Random order, one pass. Linear algebraic problems. Lecturer: Andrew McGregor and S. Muthu Muthukrishnan Antonina Kolokolova Scribes: Valentine Kabanets and 8.1 Random order √ Theorem 14. Finding the exact median requires Ω( m) space. Theorem 15. To find mδ -approximation of the median in one pass, in random order setting, re1−3δ quires space Ω(m 2 ). The lower bound is tight for δ = 0, but not known to be tight for δ = 1/2. 1 Problem 1. Improve this to Ω(m 2 −δ ). Proof of theorem 14. The proof is by reduction from communication problem INDEX. Suppose Alice has x ∈ {0, 1}t and Bob has j ∈ [t]. Claim 5. Even when x ∈R {0, 1}t (that is, picked uniformly at random from {0, 1}t ), any one-pass one way (Alice to Bob) protocol requires Ω(t) communication. Let Alice have A = {2i + xi | i ∈ [t]}, and let Bob have t − j copies of 0 (set B1 ) and j − 1 copies of 2t + 2 (set B2 ). Then, finding a median requires Alice to know j. This is the case of adversarial stream. For the random stream case, how can Alice and Bob simulate an algorithm on a random permutation of A, B1 , B2 ? They cannot do such a simulation, but they do an “almost random” stream. Start by adding to A, B1 , B2 t2 copies of B1 and B2 items. So the size of the set of 0s B1 becomes t2 + t − j, and of new B2 , |B2 | = t2 + j − 1. Using public randomness, decide ahead of time where elements of A appear. To Alice’s elements on Bob’s side, add random values yi. Alice guesses j = t/2, and randomly fills Bob’s places in her part of the stream with values 0 (small) and 2t + 2 (large) so that it is balanced by the end of her part of the stream. Bob knows the balance by the start of his stream, and fills in the rest of 0 and 2t + 2 to make the balance exact. Since t2 is large in comparison to t − j, j − 1, equal balance is ok. Finally, although Bob guesses his 2i + xi mostly incorrectly, he can recover. Reference: Guha, McGregor SiCOMP’09, and Chakrabarti, Cormode, McGregor STOC’08. Gap hamming: given two length n binary string, approximate hamming distance. There is a one-pass lower bound. 35 8.2 Linear algebraic problems 1. Compressed sensing: many parameters, time to reconstruct the signal, etc. There are tables of parameters in various paper. Pyotr Indyk has the largest table. 2. Algorithmic puzzle: given an array of n elements, find in-place a leftmost item that has a duplicate, if any. O(n2) is trivial, O(n log2 n) more interesting, if allowed to lose info (overwrite elements) can get O(n). 3. Different sparse approximation problem: given an array of size n and a number k, partition the array in k pieces. Approximate items in every interval by the average of the interval ai . P P Want to minimize the quantity: i j∈ith interval (ai − A[j])2 ; here, full memory is allowed; no streaming. Dynamic programming solution gives O(n2 k). But suppose the problem is generalized to two dimensions. What kind of partition can it be? There is hierarchical partition (every new line splits an existing block), simple partition (grid), or arbitrary partition. The problem is NP-hard for arbitrary partition, and is in P for hierarchical partition (using Dynamic Programming). Approximation algorithm converts the problem of arbitrary partition into a hierarchical partition.. Sparse approximation: intervals correspond to a dictionary. How to get streaming algorithms for this case? This problem is related to histogram representation problem. It is also related to the wavelets. 8.2.1 Linear algebraic problems Here we consider three linear algebraic problems: 1. Matrix multiplication. 2. L2 regression (least squares) 3. Low rank approximation. Matrix multiplication P Problem: given A : Rm×n , B : Rn×p , compute A · B. Frobenius norm ||x||F = i,j x2i,j : work in streaming world, only look at the space to keep the sketch what we track as the matrices are updated. Take a projection matrix S of dimensions (1/ǫ2 ) log 1/δ by n. Now AB = (AS T )(SB). Keep track of of (AS T ) (size ǫ12 log 1δ × m, (SB) : ǫ12 log 1δ × p. The probability P r(||AB − AS T SB||F ≤ ǫ||A||F ||B||F ) ≥ 1 − δ. This is similar to the approximation of additive error in wavelets, inner product of vectors. Expectation of the inner product is E[hSx, Syi] = hx, yi, variance V ar[hSx, Syi] ≤ 2ǫ2 ||x||22 ||y||22. This was proved earlier. From this, get E[AS T SB] = AB, and V ar(||AB − AS T SB||2F ) ≤ 2ǫ2 ||A||F ||B||F . 36 P r(min ||AB − ASiT Si B||F ≤ ǫ||A||F ||B||F ) > 1 − δ, 1...t where t ≈ 1 log δ1 – need at least 1 ǫ2 bits to solve with this level of accuracy. L2 regression (least squares) ? Now take A : Rn×d , n > d (often n >> d), b ∈ Rn . The problem is to solve Ax = b. That is, given points (observational) draw a line (fit to explain all points) minimizing the sum of squares of distances of points to the line. Z = minx∈Rd ||Ax − b||2 . Best known algorithm is O(nd2 ); this algorithm is based on numerical analysis. How do we solve this problem in the streaming world? Here, A and b are updated as new points come along. Want guarantees on x and Z. Can get the result: 1. z̃ ≤ (1 + ǫ)z 2. Can find x′opt such that ||SAxopt − Sb||2 ≤ (1 + ǫ)z. Take CM sketch, projection of A and d a projection of b, and solve the problem on them. For S, take d log × n CM sketch vectors. ǫ2 d log d d log d Solve minx ||SAx − Sb||2 . Size of SA is ǫ2 × d, of Sb is ǫ2 . This gives the d3 term in the expression. Use fast Johnson-Lindenstrauss transform. In streaming world, assume more SA. 3. ||xopt − x′opt ||2 ≤ ǫ2 , 2 σmin (A) the smallest eigenvalue. Low rank approximation There is a simple algorithm for the low rank approximation; surprising, knowing the history of the problem. Think of sites collecting information. There is a central site and k other sites. Let P Si (t) be the number of items seen by a (non-central) cite i by the time t. We are interested in i |Si (t)|. The P central site P gets information from all the rest of sites, and outputs a 0 if i |Si (t)| < (1 − ǫ)τ and output 1 if i |Si (t)| > τ , where τ is the central site’s threshold parameter. We are interested in minimizing the number of bits sent to the central site by the others. There is no communication from the central site back or between non-central sites. All sites know τ and k. In case of 2 players, send 1 bit when seen τ /2 items: gives 2 bits of communication. Can it be generalized to k players (would that give k bits of communication)? 37 Lecture 9. Massive Unordered Distributed Data and Functional Monitoring Lecturer: S. Muthu Muthukrishnan Scribes: Eric Allender and Thomas Thierauf 9.1 Distributed Functional Monitoring In the distributed functional monitoring problem [CMY08] we have k sites each tracking their input. The sites can communicate with a designated central site, but not among each other. Let si (t) be the number of bits that site i has seen up to time t. The task of the central site is to monitor a given function f over the inputs s1 (t), . . . , sk (t) for all times t. The goal is to minimize the number of bits that are communicated between the sites and the central site. We consider the example where the function f is the sum of the values si (t). Define F1 = k X si (t). i=1 The central site is required to detect when F1 exceeds a certain threshold τ , i.e. we consider the function ( 1, if F1 > τ, c(t) = 0, otherwise. The interesting case is to compute the approximate version cA of c: for some given 0 < ε ≤ 1/4 the output of the central site at time t is defined as ( 1, if F1 > τ, cA (t) = 0, if F1 ≤ (1 − ε)τ. We do not care about the output if (1 − ε)τ < F1 ≤ τ . The problem of computing cA with these parameters is called the (k, F1 , τ, ε) functional monitoring problem. There are two trivial algorithms for it: 1. Each site communicates the central site every bit it sees. 2. Site i sends a bit to the central site each time that si (t) increases by τ (1 − ǫ)/k. The first solution would even allow the central site to compute the exact function c. However, the amount of communication is extremely high (τ bits). The second algorithm needs only about k bits of communication, but the error made in computing cA can be very large. In this lecture we show 38 Theorem 16. There is a randomized algorithm for (k, F1 , τ, ε) monitoring with error probability ≤ δ and O( ε12 log 1δ log log 1δ )) bits of communication. It is important to note that the number of bits of communication is independent of τ and k. Thus it scales as well as one could hope for. Consider the following algorithm, where we use a constant c to be determined later. Site i: Send a bit to the central cite with probability 1/k each time that si (t) increases by ε2 τ. ck Central site: Output 1 if the total number of bits received from the sites is   1 1 , ≥c 2− ε 2ε otherwise output 0. Define the random variable X = # bits the central site has received at some point of time. For the expected value of X we have the following upper and lower bound: 1 F1 cF1 = 2 , 2 k ε τ /ck ε τ 2 cF1 1 F1 − ε τ = 2 − c. E(X) ≥ 2 k ε τ /ck ετ E(X) ≤ (9.4) (9.5) For the variance of X we have ckF1 Var(X) ≤ ε2 τ  1 1 − 2 k k  cF1 . ε2 τ (9.6) c c cF1 ≤ (1 − ε) ≤ . ε2 τ ε2 ε2 (9.7) ≤ Case 1: F1 ≤ (1 − ε)τ . By equation (9.4) we have E(X) ≤ 39 The probability that the central site outputs 1 is Pr[X ≥ c   c 1 1 ] ≤ Pr[X ≥ E(X) − ] − by equation (9.7) 2 ε 2ε 2ε c = Pr[X − E(X) ≥ − ] 2ε cF1 (2ε)2 ≤ 2 by Chebyshev inequality and equation (9.6) ε τ c2 4F1 = τc 4(1 − ε)τ ≤ by assumption in case 1 τc 4 ≤ c Case 2: F1 > τ . By equation (9.5) we have E(X) ≥ c cF1 − c > 2 − c. 2 ε τ ε (9.8) Then the probability that the central site does not output 1 is Pr[X < c   c 1 1 ] ≤ Pr[X < E(X) + c − ] − by equation (9.8) 2 ε 2ε 2ε c = Pr[X − E(X) < c − ] 2ε 1 cF1 by Chebyshev inequality and equation (9.6) ≤ 2 ε τ (c − 2εc )2 1 by assumption in case 2 ≤ c(ε − 12 )2 16 1 ≤ for ε ≤ . c 4 Choosing c = 64 makes the error probability ≤ 1/16 in case 1 and ≤ 1/4 in case 2. Hence the total the error probability is ≤ 1/3. The number of bits communicated is O(1/ε2). The error probability can be decreased to δ by running O(log 1δ ) independent instances of the algorithm. That is, we modify the protocol as follows: 40 2 ε Site i: Each time that si (t) increases by ck τ , the site makes t = O(log 1δ ) independent trials, indexed 1, . . . , t, to send a bit to the central site with probability 1/k for each trial. However, instead of just one bit, it sends the index j of each successful trial to the central site. Central site: The central site maintains a counter for each of the t trials, where it adds 1 to counter j whenever it receives message j from one of the sites. It outputs 1 if the majority of the t counters has a value ≥ 1 , otherwise it outputs 0. c ε12 − 2ε The number of bits communicated is thus O( ε12 log 1δ log log 1δ ) as claimed. 9.2 Massive Unordered Distributed Data We consider truly massive data sets, such as those that are generated by data sources as IP traffic logs, web page repositories, search query logs, retail and financial transactions, and other sources that consist of billions of items per day, and are accumulated over many days. The amount of data in these examples is so large that no single computer can make even a single pass over the data in a reasonable amount of time. Therefore the data is distributed in pieces over many machines. For example, Google’s MapReduce and Apache’s Hadoop are successful large scale distributed platforms that can process many terabytes of data at a time, distributed over hundreds of even thousands of machines. The machines process the pieces of data in parallel. Then they send their results to a central machine. Clearly, the amount of data that is sent to the central machine should be small, i.e. only poly-logarithmic in the input size. Since the distribution of the data pieces on the machines is unordered, order should have no impact on the result. Hence, in this lecture we consider a model for algorithms which is called massive, unordered, distributed (short: mud) algorithms. Mud algorithms consist of three functions: 1. a local function map that maps a single input data item x to a list of pairs of the form (k, v), where k is a key and v is a value. 2. an aggregate function reduce that gets as input the set S of all map(x) (over all data items x), and computes, for each k, some function of the pairs (k, v) that appear in S. Because the input data for reduce can be distributed on several machines, the function should be commutative and associative. 3. a function for a final post-processing step. This is not always needed. Examples As an example, we want to compute the number of links to a web-page. The data items x are web-pages and map(x) is defined to consist of all pairs (k, v), where key k is a URL that occurs 41 as a link in x and v is the number of times k occurs in x. The reduce-function simply computes, for each URL k, the sum of the values v such that (k, v) is produced during the map phase. This example does not use the post-processing step. As another example (using the post-processing step), consider the problem of computing the number of triangles in a graph x on n vertices, where the data is presented as a set of edges (u, w). Applying map(u, w) produces pairs (k, v) where the keys k are triples of nodes (i, j, k) with i < j < k, where {a, b} ⊆ {i, j, k}, and the values v are triples (bi,j , bi,k , bj,k ), where bi,j = 1, if (i, j) is an edge, and bi,j = 0 otherwise. The reduce-function computes the bitwise or of the values for each key. In the post-processing step, we output the number of keys k for which (k, (1, 1, 1)) is produced during the reduce phase. The number of keys that are used has a significant effect on the efficiency of the resulting algorithm. We will examine the computational power of the extreme case, where we have only one key. Mud algorithms with one key In the following we consider the special case where there is only one key, i.e. we can omit the key. Thus, in the map phase, each data item x produces map(x) which is communicated to the reduce phase. We call these “messages”. In the reduce phase, we apply an operator to the messages (in some order). More formally, the three functions of a mud-algorithm simplify as follows: The local function Φ : Σ → Q maps an input item to a message, the aggregator ⊕ : Q × Q → Q maps two messages to a single message, and the post-processing operator η : Q → Σ produces the final output. The output can depend on the order in which ⊕ is applied. Let T be an arbitrary binary tree circuit with n leaves. We use mT (x) to denote the q ∈ Q that results from applying ⊕ to the sequence Φ(x1 ), . . . , Φ(xn ) along the topology of T with an arbitrary permutation of these inputs as its leaves. The overall output of the mud algorithm is η(mT (x)), which is a function Σn → Σ. Notice that T is not part of the algorithm, but rather, the algorithm designer needs to make sure that η(mT (x)) is independent of T . We say that a mud algorithm computes a function f if η ◦ mT = f , for all trees T . The communication complexity of a mud algorithm is log |Q|, the number of bits needed to represent a message from one component to the next. The time, resp. space complexity of a mud algorithm is the maximum time resp. space complexity of its component functions. Let us compare mud algorithms with streaming algorithms. Formally, a streaming algorithm is given by a pair s = (σ, η), where σ : Q × Σ → Q maps a state and an input to a state. σ is an operator applied repeatedly to the input stream. η : Q → Σ converts the final state to the output. sq (x) denotes the state of the streaming algorithm after starting at state q and operating on the sequence x = x1 , . . . xn in that order, that is sq (x) = σ(σ(. . . σ(σ(q, x1 ), x2 ) . . . , xk−1 ), xn ). On input x ∈ Σn , the streaming algorithm computes η(s0 (x)), where 0 is the starting state. The communication complexity of a streaming algorithm is log |Q|, and the time, resp. space complexity is the maximum time resp. space complexity of σ and η. 42 Clearly, for every mud algorithm m = (Φ, ⊕, η) there is a equivalent streaming algorithm s = (σ, η)) of the same complexity by setting σ(q, x) = ⊕(q, Φ(x)) and maintaining η. The central question is whether the converse direction also holds. The problem is that a streaming algorithm gets its input sequentially, whereas for a mud algorithm, the input is unordered. Consider the problem of computing the number of occurrences of the first element in the input. This is trivial for a streaming algorithm. However, no mud algorithm can accomplish this because a mud algorithm cannot determine the first element in the input. Therefore we restrict our attention to symmetric functions. Here one can show that the models are equivalent in the following sense: Theorem 17. [FMS+ 08] For any symmetric function f : Σn → Σ computed by a g(n)-space and c(n)-communication streaming algorithm there exists a mud algorithm that computes f within space O(g 2(n)) and O(c(n)) communication. The proof of this theorem has much the same flavor of Savitch’s theorem and can be found in [FMS+ 08]. 43 Lecture 10. A Few Pending Items and Summary Lecturer: Andrew McGregor and S. Muthu Muthukrishnan Regan Scribes: Ricard Gavaldà and Ken 10.1 Four Things Andrew Still Wants to Say Andrew wanted to tell us about 14 things, which were eventually 4 because of time constraints: Computing spanners, estimating the entropy, a lower bound for F0 , and solving the k-center problem. 10.1.1 Computing Spanners Recall the definition of t-spanner of a graph G (alluded to in Lecture 4): It is a subgraph of G obtained by deleting edges such that no distance among any two vertices increases by a factor of more than t. Here is a simple algorithm for bulding a (t − 1)-spanner in one pass, over a stream of edges: When a new edge arrives, check whether it completes a cycle of length at most t. If it does,ignore it; otherwise, include it in the spanner. The resulting graph G′ is a t-spanner of the original graph G because for every edge (u, v) in G − G′ there must be a path from u to v in G′ of length t − 1, namely, the rest of the cycle that prevented us from adding (u, v) to G′ . Therefore, for every path of length k in G there is a path of length at most (t − 1)k in G′ with the same endpoints. Elkin [Elk07] proposed an algorithm using  this idea that computes a (2t − 1)-spanner in one 1+1/t pass using memory and total time Õ n . Furthermore, the expected time per item is O(1). 10.1.2 Estimating the Entropy Given a stream of elements in {1, . . . , m}, let mi be the frequency of element i in the stream. We would like to estimate the entropy of the stream, S= X mi i m log m . mi A first solution to this problem is simply to pick a random element in the stream, call it x, then count the occurrences of x from that point in the stream on, call this number R. Then output Ŝ = R log m m − (R − 1) log . R R−1 44 We claim that Ŝ is an estimator of the entropy S. Indeed, if we define f (r) = r log(m/r), X E[Ŝ] = Pr[R = r] · (f (r) − f (r − 1)) r = XX r i Pr[x = i] · Pr[R = r|x = i] (f (r) − f (r − 1)) mi X mi X 1 (f (r) − f (r − 1)) (and the sum telescopes, so) m m i r=1 i X mi 1 X mi m (f (mi ) − f (1)) = = log m mi m mi i i = One can show that the variance is also small when the entropy is large. A solution that works also when the entropy is small was given by Chakrabarti, Cormode, and McGregor [CCM07]. 10.1.3 A Lower Bound for (1 + ǫ)-approximation of F0 One can give a Ω(ǫ−2 ) lower bound, hence matching the algorithm that Muthu presented. The bound is a reduction from the communication complexity of the problem of estimating the Hamming distance among two streams. Let x, y be two n-bit strings, and say that Alice has x and Bob has y. Let Sx and Sy be the set of elements with characteristic vectors x and y. Then F0 (Sx ∪ Sy ) = |Sx | + |Sy | − |Sx ∩ Sy |. Jayram, √ Kumar, and Sivakumar [JKS35] showed that estimating the Hamming distance up to an additive n requires Ω(n) bits of one-way communication. From here, one can see that a one-pass algorithm that approximates F0 within multiplicative ǫ must use Ω(ǫ−2 ) bits of memory. Brody and Chakrabarti [BC09] have recently shown lower bounds for the multiround communication complexity of the gap hamming distance problem, which implies lower bounds for F0 , F1 , F2 , etc. in the multipass streaming model. 10.1.4 Solving the k-center problem Let us sketch the main trick for solving the k-center problem, discussed already in Lecture 4. Recall that the problem is: given n points p1 , . . . , pn from some metric space, we want to take k of them, y1 , . . . , yk , such the maximum distance of any pi to its closest yj is minimized. That is, so that each initial point is d-away from its closest yj , for minimal d. We observe first that if we are told the optimal distance OP T in advance, we can give a 2approximation algorithm easily: We get the first point p1 . We ignore all subsequent points within radius 2OP T of it, and keep the first one that is not as a new center. We keep opening centers as necessary, and ignore all points already 2OP T close to one center. If OP T is achieved by some k points, we give a 2OP T solution with no more than k points. 45 Similarly, if we only have a guess g with OP T ≤ g ≤ (1 + ǫ)OP T . we can give a 2(1 + ǫ)approximation. When we have no guess at all, we could of course try the algorithm above in parallel all possible guesses (spacing them out by about (1 + ǫ)). The problem is that instances with too large a guess will use too much memory by themselves, so we have to be more cautious. We proceed as follows: 1. Look at the first k + 1 points, and find its best k-clustering; this gives a lower bound a on OP T . 2. Run the algorithm above with g = (1 + ǫ)i a, for i = 0 . . . a/ǫ. If one of these O(1/ǫ) distances goes well, take the first one that goes well and we have a 2(1 + ǫ) approximation. If none goes well, this means that after examining j points p1 , . . . , pj the algorithm is trying to open a (k + 1)-th center besides the points y1 , . . . , yk it has already picked. We realize now that we should have picked a guess g > a/ǫ. But observe that all points p1 , . . . , pj are within 2g of some yi . The crucial claim is that by keeping only these yi and ignoring the previous points we can still compute a reasonable approximation to the best k-clustering: Claim 6. If the cheapest clustering of p1 , . . . , pj , pj+1, . . . , pn has cost OP T , then the cheapest clustering of y1 , . . . , yk , pj+1 , . . . , pn has cost OP T + 2g. Therefore, if we cluster y1 , . . . , yk , pj+1 , . . . , pn we get a (roughly) (1 + ǫ)-approximation to the best clustering of p1 , . . . , pn . We therefore use y1 , . . . yk as seeds for the next iterations, using larger guesses, of the form (g + a/ǫ) · (1 + ǫ)i . We can do this by recycling the space already used, rather than using new space. This is due to McClutchin and Khuller [MK08b] and Guha [Guh09]. 10.2 Summary of the Course [Muthu’s summary actually came before Andrew’s items, but we thought it better to put it last—or not quite last—and what actually came dead last was the solution to the k-center problem, which we’ve put first. RG+KWR] Lectures 1 and 2 were on CM sketches, applied to: • point queries, such as the number mx of times an item x appeared; • heavy hitters, i.e. which items appear markedly frequently; • medians and other quantiles; • dot products; 46 • quantities such as F2 = P x m2x . The primary objective was to use O( 1ǫ log 1δ ) space, where ǫ quantifies the target accuracy and δ is the probability of missing the target accuracy. Usually the accuracy involves approximating one of these statistics to within an additive term, ±ǫn, where n is the size of the stream—or for some, with relative error within a factor of (1 + ǫ). Having an extra poly-log(n) factor was generally fine, but having something like ǫ2 in the denominator was less welcome. We saw one example later √ involving sampling for distinct out-neighbors in graph streams where terms like O( ǫ14 n log 1δ ) (needing improvement!) entered the analysis. Lecture 3 was on “Inverse Problems,” meaning that instead of the “forward distribution” f (x) = mx , one works with its “inverse” f −1 (i) = the number of items that appear i times. A core task here is to maintain a sample S that approximates a sample S ′ taken uniformly over the set of distinct elements in the stream, even in the presence of deletions as well as insertions. Then we can estimate the same statistics as in lecture 1, though with ǫ2 in the denominator of space usage—coming from fully storing samples of size involving that factor. A prime ingredient here is minwise hashing, and calculations involve estimating the number of distinct elements, represented P as the zeroth moment F0 = x f (x)0 . Lecture 4 (by Andrew) focused on graph streams, usually presenting a (multi-)graph G as a streamed list of its edges. Estimating the number of triangles in G was niftily reduced to estimating the moments F0 , F1 , and F2 . This lecture introduced techniques for obtaining matching lower bounds via communication complexity, here by reduction from set-disjointness for the case of telling apart when G has 0 triangles from 1-or-more. Bounds did not care about polynomials in log n: “Õ is your friend.” Another core task is retaining a set M of edges, no two sharing an endpoint (i.e. a partial statistic, such as the number of such edges Pmatching), that maximizes some ′ e or a weighted sum e∈M we . Given a new edge e from the stream that conflicts with some edge e ∈ M, the nub is when and whether to discard e in favor of e′ . Then geometric streams were introduced, meaning streams of points from a general metric space, such as higher-dimensional Euclidean space. The k-center problem is to retain k points from the stream so as to minimize the distance from any other point to some chosen point. (Technically this means minimizing maxp miny d(p, y), where d is the metric and y ranges over the set of k chosen points—can we coin the word “minimaximin”?) The shortest-path metric on graphs led to merging the two ideas of streams. A key problem was finding an edge-induced subgraph H on the same vertex set V such that the distances restricted to H are within a multiplicative α factor of the distances in G. This was also a case where allowing multiple passes over the stream gives notably better results. Lectures 5 and 6 covered sparse approximation problems and the interesting, historically central notion of Haar wavelets. The goal is to approximate a signal (represented as a vector) with high fidelity using as few terms as possible over the Haar basis. This connected to the hot topic of compressed sensing, in which we are still trying to represent a vector by a small number of terms over a basis, but the goal is to do so with as few (physical) measurements as possible. Compressed sensing is more than just a theory of linear projections, because not all projections are the same— they have different costs. In a digital camera, each pixel performs a measurement, but generally each measurement involves some cost to the system—hence ideas from compressed sensing en47 gendered the technological goal of a “Single-Pixel Camera.” It is cool that considerations from streaming played a role in the development of this field. Lecture 7 (by Andrew) presented the random-order stream model and multi-pass algorithms. In the random-order model, the underlying input (say, edges in a graph) is chosen adversarily but items are presented in a (uniformly) random order. In both models (random order and multipass) some problems such as the median become slightly easier, but they don’t help much for the frequency moments F0 , F1 , F − 2, . . . . We presented some lower bounds in the multipass model by reducing to the Index Problem and the Pointer Chasing technique. In Lecture 8, Linear Algebra Problems, we discussed mainly the Matrix Product and the Least Squares problems. For the matrix product, we saw that keeping CM-sketches for the matrix rows, and multiplying the sketches when needed, gave us an approximation with memory linear rather than quadratic in n. For the least squares problem, we saw that keeping CM-vectors led to a good approximation algorithm not only in the streaming model but also in the standard sense. Lecture 9 presented Map-Reduce and discussed distributed streaming computations. The leading example was the “continuous problem” in which we have many sites and a central site that must output 1 when the number of items collected among all sites exceeds a given threshold. This led to the definition of the MUD (Massive Unordered Data) model and its relation to streaming. 10.3 Some Topics Not Covered Finally, let us conclude by mentioning some of the many topics not covered. [Muthu only listed the following bulleted ones, but I’ve supplied descriptive details for each (so any bugs are on me not him), and have also added one more example at the end from a very recent paper. KWR] • (More) Results on Geometric Streams: In geometric streams, the items are not single values but rather vectors in k-dimensional space. Per their main mention on p37 of [Mut09], the main effect is that instead of having a linear spectrum of values, one must also specify and maintain a partitioning of space into geometrically-determined grids, according to some metric. The resulting notion of “bucket” requires one “to maintain quite sophisticated statistics on points within each of the buckets in small space. This entails using various (non-)standard arguments and norm estimation methods within each of the buckets.” Topics here include: core sets, which are relatively small subsets C of the stream point set S such that the convex hull of C is a good approximation to the convex hull of S. Again, one must specify a spatial metric to quantify the goodness of the approximation. Another is where one wishes the median in each dimension of the points in C to be a good approximation to the respective median in S; this is called the problem of k-medians. Bi-chromatic matching, where one is given equal-sized point sets A, B and needs to find a bijection f : A −→ B that optimizes some metric-dependent cost function on the pairs (a, f (a)), also depends delicately on the space and the metric. A 2006 presentation by S. Suri [Sur06] covers some of the technical issues. • String Streams. A stream can be viewed as a string for properties that are order-dependent, such as the length of the longest common subsequence between two strings. One can also 48 picture streams of strings—distinguishing them from streams of k-vectors by considering problems in which the ordering of vector components matter or string-edit operations may be applied. Geometric issues also come into play here. • Probabilistic Streams. This does not merely mean random presentation of the items in a stream according to some distribution, but refers to situations in which the items themselves are not determined in advance but rather drawn from a distribution D, where D may or may not be known to the algorithm. One motivation is representing streams of sensor measurements that are subject to uncertainties. One can regard D as an ensemble over possible input streams. [Our two presenters wrote a major paper on this, “Estimating Aggregate Properties on Probabilistic Streams,” http://arxiv.org/abs/cs.DS/0612031, and then joined forces with T.S. Jayram (who initiated the topic) and E. Vee for [JMMV07].] • Sliding Window Models. This can be treated as a “dynamical complexity” version of the standard streaming model. The issue is not so much what storage and other privileges may be granted to the algorithm for the last N items it sees (for some N), but more the necessity to maintain statistics on the last N items seen and update them quickly when new items are presented and older ones slide outside the window. A detailed survey is [DM07]. • Machine Learning Applications. Many data-mining and other learning applications must operate within the parameters of streaming: a few looks at large data, no place to store it locally. Another avenue considers not just the quality of the statistical estimates obtained, as we have mostly done here, but also their robustness when the estimates are used inside a statistical inferencing application. João Gama of Portugal has written and edited some papers and books on this field. The considerations of streaming can also be applied to other computational models, for instance various kinds of protocols and proof systems. For one example, the last topic above can include analyzing the standard PAC learning model under streaming restrictions. For another, the new paper “Best-Order Streaming Model” by Atish Das Sarma, Richard Lipton, and Dampon Nahongkai [SLN09], which is publicly available from the first author’s site, pictures a prover having control over the order in which the input stream is presented to the verifier. This resembles best-case models discussed above, except that the requirement that the prover cannot cheat on “no”-instances and the full dependence of the ordering on details of the input can differ from particulars of the others and their associated communication models. Consider the task of proving that a big undirected graph G with vertices labeled 1, . . . , n has a perfect matching. In the “yes” case, the prover orders the stream to begin with the n/2 edges of a perfect matching, then sends a separator symbol, and then sends the rest of the graph. The verifier still needs to check that the n-many vertex labels seen before the separator are all distinct, indeed fill out 1, . . . , n. They give a randomized protocol needing only O(log n) space, but show by reduction from a lower bound for set-disjointness in a variant-of-best-case communication model that any deterministic verifier needs Ω(n) space (for graphs presented as streams of edges). For graph connectivity, which has Ω(n) space bounds even for randomized algorithms in worst-case and random-case streaming models, they give an O(log2 n)-space best-order proof system. For non-bipartiteness, the simple idea is to begin with an 49 odd cycle, but proving bipartiteness space-efficiently remains an open problem in their model. The motivation comes from “cloud-computing” situations in which it is reasonable to suppose that the server has the knowledge and computational resources needed to optimize the order of presentation of the data to best advantage for the verifier or learner. Whether we have optimized our notes stream for learning the material is left for you to decide! 50 Bibliography [Bas08] Surender Baswana. Streaming algorithm for graph spanners - single pass and constant processing time per edge. Inf. Process. Lett., 106(3):110–114, 2008. [BC09] Joshua Brody and Amit Chakrabarti. A multi-round communication lower bound for gap hamming and some consequences. CoRR abs/0902.2399:, 2009. [CCM07] Amit Chakrabarti, Graham Cormode, and Andrew McGregor. A near-optimal algorithm for computing the entropy of a stream. In Proc. SODA’07, volume SIAM Press, pages 328–335, 2007. [CCM08] Amit Chakrabarti, Graham Cormode, and Andrew McGregor. Robust lower bounds for communication and stream computation. In STOC, pages 641–650, 2008. [CM04] Graham Cormode and S. Muthukrishnan. An improved data stream summary: The count-min sketch and its applications. J. Algorithms, 55:29–38, 2004. [CMY08] G. Cormode, S. Muthukrishnan, and K. Yi. Algorithms for distributed functional monitoring. In 19th ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 1076–1085, 2008. [DM07] Mayur Datar and Rajeev Motwani. The sliding-window computation model and results. In Advances in Database Systems. Springer US, 2007. Chapter 8. [Elk07] Michael Elkin. Streaming and fully dynamic centralized algorithms for constructing and maintaining sparse spanners. In ICALP, pages 716–727, 2007. [FKM+ 05] Joan Feigenbaum, Sampath Kannan, Andrew McGregor, Siddharth Suri, and Jian Zhang. On graph problems in a semi-streaming model. Theor. Comput. Sci., 348(23):207–216, 2005. [FMS+ 08] J. Feldman, S. Muthukrishnan, A. Sidiropoulos, C. Stein, and Z. Svitkina. On the complexity of processing massive, unordered, distributed data. In 19th ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 710–719, 2008. [GK01] Michael Greenwald and Sanjeev Khanna. Space-efficient online computation of quantile summaries. In SIGMOD Conference, pages 58–66, 2001. 51 [GM09] Sudipto Guha and Andrew McGregor. Stream order and order statistics: Quantile estimation in random-order streams. SIAM Journal on Computing, 38(5):2044–2059, 2009. [Guh09] Sudipto Guha. Tight results for summarizing data streams. In Proc. ICDT’09, 2009. [JKS35] T. S. Jayram, Ravi Kumar, and D. Sivakumar. The one-way communication complexity of hamming distance. 4, 2008:Theory of Computing, 129–135. [JMMV07] T.S. Jayram, A. McGregor, S. Muthukrishnan, and E. Vee. Estimating statistical aggregates on probabilistic data streams. In Proceedings of PODS’07, 2007. [LNVZ06] David Liben-Nowell, Erik Vee, and An Zhu. Finding longest increasing and common subsequences in streaming data. J. Comb. Optim., 11(2):155–175, 2006. [McG05] A. McGregor. Finding graph matchings in data stream. In Proceedings of the 8th International Workshop on Approximation Algorithms for Combinatorial Optimization Problems, pages 170–181, 2005. [MK08a] Richard Matthew McCutchen and Samir Khuller. Streaming algorithms for k-center clustering with outliers and with anonymity. In APPROX-RANDOM, pages 165–178, 2008. [MK08b] Richard Matthew McCutchen and Samir Khuller. Streaming algorithms for k-center clustering with outliers and with anonymity. In Proc. APPROX-RANDOM 2008, pages 165–178, 2008. [Mut09] S. Muthu Muthukrishnan. Data Streams: Algorithms and Applications, 2009. [NW93] Noam Nisan and Avi Wigderson. Rounds in communication complexity revisited. SIAM J. Comput., 22(1):211–219, 1993. [Raz92] Alexander A. Razborov. On the distributional complexity of disjointness. Theor. Comput. Sci., 106(2):385–390, 1992. [Sar09] Artish Das Sarma. Distributed streaming: The power of communication. Manuscript, 2009. [SLN09] Atish Das Sarma, Richard J. Lipton, and Danupon Nanongkai. Best-order streaming model. In Proceedings of TAMC’09, 2009. to appear. [Sur06] S. Suri. Fishing for patterns in (shallow) geometric streams, 2006. Presentation, 2006 IIT Kanpur Workshop on Algorithms for Data Streams. [Zel08] Mariano Zelke. Weighted matching in the semi-streaming model. In 25th International Symposium on Theoretical Aspects of Computer Science (STACS 2008), pages 669– 680, 2008. 52