Crash Course on Data Stream Algorithms Part I: Basic Definitions and Numerical Streams Andrew McGregor University of Massachusetts Amherst 1/24 Goals of the Crash Course I Goal: Give a flavor for the theoretical results and techniques from the 100’s of papers on the design and analysis of stream algorithms. 2/24 Goals of the Crash Course I Goal: Give a flavor for the theoretical results and techniques from the 100’s of papers on the design and analysis of stream algorithms. “When we abstract away the application-specific details, what are the basic algorithmic ideas and challenges in stream processing? What is and isn’t possible?” 2/24 Goals of the Crash Course I Goal: Give a flavor for the theoretical results and techniques from the 100’s of papers on the design and analysis of stream algorithms. “When we abstract away the application-specific details, what are the basic algorithmic ideas and challenges in stream processing? What is and isn’t possible?” I Disclaimer: Talks will be theoretical/mathematical but shouldn’t require much in the way of prerequisites. 2/24 Goals of the Crash Course I Goal: Give a flavor for the theoretical results and techniques from the 100’s of papers on the design and analysis of stream algorithms. “When we abstract away the application-specific details, what are the basic algorithmic ideas and challenges in stream processing? What is and isn’t possible?” I I Disclaimer: Talks will be theoretical/mathematical but shouldn’t require much in the way of prerequisites. Request: 2/24 Goals of the Crash Course I Goal: Give a flavor for the theoretical results and techniques from the 100’s of papers on the design and analysis of stream algorithms. “When we abstract away the application-specific details, what are the basic algorithmic ideas and challenges in stream processing? What is and isn’t possible?” I I Disclaimer: Talks will be theoretical/mathematical but shouldn’t require much in the way of prerequisites. Request: I If you get bored, ask questions. . . 2/24 Goals of the Crash Course I Goal: Give a flavor for the theoretical results and techniques from the 100’s of papers on the design and analysis of stream algorithms. “When we abstract away the application-specific details, what are the basic algorithmic ideas and challenges in stream processing? What is and isn’t possible?” I I Disclaimer: Talks will be theoretical/mathematical but shouldn’t require much in the way of prerequisites. Request: I I If you get bored, ask questions. . . If you get lost, ask questions. . . 2/24 Goals of the Crash Course I Goal: Give a flavor for the theoretical results and techniques from the 100’s of papers on the design and analysis of stream algorithms. “When we abstract away the application-specific details, what are the basic algorithmic ideas and challenges in stream processing? What is and isn’t possible?” I I Disclaimer: Talks will be theoretical/mathematical but shouldn’t require much in the way of prerequisites. Request: I I I If you get bored, ask questions. . . If you get lost, ask questions. . . If you’d like to ask questions, ask questions. . . 2/24 Outline Basic Definitions Sampling Sketching Counting Distinct Items Summary of Some Other Results 3/24 Outline Basic Definitions Sampling Sketching Counting Distinct Items Summary of Some Other Results 4/24 Data Stream Model I Stream: m elements from universe of size n, e.g., hx1 , x2 , . . . , xm i = 5/24 3, 5, 3, 7, 5, 4, . . . Data Stream Model I Stream: m elements from universe of size n, e.g., hx1 , x2 , . . . , xm i = I 3, 5, 3, 7, 5, 4, . . . Goal: Compute a function of stream, e.g., median, number of distinct elements, longest increasing sequence. 5/24 Data Stream Model I Stream: m elements from universe of size n, e.g., hx1 , x2 , . . . , xm i = I I 3, 5, 3, 7, 5, 4, . . . Goal: Compute a function of stream, e.g., median, number of distinct elements, longest increasing sequence. Catch: 1. Limited working memory, sublinear in n and m 5/24 Data Stream Model I Stream: m elements from universe of size n, e.g., hx1 , x2 , . . . , xm i = I I 3, 5, 3, 7, 5, 4, . . . Goal: Compute a function of stream, e.g., median, number of distinct elements, longest increasing sequence. Catch: 1. Limited working memory, sublinear in n and m 2. Access data sequentially 5/24 Data Stream Model I Stream: m elements from universe of size n, e.g., hx1 , x2 , . . . , xm i = I I 3, 5, 3, 7, 5, 4, . . . Goal: Compute a function of stream, e.g., median, number of distinct elements, longest increasing sequence. Catch: 1. Limited working memory, sublinear in n and m 2. Access data sequentially 3. Process each element quickly 5/24 Data Stream Model I Stream: m elements from universe of size n, e.g., hx1 , x2 , . . . , xm i = I I 3, 5, 3, 7, 5, 4, . . . Goal: Compute a function of stream, e.g., median, number of distinct elements, longest increasing sequence. Catch: 1. Limited working memory, sublinear in n and m 2. Access data sequentially 3. Process each element quickly I Origins in 70s but has become popular in last ten years because of growing theory and very applicable. 5/24 Why’s it become popular? I Practical Appeal: I I Faster networks, cheaper data storage, ubiquitous data-logging results in massive amount of data to be processed. Applications to network monitoring, query planning, I/O efficiency for massive data, sensor networks aggregation. . . 6/24 Why’s it become popular? I Practical Appeal: I I I Faster networks, cheaper data storage, ubiquitous data-logging results in massive amount of data to be processed. Applications to network monitoring, query planning, I/O efficiency for massive data, sensor networks aggregation. . . Theoretical Appeal: I I Easy to state problems but hard to solve. Links to communication complexity, compressed sensing, embeddings, pseudo-random generators, approximation. . . 6/24 Outline Basic Definitions Sampling Sketching Counting Distinct Items Summary of Some Other Results 7/24 Sampling and Statistics I Sampling is a general technique for tackling massive amounts of data 8/24 Sampling and Statistics I Sampling is a general technique for tackling massive amounts of data I Example: To compute the median packet size of some IP packets, we could just sample some and use the median of the sample as an estimate for the true median. Statistical arguments relate the size of the sample to the accuracy of the estimate. 8/24 Sampling and Statistics I Sampling is a general technique for tackling massive amounts of data I Example: To compute the median packet size of some IP packets, we could just sample some and use the median of the sample as an estimate for the true median. Statistical arguments relate the size of the sample to the accuracy of the estimate. I Challenge: But how do you take a sample from a stream of unknown length or from a “sliding window”? 8/24 Reservoir Sampling I Problem: Find uniform sample s from a stream of unknown length 9/24 Reservoir Sampling I I Problem: Find uniform sample s from a stream of unknown length Algorithm: I I Initially s = x1 On seeing the t-th element, s ← xt with probability 1/t 9/24 Reservoir Sampling I I Problem: Find uniform sample s from a stream of unknown length Algorithm: I I I Initially s = x1 On seeing the t-th element, s ← xt with probability 1/t Analysis: I What’s the probability that s = xi at some time t ≥ i? 9/24 Reservoir Sampling I I Problem: Find uniform sample s from a stream of unknown length Algorithm: I I I Initially s = x1 On seeing the t-th element, s ← xt with probability 1/t Analysis: I What’s the probability that s = xi at some time t ≥ i? „ « „ « 1 1 1 1 P [s = xi ] = × 1 − × ... × 1 − = i i +1 t t 9/24 Reservoir Sampling I I Problem: Find uniform sample s from a stream of unknown length Algorithm: I I I Initially s = x1 On seeing the t-th element, s ← xt with probability 1/t Analysis: I What’s the probability that s = xi at some time t ≥ i? „ « „ « 1 1 1 1 P [s = xi ] = × 1 − × ... × 1 − = i i +1 t t I To get k samples we use O(k log n) bits of space. 9/24 Priority Sampling for Sliding Windows I Problem: Maintain a uniform sample from the last w items 10/24 Priority Sampling for Sliding Windows I I Problem: Maintain a uniform sample from the last w items Algorithm: 1. For each xi we pick a random value vi ∈ (0, 1) 10/24 Priority Sampling for Sliding Windows I I Problem: Maintain a uniform sample from the last w items Algorithm: 1. For each xi we pick a random value vi ∈ (0, 1) 2. In a window hxj−w +1 , . . . , xj i return value xi with smallest vi 10/24 Priority Sampling for Sliding Windows I I Problem: Maintain a uniform sample from the last w items Algorithm: 1. For each xi we pick a random value vi ∈ (0, 1) 2. In a window hxj−w +1 , . . . , xj i return value xi with smallest vi 3. To do this, maintain set of all elements in sliding window whose v value is minimal among subsequent values 10/24 Priority Sampling for Sliding Windows I I Problem: Maintain a uniform sample from the last w items Algorithm: 1. For each xi we pick a random value vi ∈ (0, 1) 2. In a window hxj−w +1 , . . . , xj i return value xi with smallest vi 3. To do this, maintain set of all elements in sliding window whose v value is minimal among subsequent values I Analysis: 10/24 Priority Sampling for Sliding Windows I I Problem: Maintain a uniform sample from the last w items Algorithm: 1. For each xi we pick a random value vi ∈ (0, 1) 2. In a window hxj−w +1 , . . . , xj i return value xi with smallest vi 3. To do this, maintain set of all elements in sliding window whose v value is minimal among subsequent values I Analysis: I The probability that j-th oldest element is in S is 1/j so the expected number of items in S is 1/w + 1/(w − 1) + . . . + 1/1 = O(log w ) 10/24 Priority Sampling for Sliding Windows I I Problem: Maintain a uniform sample from the last w items Algorithm: 1. For each xi we pick a random value vi ∈ (0, 1) 2. In a window hxj−w +1 , . . . , xj i return value xi with smallest vi 3. To do this, maintain set of all elements in sliding window whose v value is minimal among subsequent values I Analysis: I The probability that j-th oldest element is in S is 1/j so the expected number of items in S is 1/w + 1/(w − 1) + . . . + 1/1 = O(log w ) I Hence, algorithm only uses O(log w log n) bits of memory. 10/24 Other Types of Sampling I Universe sampling: For a random i ∈R [n], compute fi = |{j : xj = i}| 11/24 Other Types of Sampling I Universe sampling: For a random i ∈R [n], compute fi = |{j : xj = i}| I Minwise hashing: Sample i ∈R {i : there exists j such that xj = i} 11/24 Other Types of Sampling I Universe sampling: For a random i ∈R [n], compute fi = |{j : xj = i}| I Minwise hashing: Sample i ∈R {i : there exists j such that xj = i} I AMS sampling: Sample xj for j ∈R [m] and compute r = |{j 0 ≥ j : xj 0 = xj }| 11/24 Other Types of Sampling I Universe sampling: For a random i ∈R [n], compute fi = |{j : xj = i}| I Minwise hashing: Sample i ∈R {i : there exists j such that xj = i} I AMS sampling: Sample xj for j ∈R [m] and compute r = |{j 0 ≥ j : xj 0 = xj }| P Handy when estimating quantities like i g (fi ) because X E [m(g (r ) − g (r − 1))] = g (fi ) i 11/24 Outline Basic Definitions Sampling Sketching Counting Distinct Items Summary of Some Other Results 12/24 Sketching I Sketching is another general technique for processing streams 13/24 Sketching I Sketching is another general technique for processing streams I Basic idea: Apply a linear projection “on the fly” that takes high-dimensional data to a smaller dimensional space. Post-process lower dimensional image to estimate the quantities of interest. 13/24 Estimating the difference between two streams I Input: Stream from two sources hx1 , x2 , . . . , xm i ∈ ([n] ∪ [n])m 14/24 Estimating the difference between two streams I Input: Stream from two sources hx1 , x2 , . . . , xm i ∈ ([n] ∪ [n])m I Goal: Estimate difference between distribution of red values and blue values, e.g., X |fi − gi | i∈[n] where fi = |{k : xk = i}| and gi = |{k : xk = i}| 14/24 p-Stable Distributions and Algorithm I Defn: A p-stable distribution µ has the following property: for X , Y , Z ∼ µ and a, b ∈ R : aX + bY ∼ (|a|p + |b|p )1/p Z e.g., Gaussian is 2-stable and Cauchy distribution is 1-stable 15/24 p-Stable Distributions and Algorithm I Defn: A p-stable distribution µ has the following property: for X , Y , Z ∼ µ and a, b ∈ R : I aX + bY ∼ (|a|p + |b|p )1/p Z e.g., Gaussian is 2-stable and Cauchy distribution is 1-stable Algorithm: I Generate random matrix A ∈ Rk×n where Aij ∼ Cauchy, k = O(−2 ). 15/24 p-Stable Distributions and Algorithm I Defn: A p-stable distribution µ has the following property: for X , Y , Z ∼ µ and a, b ∈ R : I aX + bY ∼ (|a|p + |b|p )1/p Z e.g., Gaussian is 2-stable and Cauchy distribution is 1-stable Algorithm: I I Generate random matrix A ∈ Rk×n where Aij ∼ Cauchy, k = O(−2 ). Compute sketches Af and Ag incrementally 15/24 p-Stable Distributions and Algorithm I Defn: A p-stable distribution µ has the following property: for X , Y , Z ∼ µ and a, b ∈ R : I aX + bY ∼ (|a|p + |b|p )1/p Z e.g., Gaussian is 2-stable and Cauchy distribution is 1-stable Algorithm: I I I Generate random matrix A ∈ Rk×n where Aij ∼ Cauchy, k = O(−2 ). Compute sketches Af and Ag incrementally Return median(|t1 |, . . . , |tk |) where t = Af − Ag 15/24 p-Stable Distributions and Algorithm I Defn: A p-stable distribution µ has the following property: for X , Y , Z ∼ µ and a, b ∈ R : I e.g., Gaussian is 2-stable and Cauchy distribution is 1-stable Algorithm: I I I I aX + bY ∼ (|a|p + |b|p )1/p Z Generate random matrix A ∈ Rk×n where Aij ∼ Cauchy, k = O(−2 ). Compute sketches Af and Ag incrementally Return median(|t1 |, . . . , |tk |) where t = Af − Ag Analysis: I By the 1-stability property for Zi ∼ Cauchy X X |ti | = | Ai,j (fj − gj )| ∼ |Zi | |fj − gj | j j 15/24 p-Stable Distributions and Algorithm I Defn: A p-stable distribution µ has the following property: for X , Y , Z ∼ µ and a, b ∈ R : I e.g., Gaussian is 2-stable and Cauchy distribution is 1-stable Algorithm: I I I I aX + bY ∼ (|a|p + |b|p )1/p Z Generate random matrix A ∈ Rk×n where Aij ∼ Cauchy, k = O(−2 ). Compute sketches Af and Ag incrementally Return median(|t1 |, . . . , |tk |) where t = Af − Ag Analysis: I By the 1-stability property for Zi ∼ Cauchy X X |ti | = | Ai,j (fj − gj )| ∼ |Zi | |fj − gj | j I j For k = O(−2 ), since median(|Zi |) = 1, with high probability, X X (1 − ) |fj − gj | ≤ median(|t1 |, . . . , |tk |) ≤ (1 + ) |fj − gj | j j 15/24 A Useful Multi-Purpose Sketch: Count-Min Sketch 16/24 A Useful Multi-Purpose Sketch: Count-Min Sketch I Heavy Hitters: Find all i such that fi ≥ φm 16/24 A Useful Multi-Purpose Sketch: Count-Min Sketch I I Heavy Hitters: Find all i such that fi ≥ φm P Range Sums: Estimate i≤k≤j fk when i, j aren’t known in advance 16/24 A Useful Multi-Purpose Sketch: Count-Min Sketch I I I Heavy Hitters: Find all i such that fi ≥ φm P Range Sums: Estimate i≤k≤j fk when i, j aren’t known in advance Find k-Quantiles: Find values q0 , . . . , qk such that q0 = 0, qk = n, and X i≤qj −1 16/24 fi < jm X ≤ fi k i≤qj A Useful Multi-Purpose Sketch: Count-Min Sketch I I I Heavy Hitters: Find all i such that fi ≥ φm P Range Sums: Estimate i≤k≤j fk when i, j aren’t known in advance Find k-Quantiles: Find values q0 , . . . , qk such that q0 = 0, qk = n, and X i≤qj −1 I fi < jm X ≤ fi k i≤qj Algorithm: Count-Min Sketch I Maintain an array of counters ci,j for i ∈ [d] and j ∈ [w ] 16/24 A Useful Multi-Purpose Sketch: Count-Min Sketch I I I Heavy Hitters: Find all i such that fi ≥ φm P Range Sums: Estimate i≤k≤j fk when i, j aren’t known in advance Find k-Quantiles: Find values q0 , . . . , qk such that q0 = 0, qk = n, and X i≤qj −1 I fi < jm X ≤ fi k i≤qj Algorithm: Count-Min Sketch I I Maintain an array of counters ci,j for i ∈ [d] and j ∈ [w ] Construct d random hash functions h1 , h2 , . . . hd : [n] → [w ] 16/24 A Useful Multi-Purpose Sketch: Count-Min Sketch I I I Heavy Hitters: Find all i such that fi ≥ φm P Range Sums: Estimate i≤k≤j fk when i, j aren’t known in advance Find k-Quantiles: Find values q0 , . . . , qk such that q0 = 0, qk = n, and X i≤qj −1 I fi < jm X ≤ fi k i≤qj Algorithm: Count-Min Sketch I I I Maintain an array of counters ci,j for i ∈ [d] and j ∈ [w ] Construct d random hash functions h1 , h2 , . . . hd : [n] → [w ] Update counters: On seeing value v , increment ci,hi (v ) for i ∈ [d] 16/24 A Useful Multi-Purpose Sketch: Count-Min Sketch I I I Heavy Hitters: Find all i such that fi ≥ φm P Range Sums: Estimate i≤k≤j fk when i, j aren’t known in advance Find k-Quantiles: Find values q0 , . . . , qk such that q0 = 0, qk = n, X and fi < i≤qj −1 I jm X ≤ fi k i≤qj Algorithm: Count-Min Sketch I I I I Maintain an array of counters ci,j for i ∈ [d] and j ∈ [w ] Construct d random hash functions h1 , h2 , . . . hd : [n] → [w ] Update counters: On seeing value v , increment ci,hi (v ) for i ∈ [d] To get an estimate of fk , return f˜k = min ci,hi (k) i 16/24 A Useful Multi-Purpose Sketch: Count-Min Sketch I I I Heavy Hitters: Find all i such that fi ≥ φm P Range Sums: Estimate i≤k≤j fk when i, j aren’t known in advance Find k-Quantiles: Find values q0 , . . . , qk such that q0 = 0, qk = n, X and fi < i≤qj −1 I jm X ≤ fi k Algorithm: Count-Min Sketch I I I I Maintain an array of counters ci,j for i ∈ [d] and j ∈ [w ] Construct d random hash functions h1 , h2 , . . . hd : [n] → [w ] Update counters: On seeing value v , increment ci,hi (v ) for i ∈ [d] To get an estimate of fk , return f˜k = min ci,hi (k) i I i≤qj Analysis: For d = O(log 1/δ) and w = O(1/2 ) h i P fk − m ≤ f˜k ≤ fk ≥ 1 − δ 16/24 Outline Basic Definitions Sampling Sketching Counting Distinct Items Summary of Some Other Results 17/24 Counting Distinct Elements I Input: Stream hx1 , x2 , . . . , xm i ∈ [n]m I Goal: Estimate the number of distinct values in the stream up to a multiplicative factor (1 + ) with high probability. 18/24 Algorithm I Algorithm: 1. Apply random hash function h : [n] → [0, 1] to each element 19/24 Algorithm I Algorithm: 1. Apply random hash function h : [n] → [0, 1] to each element 2. Compute φ, the t-th smallest value of the hash seen where t = 21/2 19/24 Algorithm I Algorithm: 1. Apply random hash function h : [n] → [0, 1] to each element 2. Compute φ, the t-th smallest value of the hash seen where t = 21/2 3. Return r̃ = t/φ as estimate for r , the number of distinct items. 19/24 Algorithm I Algorithm: 1. Apply random hash function h : [n] → [0, 1] to each element 2. Compute φ, the t-th smallest value of the hash seen where t = 21/2 3. Return r̃ = t/φ as estimate for r , the number of distinct items. I Analysis: 1. Algorithm uses O(−2 log n) bits of space. 19/24 Algorithm I Algorithm: 1. Apply random hash function h : [n] → [0, 1] to each element 2. Compute φ, the t-th smallest value of the hash seen where t = 21/2 3. Return r̃ = t/φ as estimate for r , the number of distinct items. I Analysis: 1. Algorithm uses O(−2 log n) bits of space. 2. We’ll show estimate has good accuracy with reasonable probability P [|r̃ − r | ≤ r ] ≤ 9/10 19/24 Accuracy Analysis 1. Suppose the distinct items are a1 , . . . , ar 20/24 Accuracy Analysis 1. Suppose the distinct items are a1 , . . . , ar 2. Over Estimation:  P [r̃ ≥ (1 + )r ] = P [t/φ ≥ (1 + )r ] = P φ ≤ 20/24 t r (1 + )  Accuracy Analysis 1. Suppose the distinct items are a1 , . . . , ar 2. Over Estimation:  P [r̃ ≥ (1 + )r ] = P [t/φ ≥ (1 + )r ] = P φ ≤ 3. Let Xi = 1[h(ai ) ≤  P φ≤ t r (1+) ] and X = P t r (1 + ) Xi  t = P [X > t] = P [X > (1 + )E [X ]] r (1 + ) 20/24  Accuracy Analysis 1. Suppose the distinct items are a1 , . . . , ar 2. Over Estimation:  P [r̃ ≥ (1 + )r ] = P [t/φ ≥ (1 + )r ] = P φ ≤ 3. Let Xi = 1[h(ai ) ≤  P φ≤ t r (1+) ] and X = P t r (1 + ) Xi  t = P [X > t] = P [X > (1 + )E [X ]] r (1 + ) 4. By a Chebyshev analysis, P [X > (1 + )E [X ]] ≤ 20/24 1 ≤ 1/20 2 E [X ]  Accuracy Analysis 1. Suppose the distinct items are a1 , . . . , ar 2. Over Estimation:  P [r̃ ≥ (1 + )r ] = P [t/φ ≥ (1 + )r ] = P φ ≤ 3. Let Xi = 1[h(ai ) ≤  P φ≤ t r (1+) ] and X = P t r (1 + )  Xi  t = P [X > t] = P [X > (1 + )E [X ]] r (1 + ) 4. By a Chebyshev analysis, P [X > (1 + )E [X ]] ≤ 1 ≤ 1/20 2 E [X ] 5. Under Estimation: A similar analysis shows P [r̃ ≤ (1 − )r ] ≤ 1/20 20/24 Outline Basic Definitions Sampling Sketching Counting Distinct Items Summary of Some Other Results 21/24 Some Other Results Correlations: I Input: h(x1 , y1 ), (x2 , y2 ), . . . , (xm , ym )i I Goal: Estimate strength of correlation between x and y via the distance between joint distribution and product of the marginals. I Result: (1 + ) approx in Õ(−O(1) ) space. 22/24 Some Other Results Correlations: I Input: h(x1 , y1 ), (x2 , y2 ), . . . , (xm , ym )i I Goal: Estimate strength of correlation between x and y via the distance between joint distribution and product of the marginals. I Result: (1 + ) approx in Õ(−O(1) ) space. Linear Regression: I Input: Stream defines a matrix A ∈ Rn×d and b ∈ Rd×1 I Goal: Find x such that kAx − bk2 is minimized. I Result: (1 + ) estimation in Õ(d 2 −1 ) space. 22/24 Some More Other Results Histograms: I Input: hx1 , x2 , . . . , xm i ∈ [n]m I Goal: Determine B bucket histogram H : [m] → R minimizing X (xi − H(i))2 i∈[m] I Result: (1 + ) estimation in Õ(B 2 −1 ) space 23/24 Some More Other Results Histograms: I Input: hx1 , x2 , . . . , xm i ∈ [n]m I Goal: Determine B bucket histogram H : [m] → R minimizing X (xi − H(i))2 i∈[m] I Result: (1 + ) estimation in Õ(B 2 −1 ) space Transpositions and Increasing Subsequences: I Input: hx1 , x2 , . . . , xm i ∈ [n]m I Goal: Estimate number of transpositions |{i < j : xi > xj }| I Goal: Estimate length of longest increasing subsequence √ Results: (1 + ) approx in Õ(−1 ) and Õ(−1 n) space respectively I 23/24 Thanks! I Blog: http://polylogblog.wordpress.com I Lectures: Piotr Indyk, MIT http://stellar.mit.edu/S/course/6/fa07/6.895/ I Books: “Data Streams: Algorithms and Applications” S. Muthukrishnan (2005) “Algorithms and Complexity of Stream Processing” A. McGregor, S. Muthukrishnan (forthcoming) 24/24