Streaming Algorithms, etc. MIT Piotr Indyk Data Streams • A data stream is a sequence of data that is too large to be stored in available memory (disk, memory, cache, etc.) • Examples: – – – – Network traffic Database transactions Sensor networks Satellite data feed Example application: Monitoring Network Traffic • Router routs packets (many packets) – Where do they come from ? – Where do they go to ? Ideally, would like to maintain a traffic matrix x[.,.] – For each (src,dst) packet, increment xsrc,dst – Requires way too much space! (232 x 232 entries) – Need to maintain a compressed version of the matrix destination source • x Data Streams • A data stream is a (massive) sequence of data – Too large to store (on disk, memory, cache, etc.) • Examples: – – – – – Network traffic (source/destination) Database transactions Sensor networks Satellite data feed … • Approaches: – Ignore it – Develop algorithms for dealing with such data This course • Systematic introduction to the area – Emphasis on common themes – Connections between streaming, sketching, compressed sensing, communication complexity, … – First Second of its kind (previous edition from Fall’07: see my web page at MIT) • Style: algorithmic/theoretical… – Background in linear algebra and probability Topics • Streaming model. Estimating distinct elements (L0 norm) • Estimating L2 norm (AMS), Johnson Lindenstrauss • Lp norm (p<2), other norms, entropy • Heavy hitters: L1 norm, L2 norm, sparse approximations • Sparse recovery via LP decoding • Lower bounds: communication complexity, indexing, L2 norm • Options: MST, bi-chromatic matching, insertions-only streams, Fourier sampling, Plan For This Lecture • Introduce the data stream model(s) • Basic algorithms – Estimating number of distinct elements in a stream – Into to frequency moments and norms Basic Data Stream Model • Single pass over the data: i1, i2,…,in – Typically, we assume n is known • Bounded storage (typically nα or logc n) – Units of storage: bits, words or „elements” (e.g., points, nodes/edges) • Fast processing time per element – Randomness OK (in fact, almost always necessary) 8 2 1 9 1 9 2 4 6 3 9 4 2 3 4 2 3 8 5 2 5 6 ... Counting Distinct Elements • Stream elements: numbers from {1...m} • Goal: estimate the number of distinct elements DE in the stream – Up to 1±ε – With probability 1-P • Simpler goal: for a given T>0, provide an algorithm which, with probability 1-P: – Answers YES, if DE> (1+ε)T – Answers NO, if DE< (1-ε)T • Run, in parallel, the algorithm with T=1, 1+ε, (1+ε)2,..., n – Total space multiplied by log1+εn ≈ log(n)/ ε – Probability of failure multiplied by the same factor Vector Interpretation Stream: 8 2 1 9 1 9 2 4 4 9 4 2 5 4 2 5 8 5 2 5 Vector X: 1 2 3 4 5 6 7 8 9 • Initially, x=0 • Insertion of i is interpreted as xi = xi +1 • Want to estimate DE(x) = ||x||0 Estimating DE(x) Vector X: 1 2 3 4 5 6 7 8 9 Set S: • • • • + ++ (T=4) Choose a random set S of coordinates – For each i, we have Pr[i∈S]=1/T Maintain SumS(x) = Σi∈S xi Estimation algorithm A: Pr – YES, if SumS(x)>0 – NO, if SumS(x)=0 Analysis: – Pr=Pr[SumS(x)=0] = (1-1/T)DE – For T “large enough”: (1-1/T)DE ≈e-DE/T – Using calculus, for ε small enough: • If DE> (1+ε)T, then Pr ≈ e-(1+ε) < 1/e - ε/3 • if DE< (1-ε)T, then Pr ≈ e-( 1-ε) > 1/e + ε/3 0.8 0.7 0.6 0.5 0.4 Series1 0.3 0.2 0.1 0 1 3 5 7 9 11 13 15 17 19 DE Estimating DE(x) ctd. • We have Algorithm A: – If DE> (1+ε)T, then Pr<1/e-ε/3 – if DE< (1-ε)T, then Pr>1/e+ε/3 • Algorithm B: – Select sets S1 … Sk , k=O(log(1/P)/ε2) – Let Z = number of SumSj(x) that are equal to 0 – By Chernoff bound (define), with probability >1-P • If DE> (1+ε)T, then Zk/e • Total space: O( log(n)/ε log (1/P)/ε2 ) numbers in range 0…n • Can remove the log(n)/ε factor • Bibliographic note: [Flajolet-Martin’85] Interlude – Chernoff bound • Let Z1…Zk be i.i.d. Bernoulli variables, with Pr[Zj=1]=p • Let Z=∑j Zj • For any 1>ε>0, we have Pr[ |E[Z]-Z| > εE[Z] ]≤2exp( -ε2E[Z]/3 ) Comments • Implementing S: – Choose a hash function h: {1..m} -> {1..T} – Define S={i: h(i)=1} • Implementing h – Pseudorandom generators. More later. • Better algorithms known: – Theory: O( log(1/ε)/ε2 +log n) bits [Bar-Yossef-Jayram-Kumar-Sivakumar-Trevisan’02] – Practice: need 128 bytes for all works of Shakespeare , ε≈10% [Durand-Flajolet’03] More comments Vector X: 1 2 3 4 5 6 7 8 9 • The algorithm uses “linear sketches” SumSj(x)=Σi∈Sj xi • Can implement decrements xi=xi-1 – I.e., the stream can contain deletions of elements (as long as x≥0) – Other names: dynamic model, turnstile model More General Problem • What other functions of a vector x can we maintain in small space ? • Lp norms: ||x||p = ( ∑i |xi|p )1/p • – We also have ||x||∞ =maxi |xi| – … and ||x||0 = DE(x), since ||x||pp =∑i |xi|p→DE(x) as p→0 Alternatively: frequency moments Fp = p-th power of Lp norms (exception: F0 = L0 ) • How much space do you need to estimate ||x||p (for const. ε) ? • Theorem: – For p∈[0,2]: polylog n space suffices – For p>2: n1-2/p polylog n space suffices and is necessary [Alon-Matias-Szegedy’96, Feigenbaum-Kannan-Strauss-Viswanathan’99, Indyk’00, Coppersmith-Kumar’04, Ganguly’04, Bar-Yossef-JayramKumar-Sivakumar’02’03, Saks-Sun’03, Indyk-Woodruff’05]