Under consideration for publication in Knowledge and Information Systems A General Streaming Algorithm for Pattern Discovery Debprakash Patnaik1 , Srivatsan Laxman2 , Badrish Chandramouli3 , and Naren Ramakrishnan4 1 Amazon.com, Seattle, WA 98109, USA; 2 Microsoft Research, Bangalore 560080, India; 3 Microsoft Research, Redmond, WA, USA; 4 Department of Computer Science, Virginia Tech, VA 24061, USA Abstract. Discovering frequent patterns over event sequences is an important data mining problem. Existing methods typically require multiple passes over the data, rendering them unsuitable for streaming contexts. We present the first streaming algorithm for mining frequent patterns over a window of recent events in the stream. We derive approximation guarantees for our algorithm in terms of: (i) the separation of frequent patterns from infrequent ones, and (ii) the rate of change of stream characteristics. Our parameterization of the problem provides a new sweet spot in the tradeoff between making distributional assumptions over the stream and algorithmic efficiencies of mining. We illustrate how this yields significant benefits when mining practical streams from neuroscience and telecommunications logs. Keywords: Event Sequences; Data Streams; Frequent patterns; Pattern Discovery; Streaming Algorithms; Approximation Algorithms 1. Introduction Application contexts (Patnaik, Marwah, Sharma and Ramakrishnan, 2009; Ramakrishnan, Patnaik and Sreedharan, 2009) in telecommunications, neuroscience, sustainability, and intelligence analysis feature massive data streams (Muthukrishnan, 2005) with ‘firehose’-like rates of arrival. In many cases, we need to analyze such streams at speeds comparable to their generation rate. In neuroscience, Received Feb 01, 2013 Revised Feb 01, 2013 Accepted Feb 01, 2013 2 D. Patnaik et al one goal is to track spike trains from multi-electrode arrays (Patnaik, Laxman and Ramakrishnan, 2009) with a view to identify cascading circuits of neuronal firing patterns. In telecommunications, network traffic and call logs must be analyzed on a continual basis to detect attacks or other malicious activity. The common theme in all these scenarios is the need to mine patterns (i.e., a succession of events occurring frequently, but not necessarily consecutively (Mannila et al., 1997)) from dynamic and evolving streams. Algorithms for pattern mining over streams have become increasingly popular over the recent past (Manku and Motwani, 2002; Wong and Fu, 2006; Calders et al., 2007; Cheng et al., 2008). Manku and Motwani (Manku and Motwani, 2002) introduced a lossy counting algorithm for approximate frequency counting over streams, with no assumptions on the stream. Their focus on a worst-case setting often leads to stringent threshold requirements. At the other extreme, algorithms such as (Wong and Fu, 2006) provide significant efficiencies in mining but make strong assumptions such as i.i.d distribution of symbols in a stream. In the course of analyzing some real-world datasets, we were motivated to develop new methods as existing methods are unable to process streams at the rate and quality guarantees desired (see Sec. 6 for some examples). Furthermore, established stream mining algorithms are almost entirely focused on itemset mining (and, modulo a few isolated exceptions, just the counting phase of it) whereas we are interested in mining general patterns. Our specific contributions are as follows: – We present the first general algorithm for mining patterns in a stream. Unlike prior streaming algorithms that focus almost exclusively on counting, we provide solutions for both candidate generation and counting over a stream. – Although our work is geared towards pattern mining, we adopt a black-box model of a pattern mining algorithm. In other words, our approach can encapsulate and wrap around any pattern discovery algorithm to enable it to accommodate streaming data. This significantly generalizes the scope and applicability of our approach as a general methodology to streamify existing pattern discovery algorithms. We illustrate here this generality of our approach by focusing on two pattern classes—itemsets and episodes—prevalent in data mining research. – Devoid of any statistical assumptions on the stream (e.g., independence of event symbols or otherwise), we develop a novel error characterization for streaming patterns by identifying and tracking two key properties of the stream, viz. maximum rate of change and top-k separation. We demonstrate how the use of these two properties enables novel algorithmic optimizations, such as the idea of borders to amortize work as the stream is tracked. – We demonstrate successful applications in neuroscience and telecommunications log analysis, and illustrate significant benefits in runtime, memory usage, and the scales of data that can be mined. We compare against pattern-mining adaptations of two typical algorithms (Wong and Fu, 2006) from the streaming itemsets literature. 2. Preliminaries We consider the data mining task of discovering frequent patterns over a timestamped sequence of data records (Laxman and Sastry, 2006). The data se- A General Streaming Algorithm for Pattern Discovery 3 quence is denoted as h(e1 , τ1 ), (e2 , τ2 ), . . . , (en , τn )i, where each data record ei , may represent different kinds of data depending on the class of patterns being considered. For example, ei denotes a transaction in the frequent itemsets setting (Agrawal and Srikant, 1994), an event in the frequent episodes setting (Mannila et al., 1997), a graph in the frequent subgraph mining setting (Yan and Han, 2003), etc. The techniques developed in this paper are relevant in all of these settings. In our experimental work, we focus on two concrete classes of patterns; frequent episodes and frequent itemsets; we briefly summarize the associated formalisms for these in the paragraphs below. Frequent Episodes: In the framework of frequent episodes (Mannila et al., 1997), an event sequence is denoted as h(e1 , τ1 ), . . . , (en , τn )i, where (ei , τi ) represents the ith event; ei is drawn from a finite alphabet E of symbols (called event-types) and τi denotes the time-stamp of the ith event, with τi+1 ≥ τi , i = 1, . . . , (n − 1). An `-node episodes α is defined by a triple α = (Vα , <α , gα ), where Vα = {v1 , . . . , v` } is a collection of ` nodes, <α is a partial order over Vα and gα : Vα → E is a map that assigns an event-type gα (v) to each node v ∈ Vα . An occurrence of an episode α is a map h : Vα → {1, . . . , n} such that eh(v) = gα (v) for all v ∈ Vα and for all pairs of nodes v, v 0 ∈ Vα such that v <α v 0 the map h ensures that τh(v) < τh(v0 ) . Two occurrences of an episode are non-overlapped (Achar et al., 2012) if no event corresponding to one appears in-between the events corresponding to the other. The maximum number of non-overlapped occurrences of an episode is defined as its frequency in the event sequence. Frequent Itemsets: The frequent itemsets mining framework is concerned with the classical market basket problem (Agrawal and Srikant, 1994), where each data record ei can be viewed as a transaction of items purchased at a grocery store. Frequent itemsets will then refer to groups of itemsets frequently purchased together in the given data set of transactions. In general, the task in frequent pattern discovery is to find all patterns whose frequency exceeds a user-defined threshold. Apriori-style level-wise1 algorithms (Agrawal and Srikant, 1994; Mannila et al., 1997; Yan and Han, 2003; Achar et al., 2012) are typically applicable in this setting. An important variant is topk pattern mining (see (Wang et al., 2005) for definitions in the itemsets mining context), where, rather than a frequency threshold, the user supplies the number of most frequent patterns needed. Definition 1 (Top-k patterns of size `). The set of top-k patterns of size ` is defined as the collection of all `-size patterns with frequency greater than or equal to the frequency f k of the k th most frequent `-size pattern in the given data sequence. The number of top-k `-size patterns can exceed k, although the number of `size patterns with frequencies strictly greater than f k is at most (k − 1). In 1 Level-wise algorithms start with patterns of size 1 and with each increasing level estimate frequent patterns of the next size. D. Patnaik et al Events 4 Window Ws [t] Batch Bs Time Fig. 1. A sliding window model for pattern mining over data streams: Bs is the most recent batch of records that arrived in the stream and Ws is the window of interest over which the user wants to determine the set of frequent patterns. general, top-k mining can be difficult to solve without knowledge of a good lowerbound for f k ; for relatively short data sequences the following simple solution works well-enough: start mining at a high threshold and progressively lower the threshold until the desired number of top patterns are returned. 3. Problem Statement The data available (referred to as a data stream) is in the form of a potentially infinite sequence of data records: D = h(e1 , τ1 ), (e2 , τ2 ), . . . , (ei , τi ), . . . , (en , τn ), . . .i (1) Our goal is to find all patterns that were frequent in the recent past; for this, we consider a sliding window model2 for the window of interest. In this model, the user wants to determine patterns that were frequent over a (historical) window of fixed-size terminating at the current time-tick. As new records arrive in the stream, the user’s window of interest shifts, and the data mining task is to next report the frequent patterns in the new window of interest. We consider the case where the window of interest is very large and cannot be stored and processed in-memory. This straightaway precludes the use of standard multi-pass algorithms for frequent pattern discovery over the window of interest. We organize the records in the stream into smaller batches such that at any given time only the latest incoming batch is stored and processed in memory. This is illustrated in Fig. 1. The current window of interest is denoted by Ws and the most recent batch, Bs , consists of records in D that occurred between times (s − 1)Tb and sTb where Tb is the time-span of each batch and s is the batch number (s = 1, 2, . . .) . The frequency of a pattern α in a batch Bs is referred to as its batch frequency f s (α). The current window of interest, Ws , consists of m consecutive batches ending in batch Bs , i.e. Ws = hBs−m+1 , Bs−m+2 , . . . , Bs i (2) Definition 2 (Window Frequency). The frequency of a pattern α over window Ws , referred to as its window frequency and denoted by f Ws (α), is defined as the sum of batch frequencies of α in Ws . Thus, if f j (α) denotes the batch frequency of α in batch Bj , then the window frequency of α is given by P f Ws (α) = Bj ∈Ws f j (α). 2 Other models such as the landmark and time-fading models have also been studied (Cheng et al., 2008) but we do not consider them here. A General Streaming Algorithm for Pattern Discovery 5 [t] Pattern Count Pattern Count Pattern Count Pattern Count WXYZ 12 EFGH 15 WXYZ 12 IJKL 11 PQRS 10 IJKL 12 EFGH 10 PQRS 9 ABCD 8 MNOP 10 ABCD 10 MNOP 8 MNOP 8 ABCD 9 MNOP 8 ABCD 8 EFGH 0 PQRS 0 PQRS 0 EFGH 0 IJKL 0 WXYZ 0 IJKL 0 WXYZ 0 B1 B2 B4 B3 Fig. 2. Batch frequencies in Example 1. [t] Table 1. Window frequencies in Example 1. pattern ABCD MNOP Window Freq 35 34 EFGH WXYZ IJKL PQRS 25 24 23 19 In summary, we are given a data stream (D), a time-span for batches (Tb ), the number of consecutive batches that constitute the current window of interest (m), the desired size of frequent patterns (`), the desired number of most frequent patterns (k) and the problem is to discover the top-k patterns in the current window without actually having the entire window in memory. Problem 1 (Streaming Top-k Mining). For each new batch, Bs , of records in the stream, find all `-size patterns in the corresponding window of interest, Ws , whose window frequencies are greater than or equal to the window frequency, fsk , of k th most frequent `-size pattern in Ws . Example 1 (Window Top-k v/s Batch Top-k). Let W be a window of four batches B1 through B4 . The patterns in each batch with corresponding batch frequencies are listed in Fig. 2. The corresponding window frequencies (sum of each patterns’ batch frequencies) are listed in Table 1. The top-2 patterns in B1 are (PQRS) and (WXYZ). Similarly (EFGH) and (IJKL) are the top-2 patterns in B2 , and so on. (ABCD) and (MNOP) have the highest window frequencies but never appear in the top-2 of any batch – these patterns would ‘fly below the radar’ and go undetected if we considered only the top-2 patterns in every batch as candidates for the top-2 patterns over W . This example can be easily generalized to any number of batches and any k. It is also clear that the size of the batches can play a critical role with regards to the quality of approximation of top-k. At one end of the spectrum the batchsize can be as large as the window. Ee would get exact top-k results but such a scheme would be obviously impractical. At the other end the algorithm could consume events one at a time (batch-size = 1) updating the top-k results over the window in an online fashion. However, we expect the quality of approximation to be poor in this case since no local pattern statistics can be estimated reliably over batches of size=1. Thus, we suggest using batches of sufficient size, ofcourse, limited by the number of events that can be stored and processed in-memory. This will allow us to exploit the slowly changing statistics across batches to get better top-k approximations over the window. 6 D. Patnaik et al Example 1 highlights the main challenge in the streaming top-k mining problem: we can only store/process the most recent batch of records in the window of interest and the batchwise top-k may not contain sufficient information to compute the top-k over the entire window. It is obviously not possible to track all patterns (both frequent and infrequent) in every batch since the pattern space is typically very large. This brings us to the question of which patterns to track in every batch – how deep must we search within each batch for patterns that have potential to become top-k over the window? We develop the formalism needed to answer this question. 4. Persistence and Top-k Approximation We identify two important properties of the underlying data stream which influence the design and analysis of our algorithms. These are stated in Definitions 3 & 4 below. Definition 3 (Maximum Rate of Change, ∆). Maximum rate of change ∆(> 0) is defined as the maximum change in batch frequency of any pattern, α, across any pair of consecutive batches, Bs and Bs+1 , i.e., ∀α, s, we have |f s+1 (α) − f s (α)| ≤ ∆. (3) Intuitively, ∆ controls the extent of change from one batch to the next. While it is trivially bounded by the number of records arriving per batch, it is often much smaller in-practice. Definition 4 (Top-k Separation of (ϕ, )). A batch Bs of records is said to have a top-k separation of (ϕ, ), ϕ ≥ 0,  ≥ 0, if it contains at most (1 + )k patterns with batch frequencies of (fks − ϕ∆) or more, where fks is the batch frequency of the k th most-frequent pattern in Bs and ∆ is the maximum rate of change. This is essentially a measure of how well-separated the frequencies of the topk patterns are relative to the rest of the patterns. We expect to see roughly k patterns with batch frequencies of at least fsk and the separation is considered to be high (or good) if lowering the threshold from fsk to (fks − ϕ∆) only brings-in very few additional patterns, i.e.  remains small as ϕ increases. Top-k separation of any batch Bs is characterized by, not one but, several pairs of (ϕ, ) since ϕ and  are functionally related:  is typically close to zero if ϕ = 0, while we have k roughly the size of the class of `-size patterns (minus k) if ϕ∆ ≥ fsk . Note that  is a non-decreasing function of ϕ and that top-k separation is measured relative to the maximum rate of change ∆. We now use the maximum rate of change property to design efficient streaming algorithms for top-k pattern mining and show that top-k separation plays a pivotal role in determining the quality of approximation that our algorithms achieve. Lemma 1. The batch frequencies of the k th most-frequent patterns in any pair of consecutive batches cannot differ by more than the maximum rate of change ∆, i.e., for every batch Bs , we must have |fks+1 − fks | ≤ ∆. (4) A General Streaming Algorithm for Pattern Discovery 7 Proof. There exist at least k patterns in Bs with batch frequency greater than or equal to fks (by definition). Hence, there exist at least k patterns in Bs+1 with batch frequency greater than or equal to (fks − ∆) (since frequency of any pattern can decrease by at most ∆ going from Bs to Bs+1 ). Hence we must have fks+1 ≥ (fks − ∆). Similarly, there can be at most (k − 1) patterns in Bs+1 with batch frequency strictly greater than (fks + ∆). Hence we must also have fks+1 ≤ (fks + ∆). The above lemma follows directly from: (i) there are at least k patterns with frequencies no less than fks , and (ii) the batch frequency of any pattern can increase or decrease by no more than ∆ when going from one batch to the next. Our next observation is that if the batch frequency of a pattern is known relative to fks in the current batch Bs , we can bound its frequency in any later batch Bs+r . Lemma 2. Consider two batches, Bs and Bs+r , r ∈ Z, located r batches away from each other. Under a maximum rate of change of ∆ the batch frequency of any pattern α in Bs+r must satisfy the following: 1. If f s (α) ≥ fks , then f s+r (α) ≥ fks+r − 2|r|∆ 2. If f s (α) < fks , then f s+r (α) < fks+r + 2|r|∆ Proof. Since ∆ is the maximum rate of change, we have f s+r (α) ≥ (f s (α)−|r|∆) and from Lemma 1, we have fks+r ≤ (fks + |r|∆). Therefore, if f s (α) ≥ fks , then f s+r (α) + |r|∆ ≥ f s (α) ≥ fks ≥ fks+r − |r|∆ which implies f s+r (α) ≥ fks+r − 2|r|∆. Similarly, if f s (α) < fks , then f s+r (α) − |r|∆ ≤ f s (α) < fks ≤ fks+r + |r|∆ which implies f s+r (α) < fks+r + 2|r|∆. Lemma 2 gives us a way to track patterns that have potential to be in the topk of future batches. This is an important property which our algorithm exploits and we recorded this as a remark below. Remark 1. The top-k patterns of batch, Bs+r , r ∈ Z, must have batch frequencies of at least (fks − 2|r|∆) in batch Bs . Specifically, the top-k patterns of Bs+1 must have batch frequencies of at least (fks − 2∆) in Bs . 0 Proof. Assume that we know fks for the batch Bs0 . If a pattern α is to belong 0 0 to the set of top-k frequent patterns in the batch Bs0 +r , then f s +r (α) ≥ fks +r 0 (where fs0 +r (α) and fks +r are unknown for r > 0). 0 0 Substituting s = s0 + r in Lemma 2, we get f s (α) ≥ fks − 2|r|∆. The maximum rate of change property leads to a necessary condition, in the form of a minimum batch-wise frequency, for a pattern α to be in the top-k over a window Ws . Theorem 1 (Exact Top-k over Ws ). A pattern, α, can be a top-k pattern 0 0 over window Ws only if its batch frequencies satisfy f s (α) ≥ fks − 2(m − 1)∆ ∀Bs0 ∈ Ws . 8 D. Patnaik et al Proof. To prove this, we show that if a pattern fails this threshold in even one batch of Ws , then there exist k or more patterns which can have a greater window 0 0 frequency. Consider a pattern β for which f s (β) < fks − 2(m − 1)∆ in batch Bs0 ∈ Ws . Let α be any top-k pattern of Bs0 . Consider two patterns α and β 0 0 0 0 such that in a batch Bs0 f s (α) ≥ fks and f s (β) < fks − 2(m − 1)∆. In any other batch Bp ∈ Ws , we have 0 f p (α) ≥ f s (α) − |p − s0 |∆ 0 ≥ fks − |p − s0 |∆ (5) and 0 f p (β) ≤ f s (β) + |p − s0 |∆ 0 < (fks − 2(m − 1)∆) + |p − s0 |∆ (6) 0 Applying |p − s | ≤ (m − 1) to the above, we get 0 f p (α) ≥ fks − (m − 1)∆ > f p (β) (7) This implies f Ws (β) < f Ws (α) for every top-k pattern α of Bs0 . Since there are at least k top-k patterns in Bs0 , β cannot be a top-k pattern over the window Ws . 0 Mining patterns with frequency threshold (fks − 2(m − 1)∆) in each batch Bs0 gives complete counts of all top-k patterns in the window Ws where m is the 0 number of batches in the window, fks is the frequency of the kth most frequent pattern in batch Bs0 ∈ Ws and ∆ is the continuity parameter. Based on Theorem 1 we have the following simple algorithm for obtaining the top-k patterns over a window: Use a traditional level-wise approach to find all patterns with a batch frequency of at least (f1k − 2(m − 1)∆) in the first batch (B1 ), accumulate their corresponding batch frequencies over all m batches of Ws and report the patterns with the k highest window frequencies over Ws . This approach is guaranteed to return the exact top-k patterns over Ws . In order to report the top-k over the next sliding window Ws+1 , we need to consider all patterns with batch frequency of at least (f2k − 2(m − 1)∆) in the second batch and track them over all batches of Ws+1 , and so on. Thus, an exact solution to Problem 1 would require running a level-wise pattern mining algorithm in every batch, Bs , s = 1, 2, . . ., with a frequency threshold of (fsk − 2(m − 1)∆). 4.1. Class of (v, k)-Persistent patterns Theorem 1 characterizes the minimum batchwise computation needed in order to obtain the exact top-k patterns over a sliding window. This is effective when ∆ and m are small (compared to fsk ). However, the batchwise frequency thresholds can become very low in other settings, making the processing time per-batch as well as the number of patterns to track over the window to become impractically high. To address this issue, we introduce a new class of patterns called (v, k)-persistent patterns which can be computed efficiently by employing higher batchwise thresholds. Further, we show that these patterns can be used to approximate the true top-k patterns over the window and the quality of approximation is characterized in terms of the top-k separation property (cf. Definition 4). A General Streaming Algorithm for Pattern Discovery 9 Definition 5 ((v, k)-Persistent pattern). A pattern is said to be (v, k)-persistent over window Ws if it is a top-k pattern in at least v batches of Ws . Problem 2 (Mining (v, k)-Persistent patterns). For each new batch, Bs , of records in the stream, find all `-size (v, k)-persistent patterns in the corresponding window of interest, Ws . Theorem 2. A pattern, α, can be (v, k)-persistent over the window Ws only if 0 0 its batch frequencies satisfy f s (α) ≥ (fks −2(m−v)∆) for every batch Bs0 ∈ Ws . Proof. Let α be (v, k)-persistent over Ws and let Vα denote the set of batches in Ws in which α is in the top-k. For any Bq ∈ / Vα there exists Bb ∈ Vα that p(q) is nearest to Bq . Since |Vα | ≥ v, we must have |b p(q) − q| ≤ (m − v). Applying Lemma 2 we then get f q (α) ≥ fkq − 2(m − v)∆ for all Bq ∈ / Vα . Theorem 2 gives us the necessary conditions for computing all (v, k)-persistent patterns over sliding windows in the stream. The batchwise threshold required for (v, k)-persistent patterns depends on the parameter v. For v = 1, the threshold coincides with the threshold for exact top-k in Theorem 1. The threshold increases linearly with v and is highest at v = m (when the batchwise threshold is same as the corresponding batchwise top-k frequency). The algorithm for discovering (v, k)-persistent patterns follows the same general lines as the one described earlier for exact top-k mining, only that we now apply higher batchwise thresholds: For each new batch, Bs , entering the stream, use a standard level-wise pattern mining algorithm to find all patterns with batch frequency of at least (fsk − 2(m − v)∆). We provide more details of our algorithm later in Sec. 5. First, we investigate the quality of approximation of top-k that (v, k)-persistent patterns offer and show that the number of errors is closely related to the degree of top-k separation. 4.1.1. Top-k Approximation The main idea here is that, under a maximum rate of change ∆ and a topk separation of (ϕ, ), there cannot be too many distinct patterns which are not (v, k)-persistent while still having sufficiently high window frequencies. To this end, we first compute a lower-bound (fL ) on the window frequencies of (v, k)-persistent patterns and an upper-bound (fU ) on the window frequencies of patterns that are not (v, k)-persistent (cf. Lemmas 3 & 4). Lemma 3. If pattern α is (v, k)-persistent over a window, Ws , then its window frequency, f Ws (α), must satisfy the following lower-bound: X 0 def f Ws (α) ≥ fks − (m − v)(m − v + 1)∆ = fL (8) B s0 Proof. Consider pattern α that is (v, k)-persistent over Ws and let Vα denote the batches of Ws in which α is in the top-k. The window frequency of α can be 10 D. Patnaik et al written as f Ws (α) = f p (α) + X Bp ∈Vα ≥ X Bp ∈Vα = f q (α) X Bq ∈Ws \Vα fkp fkq − 2|b p(q) − q|∆ X + Bq ∈Ws \Vα 0 fks X − Bs0 ∈Ws X 2|b p(q) − q|∆ (9) Bq ∈Ws \Vα where Bb ∈ Vα denotes the batch nearest Bq where α is in the top-k. Since p(q) |Ws \ Vα | ≤ (m − v), we must have X |b p(q) − q| ≤ (1 + 2 + · · · + (m − v)) Bq ∈Ws \Vα = 1 (m − v)(m − v + 1) 2 (10) Putting together (9) and (10) gives us the lemma. Similar arguments give us the next lemma about the maximum frequency of patterns that are not (v, k)-persistent (Full proofs are available in (Patnaik et al., 2012)). Lemma 4. If pattern β is not (v, k)-persistent over a window, Ws , then its window frequency, f Ws (β), must satisfy the following upper-bound: X 0 def f Ws (β) < fks + v(v + 1)∆ = fU (11) B s0 Proof. Consider pattern β that is not (v, k)-persistent over Ws and let Vβ denote the batches of Ws in which β is in the top-k. The window frequency of β can be written as: X X f Ws (β) = f p (β) + f q (β) Bp ∈Vβ < X Bp ∈Vβ = Bq ∈Ws \Vβ fkp + 2|b p(q) − q|∆ + X fkq Bq ∈Ws \Vβ 0 fks X Bs0 ∈Ws + X 2|b q (p) − p|∆ (12) Bp ∈Vβ where Bb ∈ Ws \ Vβ denotes the batch nearest Bp where β is not in the top-k. q (p) Since |Vβ | < v, we must have X |b q (p) − p| ≤ (1 + 2 + · · · + (v − 1)) Bp ∈Vβ = 1 v(v + 1) 2 Putting together (12) and (13) gives us the lemma. (13) A General Streaming Algorithm for Pattern Discovery 11 It turns out that fU > fL ∀v, 1 ≤ v ≤ m, and hence there is always a possibility for some patterns which are not (v, k)-persistent to end up with higher window frequencies than one or more (v, k)-persistent patterns. We observed a specific instance of this kind of ‘mixing’ in our motivating example as well (cf. Example 1). This brings us to the top-k separation property that we introduced in Definition 4. Intuitively, if there is sufficient separation of the top-k patterns from the rest of the patterns in every batch, then we would expect to see very little mixing. As we shall see, this separation need not occur exactly at k th most-frequent pattern in every batch, somewhere close to it is sufficient to achieve a good top-k approximation. Definition 6 (Band Gap patterns, Gϕ ). In any batch Bs0 ∈ Ws , the halfopen frequency interval [fsk0 − ϕ∆, fsk0 ) is called the band gap of Bs0 . The corresponding set, Gϕ , of band gap patterns over the window Ws , is defined as the collection of all patterns with batch frequencies in the band gap of at least one Bs0 ∈ Ws . The main feature of Gϕ is that, if ϕ is large-enough, then the only patterns which are not (v, k)-persistent but that can still mix with (v, k)-persistent patterns are those belonging to Gϕ . This is stated formally in the next lemma. The proof, omitted here, can be found in (Patnaik et al., 2012). v Lemma 5. If ϕ2 > max{1, (1 − m )(m − v + 1)}, then any pattern β that is not (v, k)-persistent over Ws , can have f Ws (β) ≥ fL only if β ∈ Gϕ . Proof. If a pattern β is not (v, k)-persistent over Ws then there exists a batch Bs0 ∈ Ws where β is not in the top-k. Further, if β ∈ / Gϕ then we must have fs0 (β) < fsk0 − ϕ∆. Since ϕ > 2, β cannot be in the top-k of any neighboring batch of Bs0 , and hence, it will stay below fsk0 − ϕ∆ for all Bs0 ∈ Ws , i.e., X f Ws (β) < fsk0 − mϕ∆. Bs0 ∈Ws The Lemma follows from the given condition ϕ 2 > (1 − v m )(m − v + 1). The number of patterns in Gϕ is controlled by the top-k separation property, and since many of the non-persistent patterns which can mix with persistent ones must spend not one, but several batches in the band gap, the number of unique patterns that can cause such errors is bounded. Theorem 3 is our main result about quality of top-k approximation that (v, k)-persistence can achieve. Theorem 3 (Quality of Top-k Approximation). Let every batch Bs0 ∈ Ws v have a top-k separation of (ϕ, ) with ϕ2 > max{1, (1 − m )(m − v + 1)}. Let P denote the set of all (v, k)-persistent patterns over Ws . If |P| ≥ k, then the topk patterns over Ws can be determined from P with an error of no more than   √ km patterns, where µ = min{m − v + 1, ϕ2 , 21 ( 1 + 2mϕ − 1)}. µ Proof. By top-k separation, we have a maximum of (1 + )k patterns in any batch Bs0 ∈ Ws , with batch frequencies greater than or equal to fsk0 − ϕ∆. Since at least k of these must belong to the top-k of the Bs0 , there are no more than k patterns that can belong to the band gap of Bs0 . Thus, there can be no more than a total of km patterns over all m batches of Ws that can belong to Gϕ . Consider any β ∈ / P with f Ws (β) ≥ fL – these are the only patterns whose 12 D. Patnaik et al window frequencies can exceed that of any α ∈ P (since fL is the minimum window frequency of any α). If µ denotes the minimum number  of batches in which β belongs to the band gap, then there can be at most km µ such distinct β. Thus, if |P| ≥ k,we can  determine the set of top-k patterns over Ws with km error no more than patterns. µ There are now two cases to consider to determine µ: (i) β is in the top-k of some batch, and (ii) β is not in the top-k of any batch. Case (i): Let β be in the top-k of Bs0 ∈ Ws . Let Bs00 ∈ Ws be t batches away 00 from Bs0 . Using Lemma 2 we get f s (β) ≥ fsk00 − 2t∆. The minimum t for which  (fsk00 − 2t∆ < fsk − ϕ∆) is ϕ2 . Since β ∈ / P, β is below the top-k in at least (m − v + 1) batches. Hence β stays in the band gap of at least min{m − v + 1, ϕ2 } batches of Ws . Case (ii): Let VG denote the set of batches in Ws where β lies in the band gap and let |VG | = g. Since β does not belong to top-k of any batch, it must stay below the band gap in all the (m − g) batches of (Ws \ VG ). Since ∆ is the maximum rate of change, the window frequency of β can be written as follows: X X f q (β) f Ws (β) = f p (β) + Bp ∈VG < X Bq ∈Ws \VG p f (β) + Bp ∈VG (fqk − ϕ∆) X (14) Bq ∈Ws \VG Let Bb denote the batch in Ws \ VG that is nearest to Bp ∈ VG . Then we have: q (p) q (p) f p (β) ≤ f b (β) + |p − qb(p)|∆ k < fb − ϕ∆ + |p − qb(p)|∆ q (p) < fpk − ϕ∆ + 2|p − qb(p)|∆ (15) where the second inequality holds because β is below the band gap in Bb and q (p) (15) follows from Lemma 1. Using (15) in (14) we get X X f Ws (β) < fsk0 − mϕ∆ + 2|p − qb(p)|∆ Bs0 ∈Ws < X Bp ∈VG fsk0 − mϕ∆ + 2(1 + 2 + · · · + g)∆ Bs0 ∈Ws = X fsk0 − mϕ∆ + g(g + 1)∆ = UB (16) Bs0 ∈Ws The smallest g for which (f Ws (β) ≥ fL ) is feasible can be obtained by setting v )(m − v + 1), UB ≥ fL implies UB ≥ fL . Since ϕ2 > (1 − m X Bs0 ∈Ws fsk0 − mϕ∆ + g(g + 1)∆ > X Bs0 ∈Ws fsk0 − mϕ∆ 2 √ Solving for g, we get g ≥ 21 ( 1 + 2mϕ − 1). Combining cases (i) and (ii), we get √ µ = min{m − v + 1, ϕ2 , 21 ( 1 + 2mϕ − 1)}. A General Streaming Algorithm for Pattern Discovery 13 Theorem 3 shows the relationship between the extent of top-k separation required and quality of top-k approximation that can be obtained through (v, k)persistent patterns. In general, µ (which is minimum of three factors) increases with ϕ2 until the latter starts to dominate the other two factors, namely, (m−v + √ 1) and 21 ( 1 + 2mϕ − 1). The theorem also brings out the tension between the persistence parameter v and the quality of approximation. At smaller values of v, the algorithm mines ‘deeper’ within each batch and so we expect fewer errors with respect to the true top-k epispodes. On the other hand, deeper mining within batches is computationally more intensive, with the required effort approaching that of exact top-k mining as v approaches 1. Finally, we use Theorem 3 to derive error-bounds for three special cases; first for v = 1, when the batchwise threshold is same as that for exact top-k mining as per Theorem 1; second for v = m, when the batchwise threshold is simply the batch frequency of the k th most-frequent pattern in the batch; and third, for   v = m+1 , when the batchwise threshold lies midway between the thresholds 2 of the first two cases. (Proofs are detailed in (Patnaik et al., 2012)). Corollary 1. Let every batch Bs0 ∈ Ws have a top-k separation of (ϕ, ) and let Ws contain at least m ≥ 2 batches. Let P denote the set of all (v, k)-persistent patterns over Ws . If we have |P| ≥ k, then the maximum error-rate in the top-k patterns derived from P, for three different choices of v, is given by:  ϕ km 1. m−1 for v = 1, if 2. (km) for v = m, if 3.  4km2 m2 −1  for v = 2 ϕ 2 > (m − 1) >1  m+1  2 , if ϕ 2 > 1 m  m−1   m+1  2 2   . The cases of v = 1 and v = m Proof. We show the proof only for v = m+1 2 are obtained immediately upon application  m+1   m−1of Theorem 3. ϕ    m+1  1 m−1 Fixing v = implies (m−v) = . For m ≥ 2, 2 > m 2 2 2 2 v implies ϕ2 > max{1, (1 − m )(m − v + 1)}. Let tmin = min{m − v + 1, ϕ2 }. The minimum value of tmin is governed by      m+1 1 m−1 m+1 tmin ≥ min , 2 m 2 2    1 m−1 m+1 = m 2 2  2  m −1 ≥ (17) 4m    m+1   √ 2 m−1 Let gmin = 12 ( 1 + 2mϕ − 1). ϕ > m implies gmin > m−1 . From 2 2 2 Theorem 3 we have  2  m −1 µ = min{tmin , gmin } ≥ 4m   2 and hence the number of errors is no more than 4km m2 −1 . Using v = 1 we make roughly k errors by considering only persistent patterns for the final output, while the same batchwise threshold can give us the exact top-k as per Theorem 1. On the other hand, at v = m, the batchwise thresholds 14 D. Patnaik et al After a new batch arrives Decreasing frequency fks 1 Fs fks fks 1 patterns Bs fks 1 1 Old patterns no longer frequent Fs (+) (-) patterns New frequent patterns Bs [t] Fig. 3. The set of frequent patterns can be incrementally updated as new batches arrive. are higher (the algorithm will run faster) but the number of errors grows linearly with m. Note that the (ϕ,)-separation needed for v = 1 is much higher than for  v = m. The case of v = m+1 lies in-between, with roughly 4k errors under 2 reasonable separation conditions. 5. Algorithm In this section we present an efficient algorithm for incrementally mining patterns with frequency at least (fks − θ) in batch Bs , for each batch in the stream. The threshold parameter θ is set to 2(m − v)∆ for mining (v, k)-persistent patterns (see Theorem 2) and to 2(m − 1)∆ for exact top-k mining (see Theorem 1). A trivial (brute-force) algorithm to find all patterns with frequency greater than (fks −θ) in Bs is as follows: Apply an Apriori-based pattern mining algorithm (e.g. (Mannila et al., 1997; Achar et al., 2012)) on batch Bs . If the number of patterns of size-` is less than k, the support threshold is decreased and the mining repeated until at least k `-size patterns are found. At this point fks is known. The mining process is then repeated once more with the frequency threshold (fks − θ). Doing this entire procedure for every new batch is expensive and wasteful. After seeing the first batch of the data, whenever a new batch arrives we have information about the patterns that were frequent in the previous batch. This can be exploited to incrementally and efficiently update the set of frequent patterns in the new batch. The intuition is that the frequencies of most patterns do not change much from one batch to the next. As a result only a small number of previously frequent patterns fall below the new support threshold in the new batch; similarly, some new patterns may become frequent. This is illustrated in Figure 3. In order to efficiently find these sets of patterns, we need to maintain additional information that allows us to avoid full-blown candidate generation in each batch. We show that this state information is a by-product of an Aprioribased algorithm and therefore any extra processing is unnecessary. Frequent patterns are discovered level-wise, in ascending order of pattern-size. An Apriori procedure alternates between counting and candidate generation. First a set C i of candidate i-size patterns is determined by combining frequent (i − 1)-size patterns from the previous level. Then the data is scanned for deter- A General Streaming Algorithm for Pattern Discovery 15 mining frequencies of the candidates, from which the frequent i-size patterns are obtained. We note that all candidate patterns that are not frequent constitute the negative border of the frequent pattern lattice (Aumann et al., 1999). This is because, a candidate is generated only when all its subpatterns are frequent. The usual approach is to discard the border. For our purposes, the border contains information required to identify changes in the frequent patterns from one batch to the next3 . The pseudocode for incrementally mining frequent patterns in batches is listed in Algorithm 1. The inputs to the algorithm are: (i) Number k of top patterns desired, (ii) New incoming batch Bs of records in the stream, (iii) Lat∗ ∗ tice of frequent (Fs−1 ) and border patterns (Bs−1 ) from previous batch, and (iv) threshold parameter θ. Frequent and border patterns of size-i, with respect to frequency threshold fks − θ, are denoted by the sets Fsi and Bsi respectively (Fs∗ and Bs∗ denote the corresponding sets for all pattern sizes). For the first batch B1 (lines 1–3) the top-k patterns are found by a bruteforce method, i.e., by mining with a progressively lower frequency threshold until at least k patterns of size ` are found. Once fk1 is determined, F1∗ and B1∗ are obtained using an Apriori procedure. For subsequent batches, we do not need a brute-force method to determine ` fks . Based on Remark 1, if θ ≥ 2∆, Fs−1 from batch Bs−1 contains every potential top-k pattern of the next batch Bs . Therefore simply updating the counts of all ` patterns in Fs−1 in the new batch Bs and picking the k th highest frequency gives s fk (lines 4–6). To update the lattice of frequent and border patterns (lines 7– 24) the procedure starts from the bottom (size-1 patterns). The data is scanned to determine the frequency of new candidates together with the frequent and border patterns from the lattice (line 11). In the first level (patterns of size 1), ` the candidate set is empty. After counting, the patterns from Fs−1 that continue ` to be frequent in the new batch are added to Fs (lines 13–14). But if a pattern is no longer frequent it is marked as a border set and all its super patterns are deleted (lines 15–17). This ensures that only border patterns are retained in the lattice. In the border and new candidate sets, any pattern that is found to be i (lines 18–21). The remaining infrequent frequent is added to both Fsi and Fnew patterns belong to the border because, otherwise, they would have at least one infrequent subpattern and would have been deleted at a previous level; hence, these infrequent patterns are added to Bs` (lines 22–23). Finally, the candidate generation step (line 24) is required to fill out the missing parts of the frequent pattern lattice. We want to avoid a full blown candidate generation. Note that if a pattern is frequent in Bs−1 and Bs then all its subpat` ` terns are also frequent in both Bs and Bs−1 . Any new pattern (6∈ Fs−1 ∪ Bs−1 ) that turns frequent in Bs , therefore, must have at least one subpattern that was not frequent in Bs−1 but is frequent in Bs . All such patterns are listed in i Fnew . Hence the candidate generation step (line 24) for the next level generates i only candidates with at least one subpattern ∈ Fnew . This reduces the number of candidates generated at each level without compromising completeness of the results. The space and time complexity of candidate generation is now i i O(|Fnew |.|Fsi |) instead of O(|Fsi |2 ) and in most practical cases |Fnew |  |Fsi |. 3 Border sets were employed by (Aumann et al., 1999) for efficient mining of dynamic databases. Multiple passes over older data are needed for any new frequent itemsets, which is not feasible in a streaming context. 16 D. Patnaik et al Algorithm 1 Persistent pattern miner: Mine top-k v-persistent patterns. Input: Number k of top patterns; New batch Bs of events; ∗ , B∗ Current lattice of frequent & border patterns (Fs−1 s−1 ); Threshold parameter θ (set to 2(m − v)∆ for (v, k)-persistence, 2(m − 1)∆ for exact top-k) Output: Lattice of frequent and border patterns (Fs∗ , Bs∗ ) after Bs 1: if s = 1 then 2: Determine fk1 (brute force) 3: Compute (F1∗ , B1∗ ) ← Apriori(B1 , `, fk1 − θ) 4: else ` 5: CountFrequency(Fs−1 , Bs ) s ` 6: Determine fk (based on patterns in Fs−1 ) 1 7: Initialize C ← φ (new candidates of size 1) 8: for i ← 1, . . . , ` do 9: Initialize Fsi ← φ, Bsi ← φ i 10: Initialize Fnew ← φ (new frequent patterns of size i) i i 11: CountFrequency(Fs−1 ∪ Bs−1 ∪ C i , Bs ) i 12: for α ∈ Fs−1 do 13: if f s (α) ≥ fks − θ then 14: Update Fsi ← Fsi ∪ {α} 15: else 16: Update Bsi ← Bsi ∪ {α} ∗ , B∗ 17: Delete super-patterns of α from (Fs−1 s−1 ) i 18: for α ∈ Bs−1 ∪ C i do 19: if f s (α) ≥ fks − θ then 20: Update Fsi ← Fsi ∪ {α} i i 21: Update Fnew ← Fnew ∪ {α} 22: else 23: Update Bsi ← Bsi ∪ {α} i 24: C i+1 ← GenerateCandidate(Fnew , Fsi ) 25: return (Fs∗ , Bs∗ ) Later in the experiments section, we show how border-sets help our algorithm run very fast. For a window Ws ending in the batch Bs , the set of output patterns can be obtained by picking the top-k most frequent patterns from the set Fs` . Each pattern also maintains a list that stores its batch-wise counts is last m batches. The window frequency is obtained by adding these entries together. The output patterns are listed in decreasing order of their window counts. Example 2. In this example we illustrate the procedure for incrementally updating the frequent patterns lattice as a new batch Bs is processed (see Figure 4). Figure 4(A) shows the lattice of frequent and border patterns found in the batch Bs−1 (The figure does not show all the subpatterns at each level, only some of them). ABCD is a 4-size frequent pattern in the lattice. In the new batch Bs , the pattern ABCD is no longer frequent. The pattern CDXY appears as a new frequent pattern. The pattern lattice in Bs is shown in Figure 4(B). In the new batch Bs , AB falls out of the frequent set. AB now becomes the new border and all its super-patterns namely ABC, BCD and ABCD are deleted from the lattice. At level 2, the border pattern XY turns frequent in Bs . This allows us to generate DXY as a new 3-size candidate. At level 3, DXY is also found to be frequent and is combined with CDX which is also frequent in Bs to generate A General Streaming Algorithm for Pattern Discovery Level: 4 ABCD Level: 3 Level: 2 17 ABC AB Level: 1 A BCD BC CDX CD B C Border sets D AX … BX XY DX X Y E … H (A) Bs−1 Level: 4 ABCD Level: 3 Level: 2 Level: 1 A ABC AB CDXY BCD BC B CDX DXY Border sets DX XY AX … BX CD C D X Y E … H (B) Bs Fig. 4. Incremental lattice update for the next batch Bs given the lattice of frequent and border patterns in Bs−1 . CDXY as a 4-size candidate. Finally at level 4, CDXY is found to be frequent. This shows that border sets can be used to fill out the parts of the pattern lattice that become frequent in the new data. 5.1. Estimating ∆ dynamically The parameter ∆ in the bounded rate change assumption is a critical parameter in the entire formulation. But unfortunately the choice of the correct value for ∆ is highly data-dependent. In the streaming setting, the characteristics of the data can change over time. Hence one predetermined value of ∆ cannot be provided in any intuitive way. Therefore we estimate ∆ from the frequencies of `-size patterns in consecutive windows. We compute the differences in frequencies of patterns that are common in consecutive batches. Specifically, we consider the value at the 75th percentile as an estimate of ∆. We avoid using the maximum change as it tends to be noisy. A few patterns exhibiting large changes in frequency can skew the estimate and adversely affect the mining procedure. 6. Results 6.1. Experimental Setup We show experimental results for mining two different pattern classes, namely, episodes and itemsets. For episode mining we have one synthetic and two real data streams, from two very different domains: experimental neuroscience and telecom networks. In neuroscience, we consider (1) synthetic neuronal spiketrains based on mathematical models of interconnected neurons, with each neuron modeled as an inhomogeneous Poisson process (Achar et al., 2012), and 18 D. Patnaik et al Table 2. Experimental setup. (a) Datasets used in the experiments. Dataset Alphabet-size Data-length Type Call-logs Neuroscience Synthetic Kosarak T10I4D100K 8 58 515 41,270 870 42,030,149 12,070,634 25,295,474 990,000 100,000 Events Events Events Itemsets Itemsets (b) Parameter set-up. Parameter Value k (in Top-k patterns) Number of batches in a window, m Batch-size (as number of events) Error parameter (applicable to Chernoff-based and Lossy counting methods) Parameter v (applicable to Persistence miner) 25 10 106 0.001 m/2 (2) real neuronal spiking activity from dissociated cortical cultures, recorded using multi-electrode arrays at Steve Potter’s lab, Georgia Tech (Wagenaar et al., 2006). The third kind of data we consider are call data records from a large European telecom network. Each record describes whether the associated call was voice or data, an international or local call, in-network or out-of-network, and of long or short duration. Pattern discovery over such data can uncover hidden, previously unknown, trends in usage patterns, evolving risk of customer churn, etc. In case of itemsets we present results on two publicly available datasets: Kosarak and T10I4D100K. The Kosarak dataset contains (anonymized) clickstream data of a hungarian on-line news portal while the T10I4D100K dataset was generated using the itemset simulator from the IBM Almaden Quest research group. Both can be downloaded from the Frequent Itemset Mining Dataset Repository http://fimi.ua.ac.be/data. The data length in terms of number of events or transactions and the alphabet size of each of these datasets is shown in Table 2(a). Table 2(b) gives the list of parameters and the values used in experiments that we do not explicitly mention in the text. We compare persistent pattern miner against two methods from itemsets mining literature (Wong and Fu, 2006): one that fixes batchwise thresholds based on Chernoff-bounds under an iid assumption over the event stream, and the second based on a sliding window version of the Lossy Counting algorithm (Manku and Motwani, 2002). These methods were designed for frequent itemset mining and therefore we adapt them suitably for mining episodes. We modify the support threshold computation using chernoff bound for episodes since the total number of distinct episodes is bounded by the size of the largest transaction while for itemsets it is the alphabet size that determines this. As mentioned earlier, we are not aware of any streaming algorithms that directly address top-k episode mining over sliding windows of data. Estimating ∆ dynamically: ∆ is a critical parameter for our persistent pattern miner. But unfortunately the choice of the correct value is highly data-dependent and the characteristics of the data can change over time. One predetermined value of ∆ cannot be provided in any intuitive way. Therefore we estimate ∆ from the frequencies of `-size patterns in consecutive batches by computing the A General Streaming Algorithm for Pattern Discovery 19 change in frequency of patterns and using the 75th percentile as the estimate. We avoid using the maximum change as it tends to be noisy. 6.2. Quality of top-k mining We present aggregate comparisons of the three competing methods in Table 3. These datasets provide different levels of difficulty for the mining algorithms. Mining episodes: Tables 3(a) & 3(c) shows high f-scores4 for the synthetic and real neuroscience data for all methods (Our method performs best in both cases). On the other hand we find that all methods report very low f-scores on the call-logs data (see Table 3(b)). The characteristics of this data does not allow one to infer window top-k from batches (using limited computation). But our proposed method nearly doubles the f-score with identical memory and CPU usage on this real dataset. It may be noteworthy to mention that the competing methods reported close to 100% accuracies but they do not perform that well on more realistic datasets. In case of the synthetic data (see Table 3(c) the characteristics are very similar to that of the neuroscience dataset. Mining itemsets: Tables 3(d) and 3(e) show that for itemset data all the three competing methods perform identically with f-score = 100. But run-times of Persistent pattern mining is significantly better than the other methods with nearly the same or less memory requirement. 6.3. Computation efficiency comparisons Table 3 shows that we do better than both competing methods in most cases (and never significantly worse than either) with respect to time and memory. 6.3.1. Effect of parameters on performance In Figure 5, we see all three methods outperform the reference method (the standard multi-pass apriori based miner) by at least an order of magnitude in terms of both run-times and memory. In Figure 5(a)-(c), we show the effect of increasing k on all the methods. The accuracy of Lossy-counting algorithm drops with increase in k, while that of Chernoff based method and Persistence miner remain unchanged. Persistence miner has lower runtimes for all choices of k while having comparable memory footprint as the other two methods. With increasing window-size (m = 5, 10, 15 and batch − size = 106 events), we observe better f-scores for Persistence miner but this increase is not significant enough and can be caused by data characterisitcs alone. The runtimes and memory of Persistence miner remain nearly constant. This is important for streaming algorithms as the runtimes and memory of the standard multi-pass algorithm increases (roughly) linearly with window size. 4 fscore = 2 · precision·recall precision+recall 20 D. Patnaik et al Table 3. Aggregate performance comparison of different algorithms. (a) Dataset: Neuroscience, Size of patterns = 4, Pattern class: Episodes Top-k Miner F Score Runtime (sec) Memory (MB) Chernoff-bound based Lossy-counting based Persistent pattern 92.0 92.0 97.23 456.0 208.25 217.64 251.39 158.5 64.51 (b) Dataset: Call-logs, Size of patterns = 6, Pattern class: Episodes Top-k Miner F Score Runtime (sec) Memory (MB) Chernoff-bound based Lossy-counting based Persistent pattern 32.17 24.18 49.47 11.87 3.29 3.34 66.14 56.87 67.7 (c) Dataset: Synthetic data, Size of patterns = 4, Pattern class: Episodes Top-k Miner F Score Runtime (sec) Memory (MB) Chernoff-bound based Lossy-counting based Persistent pattern 92.7 92.7 96.2 14.91 6.96 4.98 43.1 32.0 34.43 (d) Dataset: Kosarak, Size of patterns = 4, batch-size = 10,000 transactions, v = 9, m = 10, Pattern class: Itemsets Top-k Miner F Score Runtime (sec) Memory (MB) Chernoff-bound based Lossy-counting based Persistent pattern 100 100 100 32.18 15.57 1.93 31.85 25.87 25.54 (e) Dataset: T10I4D100K, Size of patterns = 4, batch-size = 10,000 transactions, k = 10, v = 9, m = 10, Pattern class: Itemsets Top-k Miner F Score Runtime (sec) Memory (MB) Chernoff-bound based Lossy-counting based Persistent pattern 100 100 100 662.05 335.71 73.38 170.77 87.86 116.27 6.3.2. Utility of Border Sets For patterns with slowly changing frequency we show in Section 5 that using border-sets to incrementally update the frequent patterns lattice results in an i order complexity of O(|Fnew |.|Fsi |) instead of O(|Fsi |2 ) for candidate generation i and in most practical cases |Fnew |  |Fsi |. Table 4 demonstrates the speedup in runtime achieved by using border-set. We implemented two versions of our Persistence miner. In one we utilize border-sets to incrementally update the lattice where as in other we rebuild the frequent pattern lattice from scratch in every new batch. The same batch-wise frequency thresholds used as dictated by Theorem 2. We ran the experiment on our call-logs dataset and T10I4D100K dataset, and for various parameter settings of our algorithm we observed a speedup of ≈ 4× resulting from use of border-sets. A General Streaming Algorithm for Pattern Discovery 80 Chernoff-bound based Persistence miner 70 Lossy-counting based Runtime (sec) F-score 100 Standard Multi-pass miner Lossy-counting based 80 60 50 40 30 20 21 60 40 20 10 0 25 50 Top-k 0 100 25 50 Top-k (a) F-Score vs. k 700 Standard Multi-pass Lossy-counting based 70 Chernoff-bound based Persistence miner Chernoff-bound based Persistence miner 60 500 300 200 40 30 20 100 10 0 25 50 Top-k 0 100 5 10 Window size (m batches) (c) Memory vs. k 800 Chernoff-bound based Persistence miner 250 200 150 100 50 0 5 10 Window size (m batches) 15 (e) Runtime vs. m 700 Memory Usage (MB) Standard Multi-pass Lossy-counting based 15 (d) F-Score vs. m 400 300 Lossy-counting based 50 400 350 100 (b) Runtime vs. k F-score Memory Usage (MB) 600 Runtime (sec) Chernoff-bound based Persistence miner 600 Standard Multi-pass Chernoff-bound based Lossy-counting based Persistence miner 500 400 300 200 100 0 5 10 Window size (m batches) 15 (f) Memory vs. m Fig. 5. Performance with different parameters (Dataset: Call logs) 6.4. Adapting to Dynamic data In this experiment we show that Persistence miner adapts faster than the competing algorithms to changes in underlying data characteristics. We demonstrate this using synthetic data generated using the multi-neuronal simulator based (Achar et al., 2012). The simulation model was adapted to update the connection strengths dymanically while generating synthetic data. This allowed us to change the top-k episode patterns slowly over the length of simulated data. We embedded 25 randomly picked episodes with time-varying arrival rates. The arrival rate of each episode was changed by changing the connection strength of between consecutive event types in the episode from 0.1 to 0.9 as ramp function time. By setting a different periodicity of the ramp functions for each episode we were able to change the top-k frequent episodes over time. Figure 6(a) shows the performance of the different methods in terms of f-score computed after arrival of each new batch of events but for top-k patterns in the window. The ground truth is again the output of the standard multi-pass apriori algorithm that is allowed access to the entire window of events. The f-score curves of the both the competing methods almost always below that of the Persistence miner. While the runtimes for Persistence miner are always lower than those of the compet- 22 D. Patnaik et al Table 4. Utility of border set. (a) Dataset: Call-logs, Size of patterns = 6, Parameter v = m/2, Pattern class: Episodes Window size Runtime (border-set) Runtime (no-border-set) Memory (border-set) Memory (no-border-set) 5 10 15 2.48 3.13 3.47 12.95 11.41 13.76 67.55 67.7 67.82 66.21 66.67 67.02 (b) Dataset: Call-logs, Size of patterns = 6, window size m = 10, Patther class: Episodes Parameter v Runtime (border-set) Runtime (no-border-set) Memory (border-set) Memory (no-border-set) 0 5 10 2.78 3.13 3.21 11.98 11.41 10.85 67.7 67.7 67.69 66.67 66.67 57.5 (c) Dataset: T10I4D100K, Size of patterns = 4, window size m = 10, Pattern class: Itemsets Parameter v Runtime (border-set) Runtime (no-border-set) Memory (border-set) Memory (no-border-set) 9 73.38 285.46 116.27 79.89 ing methods (see Figure 6(b). Lossy counting based methods is the slowest at error parameter set to 0.0001. The effective batch-wise thresholds of both the lossy counting and Chernoff bounds based algorithm were very similar leading to identical performance of the two competing methods in this experiments. The main reason of better tracking in the case of Persistence miner is that the output of the algorithm filters out all non-(v, k) persistent patterns. This acts in favor of Persistence miner as the patterns most likely to gather sufficient support to be in the top-k are also likely to be persistent. The use of border-sets in Persistence miner explains the lower runtimes. 6.5. Correlation of f-score with theoretical error Can we compute theoretical errors (data-dependent quantity) and guess how well we perform? This could be invaluable in a real-world streaming data setting. In this experiment we try to establish the usefulness of the theoretical analysis proposed in the paper. The main power of the theory is to predict the error in the output set at the end of each batch. Unlike other methods we compute the error bounds using the data characteristics and is dynamically updated as new data arrives. The error guarantees of both Lossy counting and Chernoff based methods are static. In Figure 7, we plot the error bound using Theorem 3 and the f-score computed with respect to the reference method (standard multi-pass apriori) in a 2D histogram. According to the theory different pairs of (φ, ) output a different error bound in every batch. In our experiment we pick the smallest error bound in the allowable range of φ and corresponding  in each batch and plot it with the corresponding f-score. The histogram is expected to show negative correlation between f-score and our predicted error bound i.e. the predicted error for A General Streaming Algorithm for Pattern Discovery 105 Chernoff-bound based 23 Lossy-counting based Persistence miner F-Score 100 95 90 85 80 75 0 5 10 15 Sliding window position 20 25 (a) Tracking top-k - f-score comparison 40 Chernoff-bound based Runtime (sec) 35 Lossy-counting based Persistence miner 30 25 20 15 10 5 0 0 5 10 15 Sliding window position 20 25 (b) Tracking top-k - runtime Fig. 6. Comparison of different methods in tracking top-k patterns over dynamically changing event stream (Parameters used: k = 25, m = 10, Persistence miner: v=0.5, alg 6,7: error = 1.0e-4) high-f-score top-k results should be low and vice versa. The correlation is not very evident in the plot. The histogram shows higher density in the left top part of the plot, which is a mild indication that high f-score has a corresponding low error predicition. 7. Related work Traditionally, data stream management systems (DSMSs) (Babcock et al., 2002) have been used mainly for detecting patterns and conditions mined over offline data using traditional multi-pass frequent patterns, frequent itemsets, and sequential patterns and motif-detection techniques. However, as real-time data acquisition becomes ubiquitous, there is a growing need to derive insights directly from high-volume streams at low latencies and under limited memory using a DSMS (Chandramouli et al., 2012). Mining Itemsets over Streams The literature for streaming algorithms for pattern discovery is dominated by techniques from the frequent itemset min- D. Patnaik et al 60 40 20 F−Score 80 24 0 [ht] 200 400 600 800 1000 1200 Predicted Error Fig. 7. 2D Histogram of predicted error vs. f-score. (Dataset: Call logs) ing context (Manku and Motwani, 2002; Karp et al., 2003; Chang and Lee, 2003; Chang and Lee, 2004; Jin and Agrawal, 2005; Wong and Fu, 2006; Calders et al., 2007; Cheng et al., 2008; Lam et al., 2011), and we are unaware of any algorithms for pattern mining over event streams that possess the generality of mining approach we have presented here. Manku and Motwani (Manku and Motwani, 2002) proposed the lossy counting algorithm for approximate frequency counting in the landmark model (i.e., all events seen until the current-time constitute the window of interest). One of its main drawbacks is that the algorithm must essentially operate at a threshold of  to provide  error guarantees, which is impractical in many real-world scenarios. Karp et al. (Karp et al., 2003) also propose a one pass streaming algorithm for finding frequent items, and these ideas were extended to itemsets by Jin and Agrawal (Jin and Agrawal, 2005), but all these methods require even more space than lossy counting. Mendes et al. (Mendes, Ding and Han, 2008) extend the pattern growth algorithm (PrefixSpan) (Pei et al., 2001) for mining sequential patterns to incorporate the idea of lossy counting. Chang and Lee (Chang and Lee, 2004) and Wong and Fu (Wong and Fu, 2006) extend lossy counting to sliding windows and top-k setting, respectively. New frequency measures for itemsets over streams have also been proposed (e.g., Calders et al. (Calders et al., 2007) Lam et al.(Lam et al., 2011)) but these methods are heavily specialized toward the itemset context. There has also been renewed recent interest in adapting pattern matching (Chandramouli et al., 2010; Agrawal et al., 2008) (as opposed to mining) algorithms to the streaming context. It is not obvious how to extend them to accommodate patterns in a manner that supports both candidate generation and counting. Mining Time-Series Motifs An online algorithm for mining time series motifs is proposed by Mueen et al. (Mueen and Keogh, 2010). The algorithm uses an interesting data structure to find a pair of approximately repeating subsequences in a window. The Euclidean distance measure is used to measure the similarity of the motif sequences in the window. Unfortunately this notion does not extend naturally to discrete patterns. Further, this motif mining formulation does not A General Streaming Algorithm for Pattern Discovery 25 explicitly make use of a support or frequency threshold and returns exactly one pair of motifs that are found to be the closest in terms of distance. Sequential Patterns and pattern Mining An pattern or a general partial order pattern can be thought of as a generalization of itemsets where each item in the set is not confined to occur within the same transaction (i.e., at the same time tick) and there is additional structure in the form of ordering of events or items. In serial patterns, events must occur in exactly one particular order. Partial order patterns allow multiple orderings. In addition there could be repeated event types in a pattern. The loosely coupled structure of events in a pattern results in narrower separation between the frequencies of true and noisy patterns (i.e., resulting from random co-occurrences of events) and quickly leads to combinatorial explosion of candidates when mining at low frequency thresholds. Most of the itemset literature does not deal with the problem of candidate generation. The focus is on counting and not so much on efficient candidate generation schemes. While pattern mining in offline multi-pass scenarios has been researched (Laxman, 2006; Achar et al., 2012), this paper is the first to explore ways of doing both counting and candidate generation efficiently in a streaming setting. We devise algorithms that can operate at as high frequency thresholds as possible and yet give certain guarantees about frequent output patterns. 8. Conclusions While pattern mining in offline multi-pass scenarios has been researched (Agrawal and Srikant, 1994; Mannila et al., 1997; Achar et al., 2012), this paper is the first to explore ways of doing both counting and candidate generation efficiently in a streaming setting. The key ingredients of our approach involve parameterization of the streaming algorithm using data-driven quantities such as rate of change and top-k separation. This leads to algorithms that can operate at high frequency thresholds and with theoretical guarantees on the quality of top-k approximation. Further, our algorithms employ border sets in a novel way, ensuring minimal computation redundancy as we process adjacent (potentially similar) batches. Experiments demonstrate the effectiveness of our algorithms for two popular pattern classes, namely, itemsets and episodes. The approach presented here is applicable for any general pattern class provided we have black-box access to a level-wise algorithm for that class. Acknowledgements This work is supported in part by US National Science Foundation grants IIS0905313 and CCF-0937133, and by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior National Business Center (DoI/NBC) contract number D12PC000337. The US Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon. Disclaimer: The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of IARPA, DoI/NBC, or the US Government. 26 D. Patnaik et al References Achar, A. et al. (2012), ‘Discovering injective episodes with general partial orders’, Data Mining and Knowl. Dsc. 25(1), 67–108. Agrawal, J. et al. (2008), Efficient pattern matching over event streams, in ‘Proc. of the 2008 ACM SIGMOD Intl. Conf. on Management of Data’, SIGMOD ’08, pp. 147–160. Agrawal, R. and Srikant, R. (1994), Fast algorithms for mining association rules in large databases, in ‘Proceedings of the 20th International Conference on Very Large Data Bases’, pp. 487–499. Aumann, Y. et al. (1999), ‘Borders: An efficient algorithm for association generation in dynamic databases’, J. of Intl. Info. Syst. (JIIS) (1), 61–73. Babcock, B. et al. (2002), Models and issues in data stream systems, in ‘Proc. of the 21st ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems’, pp. 1–16. Calders, T. et al. (2007), Mining frequent itemsets in a stream, in ‘Proc. of the 7th IEEE Intl. Conf. on Data Mining (ICDM)’, pp. 83–92. Chandramouli, B. et al. (2010), ‘High-performance dynamic pattern matching over disordered streams’, Proc. VLDB Endow. 3(1-2), 220–231. Chandramouli, B. et al. (2012), Temporal analytics on big data for web advertising, in ‘Proc. of the Intl. Conf. of Data Engg. (ICDE)’. Chang, J. H. and Lee, W. S. (2003), Finding recent frequent itemsets adaptively over online data streams, in ‘Proc. of the 9th ACM SIGKDD Intl. Conf. on Knowledge discovery and data mining (KDD)’, pp. 487–492. Chang, J. H. and Lee, W. S. (2004), ‘A sliding window method for finding recently frequent itemsets over online data streams’, J. Inf. Sci. Eng. 20(4), 753–762. Cheng, J. et al. (2008), ‘A survey on algorithms for mining frequent itemsets over data streams’, Knowl. and Info. Systems 16(1), 1–27. Jin, R. and Agrawal, G. (2005), An algorithm for in-core frequent itemset mining on streaming data, in ‘Proc. of the 5th IEEE Intl. Conf. on Data Mining (ICDM)’, pp. 210–217. Karp, R. M. et al. (2003), ‘A simple algorithm for finding frequent elements in streams and bags’, ACM Trans. Database Syst. 28, 51–55. Lam, H. T. et al. (2011), Online discovery of top-k similar motifs in time series data, in ‘SIAM Intl. Conf. of Data mining’, pp. 1004–1015. Laxman, S. (2006), Discovering frequent episodes: Fast algorithms, Connections with HMMs and generalizations, PhD thesis, IISc, Bangalore, India. Laxman, S. and Sastry, P. S. (2006), ‘A survey of temporal data mining’, S ĀDH ĀN Ā, Academy Proceedings in Engineering Sciences 31, 173–198. Manku, G. S. and Motwani, R. (2002), Approximate frequency counts over data streams, in ‘Proc. of the 28th Intl. Conf on Very Large Data Bases (VLDB)’, pp. 346–357. Mannila, H. et al. (1997), ‘Discovery of frequent episodes in event sequences’, Data Mining and Knowl. Dsc. 1(3), 259–289. Mendes, L., Ding, B. and Han, J. (2008), Stream sequential pattern mining with precise error bounds, in ‘Proc. of the 8th IEEE Intl. Conf. on Data Mining (ICDM)’, pp. 941–946. Mueen, A. and Keogh, E. (2010), Online discovery and maintenance of time series motif, in ‘Proc. of the 16th ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining (KDD)’. Muthukrishnan, S. (2005), ‘Data streams: Algorithms and applications’, Foundations and Trends in Theoretical Computer Science 1(2). Patnaik, D., Laxman, S. and Ramakrishnan, N. (2009), Discovering Excitatory Networks from Discrete Event Streams with Applications to Neuronal Spike Train Analysis, in ‘Proc. of the 9th IEEE Intl. Conf. on Data Mining (ICDM)’. Patnaik, D., Marwah, M., Sharma, R. and Ramakrishnan, N. (2009), Sustainable operation and management of data center chillers using temporal data mining, in ‘Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining’, pp. 1305–1314. Patnaik, D. et al. (2012), ‘Streaming algorithms for pattern discovery over dynamically changing event sequences’, CoRR abs/1205.4477. Pei, J. et al. (2001), Prefixspan: Mining sequential patterns efficiently by prefix-projected pattern growth, in ‘Proc. of the 17th Intl. Conf. on Data Engg. (ICDE)’, pp. 215–224. Ramakrishnan, N., Patnaik, D. and Sreedharan, V. (2009), ‘Temporal Process Discovery in Many Guises’, IEEE Computer Vol. 42(8), 97–101. A General Streaming Algorithm for Pattern Discovery 27 Wagenaar, D. A. et al. (2006), ‘An extremely rich repertoire of bursting patterns during the development of cortical cultures’, BMC Neuroscience . Wang, J. et al. (2005), ‘TFP: An efficient algorithm for mining top-k frequent closed itemsets’, IEEE Trans. on Knowledge and Data Engg. 17(5), 652–664. Wong, R. C.-W. and Fu, A. W.-C. (2006), ‘Mining top-k frequent itemsets from data streams’, Data Mining and Knowl. Dsc. 13, 193–217. Yan, X. and Han, J. (2003), CloseGraph: mining closed frequent subgraph patterns, in ‘Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’03)’. Author Biographies Debprakash Patnaik is currently an engineer in the Search and Discovery group at Amazon.com. He completed his Ph.D. in Computer Science from Virginia Tech, USA. He received his bachelor degree from Veer Surendra Sai University of Technology, Orissa, India, in 2002 and masters degree from Indian Institute of Science, Bangalore, in 2006. He also worked as a researcher in the Diagnosis and Prognosis group at General Motors Research, Bangalore. His research interests include temporal data mining, probabilistic graphical models, and application in neuroscience. Srivatsan Laxman Srivatsan Laxman is a researcher at Microsoft Research India, Bangalore and is associated with two research areas, Machine Learning & Optimization and Security & Privacy. He works broadly in data mining, machine learning and data privacy. Srivatsan received his PhD in Electrical Engineering in 2006 from Indian Institute of Science, Bangalore. Badrish Chandramouli Badrish Chandramouli is a researcher in the database group at Microsoft Research. He is broadly interested in databases and distributed systems, with particular interests in stream processing and temporal analytics, Cloud computing, database query processing and optimization, and publish/subscribe systems. Badrish works primarily on the streams project at Microsoft Research, Redmond, which ships commercially as Microsoft StreamInsight. Naren Ramakrishnan Naren Ramakrishnan is the Thomas L. Phillips professor of engineering at Virginia Tech. He works with domain scientists and engineers to develop software systems for scientific knowledge discovery. Application domains in his research have spanned protein design, biochemical networks, neuro-informatics, plant physiology, wireless communications, and sustainable design of data centers. Ramakrishnan received his Ph.D. in computer sciences from Purdue University. He is an ACM Distinguished Scientist. 28 D. Patnaik et al Correspondence and offprint requests to: Debprakash Patnaik, Amazon.com, WA 98109, USA. Email: patnaikd@amazon.com