Under consideration for publication in Knowledge and Information
Systems

A General Streaming Algorithm for
Pattern Discovery
Debprakash Patnaik1 , Srivatsan Laxman2 , Badrish Chandramouli3 ,
and Naren Ramakrishnan4
1
Amazon.com, Seattle, WA 98109, USA;
2
Microsoft Research, Bangalore 560080, India;
3
Microsoft Research, Redmond, WA, USA;
4
Department of Computer Science, Virginia Tech, VA 24061, USA

Abstract. Discovering frequent patterns over event sequences is an important data
mining problem. Existing methods typically require multiple passes over the data, rendering them unsuitable for streaming contexts. We present the first streaming algorithm
for mining frequent patterns over a window of recent events in the stream. We derive
approximation guarantees for our algorithm in terms of: (i) the separation of frequent
patterns from infrequent ones, and (ii) the rate of change of stream characteristics. Our
parameterization of the problem provides a new sweet spot in the tradeoff between making distributional assumptions over the stream and algorithmic efficiencies of mining.
We illustrate how this yields significant benefits when mining practical streams from
neuroscience and telecommunications logs.
Keywords: Event Sequences; Data Streams; Frequent patterns; Pattern Discovery;
Streaming Algorithms; Approximation Algorithms

1. Introduction
Application contexts (Patnaik, Marwah, Sharma and Ramakrishnan, 2009; Ramakrishnan, Patnaik and Sreedharan, 2009) in telecommunications, neuroscience,
sustainability, and intelligence analysis feature massive data streams (Muthukrishnan,
2005) with ‘firehose’-like rates of arrival. In many cases, we need to analyze
such streams at speeds comparable to their generation rate. In neuroscience,
Received Feb 01, 2013
Revised Feb 01, 2013
Accepted Feb 01, 2013

2

D. Patnaik et al

one goal is to track spike trains from multi-electrode arrays (Patnaik, Laxman
and Ramakrishnan, 2009) with a view to identify cascading circuits of neuronal
firing patterns. In telecommunications, network traffic and call logs must be analyzed on a continual basis to detect attacks or other malicious activity. The
common theme in all these scenarios is the need to mine patterns (i.e., a succession of events occurring frequently, but not necessarily consecutively (Mannila
et al., 1997)) from dynamic and evolving streams.
Algorithms for pattern mining over streams have become increasingly popular
over the recent past (Manku and Motwani, 2002; Wong and Fu, 2006; Calders
et al., 2007; Cheng et al., 2008). Manku and Motwani (Manku and Motwani,
2002) introduced a lossy counting algorithm for approximate frequency counting
over streams, with no assumptions on the stream. Their focus on a worst-case
setting often leads to stringent threshold requirements. At the other extreme,
algorithms such as (Wong and Fu, 2006) provide significant efficiencies in mining
but make strong assumptions such as i.i.d distribution of symbols in a stream.
In the course of analyzing some real-world datasets, we were motivated to
develop new methods as existing methods are unable to process streams at the
rate and quality guarantees desired (see Sec. 6 for some examples). Furthermore,
established stream mining algorithms are almost entirely focused on itemset
mining (and, modulo a few isolated exceptions, just the counting phase of it)
whereas we are interested in mining general patterns.
Our specific contributions are as follows:
– We present the first general algorithm for mining patterns in a stream. Unlike prior streaming algorithms that focus almost exclusively on counting, we
provide solutions for both candidate generation and counting over a stream.
– Although our work is geared towards pattern mining, we adopt a black-box
model of a pattern mining algorithm. In other words, our approach can encapsulate and wrap around any pattern discovery algorithm to enable it to
accommodate streaming data. This significantly generalizes the scope and applicability of our approach as a general methodology to streamify existing pattern discovery algorithms. We illustrate here this generality of our approach
by focusing on two pattern classes—itemsets and episodes—prevalent in data
mining research.
– Devoid of any statistical assumptions on the stream (e.g., independence of
event symbols or otherwise), we develop a novel error characterization for
streaming patterns by identifying and tracking two key properties of the
stream, viz. maximum rate of change and top-k separation. We demonstrate
how the use of these two properties enables novel algorithmic optimizations,
such as the idea of borders to amortize work as the stream is tracked.
– We demonstrate successful applications in neuroscience and telecommunications log analysis, and illustrate significant benefits in runtime, memory usage,
and the scales of data that can be mined. We compare against pattern-mining
adaptations of two typical algorithms (Wong and Fu, 2006) from the streaming
itemsets literature.

2. Preliminaries
We consider the data mining task of discovering frequent patterns over a timestamped sequence of data records (Laxman and Sastry, 2006). The data se-

A General Streaming Algorithm for Pattern Discovery

3

quence is denoted as h(e1 , τ1 ), (e2 , τ2 ), . . . , (en , τn )i, where each data record ei ,
may represent different kinds of data depending on the class of patterns being considered. For example, ei denotes a transaction in the frequent itemsets
setting (Agrawal and Srikant, 1994), an event in the frequent episodes setting
(Mannila et al., 1997), a graph in the frequent subgraph mining setting (Yan
and Han, 2003), etc. The techniques developed in this paper are relevant in all
of these settings. In our experimental work, we focus on two concrete classes
of patterns; frequent episodes and frequent itemsets; we briefly summarize the
associated formalisms for these in the paragraphs below.

Frequent Episodes:
In the framework of frequent episodes (Mannila et al., 1997), an event sequence
is denoted as h(e1 , τ1 ), . . . , (en , τn )i, where (ei , τi ) represents the ith event; ei is
drawn from a finite alphabet E of symbols (called event-types) and τi denotes
the time-stamp of the ith event, with τi+1 ≥ τi , i = 1, . . . , (n − 1). An `-node
episodes α is defined by a triple α = (Vα , <α , gα ), where Vα = {v1 , . . . , v` } is a
collection of ` nodes, <α is a partial order over Vα and gα : Vα → E is a map that
assigns an event-type gα (v) to each node v ∈ Vα . An occurrence of an episode α
is a map h : Vα → {1, . . . , n} such that eh(v) = gα (v) for all v ∈ Vα and for all
pairs of nodes v, v 0 ∈ Vα such that v <α v 0 the map h ensures that τh(v) < τh(v0 ) .
Two occurrences of an episode are non-overlapped (Achar et al., 2012) if no event
corresponding to one appears in-between the events corresponding to the other.
The maximum number of non-overlapped occurrences of an episode is defined as
its frequency in the event sequence.

Frequent Itemsets:
The frequent itemsets mining framework is concerned with the classical market
basket problem (Agrawal and Srikant, 1994), where each data record ei can be
viewed as a transaction of items purchased at a grocery store. Frequent itemsets
will then refer to groups of itemsets frequently purchased together in the given
data set of transactions.
In general, the task in frequent pattern discovery is to find all patterns whose
frequency exceeds a user-defined threshold. Apriori-style level-wise1 algorithms
(Agrawal and Srikant, 1994; Mannila et al., 1997; Yan and Han, 2003; Achar
et al., 2012) are typically applicable in this setting. An important variant is topk pattern mining (see (Wang et al., 2005) for definitions in the itemsets mining
context), where, rather than a frequency threshold, the user supplies the number
of most frequent patterns needed.
Definition 1 (Top-k patterns of size `). The set of top-k patterns of size `
is defined as the collection of all `-size patterns with frequency greater than or
equal to the frequency f k of the k th most frequent `-size pattern in the given
data sequence.
The number of top-k `-size patterns can exceed k, although the number of `size patterns with frequencies strictly greater than f k is at most (k − 1). In
1

Level-wise algorithms start with patterns of size 1 and with each increasing level estimate
frequent patterns of the next size.

D. Patnaik et al

Events

4

Window Ws

[t]

Batch Bs

Time

Fig. 1. A sliding window model for pattern mining over data streams: Bs is the most recent
batch of records that arrived in the stream and Ws is the window of interest over which the
user wants to determine the set of frequent patterns.

general, top-k mining can be difficult to solve without knowledge of a good lowerbound for f k ; for relatively short data sequences the following simple solution
works well-enough: start mining at a high threshold and progressively lower the
threshold until the desired number of top patterns are returned.

3. Problem Statement
The data available (referred to as a data stream) is in the form of a potentially
infinite sequence of data records:
D = h(e1 , τ1 ), (e2 , τ2 ), . . . , (ei , τi ), . . . , (en , τn ), . . .i

(1)

Our goal is to find all patterns that were frequent in the recent past; for this, we
consider a sliding window model2 for the window of interest. In this model, the
user wants to determine patterns that were frequent over a (historical) window
of fixed-size terminating at the current time-tick. As new records arrive in the
stream, the user’s window of interest shifts, and the data mining task is to next
report the frequent patterns in the new window of interest.
We consider the case where the window of interest is very large and cannot be
stored and processed in-memory. This straightaway precludes the use of standard
multi-pass algorithms for frequent pattern discovery over the window of interest.
We organize the records in the stream into smaller batches such that at any
given time only the latest incoming batch is stored and processed in memory.
This is illustrated in Fig. 1. The current window of interest is denoted by Ws
and the most recent batch, Bs , consists of records in D that occurred between
times (s − 1)Tb and sTb where Tb is the time-span of each batch and s is the
batch number (s = 1, 2, . . .) .
The frequency of a pattern α in a batch Bs is referred to as its batch frequency
f s (α). The current window of interest, Ws , consists of m consecutive batches
ending in batch Bs , i.e.
Ws = hBs−m+1 , Bs−m+2 , . . . , Bs i

(2)

Definition 2 (Window Frequency). The frequency of a pattern α over window Ws , referred to as its window frequency and denoted by f Ws (α), is defined as the sum of batch frequencies of α in Ws . Thus, if f j (α) denotes the
batch frequency
of α in batch Bj , then the window frequency of α is given by
P
f Ws (α) = Bj ∈Ws f j (α).
2

Other models such as the landmark and time-fading models have also been studied (Cheng
et al., 2008) but we do not consider them here.

A General Streaming Algorithm for Pattern Discovery

5

[t]
Pattern

Count

Pattern

Count

Pattern

Count

Pattern

Count

WXYZ

12

EFGH

15

WXYZ

12

IJKL

11

PQRS

10

IJKL

12

EFGH

10

PQRS

9

ABCD

8

MNOP

10

ABCD

10

MNOP

8

MNOP

8

ABCD

9

MNOP

8

ABCD

8

EFGH

0

PQRS

0

PQRS

0

EFGH

0

IJKL

0

WXYZ

0

IJKL

0

WXYZ

0

B1

B2

B4

B3

Fig. 2. Batch frequencies in Example 1.
[t]
Table 1. Window frequencies in Example 1.
pattern
ABCD
MNOP
Window Freq
35
34

EFGH

WXYZ

IJKL

PQRS

25

24

23

19

In summary, we are given a data stream (D), a time-span for batches (Tb ), the
number of consecutive batches that constitute the current window of interest (m),
the desired size of frequent patterns (`), the desired number of most frequent
patterns (k) and the problem is to discover the top-k patterns in the current
window without actually having the entire window in memory.
Problem 1 (Streaming Top-k Mining). For each new batch, Bs , of records
in the stream, find all `-size patterns in the corresponding window of interest,
Ws , whose window frequencies are greater than or equal to the window frequency,
fsk , of k th most frequent `-size pattern in Ws .
Example 1 (Window Top-k v/s Batch Top-k). Let W be a window of four
batches B1 through B4 . The patterns in each batch with corresponding batch
frequencies are listed in Fig. 2. The corresponding window frequencies (sum of
each patterns’ batch frequencies) are listed in Table 1. The top-2 patterns in B1
are (PQRS) and (WXYZ). Similarly (EFGH) and (IJKL) are the top-2 patterns
in B2 , and so on. (ABCD) and (MNOP) have the highest window frequencies
but never appear in the top-2 of any batch – these patterns would ‘fly below
the radar’ and go undetected if we considered only the top-2 patterns in every
batch as candidates for the top-2 patterns over W . This example can be easily
generalized to any number of batches and any k.
It is also clear that the size of the batches can play a critical role with regards
to the quality of approximation of top-k. At one end of the spectrum the batchsize can be as large as the window. Ee would get exact top-k results but such
a scheme would be obviously impractical. At the other end the algorithm could
consume events one at a time (batch-size = 1) updating the top-k results over the
window in an online fashion. However, we expect the quality of approximation
to be poor in this case since no local pattern statistics can be estimated reliably
over batches of size=1. Thus, we suggest using batches of sufficient size, ofcourse,
limited by the number of events that can be stored and processed in-memory.
This will allow us to exploit the slowly changing statistics across batches to get
better top-k approximations over the window.

6

D. Patnaik et al

Example 1 highlights the main challenge in the streaming top-k mining problem: we can only store/process the most recent batch of records in the window of
interest and the batchwise top-k may not contain sufficient information to compute the top-k over the entire window. It is obviously not possible to track all
patterns (both frequent and infrequent) in every batch since the pattern space is
typically very large. This brings us to the question of which patterns to track in
every batch – how deep must we search within each batch for patterns that have
potential to become top-k over the window? We develop the formalism needed
to answer this question.

4. Persistence and Top-k Approximation
We identify two important properties of the underlying data stream which influence the design and analysis of our algorithms. These are stated in Definitions 3
& 4 below.
Definition 3 (Maximum Rate of Change, ∆). Maximum rate of change ∆(>
0) is defined as the maximum change in batch frequency of any pattern, α, across
any pair of consecutive batches, Bs and Bs+1 , i.e., ∀α, s, we have
|f s+1 (α) − f s (α)| ≤ ∆.

(3)

Intuitively, ∆ controls the extent of change from one batch to the next. While
it is trivially bounded by the number of records arriving per batch, it is often
much smaller in-practice.
Definition 4 (Top-k Separation of (ϕ, )). A batch Bs of records is said to
have a top-k separation of (ϕ, ), ϕ ≥ 0,  ≥ 0, if it contains at most (1 + )k
patterns with batch frequencies of (fks − ϕ∆) or more, where fks is the batch
frequency of the k th most-frequent pattern in Bs and ∆ is the maximum rate of
change.
This is essentially a measure of how well-separated the frequencies of the topk patterns are relative to the rest of the patterns. We expect to see roughly k
patterns with batch frequencies of at least fsk and the separation is considered to
be high (or good) if lowering the threshold from fsk to (fks − ϕ∆) only brings-in
very few additional patterns, i.e.  remains small as ϕ increases. Top-k separation
of any batch Bs is characterized by, not one but, several pairs of (ϕ, ) since ϕ
and  are functionally related:  is typically close to zero if ϕ = 0, while we have
k roughly the size of the class of `-size patterns (minus k) if ϕ∆ ≥ fsk . Note
that  is a non-decreasing function of ϕ and that top-k separation is measured
relative to the maximum rate of change ∆.
We now use the maximum rate of change property to design efficient streaming algorithms for top-k pattern mining and show that top-k separation plays
a pivotal role in determining the quality of approximation that our algorithms
achieve.
Lemma 1. The batch frequencies of the k th most-frequent patterns in any pair
of consecutive batches cannot differ by more than the maximum rate of change
∆, i.e., for every batch Bs , we must have
|fks+1 − fks | ≤ ∆.

(4)

A General Streaming Algorithm for Pattern Discovery

7

Proof. There exist at least k patterns in Bs with batch frequency greater than
or equal to fks (by definition). Hence, there exist at least k patterns in Bs+1
with batch frequency greater than or equal to (fks − ∆) (since frequency of any
pattern can decrease by at most ∆ going from Bs to Bs+1 ). Hence we must
have fks+1 ≥ (fks − ∆). Similarly, there can be at most (k − 1) patterns in Bs+1
with batch frequency strictly greater than (fks + ∆). Hence we must also have
fks+1 ≤ (fks + ∆).
The above lemma follows directly from: (i) there are at least k patterns with
frequencies no less than fks , and (ii) the batch frequency of any pattern can
increase or decrease by no more than ∆ when going from one batch to the next.
Our next observation is that if the batch frequency of a pattern is known
relative to fks in the current batch Bs , we can bound its frequency in any later
batch Bs+r .
Lemma 2. Consider two batches, Bs and Bs+r , r ∈ Z, located r batches away
from each other. Under a maximum rate of change of ∆ the batch frequency of
any pattern α in Bs+r must satisfy the following:
1. If f s (α) ≥ fks , then f s+r (α) ≥ fks+r − 2|r|∆
2. If f s (α) < fks , then f s+r (α) < fks+r + 2|r|∆

Proof. Since ∆ is the maximum rate of change, we have f s+r (α) ≥ (f s (α)−|r|∆)
and from Lemma 1, we have fks+r ≤ (fks + |r|∆). Therefore, if f s (α) ≥ fks , then
f s+r (α) + |r|∆ ≥ f s (α) ≥ fks ≥ fks+r − |r|∆
which implies f s+r (α) ≥ fks+r − 2|r|∆. Similarly, if f s (α) < fks , then
f s+r (α) − |r|∆ ≤ f s (α) < fks ≤ fks+r + |r|∆
which implies f s+r (α) < fks+r + 2|r|∆.
Lemma 2 gives us a way to track patterns that have potential to be in the topk of future batches. This is an important property which our algorithm exploits
and we recorded this as a remark below.
Remark 1. The top-k patterns of batch, Bs+r , r ∈ Z, must have batch frequencies of at least (fks − 2|r|∆) in batch Bs . Specifically, the top-k patterns of
Bs+1 must have batch frequencies of at least (fks − 2∆) in Bs .
0

Proof. Assume that we know fks for the batch Bs0 . If a pattern α is to belong
0
0
to the set of top-k frequent patterns in the batch Bs0 +r , then f s +r (α) ≥ fks +r
0
(where fs0 +r (α) and fks +r are unknown for r > 0).
0
0
Substituting s = s0 + r in Lemma 2, we get f s (α) ≥ fks − 2|r|∆.
The maximum rate of change property leads to a necessary condition, in the
form of a minimum batch-wise frequency, for a pattern α to be in the top-k over
a window Ws .
Theorem 1 (Exact Top-k over Ws ). A pattern, α, can be a top-k pattern
0
0
over window Ws only if its batch frequencies satisfy f s (α) ≥ fks − 2(m − 1)∆
∀Bs0 ∈ Ws .

8

D. Patnaik et al

Proof. To prove this, we show that if a pattern fails this threshold in even one
batch of Ws , then there exist k or more patterns which can have a greater window
0
0
frequency. Consider a pattern β for which f s (β) < fks − 2(m − 1)∆ in batch
Bs0 ∈ Ws . Let α be any top-k pattern of Bs0 . Consider two patterns α and β
0
0
0
0
such that in a batch Bs0 f s (α) ≥ fks and f s (β) < fks − 2(m − 1)∆. In any
other batch Bp ∈ Ws , we have
0

f p (α) ≥ f s (α) − |p − s0 |∆
0

≥ fks − |p − s0 |∆

(5)

and
0

f p (β) ≤ f s (β) + |p − s0 |∆
0

< (fks − 2(m − 1)∆) + |p − s0 |∆

(6)

0

Applying |p − s | ≤ (m − 1) to the above, we get
0

f p (α) ≥ fks − (m − 1)∆ > f p (β)

(7)

This implies f Ws (β) < f Ws (α) for every top-k pattern α of Bs0 . Since there are
at least k top-k patterns in Bs0 , β cannot be a top-k pattern over the window
Ws .
0

Mining patterns with frequency threshold (fks − 2(m − 1)∆) in each batch
Bs0 gives complete counts of all top-k patterns in the window Ws where m is the
0
number of batches in the window, fks is the frequency of the kth most frequent
pattern in batch Bs0 ∈ Ws and ∆ is the continuity parameter.
Based on Theorem 1 we have the following simple algorithm for obtaining
the top-k patterns over a window: Use a traditional level-wise approach to find
all patterns with a batch frequency of at least (f1k − 2(m − 1)∆) in the first
batch (B1 ), accumulate their corresponding batch frequencies over all m batches
of Ws and report the patterns with the k highest window frequencies over Ws .
This approach is guaranteed to return the exact top-k patterns over Ws . In order
to report the top-k over the next sliding window Ws+1 , we need to consider all
patterns with batch frequency of at least (f2k − 2(m − 1)∆) in the second batch
and track them over all batches of Ws+1 , and so on. Thus, an exact solution to
Problem 1 would require running a level-wise pattern mining algorithm in every
batch, Bs , s = 1, 2, . . ., with a frequency threshold of (fsk − 2(m − 1)∆).

4.1. Class of (v, k)-Persistent patterns
Theorem 1 characterizes the minimum batchwise computation needed in order to
obtain the exact top-k patterns over a sliding window. This is effective when ∆
and m are small (compared to fsk ). However, the batchwise frequency thresholds
can become very low in other settings, making the processing time per-batch
as well as the number of patterns to track over the window to become impractically high. To address this issue, we introduce a new class of patterns called
(v, k)-persistent patterns which can be computed efficiently by employing higher
batchwise thresholds. Further, we show that these patterns can be used to approximate the true top-k patterns over the window and the quality of approximation is characterized in terms of the top-k separation property (cf. Definition 4).

A General Streaming Algorithm for Pattern Discovery

9

Definition 5 ((v, k)-Persistent pattern). A pattern is said to be (v, k)-persistent
over window Ws if it is a top-k pattern in at least v batches of Ws .
Problem 2 (Mining (v, k)-Persistent patterns). For each new batch, Bs , of
records in the stream, find all `-size (v, k)-persistent patterns in the corresponding window of interest, Ws .
Theorem 2. A pattern, α, can be (v, k)-persistent over the window Ws only if
0
0
its batch frequencies satisfy f s (α) ≥ (fks −2(m−v)∆) for every batch Bs0 ∈ Ws .
Proof. Let α be (v, k)-persistent over Ws and let Vα denote the set of batches
in Ws in which α is in the top-k. For any Bq ∈
/ Vα there exists Bb
∈ Vα that
p(q)
is nearest to Bq . Since |Vα | ≥ v, we must have |b
p(q) − q| ≤ (m − v). Applying
Lemma 2 we then get f q (α) ≥ fkq − 2(m − v)∆ for all Bq ∈
/ Vα .
Theorem 2 gives us the necessary conditions for computing all (v, k)-persistent
patterns over sliding windows in the stream. The batchwise threshold required
for (v, k)-persistent patterns depends on the parameter v. For v = 1, the threshold coincides with the threshold for exact top-k in Theorem 1. The threshold
increases linearly with v and is highest at v = m (when the batchwise threshold
is same as the corresponding batchwise top-k frequency).
The algorithm for discovering (v, k)-persistent patterns follows the same general lines as the one described earlier for exact top-k mining, only that we now
apply higher batchwise thresholds: For each new batch, Bs , entering the stream,
use a standard level-wise pattern mining algorithm to find all patterns with
batch frequency of at least (fsk − 2(m − v)∆). We provide more details of our
algorithm later in Sec. 5. First, we investigate the quality of approximation of
top-k that (v, k)-persistent patterns offer and show that the number of errors is
closely related to the degree of top-k separation.

4.1.1. Top-k Approximation
The main idea here is that, under a maximum rate of change ∆ and a topk separation of (ϕ, ), there cannot be too many distinct patterns which are
not (v, k)-persistent while still having sufficiently high window frequencies. To
this end, we first compute a lower-bound (fL ) on the window frequencies of
(v, k)-persistent patterns and an upper-bound (fU ) on the window frequencies
of patterns that are not (v, k)-persistent (cf. Lemmas 3 & 4).
Lemma 3. If pattern α is (v, k)-persistent over a window, Ws , then its window
frequency, f Ws (α), must satisfy the following lower-bound:
X 0
def
f Ws (α) ≥
fks − (m − v)(m − v + 1)∆ = fL
(8)
B s0

Proof. Consider pattern α that is (v, k)-persistent over Ws and let Vα denote
the batches of Ws in which α is in the top-k. The window frequency of α can be

10

D. Patnaik et al

written as
f Ws (α) =

f p (α) +

X
Bp ∈Vα

≥

X
Bp ∈Vα

=

f q (α)

X
Bq ∈Ws \Vα

fkp

fkq − 2|b
p(q) − q|∆

X

+

Bq ∈Ws \Vα
0
fks

X

−

Bs0 ∈Ws

X

2|b
p(q) − q|∆

(9)

Bq ∈Ws \Vα

where Bb
∈ Vα denotes the batch nearest Bq where α is in the top-k. Since
p(q)
|Ws \ Vα | ≤ (m − v), we must have
X
|b
p(q) − q| ≤ (1 + 2 + · · · + (m − v))
Bq ∈Ws \Vα

=

1
(m − v)(m − v + 1)
2

(10)

Putting together (9) and (10) gives us the lemma.
Similar arguments give us the next lemma about the maximum frequency of
patterns that are not (v, k)-persistent (Full proofs are available in (Patnaik et al.,
2012)).
Lemma 4. If pattern β is not (v, k)-persistent over a window, Ws , then its
window frequency, f Ws (β), must satisfy the following upper-bound:
X 0
def
f Ws (β) <
fks + v(v + 1)∆ = fU
(11)
B s0

Proof. Consider pattern β that is not (v, k)-persistent over Ws and let Vβ denote
the batches of Ws in which β is in the top-k. The window frequency of β can be
written as:
X
X
f Ws (β) =
f p (β) +
f q (β)
Bp ∈Vβ

<

X
Bp ∈Vβ

=

Bq ∈Ws \Vβ

fkp + 2|b
p(q) − q|∆ +

X

fkq

Bq ∈Ws \Vβ
0
fks

X
Bs0 ∈Ws

+

X

2|b
q (p) − p|∆

(12)

Bp ∈Vβ

where Bb
∈ Ws \ Vβ denotes the batch nearest Bp where β is not in the top-k.
q (p)
Since |Vβ | < v, we must have
X
|b
q (p) − p| ≤ (1 + 2 + · · · + (v − 1))
Bp ∈Vβ

=

1
v(v + 1)
2

Putting together (12) and (13) gives us the lemma.

(13)

A General Streaming Algorithm for Pattern Discovery

11

It turns out that fU > fL ∀v, 1 ≤ v ≤ m, and hence there is always a possibility for some patterns which are not (v, k)-persistent to end up with higher
window frequencies than one or more (v, k)-persistent patterns. We observed
a specific instance of this kind of ‘mixing’ in our motivating example as well
(cf. Example 1). This brings us to the top-k separation property that we introduced in Definition 4. Intuitively, if there is sufficient separation of the top-k
patterns from the rest of the patterns in every batch, then we would expect to
see very little mixing. As we shall see, this separation need not occur exactly at
k th most-frequent pattern in every batch, somewhere close to it is sufficient to
achieve a good top-k approximation.
Definition 6 (Band Gap patterns, Gϕ ). In any batch Bs0 ∈ Ws , the halfopen frequency interval [fsk0 − ϕ∆, fsk0 ) is called the band gap of Bs0 . The corresponding set, Gϕ , of band gap patterns over the window Ws , is defined as the
collection of all patterns with batch frequencies in the band gap of at least one
Bs0 ∈ Ws .
The main feature of Gϕ is that, if ϕ is large-enough, then the only patterns
which are not (v, k)-persistent but that can still mix with (v, k)-persistent patterns are those belonging to Gϕ . This is stated formally in the next lemma. The
proof, omitted here, can be found in (Patnaik et al., 2012).
v
Lemma 5. If ϕ2 > max{1, (1 − m
)(m − v + 1)}, then any pattern β that is not
(v, k)-persistent over Ws , can have f Ws (β) ≥ fL only if β ∈ Gϕ .

Proof. If a pattern β is not (v, k)-persistent over Ws then there exists a batch
Bs0 ∈ Ws where β is not in the top-k. Further, if β ∈
/ Gϕ then we must have
fs0 (β) < fsk0 − ϕ∆. Since ϕ > 2, β cannot be in the top-k of any neighboring
batch of Bs0 , and hence, it will stay below fsk0 − ϕ∆ for all Bs0 ∈ Ws , i.e.,
X
f Ws (β) <
fsk0 − mϕ∆.
Bs0 ∈Ws

The Lemma follows from the given condition

ϕ
2

> (1 −

v
m )(m

− v + 1).

The number of patterns in Gϕ is controlled by the top-k separation property, and
since many of the non-persistent patterns which can mix with persistent ones
must spend not one, but several batches in the band gap, the number of unique
patterns that can cause such errors is bounded. Theorem 3 is our main result
about quality of top-k approximation that (v, k)-persistence can achieve.
Theorem 3 (Quality of Top-k Approximation). Let every batch Bs0 ∈ Ws
v
have a top-k separation of (ϕ, ) with ϕ2 > max{1, (1 − m
)(m − v + 1)}. Let P
denote the set of all (v, k)-persistent patterns over Ws . If |P| ≥ k, then the topk patterns
over Ws can be determined from P with an error of no more than


√
km
patterns, where µ = min{m − v + 1, ϕ2 , 21 ( 1 + 2mϕ − 1)}.
µ
Proof. By top-k separation, we have a maximum of (1 + )k patterns in any
batch Bs0 ∈ Ws , with batch frequencies greater than or equal to fsk0 − ϕ∆. Since
at least k of these must belong to the top-k of the Bs0 , there are no more than
k patterns that can belong to the band gap of Bs0 . Thus, there can be no more
than a total of km patterns over all m batches of Ws that can belong to Gϕ .
Consider any β ∈
/ P with f Ws (β) ≥ fL – these are the only patterns whose

12

D. Patnaik et al

window frequencies can exceed that of any α ∈ P (since fL is the minimum
window frequency of any α). If µ denotes the minimum number

of batches in
which β belongs to the band gap, then there can be at most

km
µ

such distinct

β. Thus, if |P| ≥ k,we can
 determine the set of top-k patterns over Ws with
km
error no more than
patterns.
µ

There are now two cases to consider to determine µ: (i) β is in the top-k of
some batch, and (ii) β is not in the top-k of any batch.
Case (i): Let β be in the top-k of Bs0 ∈ Ws . Let Bs00 ∈ Ws be t batches away
00
from Bs0 . Using Lemma 2 we get
f s (β) ≥ fsk00 − 2t∆. The minimum t for which

(fsk00 − 2t∆ < fsk − ϕ∆) is ϕ2 . Since β ∈
/ P, β is below the top-k in at least
(m − v + 1) batches. Hence β stays in the band gap of at least min{m − v + 1, ϕ2 }
batches of Ws .
Case (ii): Let VG denote the set of batches in Ws where β lies in the band
gap and let |VG | = g. Since β does not belong to top-k of any batch, it must
stay below the band gap in all the (m − g) batches of (Ws \ VG ). Since ∆ is the
maximum rate of change, the window frequency of β can be written as follows:
X
X
f q (β)
f Ws (β) =
f p (β) +
Bp ∈VG

<

X

Bq ∈Ws \VG
p

f (β) +

Bp ∈VG

(fqk − ϕ∆)

X

(14)

Bq ∈Ws \VG

Let Bb
denote the batch in Ws \ VG that is nearest to Bp ∈ VG . Then we have:
q (p)
q (p)
f p (β) ≤ f b
(β) + |p − qb(p)|∆

k
< fb
− ϕ∆ + |p − qb(p)|∆
q (p)

< fpk − ϕ∆ + 2|p − qb(p)|∆

(15)

where the second inequality holds because β is below the band gap in Bb
and
q (p)
(15) follows from Lemma 1. Using (15) in (14) we get
X
X
f Ws (β) <
fsk0 − mϕ∆ +
2|p − qb(p)|∆
Bs0 ∈Ws

<

X

Bp ∈VG

fsk0

− mϕ∆ + 2(1 + 2 + · · · + g)∆

Bs0 ∈Ws

=

X

fsk0 − mϕ∆ + g(g + 1)∆ = UB

(16)

Bs0 ∈Ws

The smallest g for which (f Ws (β) ≥ fL ) is feasible can be obtained by setting
v
)(m − v + 1), UB ≥ fL implies
UB ≥ fL . Since ϕ2 > (1 − m
X
Bs0 ∈Ws

fsk0 − mϕ∆ + g(g + 1)∆ >

X
Bs0 ∈Ws

fsk0 −

mϕ∆
2

√
Solving for g, we get g ≥ 21 ( 1 + 2mϕ − 1). Combining cases (i) and (ii), we get
√
µ = min{m − v + 1, ϕ2 , 21 ( 1 + 2mϕ − 1)}.

A General Streaming Algorithm for Pattern Discovery

13

Theorem 3 shows the relationship between the extent of top-k separation
required and quality of top-k approximation that can be obtained through (v, k)persistent patterns. In general, µ (which is minimum of three factors) increases
with ϕ2 until the latter starts to dominate the other two factors, namely, (m−v +
√
1) and 21 ( 1 + 2mϕ − 1). The theorem also brings out the tension between the
persistence parameter v and the quality of approximation. At smaller values of v,
the algorithm mines ‘deeper’ within each batch and so we expect fewer errors with
respect to the true top-k epispodes. On the other hand, deeper mining within
batches is computationally more intensive, with the required effort approaching
that of exact top-k mining as v approaches 1.
Finally, we use Theorem 3 to derive error-bounds for three special cases; first
for v = 1, when the batchwise threshold is same as that for exact top-k mining
as per Theorem 1; second for v = m, when the batchwise threshold is simply
the batch
frequency
of the k th most-frequent pattern in the batch; and third, for


v = m+1
,
when
the
batchwise threshold lies midway between the thresholds
2
of the first two cases. (Proofs are detailed in (Patnaik et al., 2012)).
Corollary 1. Let every batch Bs0 ∈ Ws have a top-k separation of (ϕ, ) and let
Ws contain at least m ≥ 2 batches. Let P denote the set of all (v, k)-persistent
patterns over Ws . If we have |P| ≥ k, then the maximum error-rate in the top-k
patterns derived from P, for three different choices of v, is given by:

ϕ
km
1.

m−1

for v = 1, if

2. (km) for v = m, if
3.



4km2
m2 −1



for v =

2
ϕ
2

> (m − 1)

>1

 m+1 
2

, if

ϕ
2

>

1
m

 m−1   m+1 
2

2



. The cases of v = 1 and v = m
Proof. We show the proof only for v = m+1
2
are obtained immediately
upon application
 m+1 
 m−1of Theorem 3. ϕ

  m+1 
1 m−1
Fixing v =
implies (m−v) =
. For m ≥ 2, 2 > m
2
2
2
2
v
implies ϕ2 > max{1, (1 − m
)(m − v + 1)}. Let tmin = min{m − v + 1, ϕ2 }. The
minimum value of tmin is governed by





m+1
1 m−1
m+1
tmin ≥ min
,
2
m
2
2



1 m−1
m+1
=
m
2
2
 2

m −1
≥
(17)
4m

  m+1 

√
2 m−1
Let gmin = 12 ( 1 + 2mϕ − 1). ϕ > m
implies gmin > m−1
. From
2
2
2
Theorem 3 we have
 2

m −1
µ = min{tmin , gmin } ≥
4m


2
and hence the number of errors is no more than 4km
m2 −1 .
Using v = 1 we make roughly k errors by considering only persistent patterns
for the final output, while the same batchwise threshold can give us the exact
top-k as per Theorem 1. On the other hand, at v = m, the batchwise thresholds

14

D. Patnaik et al
After a new
batch arrives

Decreasing frequency

fks 1
Fs
fks

fks

1

patterns

Bs

fks

1

1

Old patterns no
longer frequent

Fs
(+)
(-)
patterns

New frequent
patterns

Bs

[t]
Fig. 3. The set of frequent patterns can be incrementally updated as new batches arrive.

are higher (the algorithm will run faster) but the number of errors grows linearly
with m. Note that the (ϕ,)-separation
needed for v = 1 is much higher than for

v = m. The case of v = m+1
lies
in-between,
with roughly 4k errors under
2
reasonable separation conditions.

5. Algorithm
In this section we present an efficient algorithm for incrementally mining patterns
with frequency at least (fks − θ) in batch Bs , for each batch in the stream. The
threshold parameter θ is set to 2(m − v)∆ for mining (v, k)-persistent patterns
(see Theorem 2) and to 2(m − 1)∆ for exact top-k mining (see Theorem 1).
A trivial (brute-force) algorithm to find all patterns with frequency greater
than (fks −θ) in Bs is as follows: Apply an Apriori-based pattern mining algorithm
(e.g. (Mannila et al., 1997; Achar et al., 2012)) on batch Bs . If the number of
patterns of size-` is less than k, the support threshold is decreased and the mining
repeated until at least k `-size patterns are found. At this point fks is known. The
mining process is then repeated once more with the frequency threshold (fks −
θ). Doing this entire procedure for every new batch is expensive and wasteful.
After seeing the first batch of the data, whenever a new batch arrives we have
information about the patterns that were frequent in the previous batch. This can
be exploited to incrementally and efficiently update the set of frequent patterns
in the new batch. The intuition is that the frequencies of most patterns do not
change much from one batch to the next. As a result only a small number of
previously frequent patterns fall below the new support threshold in the new
batch; similarly, some new patterns may become frequent. This is illustrated in
Figure 3. In order to efficiently find these sets of patterns, we need to maintain
additional information that allows us to avoid full-blown candidate generation in
each batch. We show that this state information is a by-product of an Aprioribased algorithm and therefore any extra processing is unnecessary.
Frequent patterns are discovered level-wise, in ascending order of pattern-size.
An Apriori procedure alternates between counting and candidate generation.
First a set C i of candidate i-size patterns is determined by combining frequent
(i − 1)-size patterns from the previous level. Then the data is scanned for deter-

A General Streaming Algorithm for Pattern Discovery

15

mining frequencies of the candidates, from which the frequent i-size patterns are
obtained. We note that all candidate patterns that are not frequent constitute
the negative border of the frequent pattern lattice (Aumann et al., 1999). This is
because, a candidate is generated only when all its subpatterns are frequent. The
usual approach is to discard the border. For our purposes, the border contains
information required to identify changes in the frequent patterns from one batch
to the next3 .
The pseudocode for incrementally mining frequent patterns in batches is
listed in Algorithm 1. The inputs to the algorithm are: (i) Number k of top
patterns desired, (ii) New incoming batch Bs of records in the stream, (iii) Lat∗
∗
tice of frequent (Fs−1
) and border patterns (Bs−1
) from previous batch, and
(iv) threshold parameter θ. Frequent and border patterns of size-i, with respect
to frequency threshold fks − θ, are denoted by the sets Fsi and Bsi respectively
(Fs∗ and Bs∗ denote the corresponding sets for all pattern sizes).
For the first batch B1 (lines 1–3) the top-k patterns are found by a bruteforce method, i.e., by mining with a progressively lower frequency threshold until
at least k patterns of size ` are found. Once fk1 is determined, F1∗ and B1∗ are
obtained using an Apriori procedure.
For subsequent batches, we do not need a brute-force method to determine
`
fks . Based on Remark 1, if θ ≥ 2∆, Fs−1
from batch Bs−1 contains every potential
top-k pattern of the next batch Bs . Therefore simply updating the counts of all
`
patterns in Fs−1
in the new batch Bs and picking the k th highest frequency gives
s
fk (lines 4–6). To update the lattice of frequent and border patterns (lines 7–
24) the procedure starts from the bottom (size-1 patterns). The data is scanned
to determine the frequency of new candidates together with the frequent and
border patterns from the lattice (line 11). In the first level (patterns of size 1),
`
the candidate set is empty. After counting, the patterns from Fs−1
that continue
`
to be frequent in the new batch are added to Fs (lines 13–14). But if a pattern
is no longer frequent it is marked as a border set and all its super patterns are
deleted (lines 15–17). This ensures that only border patterns are retained in the
lattice. In the border and new candidate sets, any pattern that is found to be
i
(lines 18–21). The remaining infrequent
frequent is added to both Fsi and Fnew
patterns belong to the border because, otherwise, they would have at least one
infrequent subpattern and would have been deleted at a previous level; hence,
these infrequent patterns are added to Bs` (lines 22–23).
Finally, the candidate generation step (line 24) is required to fill out the missing parts of the frequent pattern lattice. We want to avoid a full blown candidate
generation. Note that if a pattern is frequent in Bs−1 and Bs then all its subpat`
`
terns are also frequent in both Bs and Bs−1 . Any new pattern (6∈ Fs−1
∪ Bs−1
)
that turns frequent in Bs , therefore, must have at least one subpattern that
was not frequent in Bs−1 but is frequent in Bs . All such patterns are listed in
i
Fnew
. Hence the candidate generation step (line 24) for the next level generates
i
only candidates with at least one subpattern ∈ Fnew
. This reduces the number of candidates generated at each level without compromising completeness
of the results. The space and time complexity of candidate generation is now
i
i
O(|Fnew
|.|Fsi |) instead of O(|Fsi |2 ) and in most practical cases |Fnew
|  |Fsi |.
3

Border sets were employed by (Aumann et al., 1999) for efficient mining of dynamic
databases. Multiple passes over older data are needed for any new frequent itemsets, which is
not feasible in a streaming context.

16

D. Patnaik et al

Algorithm 1 Persistent pattern miner: Mine top-k v-persistent patterns.
Input: Number k of top patterns; New batch Bs of events;
∗ , B∗
Current lattice of frequent & border patterns (Fs−1
s−1 );
Threshold parameter θ (set to 2(m − v)∆ for (v, k)-persistence, 2(m − 1)∆ for exact top-k)
Output: Lattice of frequent and border patterns (Fs∗ , Bs∗ ) after Bs
1: if s = 1 then
2:
Determine fk1 (brute force)
3:
Compute (F1∗ , B1∗ ) ← Apriori(B1 , `, fk1 − θ)
4: else
`
5:
CountFrequency(Fs−1
, Bs )
s
`
6:
Determine fk (based on patterns in Fs−1
)
1
7:
Initialize C ← φ (new candidates of size 1)
8:
for i ← 1, . . . , ` do
9:
Initialize Fsi ← φ, Bsi ← φ
i
10:
Initialize Fnew
← φ (new frequent patterns of size i)
i
i
11:
CountFrequency(Fs−1
∪ Bs−1
∪ C i , Bs )
i
12:
for α ∈ Fs−1
do
13:
if f s (α) ≥ fks − θ then
14:
Update Fsi ← Fsi ∪ {α}
15:
else
16:
Update Bsi ← Bsi ∪ {α}
∗ , B∗
17:
Delete super-patterns of α from (Fs−1
s−1 )
i
18:
for α ∈ Bs−1
∪ C i do
19:
if f s (α) ≥ fks − θ then
20:
Update Fsi ← Fsi ∪ {α}
i
i
21:
Update Fnew
← Fnew
∪ {α}
22:
else
23:
Update Bsi ← Bsi ∪ {α}
i
24:
C i+1 ← GenerateCandidate(Fnew
, Fsi )
25: return (Fs∗ , Bs∗ )

Later in the experiments section, we show how border-sets help our algorithm
run very fast.
For a window Ws ending in the batch Bs , the set of output patterns can
be obtained by picking the top-k most frequent patterns from the set Fs` . Each
pattern also maintains a list that stores its batch-wise counts is last m batches.
The window frequency is obtained by adding these entries together. The output
patterns are listed in decreasing order of their window counts.
Example 2. In this example we illustrate the procedure for incrementally updating the frequent patterns lattice as a new batch Bs is processed (see Figure 4).
Figure 4(A) shows the lattice of frequent and border patterns found in the
batch Bs−1 (The figure does not show all the subpatterns at each level, only some
of them). ABCD is a 4-size frequent pattern in the lattice. In the new batch Bs ,
the pattern ABCD is no longer frequent. The pattern CDXY appears as a new
frequent pattern. The pattern lattice in Bs is shown in Figure 4(B).
In the new batch Bs , AB falls out of the frequent set. AB now becomes
the new border and all its super-patterns namely ABC, BCD and ABCD are
deleted from the lattice.
At level 2, the border pattern XY turns frequent in Bs . This allows us to
generate DXY as a new 3-size candidate. At level 3, DXY is also found to be
frequent and is combined with CDX which is also frequent in Bs to generate

A General Streaming Algorithm for Pattern Discovery
Level: 4

ABCD

Level: 3
Level: 2

17

ABC
AB

Level: 1 A

BCD
BC

CDX
CD

B

C

Border sets

D

AX … BX

XY

DX
X

Y

E … H

(A) Bs−1
Level: 4

ABCD

Level: 3
Level: 2
Level: 1 A

ABC
AB

CDXY

BCD
BC

B

CDX

DXY

Border sets

DX

XY

AX … BX

CD
C

D

X

Y

E … H

(B) Bs
Fig. 4. Incremental lattice update for the next batch Bs given the lattice of frequent and
border patterns in Bs−1 .

CDXY as a 4-size candidate. Finally at level 4, CDXY is found to be frequent.
This shows that border sets can be used to fill out the parts of the pattern lattice
that become frequent in the new data.

5.1. Estimating ∆ dynamically
The parameter ∆ in the bounded rate change assumption is a critical parameter
in the entire formulation. But unfortunately the choice of the correct value for ∆
is highly data-dependent. In the streaming setting, the characteristics of the data
can change over time. Hence one predetermined value of ∆ cannot be provided in
any intuitive way. Therefore we estimate ∆ from the frequencies of `-size patterns
in consecutive windows. We compute the differences in frequencies of patterns
that are common in consecutive batches. Specifically, we consider the value at
the 75th percentile as an estimate of ∆. We avoid using the maximum change
as it tends to be noisy. A few patterns exhibiting large changes in frequency can
skew the estimate and adversely affect the mining procedure.

6. Results
6.1. Experimental Setup
We show experimental results for mining two different pattern classes, namely,
episodes and itemsets. For episode mining we have one synthetic and two real
data streams, from two very different domains: experimental neuroscience and
telecom networks. In neuroscience, we consider (1) synthetic neuronal spiketrains based on mathematical models of interconnected neurons, with each neuron modeled as an inhomogeneous Poisson process (Achar et al., 2012), and

18

D. Patnaik et al

Table 2. Experimental setup.
(a) Datasets used in the experiments.
Dataset

Alphabet-size

Data-length

Type

Call-logs
Neuroscience
Synthetic
Kosarak
T10I4D100K

8
58
515
41,270
870

42,030,149
12,070,634
25,295,474
990,000
100,000

Events
Events
Events
Itemsets
Itemsets

(b) Parameter set-up.
Parameter

Value

k (in Top-k patterns)
Number of batches in a window, m
Batch-size (as number of events)
Error parameter (applicable to Chernoff-based and Lossy
counting methods)
Parameter v (applicable to Persistence miner)

25
10
106
0.001
m/2

(2) real neuronal spiking activity from dissociated cortical cultures, recorded
using multi-electrode arrays at Steve Potter’s lab, Georgia Tech (Wagenaar
et al., 2006). The third kind of data we consider are call data records from a large
European telecom network. Each record describes whether the associated call was
voice or data, an international or local call, in-network or out-of-network, and
of long or short duration. Pattern discovery over such data can uncover hidden,
previously unknown, trends in usage patterns, evolving risk of customer churn,
etc. In case of itemsets we present results on two publicly available datasets:
Kosarak and T10I4D100K. The Kosarak dataset contains (anonymized) clickstream data of a hungarian on-line news portal while the T10I4D100K dataset
was generated using the itemset simulator from the IBM Almaden Quest research group. Both can be downloaded from the Frequent Itemset Mining Dataset
Repository http://fimi.ua.ac.be/data. The data length in terms of number
of events or transactions and the alphabet size of each of these datasets is shown
in Table 2(a). Table 2(b) gives the list of parameters and the values used in
experiments that we do not explicitly mention in the text.
We compare persistent pattern miner against two methods from itemsets mining literature (Wong and Fu, 2006): one that fixes batchwise thresholds based on
Chernoff-bounds under an iid assumption over the event stream, and the second
based on a sliding window version of the Lossy Counting algorithm (Manku and
Motwani, 2002). These methods were designed for frequent itemset mining and
therefore we adapt them suitably for mining episodes. We modify the support
threshold computation using chernoff bound for episodes since the total number
of distinct episodes is bounded by the size of the largest transaction while for
itemsets it is the alphabet size that determines this. As mentioned earlier, we
are not aware of any streaming algorithms that directly address top-k episode
mining over sliding windows of data.
Estimating ∆ dynamically: ∆ is a critical parameter for our persistent pattern
miner. But unfortunately the choice of the correct value is highly data-dependent
and the characteristics of the data can change over time. One predetermined
value of ∆ cannot be provided in any intuitive way. Therefore we estimate ∆
from the frequencies of `-size patterns in consecutive batches by computing the

A General Streaming Algorithm for Pattern Discovery

19

change in frequency of patterns and using the 75th percentile as the estimate.
We avoid using the maximum change as it tends to be noisy.

6.2. Quality of top-k mining
We present aggregate comparisons of the three competing methods in Table 3.
These datasets provide different levels of difficulty for the mining algorithms.
Mining episodes: Tables 3(a) & 3(c) shows high f-scores4 for the synthetic
and real neuroscience data for all methods (Our method performs best in both
cases). On the other hand we find that all methods report very low f-scores on
the call-logs data (see Table 3(b)). The characteristics of this data does not allow
one to infer window top-k from batches (using limited computation). But our
proposed method nearly doubles the f-score with identical memory and CPU
usage on this real dataset. It may be noteworthy to mention that the competing
methods reported close to 100% accuracies but they do not perform that well
on more realistic datasets. In case of the synthetic data (see Table 3(c) the
characteristics are very similar to that of the neuroscience dataset.
Mining itemsets: Tables 3(d) and 3(e) show that for itemset data all the
three competing methods perform identically with f-score = 100. But run-times
of Persistent pattern mining is significantly better than the other methods with
nearly the same or less memory requirement.

6.3. Computation efficiency comparisons
Table 3 shows that we do better than both competing methods in most cases
(and never significantly worse than either) with respect to time and memory.

6.3.1. Effect of parameters on performance
In Figure 5, we see all three methods outperform the reference method (the
standard multi-pass apriori based miner) by at least an order of magnitude in
terms of both run-times and memory.
In Figure 5(a)-(c), we show the effect of increasing k on all the methods.
The accuracy of Lossy-counting algorithm drops with increase in k, while that
of Chernoff based method and Persistence miner remain unchanged. Persistence
miner has lower runtimes for all choices of k while having comparable memory
footprint as the other two methods.
With increasing window-size (m = 5, 10, 15 and batch − size = 106 events),
we observe better f-scores for Persistence miner but this increase is not significant
enough and can be caused by data characterisitcs alone. The runtimes and memory of Persistence miner remain nearly constant. This is important for streaming
algorithms as the runtimes and memory of the standard multi-pass algorithm
increases (roughly) linearly with window size.

4

fscore = 2 ·

precision·recall
precision+recall

20

D. Patnaik et al

Table 3. Aggregate performance comparison of different algorithms.
(a) Dataset: Neuroscience, Size of patterns = 4, Pattern class: Episodes
Top-k Miner

F Score

Runtime (sec)

Memory (MB)

Chernoff-bound based
Lossy-counting based
Persistent pattern

92.0
92.0
97.23

456.0
208.25
217.64

251.39
158.5
64.51

(b) Dataset: Call-logs, Size of patterns = 6, Pattern class: Episodes
Top-k Miner

F Score

Runtime (sec)

Memory (MB)

Chernoff-bound based
Lossy-counting based
Persistent pattern

32.17
24.18
49.47

11.87
3.29
3.34

66.14
56.87
67.7

(c) Dataset: Synthetic data, Size of patterns = 4, Pattern class: Episodes
Top-k Miner

F Score

Runtime (sec)

Memory (MB)

Chernoff-bound based
Lossy-counting based
Persistent pattern

92.7
92.7
96.2

14.91
6.96
4.98

43.1
32.0
34.43

(d) Dataset: Kosarak, Size of patterns = 4, batch-size = 10,000 transactions, v = 9,
m = 10, Pattern class: Itemsets
Top-k Miner

F Score

Runtime (sec)

Memory (MB)

Chernoff-bound based
Lossy-counting based
Persistent pattern

100
100
100

32.18
15.57
1.93

31.85
25.87
25.54

(e) Dataset: T10I4D100K, Size of patterns = 4, batch-size = 10,000 transactions, k = 10,
v = 9, m = 10, Pattern class: Itemsets
Top-k Miner

F Score

Runtime (sec)

Memory (MB)

Chernoff-bound based
Lossy-counting based
Persistent pattern

100
100
100

662.05
335.71
73.38

170.77
87.86
116.27

6.3.2. Utility of Border Sets
For patterns with slowly changing frequency we show in Section 5 that using
border-sets to incrementally update the frequent patterns lattice results in an
i
order complexity of O(|Fnew
|.|Fsi |) instead of O(|Fsi |2 ) for candidate generation
i
and in most practical cases |Fnew
|  |Fsi |. Table 4 demonstrates the speedup in runtime achieved by using border-set. We implemented two versions of
our Persistence miner. In one we utilize border-sets to incrementally update the
lattice where as in other we rebuild the frequent pattern lattice from scratch in
every new batch. The same batch-wise frequency thresholds used as dictated by
Theorem 2. We ran the experiment on our call-logs dataset and T10I4D100K
dataset, and for various parameter settings of our algorithm we observed a speedup of ≈ 4× resulting from use of border-sets.

A General Streaming Algorithm for Pattern Discovery
80

Chernoff-bound based
Persistence miner

70

Lossy-counting based

Runtime (sec)

F-score

100
Standard Multi-pass miner
Lossy-counting based

80

60

50
40
30

20

21

60

40
20

10

0
25

50
Top-k

0

100

25

50
Top-k

(a) F-Score vs. k
700

Standard Multi-pass
Lossy-counting based

70

Chernoff-bound based
Persistence miner

Chernoff-bound based
Persistence miner

60

500

300
200

40

30
20

100

10

0

25

50
Top-k

0

100

5

10
Window size (m batches)

(c) Memory vs. k
800

Chernoff-bound based
Persistence miner

250

200
150
100

50
0

5

10
Window size (m batches)

15

(e) Runtime vs. m

700
Memory Usage (MB)

Standard Multi-pass
Lossy-counting based

15

(d) F-Score vs. m

400
300

Lossy-counting based

50

400

350

100

(b) Runtime vs. k

F-score

Memory Usage (MB)

600

Runtime (sec)

Chernoff-bound based
Persistence miner

600

Standard Multi-pass

Chernoff-bound based

Lossy-counting based

Persistence miner

500
400
300
200
100
0
5

10
Window size (m batches)

15

(f) Memory vs. m

Fig. 5. Performance with different parameters (Dataset: Call logs)

6.4. Adapting to Dynamic data
In this experiment we show that Persistence miner adapts faster than the competing algorithms to changes in underlying data characteristics. We demonstrate this using synthetic data generated using the multi-neuronal simulator
based (Achar et al., 2012). The simulation model was adapted to update the
connection strengths dymanically while generating synthetic data. This allowed
us to change the top-k episode patterns slowly over the length of simulated data.
We embedded 25 randomly picked episodes with time-varying arrival rates. The
arrival rate of each episode was changed by changing the connection strength of
between consecutive event types in the episode from 0.1 to 0.9 as ramp function
time. By setting a different periodicity of the ramp functions for each episode we
were able to change the top-k frequent episodes over time. Figure 6(a) shows the
performance of the different methods in terms of f-score computed after arrival
of each new batch of events but for top-k patterns in the window. The ground
truth is again the output of the standard multi-pass apriori algorithm that is
allowed access to the entire window of events. The f-score curves of the both the
competing methods almost always below that of the Persistence miner. While
the runtimes for Persistence miner are always lower than those of the compet-

22

D. Patnaik et al

Table 4. Utility of border set.
(a) Dataset: Call-logs, Size of patterns = 6, Parameter v = m/2, Pattern class: Episodes
Window
size

Runtime
(border-set)

Runtime
(no-border-set)

Memory
(border-set)

Memory
(no-border-set)

5
10
15

2.48
3.13
3.47

12.95
11.41
13.76

67.55
67.7
67.82

66.21
66.67
67.02

(b) Dataset: Call-logs, Size of patterns = 6, window size m = 10, Patther class: Episodes
Parameter
v

Runtime
(border-set)

Runtime
(no-border-set)

Memory
(border-set)

Memory
(no-border-set)

0
5
10

2.78
3.13
3.21

11.98
11.41
10.85

67.7
67.7
67.69

66.67
66.67
57.5

(c) Dataset: T10I4D100K, Size of patterns = 4, window size m = 10, Pattern class:
Itemsets
Parameter
v

Runtime
(border-set)

Runtime
(no-border-set)

Memory
(border-set)

Memory
(no-border-set)

9

73.38

285.46

116.27

79.89

ing methods (see Figure 6(b). Lossy counting based methods is the slowest at
error parameter set to 0.0001. The effective batch-wise thresholds of both the
lossy counting and Chernoff bounds based algorithm were very similar leading
to identical performance of the two competing methods in this experiments.
The main reason of better tracking in the case of Persistence miner is that
the output of the algorithm filters out all non-(v, k) persistent patterns. This
acts in favor of Persistence miner as the patterns most likely to gather sufficient
support to be in the top-k are also likely to be persistent. The use of border-sets
in Persistence miner explains the lower runtimes.

6.5. Correlation of f-score with theoretical error
Can we compute theoretical errors (data-dependent quantity) and guess how well
we perform? This could be invaluable in a real-world streaming data setting.
In this experiment we try to establish the usefulness of the theoretical analysis
proposed in the paper. The main power of the theory is to predict the error in
the output set at the end of each batch. Unlike other methods we compute the
error bounds using the data characteristics and is dynamically updated as new
data arrives. The error guarantees of both Lossy counting and Chernoff based
methods are static.
In Figure 7, we plot the error bound using Theorem 3 and the f-score computed with respect to the reference method (standard multi-pass apriori) in a
2D histogram. According to the theory different pairs of (φ, ) output a different
error bound in every batch. In our experiment we pick the smallest error bound
in the allowable range of φ and corresponding  in each batch and plot it with
the corresponding f-score. The histogram is expected to show negative correlation between f-score and our predicted error bound i.e. the predicted error for

A General Streaming Algorithm for Pattern Discovery
105

Chernoff-bound based

23

Lossy-counting based

Persistence miner

F-Score

100

95
90
85
80
75
0

5

10
15
Sliding window position

20

25

(a) Tracking top-k - f-score comparison
40

Chernoff-bound based

Runtime (sec)

35

Lossy-counting based

Persistence miner

30
25
20

15
10
5
0

0

5

10

15
Sliding window position

20

25

(b) Tracking top-k - runtime
Fig. 6. Comparison of different methods in tracking top-k patterns over dynamically changing
event stream (Parameters used: k = 25, m = 10, Persistence miner: v=0.5, alg 6,7: error =
1.0e-4)

high-f-score top-k results should be low and vice versa. The correlation is not
very evident in the plot. The histogram shows higher density in the left top part
of the plot, which is a mild indication that high f-score has a corresponding low
error predicition.

7. Related work
Traditionally, data stream management systems (DSMSs) (Babcock et al., 2002)
have been used mainly for detecting patterns and conditions mined over offline data using traditional multi-pass frequent patterns, frequent itemsets, and
sequential patterns and motif-detection techniques. However, as real-time data
acquisition becomes ubiquitous, there is a growing need to derive insights directly
from high-volume streams at low latencies and under limited memory using a
DSMS (Chandramouli et al., 2012).
Mining Itemsets over Streams The literature for streaming algorithms for
pattern discovery is dominated by techniques from the frequent itemset min-

D. Patnaik et al

60
40
20

F−Score

80

24

0

[ht]

200

400

600

800

1000

1200

Predicted Error

Fig. 7. 2D Histogram of predicted error vs. f-score. (Dataset: Call logs)

ing context (Manku and Motwani, 2002; Karp et al., 2003; Chang and Lee,
2003; Chang and Lee, 2004; Jin and Agrawal, 2005; Wong and Fu, 2006; Calders
et al., 2007; Cheng et al., 2008; Lam et al., 2011), and we are unaware of any
algorithms for pattern mining over event streams that possess the generality
of mining approach we have presented here. Manku and Motwani (Manku and
Motwani, 2002) proposed the lossy counting algorithm for approximate frequency
counting in the landmark model (i.e., all events seen until the current-time constitute the window of interest). One of its main drawbacks is that the algorithm
must essentially operate at a threshold of  to provide  error guarantees, which
is impractical in many real-world scenarios. Karp et al. (Karp et al., 2003) also
propose a one pass streaming algorithm for finding frequent items, and these
ideas were extended to itemsets by Jin and Agrawal (Jin and Agrawal, 2005),
but all these methods require even more space than lossy counting. Mendes et
al. (Mendes, Ding and Han, 2008) extend the pattern growth algorithm (PrefixSpan) (Pei et al., 2001) for mining sequential patterns to incorporate the idea of
lossy counting. Chang and Lee (Chang and Lee, 2004) and Wong and Fu (Wong
and Fu, 2006) extend lossy counting to sliding windows and top-k setting, respectively. New frequency measures for itemsets over streams have also been
proposed (e.g., Calders et al. (Calders et al., 2007) Lam et al.(Lam et al., 2011))
but these methods are heavily specialized toward the itemset context. There has
also been renewed recent interest in adapting pattern matching (Chandramouli
et al., 2010; Agrawal et al., 2008) (as opposed to mining) algorithms to the
streaming context. It is not obvious how to extend them to accommodate patterns in a manner that supports both candidate generation and counting.
Mining Time-Series Motifs An online algorithm for mining time series motifs
is proposed by Mueen et al. (Mueen and Keogh, 2010). The algorithm uses an
interesting data structure to find a pair of approximately repeating subsequences
in a window. The Euclidean distance measure is used to measure the similarity
of the motif sequences in the window. Unfortunately this notion does not extend
naturally to discrete patterns. Further, this motif mining formulation does not

A General Streaming Algorithm for Pattern Discovery

25

explicitly make use of a support or frequency threshold and returns exactly one
pair of motifs that are found to be the closest in terms of distance.
Sequential Patterns and pattern Mining An pattern or a general partial
order pattern can be thought of as a generalization of itemsets where each item
in the set is not confined to occur within the same transaction (i.e., at the same
time tick) and there is additional structure in the form of ordering of events
or items. In serial patterns, events must occur in exactly one particular order.
Partial order patterns allow multiple orderings. In addition there could be repeated event types in a pattern. The loosely coupled structure of events in a
pattern results in narrower separation between the frequencies of true and noisy
patterns (i.e., resulting from random co-occurrences of events) and quickly leads
to combinatorial explosion of candidates when mining at low frequency thresholds. Most of the itemset literature does not deal with the problem of candidate
generation. The focus is on counting and not so much on efficient candidate generation schemes. While pattern mining in offline multi-pass scenarios has been
researched (Laxman, 2006; Achar et al., 2012), this paper is the first to explore
ways of doing both counting and candidate generation efficiently in a streaming
setting. We devise algorithms that can operate at as high frequency thresholds
as possible and yet give certain guarantees about frequent output patterns.

8. Conclusions
While pattern mining in offline multi-pass scenarios has been researched (Agrawal
and Srikant, 1994; Mannila et al., 1997; Achar et al., 2012), this paper is the first
to explore ways of doing both counting and candidate generation efficiently in
a streaming setting. The key ingredients of our approach involve parameterization of the streaming algorithm using data-driven quantities such as rate of
change and top-k separation. This leads to algorithms that can operate at high
frequency thresholds and with theoretical guarantees on the quality of top-k approximation. Further, our algorithms employ border sets in a novel way, ensuring
minimal computation redundancy as we process adjacent (potentially similar)
batches. Experiments demonstrate the effectiveness of our algorithms for two
popular pattern classes, namely, itemsets and episodes. The approach presented
here is applicable for any general pattern class provided we have black-box access
to a level-wise algorithm for that class.

Acknowledgements
This work is supported in part by US National Science Foundation grants IIS0905313 and CCF-0937133, and by the Intelligence Advanced Research Projects
Activity (IARPA) via Department of Interior National Business Center (DoI/NBC)
contract number D12PC000337. The US Government is authorized to reproduce
and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon. Disclaimer: The views and conclusions contained herein
are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of IARPA,
DoI/NBC, or the US Government.

26

D. Patnaik et al

References
Achar, A. et al. (2012), ‘Discovering injective episodes with general partial orders’, Data Mining
and Knowl. Dsc. 25(1), 67–108.
Agrawal, J. et al. (2008), Efficient pattern matching over event streams, in ‘Proc. of the 2008
ACM SIGMOD Intl. Conf. on Management of Data’, SIGMOD ’08, pp. 147–160.
Agrawal, R. and Srikant, R. (1994), Fast algorithms for mining association rules in large
databases, in ‘Proceedings of the 20th International Conference on Very Large Data Bases’,
pp. 487–499.
Aumann, Y. et al. (1999), ‘Borders: An efficient algorithm for association generation in dynamic
databases’, J. of Intl. Info. Syst. (JIIS) (1), 61–73.
Babcock, B. et al. (2002), Models and issues in data stream systems, in ‘Proc. of the 21st ACM
SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems’, pp. 1–16.
Calders, T. et al. (2007), Mining frequent itemsets in a stream, in ‘Proc. of the 7th IEEE Intl.
Conf. on Data Mining (ICDM)’, pp. 83–92.
Chandramouli, B. et al. (2010), ‘High-performance dynamic pattern matching over disordered
streams’, Proc. VLDB Endow. 3(1-2), 220–231.
Chandramouli, B. et al. (2012), Temporal analytics on big data for web advertising, in ‘Proc.
of the Intl. Conf. of Data Engg. (ICDE)’.
Chang, J. H. and Lee, W. S. (2003), Finding recent frequent itemsets adaptively over online
data streams, in ‘Proc. of the 9th ACM SIGKDD Intl. Conf. on Knowledge discovery and
data mining (KDD)’, pp. 487–492.
Chang, J. H. and Lee, W. S. (2004), ‘A sliding window method for finding recently frequent
itemsets over online data streams’, J. Inf. Sci. Eng. 20(4), 753–762.
Cheng, J. et al. (2008), ‘A survey on algorithms for mining frequent itemsets over data streams’,
Knowl. and Info. Systems 16(1), 1–27.
Jin, R. and Agrawal, G. (2005), An algorithm for in-core frequent itemset mining on streaming
data, in ‘Proc. of the 5th IEEE Intl. Conf. on Data Mining (ICDM)’, pp. 210–217.
Karp, R. M. et al. (2003), ‘A simple algorithm for finding frequent elements in streams and
bags’, ACM Trans. Database Syst. 28, 51–55.
Lam, H. T. et al. (2011), Online discovery of top-k similar motifs in time series data, in ‘SIAM
Intl. Conf. of Data mining’, pp. 1004–1015.
Laxman, S. (2006), Discovering frequent episodes: Fast algorithms, Connections with HMMs
and generalizations, PhD thesis, IISc, Bangalore, India.
Laxman, S. and Sastry, P. S. (2006), ‘A survey of temporal data mining’, S ĀDH ĀN Ā,
Academy Proceedings in Engineering Sciences 31, 173–198.
Manku, G. S. and Motwani, R. (2002), Approximate frequency counts over data streams, in
‘Proc. of the 28th Intl. Conf on Very Large Data Bases (VLDB)’, pp. 346–357.
Mannila, H. et al. (1997), ‘Discovery of frequent episodes in event sequences’, Data Mining
and Knowl. Dsc. 1(3), 259–289.
Mendes, L., Ding, B. and Han, J. (2008), Stream sequential pattern mining with precise error
bounds, in ‘Proc. of the 8th IEEE Intl. Conf. on Data Mining (ICDM)’, pp. 941–946.
Mueen, A. and Keogh, E. (2010), Online discovery and maintenance of time series motif, in
‘Proc. of the 16th ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining
(KDD)’.
Muthukrishnan, S. (2005), ‘Data streams: Algorithms and applications’, Foundations and
Trends in Theoretical Computer Science 1(2).
Patnaik, D., Laxman, S. and Ramakrishnan, N. (2009), Discovering Excitatory Networks from
Discrete Event Streams with Applications to Neuronal Spike Train Analysis, in ‘Proc. of
the 9th IEEE Intl. Conf. on Data Mining (ICDM)’.
Patnaik, D., Marwah, M., Sharma, R. and Ramakrishnan, N. (2009), Sustainable operation
and management of data center chillers using temporal data mining, in ‘Proceedings of the
15th ACM SIGKDD international conference on Knowledge discovery and data mining’,
pp. 1305–1314.
Patnaik, D. et al. (2012), ‘Streaming algorithms for pattern discovery over dynamically changing event sequences’, CoRR abs/1205.4477.
Pei, J. et al. (2001), Prefixspan: Mining sequential patterns efficiently by prefix-projected pattern growth, in ‘Proc. of the 17th Intl. Conf. on Data Engg. (ICDE)’, pp. 215–224.
Ramakrishnan, N., Patnaik, D. and Sreedharan, V. (2009), ‘Temporal Process Discovery in
Many Guises’, IEEE Computer Vol. 42(8), 97–101.

A General Streaming Algorithm for Pattern Discovery

27

Wagenaar, D. A. et al. (2006), ‘An extremely rich repertoire of bursting patterns during the
development of cortical cultures’, BMC Neuroscience .
Wang, J. et al. (2005), ‘TFP: An efficient algorithm for mining top-k frequent closed itemsets’,
IEEE Trans. on Knowledge and Data Engg. 17(5), 652–664.
Wong, R. C.-W. and Fu, A. W.-C. (2006), ‘Mining top-k frequent itemsets from data streams’,
Data Mining and Knowl. Dsc. 13, 193–217.
Yan, X. and Han, J. (2003), CloseGraph: mining closed frequent subgraph patterns, in ‘Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery
and Data Mining (KDD’03)’.

Author Biographies
Debprakash Patnaik is currently an engineer in the Search and
Discovery group at Amazon.com. He completed his Ph.D. in Computer
Science from Virginia Tech, USA. He received his bachelor degree from
Veer Surendra Sai University of Technology, Orissa, India, in 2002 and
masters degree from Indian Institute of Science, Bangalore, in 2006.
He also worked as a researcher in the Diagnosis and Prognosis group
at General Motors Research, Bangalore. His research interests include
temporal data mining, probabilistic graphical models, and application
in neuroscience.

Srivatsan Laxman Srivatsan Laxman is a researcher at Microsoft
Research India, Bangalore and is associated with two research areas,
Machine Learning & Optimization and Security & Privacy. He works
broadly in data mining, machine learning and data privacy. Srivatsan received his PhD in Electrical Engineering in 2006 from Indian
Institute of Science, Bangalore.

Badrish Chandramouli Badrish Chandramouli is a researcher in
the database group at Microsoft Research. He is broadly interested in
databases and distributed systems, with particular interests in stream
processing and temporal analytics, Cloud computing, database query
processing and optimization, and publish/subscribe systems. Badrish
works primarily on the streams project at Microsoft Research, Redmond, which ships commercially as Microsoft StreamInsight.

Naren Ramakrishnan Naren Ramakrishnan is the Thomas L.
Phillips professor of engineering at Virginia Tech. He works with domain scientists and engineers to develop software systems for scientific
knowledge discovery. Application domains in his research have spanned
protein design, biochemical networks, neuro-informatics, plant physiology, wireless communications, and sustainable design of data centers.
Ramakrishnan received his Ph.D. in computer sciences from Purdue
University. He is an ACM Distinguished Scientist.

28

D. Patnaik et al

Correspondence and offprint requests to: Debprakash Patnaik, Amazon.com, WA 98109, USA.
Email: patnaikd@amazon.com