1 Introduction 1 Data Stream Mining: A Review of Learning Methods and Frameworks Svitlana Volkova Center for Language and Speech Processing Johns Hopkins University svitlana@cs.jhu.edu October 12, 2012 Abstract The goal of the paper is to review methods, algorithms and frameworks for processing and analyzing real time data streams. We first give an overview of classic streaming algorithms and show how they can be applied for solving large-scale natural language processing (NLP) tasks. We next discuss differences between stream- and batchmode learning and analyze two stream-based learning frameworks – Jerboa and MOA. Finally, we describe existing applications for learning from social media streams. 1 Introduction Data stream mining is a relatively new research area that is, unfortunately, less studied compared to data mining and learning in a traditional batch mode. The majority of evaluations in the traditional batch mode use less than 1 million training examples. In contrast, stream-based learning models are designed to handle very large (potentially infinite) streams of examples. Moreover, demonstrating system performance on a small amount of data is not a convincing case for capacity while learning from recently emerged streaming applications including processing massive text streams [Bif, 2012]. A variety of natural processing tasks require making decisions in real time at a high speed on a continuous flow of text data. One of the reasons to this growing demand is the rapid growth of social media services – the most popular Twitter1 , Facebook2 that allow mining useful up-to-date information from public threads in real time. For example, real-time opinion mining and sentiment analysis in social media streams is becoming increasingly popular [Pang & Lee, 2008]. The other NLP tasks that also require processing text streams in real time include named entity disambiguation [Sarmento et al., 2009] and statistical machine translation [Levenberg et al., 2011]. In this paper we review the exciting algorithms for processing text streams and give a comprehensive analysis of two stream-based learning frameworks. We structure the paper as follows: in Section 2 we review classic streaming algorithms and show the examples of applying these algorithms to solve large-scale NLP tasks; in Section 3 we talk about stream-based learning where we first enumerate differences between stream and batch learning, next we present two stream-based learning frameworks and, finally, we give the examples of stream-based learning from social medial streams. 1 2 www.twitter.com www.facebook.com 2 Classic Streaming Algorithms for NLP Tasks 2 2 Classic Streaming Algorithms for NLP Tasks A data stream is a real-time, continuous, ordered sequence of items σ. It is impossible to control the order in which items arrive and to store a stream entirety as defined in [Golab & Özsu, 2003]. The goal of streaming algorithms is to compute some function of a massively long sequence of items (tokens) σ =< a1 , a2 , a3 , . . . , am > taking into account limited time and space constraints. The foundations of streaming algorithms and a comprehensive overview of streaming systems and applications are given in [Muthukrishnan, 2005]. More general purpose streaming algorithms such as: finding frequent items deterministically (MistraGries algorithm) and via sketching (count-min algorithm), estimating the number of distinct items using hashing (BJKST algorithm), and graph streaming algorithms discussed in details in [Chakrabarti, 2009] as a part of his streaming course at Dartmouth college. The other streaming courses that have excellent notes are taught by Indyk and MIT [Indyk, 2007] and McGregor at UMass Amherst [McGregor, 2011]. Finally, recent ACM publication in XRDS journal gives a concise and the most up-to-date overview of sketching and streaming algorithms for processing massive data and presents classic streaming algorithms such as: probabilistic counting (Moris’ algorithm), frequent items (MJRTY algorithm), and linear sketches [Nelson, 2012]. Classic streaming algorithms are designed to process any kinds of data streams for a variety of applications including large-scale scientific experiments, search engines, online content delivery, and consumer tracking for large retails. Despite the fact that classic streaming algorithms are not designed learn and predict latent attributes learned from the data stream, they can be reused to reduce space and time constraints for large-scale NLP tasks, as shown in the examples below: - Goyal et. al. applied classic streaming algorithm from [Manku & Motwani, 2002] that computes approximate frequency counts of frequently occurring n-grams to solve large-scale SMT task by constructing high-order approximate n-gram language models in a range of billions of words [Goyal et al., 2009]. - Goyal and Hal Daume III explored sketching techniques, specifically count-min sketch algorithm from [Cormode & Muthukrishnan, 2005], to solve three important NLP tasks: predicting word semantic orientations, distributional approaches to word similarity and unsupervised dependency parsing [Goyal & Daumé, 2011]. More examples of streaming and randomized algorithms, as well as stream-based learning approaches and frameworks designed to solve large-scale NLP tasks are discussed in Section 3. 3 3.1 Learning from Streaming Text Data Stream vs. Batch Learning State-of-the-art machine learning algorithms for clustering and classification use batch procedures and are not designed to make a prediction in streaming mode due to the following assumptions [Bishop, 2006]: 3 Learning from Streaming Text Data 3 a) examples are independent and identically distributed (i.i.d.) random variables; b) examples are randomly generated from stationary probability distribution D. Real-time decision models are dynamic models that use unbounded training sets where training examples are not i.i.d.. To make decisions in real time, learning algorithms for must have the following capabilities [Gama et al., 2009]: a) deal with data whose distribution changes over time; b) track changes and update the decision accordingly; c) incorporate new data at the speed it arrives; d) use limited resources – time and memory. 3.2 Stream-based Learning Frameworks In this section we analyze two frameworks3 for stream-based clustering and classification from massive data. 1. Jerboa is a toolkit of randomized and streaming algorithms for large-scale language mining tasks developed by Benjamin Van Durme [Van Durme, 2012a]. Jerboa toolkit uses a variety of efficient filtering, sampling and counting techniques, well-known in the database community, for processing streaming data including Bloom filters [Bloom, 1970] and reservoir sampling [Vitter, 1985] to process data fast and efficiently, Morris-style [Morris, 1978], Talbot Osborne Morris Bloom (TOMB) and reservoir counters [Van Durme & Lall, 2011] to keep track of very large keysets of ngram types and Local Sensitive Hashing (LSH) methods [Van Durme & Lall, 2010] to reduce the memory use. Jerboa toolkit was effectively used for streaming analysis of discourse participants [Van Durme, 2012b]. A standard bag-of-word (BOW) classification model for predicting user gender based on user utterance is decomposed into a series of streaming updates; reservoir counting algorithm [Van Durme & Lall, 2011] is used as an approximation method for compressing the streaming memory requirements of the classifier by 75% with no significant cost in accuracy. 2. MOA (Massive Online Analysis) is an open-source software environment for implementing algorithms and running experiments for online learning from massive, potentially infinite, evolving data streams [Bifet et al., 2010]. Figure shows the typical use of stream classification algorithm: (1) reading the next available example from the stream (req.1); (2) processing the examples and updating the data structures withing the memory (req.2) and time (req.3) bounds; (3) ready to accept the next example and to predict the class of unseen examples (req.4). 3 To the best of our knowledge, these are the only publicly available stream-based learning frameworks that contain state-of-the-art learning algorithms for streaming mode prediction. 3 Learning from Streaming Text Data 4 MOA permits evaluation of data stream learning algorithms on large streams in the order of 10-100 millions examples. MOA allows joining and filtering streams and contains several stream classification methods: Naive Bayes, Decision Stump, Hoeffding Tree, Bagging, Boosting and clustering methods: StreamKM++, CluStream, ClusTree, Den-stream, D-stream and CobWeb [Bifet et al., 2010]. MOA was effectively used for detecting sentiment change in Twitter streaming data using frequent item mining approach [Bifet et al., 2011] and other stream mining methods including Multinomial Naive Bayes, Stochastic Gradient Descent and Hoeffding Tree [Bifet & Frank, 2010]. Finally, according to Bifet [Bif, 2012] there are other ways to speedup the mining of streaming learners is to distribute the training process, for example using Hadoop MapReduce4 programming model to process data in parallel on large clusters. Arache S4 [Neumeyer et al., 2010] platform, inspired by MapReduce model, is designed to process data streams. 3.3 Learning from Social Media Streams Many application for analyzing massive data sizes of user generated content (UGC) produced daily by microblogging and social networking serviced exist. In this section we review several examples of recently emerged online services for social media stream mining. - Cloud4Trends is a system for analyzing social network trends and dynamics in real-time fashion using clustering approach [Vakali et al., 2012]. Trend detection application has 3-tier design: (1) data collection in streaming fashion; (2) application of an online clustering technique on the data to detect recent trends; and (3) refinement and ranking of clusters for data visualization. Cloud4Trends runs over the VENUS-C infrustructure 5 . - SPOONS (Swift Perceptions of Online Negative Situations) is a Netflix service outtage detection system via real-time stream analysis [Augustine et al., 2012]. The system does not perform multiclass classification in a classic streaming fashion, instead it incrementally retrains a variety WEKA learners including Naive Bayes, Bayes Net, J48, KNN [Hall et al., 2009] on a new batch of training data to predict Netflix service outtage. Due to space limitations, we can not describe all streaming systems that process social media. However we can enumerate the most popular services including Google Hot Trends6 , IceRocket 7 , BingSocial 8 , SocialMention 9 , Topsy 10 . 4 http://hadoop.apache.org/ http://www.venus-c.eu 6 http://www.google.com/trends/hottrends 7 http://www.icerocket.com/popular/ 8 http://www.bing.com/social 9 http://socialmention.com/trends/ 10 http://topsy.com/ 5 4 Summary 4 5 Summary In this paper we reviewed learning methods and tools for massive, infinite data stream mining. We presented classic streaming algorithms and demonstrated how they can be reused in order to reduce space and processing time for large scale NLP problems. We analyzed two stream-based learning frameworks – Jerboa and MOA, and described sampling, counting and filtering techniques used to reduce memory size and increase the efficiency of computations while learning from data streams in real time. We also demonstrated a typical use of stream classification algorithm in MOA and presented a set of classification and clustering algorithms designed to make a prediction in streaming fashion on millions of examples. The major property of streambased learning algorithms is processing the data that arrives at high speed under very strict constraints of space and time. Finally, we described the existing applications for analyzing massive data from social media streams – Cloud4Trends and SPOONS. References [Bif, 2012] (2012). Mining Big Data in real Time, New York, NY, USA. ACM. [Augustine et al., 2012] Augustine, E., Cushing, C., Dekhtyar, A., McEntee, K., Paterson, K., & Tognetti, M. (2012). Outage detection via real-time social stream analysis: leveraging the power of online complaints. In Proceedings of the 21st international conference companion on World Wide Web, WWW ’12 Companion (pp. 13–22). New York, NY, USA: ACM. [Bifet & Frank, 2010] Bifet, A. & Frank, E. (2010). Sentiment knowledge discovery in twitter streaming data. In Proceedings of the 13th international conference on Discovery science, DS’10 (pp. 1–15). Berlin, Heidelberg: Springer-Verlag. [Bifet et al., 2011] Bifet, A., Holmes, G., Pfahringer, B., & Gavaldà, R. (2011). Detecting sentiment change in twitter streaming data. Journal of Machine Learning Research - Proceedings Track, 17, 5–11. [Bifet et al., 2010] Bifet, A., Holmes, G., Pfahringer, B., Kranen, P., Kremer, H., Jansen, T., & Seidl, T. (2010). Moa: Massive online analysis, a framework for stream classification and clustering. Journal of Machine Learning Research - Proceedings Track, 11, 44–50. [Bishop, 2006] Bishop, C. M. (2006). Pattern Recognition and Machine Learning (Information Science and Statistics). Secaucus, NJ, USA: Springer-Verlag New York, Inc. [Bloom, 1970] Bloom, B. H. (1970). Space/time trade-offs in hash coding with allowable errors. Commun. ACM, 13(7), 422–426. [Chakrabarti, 2009] Chakrabarti, A. (2009). Data Stream Algorithms. Lecture Notes. http://www.cs.dartmouth.edu/~ac/Teach/CS49-Fall11/. 4 Summary 6 [Cormode & Muthukrishnan, 2005] Cormode, G. & Muthukrishnan, S. (2005). An improved data stream summary: the count-min sketch and its applications. J. Algorithms, 55(1), 58–75. [Gama et al., 2009] Gama, J. a., Sebastião, R., & Rodrigues, P. P. (2009). Issues in evaluation of stream learning algorithms. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD ’09 (pp. 329–338). New York, NY, USA: ACM. [Golab & Özsu, 2003] Golab, L. & Özsu, M. T. (2003). Processing sliding window multijoins in continuous queries over data streams. In VLDB (pp. 500–511). [Goyal & Daumé, 2011] Goyal, A. & Daumé, III, H. (2011). Approximate scalable bounded space sketch for large data nlp. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP ’11 (pp. 250–261). Stroudsburg, PA, USA: Association for Computational Linguistics. [Goyal et al., 2009] Goyal, A., Daumé, III, H., & Venkatasubramanian, S. (2009). Streaming for large scale nlp: language modeling. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, NAACL ’09 (pp. 512–520). Stroudsburg, PA, USA: Association for Computational Linguistics. [Hall et al., 2009] Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., & Witten, I. H. (2009). The weka data mining software: an update. SIGKDD Explor. Newsl., 11(1), 10–18. [Indyk, 2007] Indyk, P. (2007). Sketching, Streaming and Sub-linear Space algorithms. Lecture Notes. http://stellar.mit.edu/S/course/6/fa07/6.895/ materials.html. [Levenberg et al., 2011] Levenberg, A., Osborne, M., & Matthews, D. (2011). Multiplestream language models for statistical machine translation. In Proceedings of the Sixth Workshop on Statistical Machine Translation, WMT ’11 (pp. 177–186). Stroudsburg, PA, USA: Association for Computational Linguistics. [Manku & Motwani, 2002] Manku, G. S. & Motwani, R. (2002). Approximate frequency counts over data streams. In Proceedings of the 28th international conference on Very Large Data Bases, VLDB ’02 (pp. 346–357).: VLDB Endowment. [McGregor, 2011] McGregor, A. (2011). More Advanced Algorithms. Lecture Notes. http://people.cs.umass.edu/~mcgregor/courses/CS711S12/. [Morris, 1978] Morris, R. (1978). Counting large numbers of events in small registers. Commun. ACM, 21(10), 840–842. [Muthukrishnan, 2005] Muthukrishnan, S. (2005). Data streams: algorithms and applications. Found. Trends Theor. Comput. Sci., 1(2), 117–236. [Nelson, 2012] Nelson, J. (2012). Sketching and streaming algorithms for processing massive data. XRDS, 19(1), 14–19. 4 Summary 7 [Neumeyer et al., 2010] Neumeyer, L., Robbins, B., Nair, A., & Kesari, A. (2010). S4: Distributed stream computing platform. In Data Mining Workshops (ICDMW), 2010 IEEE International Conference on (pp. 170 –177). [Pang & Lee, 2008] Pang, B. & Lee, L. (2008). Opinion mining and sentiment analysis. Found. Trends Inf. Retr., 2(1-2), 1–135. [Sarmento et al., 2009] Sarmento, L., Kehlenbeck, A., Oliveira, E., & Ungar, L. (2009). An approach to web-scale named-entity disambiguation. In Proceedings of the 6th International Conference on Machine Learning and Data Mining in Pattern Recognition, MLDM ’09 (pp. 689–703). Berlin, Heidelberg: Springer-Verlag. [Vakali et al., 2012] Vakali, A., Giatsoglou, M., & Antaris, S. (2012). Social networking trends and dynamics detection via a cloud-based framework design. In Proceedings of the 21st international conference companion on World Wide Web, WWW ’12 Companion (pp. 1213–1220). New York, NY, USA: ACM. [Van Durme, 2012a] Van Durme, B. (2012a). Jerboa: AToolkit for Randomized and Streaming Algorithms. Technical report, Human Language Technology Center of Excellence. [Van Durme, 2012b] Van Durme, B. (2012b). Streaming analysis of discourse participants. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (pp. 48–58). Jeju Island, Korea: Association for Computational Linguistics. [Van Durme & Lall, 2010] Van Durme, B. & Lall, A. (2010). Online generation of locality sensitive hash signatures. In Proceedings of the ACL 2010 Conference Short Papers, ACLShort ’10 (pp. 231–235). Stroudsburg, PA, USA: Association for Computational Linguistics. [Van Durme & Lall, 2011] Van Durme, B. & Lall, A. (2011). Efficient online locality sensitive hashing via reservoir counting. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers - Volume 2, HLT ’11 (pp. 18–23). Stroudsburg, PA, USA: Association for Computational Linguistics. [Vitter, 1985] Vitter, J. S. (1985). Random sampling with a reservoir. ACM Trans. Math. Softw., 11(1), 37–57.