Abstract
In this paper, we present two novel decaying operators for Telco Big Data (TBD), coined TBD-DP and CTBD-DP that are founded on the notion of Data Postdiction. Unlike data prediction, which aims to make a statement about the future value of some tuple, our formulated data postdiction term, aims to make a statement about the past value of some tuple, which does not exist anymore as it had to be deleted to free up disk space. TBD-DP relies on existing Machine Learning (ML) algorithms to abstract TBD into compact models that can be stored and queried when necessary. Our proposed TBD-DP operator has the following two conceptual phases: (i) in an offline phase, it utilizes a LSTM-based hierarchical ML algorithm to learn a tree of models (coined TBD-DP tree) over time and space; (ii) in an online phase, it uses the TBD-DP tree to recover data within a certain accuracy. Additionally, we provide three decaying focus methods that can be plugged into the operators we propose, namely: (i) FIFO-amnesia, which is based on the time that the tuple was created; (ii) SPATIAL-amnesia, which is based on the cellular tower’s location related with the tuple; and (iii) UNIFORM-amnesia, which picks randomly the tuples to be decayed. Similarly, CTBD-DP enables the decaying of streaming data utilizing the TBD-DP tree to extend and update the stored models. In our experimental setup, we measure the efficiency of the proposed operator using a ∼10GB anonymized real telco network trace. Our experimental results in Tensorflow over HDFS are extremely encouraging as they show that TBD-DP saves an order of magnitude storage space while maintaining a high accuracy on the recovered data. Our experiments also show that CTBD-DP improves the accuracy over streaming data.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
TBD Awareness, https://tbd.cs.ucy.ac.cy/
References
Abbasoğlu MA, Gedik B, Ferhatosmanoğlu H (2013) Aggregate profile clustering for telco analytics. Proc VLDB Endow 6(12):1234–1237. https://doi.org/10.14778/2536274.2536284
Agarwal PK, Cormode G, Huang Z, Phillips J, Wei Z, Yi K (2012) Mergeable summaries. In: Proceedings of the 31st ACM SIGMOD-SIGACT-SIGAI symposium on principles of database systems, PODS ’12. ACM, New York, pp 23–34. http://doi.acm.org/10.1145/2213556.2213562
Agarwal S, Mozafari B, Panda A, Milner H, Madden S, Stoica I (2013) Blinkdb: queries with bounded errors and bounded response times on very large data. In: Proceedings of the 8th ACM European conference on computer systems, EuroSys ’13. ACM, New York, pp 29–42. http://doi.acm.org/10.1145/2465351.2465355
Barbará D, DuMouchel W, Faloutsos C, Haas PJ, Hellerstein JM, Ioannidis YE, Jagadish HV, Johnson T, Ng RT, Poosala V, Ross KA, Sevcik KC (1997) The new jersey data reduction report. IEEE Data Eng Bull 20 (4):3–45. http://sites.computer.org/debull/97DEC-CD.pdf
Bhattacherjee S, Deshpande A, Sussman A (2014) Pstore: an efficient storage framework for managing scientific data. In: Proceedings of the 26th international conference on scientific and statistical database management, SSDBM ’14. ACM, New York, pp 25:1–25:12. http://doi.acm.org/10.1145/2618243.2618268
Bhattacherjee S, Chavan A, Huang S, Deshpande A, Parameswaran A (2015) Principles of dataset versioning: exploring the recreation/storage tradeoff. Proc VLDB Endow 8(12):1346–1357
Bicer T, Yin J, Chiu D, Agrawal G, Schuchardt K (2013) Integrating online compression to accelerate large-scale data analytics applications. In: 2013 IEEE 27th International symposium on parallel & distributed processing (IPDPS). IEEE, pp 1205–1216
Bouillet E, Kothari R, Kumar V, Mignet L, Nathan S, Ranganathan A, Turaga DS, Udrea O, Verscheure O (2012) Processing 6 billion cdrs/day: from research to production (experience report). In: Proceedings of the 6th ACM international conference on distributed event-based systems, DEBS ’12. ACM, New York, pp 264–267, https://doi.org/10.1145/2335484.2335513
Braun L, Etter T, Gasparis G, Kaufmann M, Kossmann D, Widmer D, Avitzur A, Iliopoulos A, Levy E, Liang N (2015) Analytics in motion: high performance event-processing and real-time analytics in the same database. In: Proceedings of the 2015 ACM SIGMOD international conference on management of data, SIGMOD ’15. ACM, New York, pp 251–264, https://doi.org/10.1145/2723372.2742783
Burtscher M, Ratanaworabhan P (2009) Fpc: a high-speed compressor for double-precision floating-point data. IEEE Trans Comput 58(1):18–31
Chaudhuri S, Das G, Narasayya V (2007) Optimized stratified sampling for approximate query processing. ACM Trans Database Syst 32:2. http://doi.acm.org/10.1145/1242524.1242526
Cormode G, Garofalakis M, Haas PJ, Jermaine C (2012) Synopses for massive data: samples, histograms, wavelets, sketches. Found Trends Datab 4(1–3):1–294. https://doi.org/10.1561/1900000004
Costa C, Zeinalipour-Yazti D (2018) Telco big data: current state and future directions. In: Proceedings of the 19th IEEE international conference on mobile data management. IEEE Computer Society, ISBN: 978-1-5386-4133-0, June 27, 2018, Aalborg, Denmark, MDM‘18, pp 11–12. https://doi.org/10.1109/MDM.2018.00016
Costa C, Chatzimilioudis G, Zeinalipour-Yazti D, Mokbel MF (2017) Efficient exploration of telco big data with compression and decaying. In: 2017 IEEE 33rd international conference on data engineering (ICDE), pp 1332–1343. https://doi.org/10.1109/ICDE.2017.175
Costa C, Chatzimilioudis G, Zeinalipour-Yazti D, Mokbel MF (2017) Towards real-time road traffic analytics using telco big data. In: Proceedings of the international workshop on real-time business intelligence and analytics, BIRTE, Munich, Germany, August 28, 2017, pp 5:1–5:5. http://doi.acm.org/10.1145/3129292.3129296
Costa C, Charalampous A, Konstantinidis A, Zeinalipour-Yazti D, Mokbel MF (2018) Decaying telco big data with data postdiction. In: 2018 19th IEEE international conference on mobile data management (MDM), pp 106–115. https://doi.org/10.1109/MDM.2018.00027
Dey R, Salemt FM (2017) Gate-variants of gated recurrent unit (gru) neural networks. In: 2017 IEEE 60th international midwest symposium on circuits and systems (MWSCAS), pp 1597–1600. https://doi.org/10.1109/MWSCAS.2017.8053243
Douglis F, Iyengar A (2003) Application-specific delta-encoding via resemblance detection. In: USENIX Annual technical conference, General Track, pp 113–126
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735
Hu X, Yuan M, Yao J, Deng Y, Chen L, Yang Q, Guan H, Zeng J (2015) Differential privacy in telco big data platform. Proc VLDB Endow 8 (12):1692–1703. https://doi.org/10.14778/2824032.2824067
Huang Y, Zhu F, Yuan M, Deng K, Li Y, Ni B, Dai W, Yang Q, Zeng J (2015) Telco churn prediction with big data. In: Proceedings of the 2015 ACM SIGMOD international conference on management of data, SIGMOD. ACM, New York, pp 607–618, https://doi.org/10.1145/2723372.2742794
Iyer AP, Li LE, Stoica I (2015) Celliq: real-time cellular network analytics at scale. In: Proceedings of the 12th USENIX conference on networked systems design and implementation, NSDI’15. USENIX Association, Berkeley, pp 309–322
Kersten ML (2015) Big data space fungus. In: CIDR 2015, Seventh biennial conference on innovative data systems research, Asilomar, CA, USA, January 4-7, 2015, Online Proceedings
Kersten ML, Sidirourgos L (2017) A database system with amnesia. In: CIDR
Krishna K, Jain D, Mehta SV, Choudhary S (2018) An lstm based system for prediction of human activities with durations. Proc ACM Interact Mob Wearable Ubiquitous Technol 1(4):147:1–147:31. http://doi.acm.org/10.1145/3161201
LaChapelle C (2016) The cost of data storage and management: where is the it headed in 2016? http://www.datacenterjournal.com/cost-data-storage-management-headed-2016/
Laiho J, Wacker A, Novosad T (2006) Radio network planning and optimisation for UMTS. Wiley
Lakshminarasimhan S, Shah N, Ethier S, Klasky S, Latham R, Ross R, Samatova NF (2011) Compressing the incompressible with isabela: in-situ reduction of spatio-temporal data. In: European conference on parallel processing. Springer, pp 366–379
Luo C, Zeng J, Yuan M, Dai W, Yang Q (2016) Telco user activity level prediction with massive mobile broadband data. ACM Trans Intell Syst Technol 7(4):63,1–63,30. https://doi.org/10.1145/2856057
Savitz E (2012) Forbes magazine. https://goo.gl/eM1uwV, [Online; April 16, 2012]
Schendel ER, Jin Y, Shah N, Chen J, Chang CS, Ku SH, Ethier S, Klasky S, Latham R, Ross R et al (2012) Isobar preconditioner for effective and high-throughput lossless data compression. In: 2012 IEEE 28th international conference on data engineering. IEEE, pp 138–149
Sidirourgos L, Martin, Boncz P (2011) Sciborq: Scientific data management with bounds on runtime and quality. In: Proc. of the Int’l conf. on innovative data systems research (CIDR, pp 296–301)
Soroush E, Balazinska M (2013) Time travel in a scientific array database. In: 2013 IEEE 29th international conference on data engineering (ICDE). IEEE, pp 98–109
Wei Z, Luo G, Yi K, Du X, Wen JR (2015) Persistent data sketching. In: Proceedings of the 2015 ACM SIGMOD international conference on management of data, SIGMOD ’15. ACM, New York, pp 795–810. http://doi.acm.org/10.1145/2723372.2749443
Yan H, Ding S, Suel T (2009) Inverted index compression and query processing with optimized document ordering. In: Proceedings of the 18th international conference on World wide web. ACM, pp 401–410
You LL, Pollack KT, Long DD, Gopinath K (2011) Presidio: a framework for efficient archival data storage. ACM Trans Storage (TOS) 7(2):6
Yuan M, Deng K, Zeng J, Li Y, Ni B, He X, Wang F, Dai W, Yang Q (2014) Oceanst: a distributed analytic system for large-scale spatiotemporal mobile broadband data. Proc VLDB Endow 7(13):1561–1564. https://doi.org/10.14778/2733004.2733030
Zeng K, Agarwal S, Dave A, Armbrust M, Stoica I (2015) G-ola: generalized on-line aggregation for interactive analysis on big data. In: Proceedings of the 2015 ACM SIGMOD international conference on management of data, SIGMOD ’15. ACM, New York, pp 913–918. http://doi.acm.org/10.1145/2723372.2735381
Zhang S, Yang Y, Fan W, Lan L, Yuan M (2014) Oceanrt: real-time analytics over large temporal data. In: Proceedings of the 2014 ACM SIGMOD international conference on management of data, SIGMOD ’14. ACM, New York, pp 1099–1102, https://doi.org/10.1145/2588555.2594513
Zhu F, Luo C, Yuan M, Zhu Y, Zhang Z, Gu T, Deng K, Rao W, Zeng J (2016) City-scale localization with telco big data. In: Proceedings of the 25th ACM international on conference on information and knowledge management, CIKM. ACM, New York, pp 439–448, https://doi.org/10.1145/2983323.2983345
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Costa, C., Konstantinidis, A., Charalampous, A. et al. Continuous decaying of telco big data with data postdiction. Geoinformatica 23, 533–557 (2019). https://doi.org/10.1007/s10707-019-00364-z
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10707-019-00364-z