skip to main content
research-article

Smile: a system to support machine learning on EEG data at scale

Published: 01 August 2019 Publication History

Abstract

In order to reduce the possibility of neural injury from seizures and sidestep the need for a neurologist to spend hours on manually reviewing the EEG recording, it is critical to automatically detect and classify "interictal-ictal continuum" (IIC) patterns from EEG data. However, the existing IIC classification techniques are shown to be not accurate and robust enough for clinical use because of the lack of high quality labels of EEG segments as training data. Obtaining high-quality labeled data is traditionally a manual process by trained clinicians that can be tedious, time-consuming, and error-prone. In this work, we propose Smile, an industrial scale system that provides an end-to-end solution to the IIC pattern classification problem. The core components of Smile include a visualization-based time series labeling module and a deep-learning based active learning module. The labeling module enables the users to explore and label 350 million EEG segments (30TB) at interactive speed. The multiple coordinated views allow the users to examine the EEG signals from both time domain and frequency domain simultaneously. The active learning module first trains a deep neural network that automatically extracts both the local features with respect to each segment itself and the long term dynamics of the EEG signals to classify IIC patterns. Then leveraging the output of the deep learning model, the EEG segments that can best improve the model are selected and prompted to clinicians to label. This process is iterated until the clinicians and the models show high degree of agreement. Our initial experimental results show that our Smile system allows the clinicians to label the EEG segments at will with a response time below 500 ms. The accuracy of the model is progressively improved as more and more high quality labels are acquired over time.

References

[1]
Citus data. https://www.citusdata.com/.
[2]
R. Agarwal, J. Gotman, D. Flanagan, and B. Rosenblatt. Automatic eeg analysis during long-term monitoring in the icu. Electroencephalography and clinical Neurophysiology, 107(1):44--58, 1998.
[3]
E. Amorim, C. A. Williamson, L. M. Moura, M. M. Shafi, N. Gaspard, E. S. Rosenthal, M. M. Guanci, V. Rajajee, and M. B. Westover. Performance of spectrogram-based seizure identification of adult eegs by critical care nurses and neurophysiologists. Journal of clinical neurophysiology: official publication of the American Electroencephalographic Society, 34(4):359--364, 2017.
[4]
L. Battle, R. Chang, and M. Stonebraker. Dynamic prefetching of data tiles for interactive visualization. In SIGMOD, pages 1363--1375, New York, NY, USA, 2016.
[5]
M. Behrisch, D. Streeb, F. Stoffel, D. Seebacher, B. Matejek, S. H. Weber, S. Mittelstaedt, H. Pfister, and D. Keim. Commercial Visual Analytics Systems-Advances in the Big Data Analytics Field. TVCG, pages 1--1, 2018.
[6]
C. Beilschmidt, T. Fober, M. Mattig, and B. Seeger. A linear-time algorithm for the aggregation and visualization of big spatial point data. In SIGSPATIAL, page 73. ACM, 2017.
[7]
S. Biswal, H. Sun, B. Goparaju, M. B. Westover, J. Sun, and M. T. Bianchi. Expert-level sleep scoring with deep neural networks. Journal of the American Medical Informatics Association, 25(12):1643--1650, 2018.
[8]
G. Bodenstein and H. M. Praetorius. Feature extraction from the electroencephalogram by adaptive segmentation. Proceedings of the IEEE, 65(5):642--652, 1977.
[9]
A. Boufea, R. Finkers, M. van Kaauwen, M. Kramer, and I. N. Athanasiadis. Managing variant calling files the big data way: Using HDFS and apache parquet. In BDCAT, Austin, TX, USA, December 05 - 08, 2017, pages 219--226, 2017.
[10]
S.-M. Chan, L. Xiao, J. Gerth, and P. Hanrahan. Maintaining interactivity while exploring massive time series. In VAST, pages 59--66, 2008.
[11]
H. Chen, W. Chen, H. Mei, Z. Liu, K. Zhou, W. Chen, W. Gu, and K.-L. Ma. Visual abstraction and exploration of multi-class scatterplots. TVCG, 20(12):1683--1692, 2014.
[12]
D. Cheng, P. Schretlen, N. Kronenfeld, N. Bozowsky, and W. Wright. Tile based visual analytics for twitter big data exploratory analysis. In IEEE Big Data, 2013, pages 2--4. IEEE, 2013.
[13]
D. J. Chong and L. J. Hirsch. Which eeg patterns warrant treatment in the critically ill? reviewing the evidence for treatment of periodic epileptiform discharges and related patterns. Journal of Clinical Neurophysiology, 22(2):79--91, 2005.
[14]
J. Chung, C. Gulcehre, K. Cho, and Y. Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555, 2014.
[15]
J. Claassen, S. Mayer, R. Kowalski, R. Emerson, and L. Hirsch. Detection of electrographic seizures with continuous eeg monitoring in critically ill patients. Neurology, 62(10):1743--1748, 2004.
[16]
M. C. Cloostermans, C. C. de Vos, and M. J. van Putten. A novel approach for computer assisted eeg monitoring in the adult icu. Clinical neurophysiology, 122(10):2100--2109, 2011.
[17]
A. Das Sarma, H. Lee, H. Gonzalez, J. Madhavan, and A. Halevy. Efficient spatial sampling of large geographical tables. In SIGMOD, pages 193--204, 2012.
[18]
B. Du, Z. Wang, L. Zhang, L. Zhang, W. Liu, J. Shen, and D. Tao. Exploring representativeness and informativeness for active learning. IEEE Trans. Cybernetics, 47(1):14--26, 2017.
[19]
J. L. Fleiss. Measuring nominal scale agreement among many raters. Psychological bulletin, 76(5):378, 1971.
[20]
N. Gaspard, L. J. Hirsch, S. M. LaRoche, C. D. Hahn, M. B. Westover, and C. C. E. M. R. Consortium. Interrater agreement for critical care eeg terminology. Epilepsia, 55(9):1366--1373, 2014.
[21]
V. Guralnik and J. Srivastava. Event detection from time series data. In KDD, pages 33--42, 1999.
[22]
A. Y. Hannun, P. Rajpurkar, M. Haghpanahi, G. H. Tison, C. Bourn, M. P. Turakhia, and A. Y. Ng. Cardiologist-level arrhythmia detection and classification in ambulatory electrocardiograms using a deep neural network. Nature medicine, 25(1):65, 2019.
[23]
L. Hirsch, S. LaRoche, N. Gaspard, E. Gerard, A. Svoronos, S. Herman, R. Mani, H. Arif, N. Jette, Y. Minazad, et al. American clinical neurophysiology societys standardized critical care eeg terminology: 2012 version. Journal of clinical neurophysiology, 30(1):1--27, 2013.
[24]
L. J. Hirsch. Continuous eeg monitoring in the intensive care unit: an overview. Journal of clinical neurophysiology, 21(5):332--340, 2004.
[25]
S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735--1780, 1997.
[26]
G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger. Densely connected convolutional networks. In CVPR, pages 4700--4708, 2017.
[27]
S. Huang, R. Jin, and Z. Zhou. Active learning by querying informative and representative examples. IEEE Trans. Pattern Anal. Mach. Intell., 36(10):1936--1949, 2014.
[28]
J. Jing, E. Angremont, S. Zafar, E. S. Rosenthal, M. Tabaeizadeh, S. Ebrahim, J. Dauwels, and M. B. Westover. Rapid annotation of seizures and interictal-ictal continuum eeg patterns. In EMBC, pages 3394--3397. IEEE, 2018.
[29]
E. L. Johnson and P. W. Kaplan. Population of the ictal-interictal zone: The significance of periodic and rhythmic activity. Clinical neurophysiology practice, 2:107--118, 2017.
[30]
P. Kerpedjiev, N. Abdennur, F. Lekschas, C. McCallum, K. Dinkla, H. Strobelt, J. M. Luber, S. B. Ouellette, A. Azhir, N. Kumar, et al. Higlass: web-based visual exploration and analysis of genome interaction maps. Genome biology, 19(1):125, 2018.
[31]
R. Killick, P. Fearnhead, and I. A. Eckley. Optimal detection of changepoints with a linear computational cost. Journal of the American Statistical Association, 107(500):1590--1598, 2012.
[32]
A. Kim, E. Blais, A. G. Parameswaran, P. Indyk, S. Madden, and R. Rubinfeld. Rapid sampling for visualizations with ordering guarantees. PVLDB, 8(5):521--532, 2015.
[33]
A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, pages 1097--1105, 2012.
[34]
J. S. Kumar and P. Bhuvaneswari. Analysis of electroencephalography (eeg) signals and its categorization-a study. signal, 25:26.
[35]
Y. LeCun, B. E. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. E. Hubbard, and L. D. Jackel. Handwritten digit recognition with a back-propagation network. In NIPS, pages 396--404, 1990.
[36]
J. M. Lees and J. Park. Multiple-taper spectral analysis: A stand-alone c-subroutine. Computers & Geosciences, 21(2):199--236, 1995.
[37]
L. Lins, J. T. Klosowski, and C. Scheidegger. Nanocubes for real-time exploration of spatiotemporal datasets. IEEE TVCG, 19(12):2456--2465, 2013.
[38]
Z. Liu and J. Heer. The effects of interactive latency on exploratory visual analysis. TVCG, 20(12):2122--2131, 2014.
[39]
Z. Liu, B. Jiang, and J. Heer. imMens: Real-time visual querying of big data. Comput. Graphics Forum, 32:421--430, 2013.
[40]
S. S. Lodder and M. J. van Putten. Quantification of the adult eeg background pattern. Clinical neurophysiology, 124(2):228--237, 2013.
[41]
R. Lund, X. L. Wang, Q. Q. Lu, J. Reeves, C. Gallagher, and Y. Feng. Changepoint detection in periodic and autocorrelated time series. Journal of Climate, 20(20):5178--5190, 2007.
[42]
L. Maaten. Learning a parametric embedding by preserving local structure. In Artificial Intelligence and Statistics, pages 384--391, 2009.
[43]
S. Macke, Y. Zhang, S. Huang, and A. G. Parameswaran. Adaptive sampling for rapidly matching histograms. PVLDB, 11(10):1262--1275, 2018.
[44]
J. MacQueen et al. Some methods for classification and analysis of multivariate observations. In Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, volume 1, pages 281--297. Oakland, CA, USA, 1967.
[45]
L. McInnes and J. Healy. Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426, 2018.
[46]
P. Melville and R. J. Mooney. Diverse ensembles for active learning. In ICML, pages 74--, New York, NY, USA, 2004.
[47]
L. M. Moura, M. M. Shafi, M. Ng, S. Pati, S. S. Cash, A. J. Cole, D. B. Hoch, E. S. Rosenthal, and M. B. Westover. Spectrogram screening of adult eegs is sensitive and efficient. Neurology, 83(1):56--64, 2014.
[48]
E. Niedermeyer and F. L. da Silva. Electroencephalography: basic principles, clinical applications, and related fields. Lippincott Williams & Wilkins, 2005.
[49]
C. A. L. Pahins, S. A. Stephens, C. Scheidegger, and J. L. D. Comba. Hashedcubes: Simple, low memory, real-time visual exploration of big data. TVCG, pages 671--680, 2017.
[50]
A. Ratner, S. H. Bach, H. R. Ehrenberg, J. A. Fries, S. Wu, and C. Ré. Snorkel: Rapid training data creation with weak supervision. PVLDB, 11(3):269--282, 2017.
[51]
A. Savitzky and M. J. Golay. Smoothing and differentiation of data by simplified least squares procedures. Analytical chemistry, 36(8):1627--1639, 1964.
[52]
R. T. Schirrmeister, J. T. Springenberg, L. D. J. Fiederer, M. Glasstetter, K. Eggensperger, M. Tangermann, F. Hutter, W. Burgard, and T. Ball. Deep learning with convolutional neural networks for eeg decoding and visualization. Human brain mapping, 38(11):5391--5420, 2017.
[53]
B. Settles. Active learning. Synthesis Lectures on Artificial Intelligence and Machine Learning, 6(1):1--114, 2012.
[54]
C. Stolte, D. Tang, and P. Hanrahan. Multiscale visualization using data cubes. In INFOVIS, pages 7--14, 2002.
[55]
I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104--3112, 2014.
[56]
W. Tao, X. Liu, Ç. Demiralp, R. Chang, and M. Stonebraker. Kyrix: Interactive visual data exploration at scale. In CIDR, 2019.
[57]
L. van der Maaten. Accelerating t-sne using tree-based algorithms. Journal of Machine Learning Research, 15:3221--3245, 2014.
[58]
L. van der Maaten and G. Hinton. Visualizing data using t-SNE. Journal of Machine Learning Research, 9:2579--2605, 2008.
[59]
P. Varma and C. Ré. Snuba: Automating weak supervision to label training data. PVLDB, 12(3):223--236, 2018.
[60]
M. Vartak, S. Rahman, S. Madden, A. G. Parameswaran, and N. Polyzotis. SEEDB: efficient data-driven visualization recommendations to support visual analytics. PVLDB, 8(13):2182--2193, 2015.
[61]
Z. Wang and J. Ye. Querying discriminative and representative samples for batch mode active learning. In SIGKDD, pages 158--166, 2013.

Cited By

View all
  • (2024)RITA: Group Attention is All You Need for Timeseries AnalyticsProceedings of the ACM on Management of Data10.1145/36393172:1(1-28)Online publication date: 26-Mar-2024
  • (2022)Sintel: A Machine Learning Framework to Extract Insights from SignalsProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3517910(1855-1865)Online publication date: 10-Jun-2022
  • (2022)Semantics and Anomaly Preserving Sampling Strategy for Large-Scale Time Series DataACM/IMS Transactions on Data Science10.1145/35119182:4(1-25)Online publication date: 30-Mar-2022
  • Show More Cited By

Index Terms

  1. Smile: a system to support machine learning on EEG data at scale
    Index terms have been assigned to the content through auto-classification.

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image Proceedings of the VLDB Endowment
    Proceedings of the VLDB Endowment  Volume 12, Issue 12
    August 2019
    547 pages

    Publisher

    VLDB Endowment

    Publication History

    Published: 01 August 2019
    Published in PVLDB Volume 12, Issue 12

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)44
    • Downloads (Last 6 weeks)2
    Reflects downloads up to 14 Sep 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)RITA: Group Attention is All You Need for Timeseries AnalyticsProceedings of the ACM on Management of Data10.1145/36393172:1(1-28)Online publication date: 26-Mar-2024
    • (2022)Sintel: A Machine Learning Framework to Extract Insights from SignalsProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3517910(1855-1865)Online publication date: 10-Jun-2022
    • (2022)Semantics and Anomaly Preserving Sampling Strategy for Large-Scale Time Series DataACM/IMS Transactions on Data Science10.1145/35119182:4(1-25)Online publication date: 30-Mar-2022
    • (2021)LANCETProceedings of the VLDB Endowment10.14778/3476249.347626914:11(2154-2166)Online publication date: 1-Jul-2021
    • (2020)AsteriskACM/IMS Transactions on Data Science10.1145/33851881:2(1-25)Online publication date: 30-May-2020
    • (2020)TRACER: A Framework for Facilitating Accurate and Interpretable Analytics for High Stakes ApplicationsProceedings of the 2020 ACM SIGMOD International Conference on Management of Data10.1145/3318464.3389720(1747-1763)Online publication date: 11-Jun-2020

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media