Prior Work

Prior Work#

Prior to our work, research on deep-learning-based EEG decoding was limited

Few studies compared to published feature-based decoding results
Most EEG DL architectures had only 1-3 convolutional layers and included fully-connected layers with many parameters
Most work only considered very restricted frequency ranges
Most studies only compared few design choices and training strategies

Prior to 2017, when the first work presented in this thesis was published, there was only limited literature on EEG decoding with deep learning. In this chapter, I outline what decoding problems, input representations, network architectures, hyperparameter choices and visualizations were evaluated in prior work. This is based on the literature research that we presented in Schirrmeister et al. [2017].

Decoding Problems and Baselines#

Table 1 **Decoding problems in deep-learning EEG decoding studies prior to our work.** Studies with published baseline compared their decoding results to an external baseline result published by other authors.#
Decoding problem	Number of studies	With published baseline
Imagined or Executed Movement	6	2
Oddball/P300	5	1
Epilepsy-related	4	2
Music Rhythm	2	0
Memory Performance/Cognitive Load	2	0
Driver Performance	1	0

The most widely studied decoding problems were movement-related decoding problems such as decoding which body part (hand, feet etc.) a person is moving or imagining to move (see Table 1). From the 19 studies we identified at the time, only 5 compared their decoding results to an external published baseline result, limiting the insights about deep-learning EEG decoding performance. We therefore decided to compare deep-learning EEG decoding to a strong feature-based baseline (see Filter Bank Common Spatial Patterns and Filterbank Network) on widely researched movement-related decoding tasks.

Input Domains and Frequency Ranges#

Show code cell source Hide code cell source

import matplotlib
import matplotlib.pyplot as plt
from matplotlib import cm
import seaborn
import numpy as np
import re
from myst_nb import glue
seaborn.set_palette('colorblind')
seaborn.set_style('darkgrid')
import re

%matplotlib inline
%config InlineBackend.figure_format = 'png'
#matplotlib.rcParams['figure.figsize'] = (12.0, 1.0)
matplotlib.rcParams['font.size'] = 14
a = np.array(['Time,  8–30 Hz ', 'Time, 0.1–40 Hz ', 'Time, 0.05–15 Hz ',
       'Time, 0.3–20 Hz ', 'Frequency, 6–30 Hz ', ' Frequency, 0–200 Hz ',
       'Time,  1–50 Hz ', ' Time,  0–100 HZ ',
       'Frequency, mean amplitude for 0–7 Hz, 7–14 Hz, 14–49 Hz ',
       'Time, 0.5–50 Hz ', 'Time,  0–128 Hz ',
       ' Frequency, mean power for 4–7 Hz, 8–13 Hz, 13–30 Hz ',
       'Time, 0.5–30Hz ', 'Time, 0.1–50 Hz ',
       'Frequency, 4–40 Hz, using FBCSP ',
       ' Time and frequency evaluated, 0-200 Hz ', 'Frequency, 8–30 Hz ',
       'Time, 0.15–200 Hz ', ' Time, 0.1-20 Hz '])
domain_strings = [s.split(',')[0] for s in a]
start_fs = [float(re.sub(r'[a-z ]+',r'', re.split(r'[–-–-]'," ".join(s.split(',')[1:]))[0])) for s in a]
end_fs = [float(re.sub(r'[a-z HZFBCSP]+',r'', re.split(r'[–-–-]'," ".join(s.split(',')[1:]))[1])) for s in a]
domain_strings = np.array(domain_strings)
start_fs = np.array(start_fs)
end_fs = np.array(end_fs)

freq_mask = np.array(['freq' in s.lower() for s in domain_strings])
time_mask = np.array(['time' in s.lower() for s in domain_strings])

fig = plt.figure(figsize=(8,4))
rng = np.random.RandomState(98349384)
color = seaborn.color_palette()[0]
i_sort = np.flatnonzero(time_mask)[np.argsort(end_fs[time_mask])]
for i, (d,s,e) in enumerate(zip(
        domain_strings[i_sort], start_fs[i_sort], end_fs[i_sort])):
    offset = 0.6*i/len(i_sort) - 0.3
    plt.plot([offset,offset] , [s, e], marker='o', alpha=1, color=color, ls='-')
i_sort = np.flatnonzero(freq_mask)[np.argsort(end_fs[freq_mask])]
for i, (d,s,e) in enumerate(zip(
        domain_strings[i_sort], start_fs[i_sort], end_fs[i_sort])):
    offset = 0.6*i/len(i_sort) + 0.7
    plt.plot([offset,offset] , [s, e], marker='o', alpha=1, color=color, ls='-')

plt.xlim(-0.5,1.5)
plt.xlabel("Input domain")
plt.ylabel("Frequency [Hz]")
plt.xticks([0,1], ["Time", "Frequency"], rotation=45)
plt.title("Input domains and frequency ranges in prior work", y=1.05)
plt.yticks([0,25,50,75,100,150,200])
glue('input_domain_fig', fig)
plt.close(fig)
None

_images/12b9c75de2ac9fe0306de0abcbc038b8f7372c501526c6e11e7c798b26cfbe44.png

Fig. 1 Input domains and frequency ranges in prior work. Grey lines represent frequency ranges of individual studies. Note that many studies only include frequencies below 50 Hz, some use very restricted ranges (alpha/beta band).#

Deep networks can either decode directly from the time-domain EEG or process the data in the frequency domain, for example after a Fourier transformation. 12 of the prior studies used time-domain inputs, 6 used frequency-domain inputs and one used both. We decided to work directly in the time domain, as the deep networks should in principle be able to learn how to extract any needed spectral information from the time-domain input.

Most prior studies that were working in the time domain only used frequencies below 50 Hz. We were interested in how well deep networks can also extract lesser-used higher-frequency components of the EEG signal. For that, we used a sampling rate of 250 Hz, which means we were able to analyze frequencies up to the Nyquist frequency of 125 Hz. As a suitable dataset where high-frequency information may help decoding, we included our high-gamma dataset in our study, since it was recorded specifically to allow extraction of higher-frequency (>50 Hz) information from scalp EEG [Schirrmeister et al., 2017].

Network Architectures#

Show code cell source Hide code cell source

ls = np.array([' 2/2 ', ' 3/1 ', ' 2/2 ', ' 3/2 ', ' 1/1 ', ' 1/2 ', ' 1/3 ',
       ' 1–2/2 ', ' 3/1 (+ LSTM as postprocessor) ', ' 4/3 ', ' 1-3/1-3 ',
       ' 3–7/2 (+ LSTM or other temporal post-processing (see design choices)) ',
       ' 2/1 ', ' 3/3 (Spatio-temporal regularization) ',
       ' 2/2 (Final fully connected layer uses concatenated output by convolutionaland fully connected layers) ',
       ' 1-2/1 ',
       '2/0 (Convolutional deep belief network, separately trained RBF-SVM classifier) ',
       ' 3/1 (Convolutional layers trained as convolutional stacked autoencoder with target prior) ',
       ' 2/2 '])

conv_ls = [l.split('/')[0] for l in ls]
low_conv_ls = [int(re.split(r'[–-]', c)[0])for c in conv_ls]
high_conv_ls = [int(re.split(r'[–-]', c)[-1])for c in conv_ls]
dense_ls = [l.split('/')[1] for l in ls]
low_dense_ls = [int(re.split(r'[–-]', c[:8])[0][:2])for c in dense_ls]
high_dense_ls = [int(re.split(r'[–-]', c[:8])[-1][:2])for c in dense_ls]

all_conv_ls = np.concatenate([np.arange(low_c, high_c+1) for low_c, high_c in zip(low_conv_ls, high_conv_ls)])
all_dense_ls = np.concatenate([np.arange(low_c, high_c+1) for low_c, high_c in zip(low_dense_ls, high_dense_ls)])
bincount_conv = np.bincount(all_conv_ls)
bincount_dense = np.bincount(all_dense_ls)
rng = np.random.RandomState(98349384)
color = seaborn.color_palette()[0]
fig = plt.figure(figsize=(8,4))
for low_c, high_c in zip(low_conv_ls, high_conv_ls):
    offset = rng.randn(1) * 0.1
    tried_cs = np.arange(low_c, high_c+1)
    plt.plot([offset,] * len(tried_cs), tried_cs, marker='o', alpha=0.5, color=color, ls=':')
    
for i_c, n_c in enumerate(bincount_conv):
    plt.scatter(0.4, i_c, color=color, s=n_c*40)
    plt.text(0.535, i_c, str(n_c)+ "x", ha='left', va='center')

for low_c, high_c in zip(low_dense_ls, high_dense_ls):
    offset = 1 + rng.randn(1) * 0.1
    tried_cs = np.arange(low_c, high_c+1)
    plt.plot([offset,] * len(tried_cs), tried_cs, marker='o', alpha=0.5, color=color, ls=':')
    
for i_c, n_c in enumerate(bincount_dense):
    plt.scatter(1.4, i_c, color=color, s=n_c*40)
    plt.text(1.535, i_c, str(n_c)+ "x", ha='left', va='center')

plt.xlim(-0.5,2)
plt.xlabel("Type of layer")
plt.ylabel("Number of layers")
plt.xticks([0,1], ["Convolutional", "Dense"], rotation=45)
plt.yticks([1,2,3,4,5,6,7]);
plt.title("Number of layers in prior works' architectures", y=1.05)
glue('layernum_fig', fig)
plt.close(fig)
None

_images/7392f18c25bd31ecf70f575926c0cd4b901396bfe82eebc08ed594903a0475ce.png

Fig. 2 Number of layers in prior work. Small grey markers represent individual architectures. Dashed lines indicate different number of layers investigated in a single study (e.g., a single study investigated 3-7 convolutional layers). Larger grey markers indicate sum of occurences of that layer number over all studies (e.g., 9 architectures used 2 convolutional layers). Note most architectures use only 1-3 convolutional layers.#

The architectures used in prior work typically only included up to 3 layers, with only 2 studies considering more layers. As network architectures in other domains tend to be a lot deeper, we also evaluated architectures with a larger number of layers in our work. Several architectures from prior work also included fully-connected layers with larger number of parameters which had fallen out of favor in computer-vision deep-learning architectures due to their large compute and memory requirements with little accuracy benefit. Our architectures do not include traditional fully-connected layers with a large number of parameters.

Hyperparameter Evaluations#

Table 2 Design choices and training strategies that prior deep-learning EEG decoding studies had studies.#
Study	Design choices	Training strategies
[Lawhern et al., 2016]	Kernel sizes
[Sun et al., 2016]		Different time windows
[Tabar and Halici, 2017]	Addition of six-layer stacked autoencoder on ConvNet features Kernel sizes
[Liang et al., 2016]		Different subdivisions of frequency range Different lengths of time crops Transfer learning with auxiliary non-epilepsy datasets
[Hajinoroozi et al., 2016]	Replacement of convolutional layers by restricted Boltzmann machines with slightly varied network architecture}
[Antoniades et al., 2016]	1 or 2 convolutional layers
[Page et al., 2016]		Cross-subject supervised training, within-subject finetuning of fully connected layers
[Bashivan et al., 2016]	Number of convolutional layers Temporal processing of ConvNet output by max pooling, temporal convolution, LSTM or temporal convolution + LSTM
[Stober, 2016]	Kernel sizes	Pretraining first layer as convolutional autoencoder with different constraints
[Sakhavi et al., 2015]	Combination ConvNet and MLP (trained on different features) vs. only ConvNet vs. only MLP
[Stober et al., 2014]	Best values from automatic hyperparameter optimization: frequency cutoff, one vs two layers, kernel sizes, number of channels, pooling width	Best values from automatic hyperparameter optimization: learning rate, learning rate decay, momentum, final momentum
[Wang et al., 2013]	Partially supervised CSA
[Cecotti and Graser, 2011]	Electrode subset (fixed or automatically determined) Using only one spatial filter Different ensembling strategies

Prior work varied widely in their comparison of design choices and training strategies. 6 of the studies did not compare any design choices or training strategy hyperparameters. The other 13 studies evaluated different hyperparameters, with the most common one the kernel size (see Table 2). Only one study evaluated a wider range of hyperparameters [Stober et al., 2014]. To fill this gap, we compared a wider range of design choices and training strategies and specifically evaluated whether improvements of computer vision architecture design choices and training strategies also lead to improvements in EEG decoding.

Visualizations#

Table 3 Visualizations presented in prior work.#
Study	Visualization type(s)	Visualization findings
[Sun et al., 2016]	Weights (spatial)	Largest weights found over prefrontal and temporal cortex
[Manor et al., 2016]	Weights Activations Saliency maps by gradient	Weights showed typical P300 distribution Activations were high at plausible times (300-500ms) Saliency maps showed plausible spatio-temporal plots
[Tabar and Halici, 2017]	Weights (spatial + frequential)	Some weights represented difference of values of two electrodes on different sides of head
[Liang et al., 2016]	Weights Clustering of weights	Clusters of weights showed typical frequency band subdivision (delta, theta, alpha, beta, gamma)
[Antoniades et al., 2016]	Weights Correlation weights and interictal epileptic discharges (IED) Activations	Weights increasingly correlated with IED waveforms with increasing number of training iterations Second layer captured more complex and well-defined epileptic shapes than first layer IEDs led to highly synchronized activations for neighbouring electrodes
[Thodoroff et al., 2016]	Input occlusion and effect on prediction accuracy	Allowed to locate areas critical for seizure
[Shamwell et al., 2016]	Weights (spatial)	Some filter weights had expected topographic distributions for P300 Others filters had large weights on areas not traditionally associated with P300
[Bashivan et al., 2016]	Inputs that maximally activate given filter Activations of these inputs ”Deconvolution” for these inputs	Different filters were sensitive to different frequency bands Later layers had more spatially localized activations Learned features had noticeable links to well-known electrophysiological markers of cognitive load
[Stober, 2016]	Weights (spatial+3 timesteps, pretrained as autoencoder)	Different constraints led to different weights, one type of constraints could enforce weights that are similar across subjects; other type of constraints led to weights that have similar spatial topographies under different architectural configurations and preprocessings
[Manor and Geva, 2015]	Weights Mean and single-trial activations	Spatiotemporal regularization led to softer peaks in weights Spatial weights showed typical P300 distribution Activations mostly had peaks at typical times (300-400ms)
[Cecotti and Graser, 2011]	Weights	Spatial filters were similar for different architectures Spatial filters were different (more focal, more diffuse) for different subjects

Visualizations can help understand what information the networks are extracting from the EEG signal. 11 of the prior 19 studies presented any visualizations. These studies mostly focused on analyzing weights and activations, see Table 3. In our work, we first focused on investigating how far the networks extract spectral features known to work well for movement-related decoding, see Perturbation Visualization. Later, we also developed more sophisticated visualization methods and applied them both to pathology decoding, see Invertible Networks and Understanding Pathology Decoding With Invertible Networks.

Open Questions

How do ConvNets perform on well-researched EEG movement-related decoding tasks against strong feature-based baselines?
How do shallower and deeper architectures compare?
How do design choices and training strategies affect the decoding performance?
What features do the deep networks learn on the EEG signals?
Do they learn to use higher-frequency (>50 Hz) information?