Tensorflow Speech Commands
The dataset we used [data1] consists of 64,721 short recordings of speech commands. Each recording is an utterance of 1 second or less of one of 30 words. Following the steps outlined in a speech recognition tutorial [Pai2], we resampled the recordings from the original sample rate of 16kHz to 8kHz, as most of the essential frequency information for speech lies below the frequency of 4kHz [confirmation needed]. We split the data randomly into validation (1,000 count), test (7,000 count), and training (56, 721 count) data, and used the training set for training, the validation set for tuning the hyperparameters, and the test set only for evaluating the best version of each network type.
Convolutional neural networks (CNN), which have been notoriously useful in tasks like visual classification, have also been successfully used in speech recognition [Sainath3], [Microsoft4]. CNNs are powerful feature extractors – in the task of speech recognition, their ability to learn structures from the data makes them robust to noise and variation in speaker frequency patterns [Sainath3].
Our first approach was guided by Aravind Pai's tutorial [Pai2] on CNNs for sound classification. This approach essentially treats sound recordings as 1-dimensional images – a CNN, using only max pooling, dropout, and dense (in addition to convolutional) layers in order to classify each recording in one of the 30 categories. The architecture consists of 4 convolutional layers with max pooling of size 3 and dropout of 30% after each convolution, followed by a dense layer with 256 units and ReLU activation, a dense layer with 128 units and Softmax activation, and an output layer with Softmax activation. We trained this network, containing 1,614,078 trainable parameters for 150 epochs (21 minutes on a GPU-accelerated runtime on Google Colab) and achieved an accuracy of 81.8% on the test set.
After achieving satisfactory results on the speech classification task with convolution on raw sound data, we wanted to find out if providing the data in a different format to the network would lead to higher performance. A common method of preprocessing speech data is to compute Mel spectrograms of the raw sound , . The Mel spectrogram computation is a nonlinear transformation that takes in a raw sound file – a 1D array of amplitudes – and converts it into a sequence of Mel coefficients – a set of coefficients that are considered to model well the way humans perceive differences in frequencies [Mel reference5].
We converted each 1-second recording into a sequence of length 16 of 128 Mel coefficients at each time step.
We were curious whether an approach that treats the sound data sequentially – as opposed to treating it as an image – would provide better results. A good way to treat sequences is using recurrent units – units that use the output of one time step as input for the next. The particular unit we used is long short-term memory (LSTM) – a type of recurrent network that has an input gate, an output gate, and a forget gate, which control the flow of information and help the unit remember information for longer periods of time [LSTM6].
For this task, we kept the convolution stack from the previous one, as CNNs are good feature extractors [Microsoft speech4], and added an LSTM layer with 100 units on top, replacing the two hidden dense layers. Because of this, the network has fewer trainable parameters than the Mel CNN – 188,694.
While working on this project, we discussed our approach with Maan Qraitem.
 Warden P. Speech Commands: A public dataset for single-word speech recognition, 2017. Available from http://download.tensorflow.org/data/speech_commands_v0.01.tar.gz
 Pai, A. (2019, July 15). Learn how to Build your own Speech-to-Text Model (using Python). Retrieved from https://www.analyticsvidhya.com/blog/2019/07/learn-build-first-speech-to-text-model-python/.
 T. N Sainath, A. Mohamed, B. Kingsbury, and B. Ramabhadran. "Deep Convolutional Nural Networks For LVCSR." In: IEEE International Conference on Acoustics, Speech and Signal Processing, 2013.
 L. Deng, J. Li, J. T Huang, K. Yao, D. Yu, F. Seide, Y. Gong. Recent advances in deep learning for speech research at Microsoft. In: IEEE International Conference on Acoustics, Speech and Signal Processing. pp. 8604–8608, 2013.
 S. S. Stevens, J. Volkmann, and E. B. Newman. "A scale for the measurement of the psychological magnitude pitch". Journal of the Acoustical Society of America. Volume 8, issue 3, pp. 185–190, 1937.
 S. Hochreiter, J. Schmidhuber. "Long Short-Term Memory." In: Neural Computation. Volume 9, issue 8, pp. 1735-1780, 1997.