I’m going to write a network to classify/detect dual-tone multi-frequency (DTMF) signals using a neural network. Then I’m going to demonstrate a common DSP technique (one that modern telephony systems still use). I actually don’t know what to expect, but I’ll do some validation and look for ways to break each approach.

Alright, let’s get started with our deep learning tone detector. Where do you start with deep learning? Labeled data. That’s a point for your grandad’s Goertzel filter. With a DSP textbook you can look up simple formulas and use existing implementations and be done. The End.

OK, but what if you *had* a reason to train something, maybe your problem is noise robustness or frequency perturbations caused by a new codec, etc, that means that you need to compute several filters for each frequency and complex logic to reject false positives. Meh. Ok just stay with me…

Labeled data might be hard to collect, but good news. Because there are published definitions we could, in theory, use simple trigonometry functions to generate waves with varying frequencies and amplitudes, sum a couple frequencies and add some noise. Maybe we can find and use a corpus of audio that we know doesn’t have these tones to ensure that we don’t get unnecessary false positives.

This isn’t going to be perfect (it’s really just for demonstration purposes), and I’m going to leave to the reader to think about how they might make this more robust, or which controls I’m being fast and loose with.

Once we figure this out we’ll need it to test both solutions. So it’s a good place to start.

I’m not going to get too into the weeds of tone generation. You can get the jist by looking at a wikipedia article: https://en.wikipedia.org/wiki/Sine_wave

One thing we do want to be able to do is mutate the signal a little by adding noise wiggling the frequencies slightly and modulating the amplitude. Different environments will have different effects on the signal. I’m not guaranteeing that that just adding some basic white noise does much for us, but it should be enough for demonstration purposes.

First we need the Frequencies for DTMF signals. They can be found here: https://en.wikipedia.org/wiki/Dual-tone_multi-frequency_signaling

I’m going to distill the frequencies into a basic lookup table (dictionary in python):

import numpy as np
from collections import OrderedDict

# tuples in python are hashable, and so can be compared with a basic equality operator.
tone_lookup = OrderedDict({
              0: (853,1477),
              1: (1209,1336),
              2: (1209,1477),
              3: (1209,1633),
              4: (697,1336),
              5: (697,1477),
              6: (697,1633),
              7: (770,1336),
              8: (770,1477),
              9: (770,1633),
              10: (853,1336),
              11: (853,1633)
            })


To build a signal, we need to create sine waves of a given frequency. In doing this we are adding a small factor to move the audio in and out of phase, and modulating the frequency some. Second, we combine them.

def freq_wave(freq, amplitude=1, phase=0):
timeseries = np.linspace(0, 4000, 4000)
return amplitude * np.sin(((freq2np.pi))*timeseries/8000+phase)

def single_freq(freq, var=0.005, amplitude=0.4):
freq_noise = np.random.normal(0,var*(freq))
phase = np.random.uniform(0,0.03,1)
return freq_wave(freq+freq_noise, phase=phase, amplitude=amplitude)

def build_tone(f1, f2, var=0.005, a=0.4):
waveform = single_freq(f1, var=var, amplitude=a) + single_freq(f2, var=var, amplitude=a)
return waveform/(waveform.max()+1e-8)

The main workhorse get_tones function, randomly decides which tones to create and 10% of the time, will create tones we don’t care about as counter examples. We sort the frequencies to test if they are in the list, in case we accidentally create a frequency we do care about.

Then we build the label set, and return the labels and waveform. This is a little contrived, but it should work well enough.

def get_tones(printed=False):
if np.random.uniform(0,1) < 0.1:
rgh = True # I'm setting a flag to determine when we're using a random frequency pair.
x1 = np.random.randint(100,3200)
x2 = np.random.randint(100,3200)
x1, x2 = sorted([x1, x2]) # sorted so that frequency tuple matches the lookup table order
else:
    rgh = False
    tone = np.random.choice(np.arange(12,dtype=np.int),1)
    x1, x2 = tone_lookup[tone[0]]

amp = np.random.normal(0.3, 0.2, 1)
tone_waveform = build_tone(x1, x2, a=amp)

# make the tones randomly positioned in time
pads = np.random.randint(0,2000)

labels = np.zeros(6000)

if (x1, x2) in tone_lookup.values():
    if rgh:
        print('randomly got here')
    # native python argmax where frequency tuple in defined frequency tuples
    target_index, frequency_pair = max(tone_lookup.items(), key=lambda x: (x1,x2)==x[1])
    labels[pads:(4000+pads)] = target_index+1

# I'd selecting the labels naively, but to align with the convolutions.
# I've broken it to two lines so that the dilation rate can change per layer without making the logic terse.
labels = labels[::2]
labels = labels[::2]

padded_wave = np.pad(tone_waveform, [pads,2000-pads], 'constant', constant_values=(0,0))
padded_wave += np.random.normal(0,0.01,size=padded_wave.shape)

if printed:
    print(x1,x2)
    fig, (ax1,ax2) = subplots(nrows=2,ncols=1, figsize=(18,8))
    ax1.plot(padded_wave)
    ax1.set_xlim((0,6000))
    ax2.plot(labels)
    ax2.set_xlim((0,1500))
return labels, padded_wave

OK. I think that will work for creating our tones. Now we can talk about features and neural networks. The Goertzel filter will operate on the waveform without any modulation of the signal first. One option would be to create an FFT that could assure that each frequency in our list is a primary contributor to a frequency band, but that’s an extra transform, and if you’re going to do that, you should just use that, instead of a neural network, right?

So what kind of neural network can we use to identify frequencies in a timeseries. You might be tempted to want to use an RNN. That sounds perfect, right? You’d not be wrong, but that’s also a little slow computation-wise, and maybe a little boring. We’re testing this against a Goertzel *filter*. So I want to explore the technique of using a network architecture that takes advantage of translation invariant filters. So we are going to use a 1-Dimensional Convolutional network. It’s faster to train, and I think it will work just fine.

We need to decide on a few parameters. How many filters? Window size? How many layers? Regularization? Pooling? Turns out I cheated and ran a few architectures, first. I started with a low number of filters on a single layer, with a kernel size that represented a few periods of the lowest frequency, and built up from there. I landed on the parameters below, which converge pretty well after a couple epochs, and yield ~99% accuracy.

I’m going to use Keras as a shortcut to build the Tensorflow network:

from keras.models import Model, Input
from keras.layers import Conv1D, Dense, TimeDistributed, MaxPool1D
from keras.optimizers import RMSprop
from keras.regularizers import l2

#Telephony signals are usually mu-law encoded 8,000 samples per second
sig = Input([None, 1])
convs = Conv1D(18, int(8000/40), kernel_regularizer=l2(1e-3), padding='same')(sig)
convs = MaxPool1D(pool_size=2)(convs)
convs = Conv1D(18, int(8000/40), padding='same', kernel_regularizer=l2(1e-3))(convs)
convs = MaxPool1D(pool_size=2)(convs)
out = TimeDistributed(Dense(13, kernel_regularizer=l2(1e-3), activation='sigmoid'))(convs)

model = Model(inputs=[sig],outputs=[out])
model.summary()

model.compile(optimizer=RMSprop(), loss='sparse_categorical_crossentropy', metrics=['accuracy'] )

model.fit_generator(fit_gen(32), validation_data=fit_gen(16), validation_steps=100, steps_per_epoch=500, epochs=10)

'''
Layer (type) Output Shape Param #
=================================================================
input_3 (InputLayer) (None, None, 1) 0
_________________________________________________________________
conv1d_5 (Conv1D) (None, None, 18) 3618
_________________________________________________________________
max_pooling1d_5 (MaxPooling1 (None, None, 18) 0
_________________________________________________________________
conv1d_6 (Conv1D) (None, None, 18) 64818
_________________________________________________________________
max_pooling1d_6 (MaxPooling1 (None, None, 18) 0
_________________________________________________________________
time_distributed_3 (TimeDist (None, None, 13) 247
=================================================================
Total params: 68,683
Trainable params: 68,683
Non-trainable params: 0
__________________________

Epoch 1/10
500/500 [==============================] - 685s - loss: 0.3074 - acc: 0.8555 - val_loss: 0.0818 - val_acc: 0.9933
Epoch 2/10
500/500 [==============================] - 696s - loss: 0.0834 - acc: 0.9908 - val_loss: 0.0668 - val_acc: 0.9951
Epoch 3/10
500/500 [==============================] - 713s - loss: 0.0681 - acc: 0.9932 - val_loss: 0.0484 - val_acc: 0.9978
Epoch 4/10
500/500 [==============================] - 684s - loss: 0.0643 - acc: 0.9930 - val_loss: 0.0521 - val_acc: 0.9963
Epoch 5/10
500/500 [==============================] - 683s - loss: 0.0546 - acc: 0.9949 - val_loss: 0.0538 - val_acc: 0.9961
Epoch 6/10
500/500 [==============================] - 713s - loss: 0.0523 - acc: 0.9951 - val_loss: 0.0664 - val_acc: 0.9896
Epoch 7/10
500/500 [==============================] - 715s - loss: 0.0541 - acc: 0.9951 - val_loss: 0.0447 - val_acc: 0.9978
Epoch 8/10
500/500 [==============================] - 688s - loss: 0.0465 - acc: 0.9961 - val_loss: 0.0539 - val_acc: 0.9958
Epoch 9/10
500/500 [==============================] - 671s - loss: 0.0452 - acc: 0.9960 - val_loss: 0.0408 - val_acc: 0.9968
Epoch 10/10
500/500 [==============================] - 662s - loss: 0.0470 - acc: 0.9955 - val_loss: 0.0660 - val_acc: 0.9932
'''

So the neural network, is probably running somewhere around 99% accuracy, it trained for about 90 minutes on the cpu.

If I were going to characterize some features of this approach:

  • You could train something fairly flexible, and noise robust hopefully the network is learning it’s own rejection criteria
  • Given appropriate counter examples you could potentially learn more sophisticated sigals, and sequences fairly easily
  • Presumably you could learn more sophisticated signals than dual-tones or an arbitrarily large set of signals using a similar architecture

If I were going to talk about the challenges of this approach:

  • You might not ever know for sure what adversarial observations could trigger your filter.
  • It’s unlikely that you could define a network that’s as efficient as a simple goertzel filter for any single signal set
  • You have to synthesize or collect labeled examples
  • The math and signal processing difficulty is functionally equivalent to simpler DSP filter techniques

We’ll see how much bad code I have to write, to do this with a Goertzel Filter, and if the filters frequency magnitude response is wide enough to handle the same noise, and amplitude modulations I’ve defined here.