## Chapter 2 Neural Networks For Speech Recognition## 2.1 IntroductionNeural Networks are, in essence, biologically inspired networks since they are based on the current understanding of the biological nervous system. In essence they are comprised of a network of densely interconnected simple processing elements which perform in a manner analogous to the most elementary functions of a biological neuron.A brief history of the development of neural networks and a basic introduction to their theory is outlined in this chapter. Reduced connectivity neural networks are discussed and the scaly architecture neural network is described in detail. ## 2.2 The PerceptronThe idea of the simple neuron model first emerged in the 1940s with the work of McCulloch and Pitts [37]. The cybernetics movement which ensued attempted to combine biology, psychology, engineering and mathematics resulting in architectures for networks of neurons which would perform a number of tasks. In 1949, Hebb's book [17] put forward the theory of neural networks developing "internal representations" related to experience.In the 1950s, research continued initially into the development of networks to perform specific tasks but this changed and the goal became to develop machines that could learn. By the end of that decade there had been a lack of significant developments and work in this field diminished considerably. In the 1960s, interest was revived with the publication of a book by Rosenblatt [47] where he defined the concept of the perceptrons and laid down many theories about them. In their simplest form, these processing elements, also known as, nodes or artificial neurons, have the structure illustrated in figure 2.1. It was proved theoretically that a perceptron could learn to perform any task as long as it was possible to program it to do so. A set of inputs ( X to _{1}X) is applied to each node
representing the inputs from the outside world or, alternatively,
they may be outputs from other nodes. Each input is multiplied by a weight (_{n}W
to _{1}W) associated with the node input to which it is connected and the
weighted inputs are then summed together.
_{n}A threshold value ( C) local for each node is added to the weighted summation and the
resulting sum is then passed through a hard limiting activation function (F).
The output of a node is thereforeThe perceptron effectively splits the input patterns into two distinct regions with one region being represented by a 1 on the output and the other a 0. Rosenblatt's training algorithm for the perceptron would converge if the input patterns to the perceptron were linearly separable. The perceptron would therefore approximate the decision boundary between the two classes of outputs. Perceptrons were successfully trained to perform certain tasks but there were failures that could not be overcome. Minsky and Papert pointed out the serious shortcomings of perceptrons [34] and interest in the study of neural networks again declined. The Exclusive-Or (Ex-Or) function is a major illustration of the limitation of perceptrons. For the ex-or function an output of 1 is generated if the inputs are {0,1} or {1,0} and an output of 0 is generated if the inputs are {0,0} or {1,1}. This is not a linearly separable function so the perceptron cannot learn it. A more complicated decision surface is required here and it was found that a curved decision surface is required to separate the two classes of inputs. ## 2.3 The Multi-Layer PerceptronMinsky and Papert had proposed a solution to the problem posed by functions such as the ex-or. They suggested that an extra layer of nodes with non-linear activation functions could be introduced. The output would now be a non-linear combination of the inputs so more complicated decision surfaces could be represented. The problem that remained was that no training algorithm was available to train such a network of perceptrons at the time.During the 1970s more research turned towards the representation of knowledge and away from learning and many new ideas were developed. Then, in the 1980s, there was a resurgence of interest in neural networks and it was during this time that an effective algorithm, called back propagation, for the training of multi-layer perceptron (MLP) structures was developed [48]. 2.3.1 The Error Back Propagation Algorithm
Error back propagation is a gradient descent algorithm where weights and biases are adjusted to minimize a cost function equal to the mean square error in the network. For a 3-layer neural network with N input nodes and M
output nodes, the network's weights are initially set to small random values. An
input/output vector pair p is presented to the network with input vectorx_{p0} , x_{p1}, ..., x_{pN-1}t_{0} , t_{1}, ..., t_{M-1}From this input vector an output vector is produced by the network which can then be compared to the target output vector. If there is no difference between the produced and target output vectors no learning takes place. Otherwise the weights are changed to reduce the difference. The weights are adapted using a recursive algorithm which starts at the output nodes and works back to the hidden layer. The error in the network when training pair p is presented is defined as;t is the target value for the _{pj}jth
element of the output pattern from the training pair p.o is the actual value produced by the network for the _{pj}j
element of the output pattern when the_{th}input pattern from the training pair p is presented to its
input.The overall error is therefore j iswis the weight from the _{ji}(t)ith node of the previous layer to the
jthnode at time twhen the input/outputpair p is presented to the network.ois the output of the _{pi}ith node of the previous layer.
A non-linear activation function is employed in each node such that the output of node j is;To implement a gradient descent the negative derivative of Ewith respect
to _{p}w must be proportional to the change in the weight _{ij}w,
_{ij}
Therefore,_{pwji}.As mentioned earlier, the negative derivative of E with respect to
_{p}w must be proportional to the change in the weight _{ij}w,
_{ij}D to implement a gradient descent._{p}Therefore, is a small constant known as the learning rate. Then, w_{ji} (t+1) = w_{ji} (t) + hd_{pjfj'}(net_{pj})o_{pi}where : w is the weight from the _{ji}(t+1)ith node to the jth after adjustment.Equation (2.14) is known as the standard delta rule and defines how weights are changed after the presentation of a training pair. The activation function must be non-linear since, otherwise, the neural network would perform a linear transformation at each layer and could therefore be reduced to its equivalent single layer network. The effectiveness of the extra layer of perceptrons is then lost. The activation must also be differentiable as required by equation (2.11). The sigmoid function is the one most often used since it meets all the requirements. 2.3.2 The Fully Connected Neural Network
The most common form of neural network is the 3-layer, fully connected, feed forward MLP, an example of which is shown in figure 2.2. The nodes are arranged in 3 layers; an input layer, a hidden layer and an output layer with inputs flowing in the forward direction from input layer to output layer through the hidden layer except during training. In this type of network, the inputs of every node in the hidden layer are connected to the outputs of every node in the input layer and the inputs of every node in the output layer are connected to the outputs of every node in the hidden layer. The nodes in the input layer are used to monitor the external signals input to the neural network and the neurons in the output layer are used to make the final decision and transmit the signal produced to the outside world. The hidden layer is where the major problem solving occurs. The values of the weights on the nodes of the hidden layer constitute the 'internal representations' of the net and are based on the networks experience of the training data. The more hidden units there are the more complex a decision surface is formed. However, if there are two many hidden units the network will possess too much modeling power and will learn the training set along with any inherent idiosyncrasies. It will then have poor generalization and may not be able to recognize new data not contained in the training set. Some rules of thumb are offered in determining a good number of hidden units for neural networks but few seem to be based on any solid theory. At the time the work in this thesis was begun the most common method of determining an optimal number of hidden units was trial and error. Much research has been carried out since and other methods have been presented as an alternative. Most go hand in hand with methods to reduce the connectivity of the network and are described in more detail in the next section of this chapter. Neural networks are trained using training pairs which consist of an attribute or feature vector and a class vector. The input feature vector is applied to the network and the output of the network calculated. The output is compared with the target output (the class vector) and the difference between the two fed back to the network. A training algorithm is used to adjust the weights of each node to minimize the error. Feature vectors are applied sequentially and the weights adjusted until the error for the whole training set is at an acceptably low level. During the training process weights will gradually converge towards the values which will match the input feature vector to the desired output. The most common approach to training is to use a gradient descent algorithm such as the error back propagation algorithm described earlier in section 2.3.1. Neural networks have many potential benefits. They are massively parallel, which provides high computation rates. They contain a degree of robustness compared to sequential computers due to the large number of nodes whose connections are primarily local. Overall performance need not be impaired significantly due to damage to a few nodes or links. There is also the improvement in performance with time which makes them suitable for tasks such as the one in question here of speech recognition [31]. Two of the major problems in speech recognition have been due to fluctuations on the speech pattern time axis and spectral pattern variation [51]. Neural networks have proved to be useful tools for the task of discriminating between different categories of spectral patterns such as those obtained for speech signals. Many of these advantages are not applicable to neural networks which are implemented on sequential computers, as is most often the case. The benefits are realized when parallel computing technology is employed. There has been much research on the implementation of neural networks on highly parallel computing architectures and typical neural network algorithms map successfully onto SIMD (Single Instruction stream, Multiple Data streams) architectures[39]. It should be noted the HMMs have also been implemented in parallel bringing many of the associated advantages. There are now several different types of neural networks and training algorithms. Some, like Kohonen's self organizing network [25], have been applied to speech recognition. However, the work in this thesis is concerned with a special class MLP neural structure trained with the error back propagation algorithm. 2.3.3 The Scaly Neural Network Architecture
A problem with fully connected neural networks is their size. When it comes to the practical implementation of neural networks size becomes an important factor. In hardware implementation this becomes a problem of scaling and the number of components required. In the case of computer simulation the problem comes with the large computational cost. The larger and more complex the network the longer it takes to train and once trained it takes longer for the network to perform its recognition task. For this work the concern is with the computational cost. In general, it is not known what size of network works best for a given task and this problem is unlikely to be resolved since each task demands different capabilities from the network [21]. As with the decision about the number of hidden units, the size of network tends to be decided on a trial and error basis and the designers own experience. One such approach is to start with a fully connected network and then gradually prune it. This involves removing those weights which are close to zero and contribute little to the solution [4]. Another approach is to base the network architecture on prior knowledge of the structure of the input data. The network topology can then be arranged to reflect the data structure. Examples of this approach are Demichelis et al [10], Hampshire and Waibel [15], Sakoe et al [51] and Krause and Hackbarth [26]. All of these structures are similar in that the input nodes are grouped into zones which are connected to one node or a group of nodes in the hidden layer. Krause and Hackbarth used a 'scaly' type architecture to reduce the number of connections in a network. They showed that performance of a neural network does not necessarily improve as the number of connections between the nodes is increased. This architecture uses overlapped zones and was shown to support high recognition rates for isolated word recognition. Figure 2.3 shows an example of a neural network where scaly architecture has been applied between the input layer and the hidden layer. This is the approach adopted for the work here since the localized structure of the input zones is somewhat analogous to the cochlear processing which occurs in the human ear [16]. A preliminary investigation into the ability of a scaly architecture neural network was carried out by A.D. Smith at the University of Newcastle Upon Tyne [53]. Smith's work suggested that further investigation of this network was required to better determine the effect of changing the parameters of the network. The work in this thesis is the continuation of that preliminary work. The input data is processed (chapter 3) and presented to the network such that successive feature vectors or frames are presented to the network inputs, each coefficient of the feature vectors being presented to one of the input nodes. The hidden nodes are also grouped into frames of the same size as those in the input layer. The input frames are further grouped into input zones with each frame of hidden nodes being assigned to a zone of input nodes. Each hidden node is connected to its equivalent node in the input frames of the zone associated with that hidden frame. The input zones overlap such that some of the input frames connected to one hidden frame will also be connected to the adjacent hidden frame. In the example network shown in figure 2.3, the network is capable of taking input patterns of 35 feature vectors containing 8 coefficients. To accommodate this, an architecture with 280 input nodes is required. The scaly neural network therefore consists of an input layer with 35 frames each of 8 nodes. The number of frames in a zone is taken as 10 frames with an overlap of 5 frames so the number of frames required in the hidden layer is 6. The hidden layer therefore consists of 6 frames each of 8 nodes equaling 48 nodes in total. The number of output classes is 26, so 26 nodes are required in the output layer. A disadvantage of the reduced connectivity is that some of the robustness of the neural network is lost but large savings in computational cost are gained. An example of the saving in computational cost can be demonstrated by comparing the number of operations required to calculate the output of the scaly network shown in figure 2.3 and a fully connected network with the same size layers. The output of each node is calculated using equation 2.1 and the number of operations required is shown in table 2.1. In this case, the computational cost is reduced by nearly two-thirds by employing the scaly architecture. It should be mentioned that several new methods of determining the optimal number of hidden units and the interconnectivity of units within a neural network have been developed since the work in this thesis was begun. Tresp, Hollatz and Ahmad [57] demonstrate the use of rule-based knowledge to determine the structure of the network prior to training. With this approach the network already has some knowledge of the task for which it is being developed that can be refined and altered with training. Neural networks have been developed where the network architecture is determined by the learning algorithm during the training phase. They use incremental learning algorithms which modify the architecture by removing or adding hidden units and/or the connections between them. The "Grow And Learn" (GAL) algorithm [1] starts from a simple network structure and adds units and/or connections to decrease the error. As new units are added some of the previously added units may become redundant and can be eliminated. Levin, Leen and Moody [29] introduce an algorithm which prunes the network after training has taken place by removing the nodes and connections which have the least effect on error. ## 2.4 SummaryAn introduction to the basic theory of neural networks in general and scaly neural networks in particular has been given in this chapter. This thesis deals with the performance of scaly networks for the task of speech recognition and how they compare with fully connected networks in the same task. The next chapter will deal with the specific problem of preprocessing the speech signals so that they can be presented in an appropriate form to the neural network inputs. |