[Back] [Contents Page] [Next]


Chapter 5 Investigation Of The Number Of Inputs To The Scaly Neural Network Architecture



5.1 Introduction

As mentioned previously, in Section 2.2, neural network size becomes an important factor when it comes to practical implementation. In Chapter 4, a reduced neural network architecture type called a scaly neural network was applied to the recognition of the English alphabet. It was shown to be able to recognize small vocabularies, consisting of subsets of letters from the English alphabet, on a level comparable with a fully connected neural network architecture. However, the performance of the scaly network fell significantly when applied to the much harder task of recognizing the whole English alphabet. The performance of the fully connected network only fell slightly in comparison. Only one possible version of the scaly network architecture was investigated in Chapter 4 and many more permutations of this architecture can be implemented. In the work presented in this chapter the scaly neural network architecture is examined in more detail by investigating the number of input neurons to the network. An optimization of the number of input neurons for the task of learning a difficult subset of letters from the English alphabet is determined. The results obtained using this subset of letters are then compared with those obtained using the whole alphabet to establish whether the subset in question gives a good indication of how performance will be on the whole alphabet.


5.2 Experimental Procedure

The first feature of the scaly architecture that is being investigated is the number of inputs to the network. All other possible variants of the architecture are kept the same while the number of inputs to the neural network is increased in steps and the change in performance achieved observed. Speaker independent recognition is investigated as opposed to multiple speaker recognition which was explored in the work described in chapter 4. For speaker independent recognition the network performance is tested with examples of speech from speakers it has never "heard" before. This form of recognition is more widely applicable in the real world and it is therefore more desirable for the network to achieve good performance with it.

From Chapter 4, it can be seen that there is a large drop in performance when networks are trained for recognition of all the letters of the English alphabet compared to when they are trained for recognition of the letters 'A' and 'B' only or 'A', 'B' and 'C' only. Due to the large number of simulations to be carried out it is desirable to use a smaller subset of letters to gauge the network performance since it takes a relatively long time to train a network to recognize the whole alphabet. However, the subsets of letters used in Chapter 4 did not give a good indication of how the same network would perform on the much harder task of recognizing the whole English alphabet. It was suggested, in the conclusion to chapter 4, that there may have been too much of a jump in the difficulty of the task between recognizing these subsets to recognizing the whole alphabet. A subset of letters which represents a harder learning task than the letters 'A' and 'B' or 'A', 'B' and 'C' is employed here. The English alphabet contains several such subsets (see Section 1.2) and the networks are trained to recognize one of these, the E-set, {'B', 'C', 'D', 'E', 'G', 'P', 'T', 'V'}. It can then be determined whether the E-set gives a better indication of how the network will perform with recognition of the whole alphabet.

From Chapter 4 it can also be seen that the performance of the Trace Segmentation (TS) algorithm is comparable to that of the Dynamic Time Warping Algorithm (DTW) and in some instances it performs better. The TS algorithm is much simpler computationally and does not require the use of a reference pattern. The TS algorithm is, therefore, employed as a time alignment algorithm to preprocess the speech signal so that it is in a suitable form to be input to the scaly neural network

Woodland suggests setting the target value for the output node of the network representing the current input pattern class to 0.9 and the target value for the other output nodes to 0.1 [61]. This should lead to an improved generalization since, in the training stage, the network would not be pushed to achieve the more customary output levels of 0.0 and 1.0. Pushing the network could lead to overlearning of the task which it is important to avoid for speaker independent recognition. If this were to happen the network would become specialized in recognizing the speech samples with which it has been trained and would not perform well on samples of speech from "new" speakers it has never been presented with before. Woodland's theory is investigated further by using three sets of output node target values for each activation function. For the sigmoid01 function the following three set of output node target values are investigated {0, 1}, {0.1, 0.9} and {0.99, 0.01}, with the higher value indicating the current input pattern category and the lower values indicating the categories to which the input pattern does not belong. For the sigmoid11 function the three sets of output values investigated are {-1, 1), {-0.9, 0.9} and {-0.99, 0.99} and for the standard sigmoid function the three sets are {-1.71, 1.71}, {-1.70, 1.70} and {-1.60, 1.60}.

As previously mentioned, variation of the number of inputs to the network is being investigated to determine an optimum which can be used for investigating the variation of the other parameters in the scaly neural architecture. The size of the input zone and the overlap of the input zones are kept the same for all the networks with a zone size of 5 frames and an overlap of 3 frames being employed. A smaller size of zone than was used for the work in chapter 4 is employed so that networks with relatively small numbers of input neurons can be investigated here.

Networks with eight different sizes of input layer are trained to recognize the E-set: 11 input frames (88 input neurons); 15 input frames (120 input neurons); 21 input frames (168 input neurons); 25 input frames (200 input neurons); 31 input frames (248 input neurons); 35 input frames (280 input neurons); 41 input frames (328 input neurons) and 45 input frames (360 input neurons). The aim is to investigate the network input layer in sizes which step up as equally as possible. A 5 frame step in size of input layer was aimed at but restrictions on the network due to the requirements imposed in keeping the rest of the network parameters the same meant the steps in size had to be alternated between 4 and 6 input frames. Each network is trained over 2500 training sweeps with e = 0.01 and using the activation function sigmoid01. The same procedure is then followed using the activation functions sigmoid11 and the standard sigmoid function. The same procedure is then carried out again this time using the whole English Alphabet in place of the E-set.


5.3 Simulation Results For The E-Set

Tables 5.1, 5.2, 5.3, 5.4, 5.5, 5.6, 5.7 and 5.8 show the performance and error after 2500 sweeps for eight networks of different input layer sizes using the three possible activation functions.

Performance indicates the percentage of all the utterances presented to the network which the network identifies correctly. On the training data the sigmoid 11 and standard sigmoid activation functions achieve the best performances attaining performances of over 99% at best. The sigmoid01 function only manages to accomplish a performance of 95.08% at best on the training data. However, it is the sigmoid01 which attains the best performances on the test data. At best the sigmoid01 activation function achieves a performance of 66.69% on the test data while the best performance of the sigmoid11 function on the test data is 63.17% and the standard sigmoid is 55.21%.

Error indicates the difference between the desired outputs and the actual outputs. A similar pattern is seen with the error rate results as was seen with performance. The sigmoid11 function achieves error values of less than 0.01777 on the training data and the standard sigmoid achieves values of less than 0.10143. The sigmoid01 function achieves error values of less than 0.02452 for the training data which is not as good as the sigmoid11 function but it is better than the results obtained with the standard sigmoid function. As happened with the results for performance the sigmoid01 function clearly gives the best error results on the test data with error values of less than 0.02698 while the sigmoid11 function achieves error values less than 0.05528 and the standard sigmoid achieves errors of less than 0.11602.

Figure 5.1 shows how performance changes as the size of input layer is increased for the three possible sets of node target values on the training data with the activation function sigmoid01. Figure 5.2 shows the performance on a network with the same parameters but this time recognizing the test data. Figures 5.3 and 5.4 show the error values obtained as the size of input layer is increased for the three possible sets of node target values.

As can be seen from the graphs, there is not much difference in performance and error when the node target value sets {0, 1} and {0.01, 0.99} are used but there is a big difference between these and the performance and error achieved with node target values {0.1, 0.9}. This set of output node target values achieves much lower performance and higher error and is therefore the most unsuitable of the three when using the sigmoid 01 activation function.


Table 5.1 Error And Performance Of Network Of Input Layer Size 11 Frames In Recognizing The E-Set After 2500 Training Sweeps

[Error And Performance Of Network Of Input Layer Size 11 Frames In Recognizing
         The E-Set After 2500 Training Sweeps]



Table 5.2 Error And Performance Of Network Of Input Layer Size 15 Frames In Recognizing The E-Set After 2500 Training Sweeps

[Error And Performance Of Network Of Input Layer Size 15 Frames In Recognizing 
         The E-Set After 2500 Training Sweeps]



Table 5.3 Error And Performance Of Network Of Input Layer Size 21 Frames In Recognizing The E-Set After 2500 Training Sweeps

[Error And Performance Of Network Of Input Layer Size 21 Frames In Recognizing
         The E-Set After 2500 Training Sweeps]



Table 5.4 Error And Performance Of Network Of Input Layer Size 25 Frames In Recognizing The E-Set After 2500 Training Sweeps

[Error And Performance Of Network Of Input Layer Size 25 Frames In Recognizing 
         The E-Set After 2500 Training Sweeps]



Table 5.5 Error And Performance Of Network Of Input Layer Size 31 Frames In Recognizing The E-Set After 2500 Training Sweeps

[And Performance Of Network Of Input Layer Size 31 Frames In Recognizing 
         The E-Set After 2500 Training Sweeps]



Table 5.6 Error And Performance Of Network Of Input Layer Size 35 Frames In Recognizing The E-Set After 2500 Training Sweeps

[Error And Performance Of Network Of Input Layer Size 35 Frames In Recognizing 
         The E-Set After 2500 Training Sweeps]



Table 5.7 Error And Performance Of Network Of Input Layer Size 41 Frames In Recognizing The E-Set After 2500 Training Sweeps

[Error And Performance Of Network Of Input Layer Size 41 Frames In Recognizing 
         The E-Set After 2500 Training Sweeps]



Table 5.8 Error And Performance Of Network Of Input Layer Size 45 Frames In Recognizing The E-Set After 2500 Training Sweeps

[Error And Performance Of Network Of Input Layer Size 45 Frames In Recognizing 
         The E-Set After 2500 Training Sweeps]



As the number of frames in the input layer is increased there is a definite pattern of increase in performance and decrease in error. This levels off until a point is reached where the increase in performance is no longer significant. In fact, the performance starts to fall off as the size of the input layer is increased further. A peak in performance occurs at around 35 input zones for both the training and test data when the target node values of {0, 1} and {0.01, 0.99} are used. The peak is around 30 input zones for {0.1, 0.9} and in all cases the peak performance is more pronounced in the test data. The same is true for the error values, that is, a point is reached where an increase in input layer size does not yield a significant drop in error. The best error values are achieved at the same size of input layer as the best performance is achieved for all the target node values.

Figures 5.5, 5.6, 5.7 and 5.8 show how the error and performance change as the input layer size is increased when the sigmoid11 activation function is used. In this case there is not such a big difference between the results achieved when the different sets of output node target values are employed. The difference is more pronounced on the test data with the performance of the node target values {-0.9, 0.9} being slightly less than the other two. The error rates this set attains are slightly higher in most cases indicating that this is the least suitable set to use. There is not much difference in the results obtained using the other two sets of output node target values {-1, 1} and {-0.99, 0.99}.

The same pattern for performance and error is seen as was discerned for the sigmoid01 function. Performance increases and error decreases as the size of the input layer is increased until a point is reached where the increase or decrease is no longer significant. The fall off in performances observed when the input layer size is increased beyond 35 frames is not present here but the increase in performance after this point is very small.

Figures 5.9, 5.10, 5.11 and 5.12 show how the error and performance change as the input layer size is increased when the standard sigmoid activation function is used. In this set of results there is not much difference in the results obtained when the two sets of node output values {-1.71, 171} and {-1.70, 1.70} are used. The difference is more pronounced when the set of values {-1.61, 161} is used. Overall it achieves lower performances and higher error values than the other two and is therefore the most unsuitable of the three.


[Performance Of Network On Training Data With Sigmoid01 Activation Function]

Figure 5.1 : Performance Of Network On Training Data With Sigmoid01 Activation Function



[Performance Of Network On Test Data With Sigmoid01 Activation Function]

Figure 5.2 : Performance Of Network On Test Data With Sigmoid01 Activation Function



[Network Error On Training Data With Sigmoid01 Activation Function]

Figure 5.3 : Network Error On Training Data With Sigmoid01 Activation Function



[Figure 5.4 : Network Error On Test Data With  Sigmoid01 Activation Function]

Figure 5.4 : Network Error On Test Data With Sigmoid01 Activation Function



[Performance Of Network On Training Data With Sigmoid11 Activation Function]

Figure 5.5 : Performance Of Network On Training Data With Sigmoid11 Activation Function



[Performance Of Network On Test Data With Sigmoid11 Activation Function]

Figure 5.6 : Performance Of Network On Test Data With Sigmoid11 Activation Function



[Network Error On Training Data With Sigmoid11 Activation Function]

Figure 5.7 : Network Error On Training Data With Sigmoid11 Activation Function



[Network Error On Test Data With Sigmoid11 Activation Function]

Figure 5.8 : Network Error On Test Data With Sigmoid11 Activation Function



[Performance Of Network On Training Data With Standard Sigmoid Activation]

Figure 5.9 : Performance Of Network On Training Data With Standard Sigmoid Activation Function



[Performance Of Network On Test Data With Standard Sigmoid Activation]

Figure 5.10 : Performance Of Network On Test Data With Standard Sigmoid Activation Function



[Network Error On Training Data With Standard Sigmoid Activation Function]

Figure 5.11 : Network Error On Training Data With Standard Sigmoid Activation Function



[Network Error On Test Data With Standard Sigmoid Activation Function]

Figure 5.12 : Network Error On Test Data With Standard Sigmoid Activation Function



Again it is observed that performance increases and error decreases as the size of the input layer is increased. There is no fall off in performance when the size of the input layer is increased beyond 35 frames for the training data set but it can be observed with the test data. The performance on the training set does not increase significantly beyond 35 frames and the error for both the training and test set do not fall significantly beyond an input layer size of 35 frames.


5.4 Simulation Results For The Whole Alphabet

Tables 5.9, 5.10, 5.11, 5.12, 5.13, 5.14. 5.15 and 5.16 show the performance and error after 2500 sweeps for eight networks of different input layer sizes using the three possible activation functions. In this set of simulations the neural network is trained to recognize the whole English alphabet as opposed to just the E-set as was the case for the simulations described in section 5.3.

The sigmoid 01 activation function achieves the best results overall attaining performances of over 90% on the training data. The sigmoid 01 function only manages to achieve performances of around 85-86% at best but it does give much better results on the test data set. The best performance achieved by the sigmoid11 function on the test data is 69.47% while the sigmoid01 function achieves performances of over 70%. In all cases, the performance of the standard sigmoid function is inferior to those achieved by the other two activation functions.

The results obtained for error in all the networks reflect those obtained for performance. The neural networks which give good performances also give low error rates and those which perform poorly give high error rates.


Table 5.9 Error And Performance Of Network Of Input Layer Size 11 Frames In Recognizing The Whole English Alphabet After 2500 Training Sweeps

[Error And Performance Of Network Of Input Layer Size 11 Frames In Recognizing 
         The Whole English Alphabet After 2500 Training Sweeps]



Table 5.10 Error And Performance Of Network Of Input Layer Size 15 Frames In Recognizing The Whole English Alphabet After 2500 Training Sweeps

[Error And Performance Of Network Of Input Layer Size 15 Frames In Recognizing 
         The Whole English Alphabet After 2500 Training Sweeps]



Table 5.11 Error And Performance Of Network Of Input Layer Size 21 Frames In Recognizing The Whole English Alphabet After 2500 Training Sweeps

[Error And Performance Of Network Of Input Layer Size 21 Frames In Recognizing 
         The Whole English Alphabet After 2500 Training Sweeps]



Table 5.12 Error And Performance Of Network Of Input Layer Size 25 Frames In Recognizing The Whole English Alphabet After 2500 Training Sweeps

[Error And Performance Of Network Of Input Layer Size 25 Frames In Recognizing 
         The Whole English Alphabet After 2500 Training Sweeps]



Table 5.13 Error And Performance Of Network Of Input Layer Size 31 Frames In Recognizing The Whole English Alphabet After 2500 Training Sweeps

[Error And Performance Of Network Of Input Layer Size 31 Frames In Recognizing 
         The Whole English Alphabet After 2500 Training Sweeps]



Table 5.14 Error And Performance Of Network Of Input Layer Size 35 Frames In Recognizing The Whole English Alphabet After 2500 Training Sweeps

[Error And Performance Of Network Of Input Layer Size 35 Frames In Recognizing 
         The Whole English Alphabet After 2500 Training Sweeps]



Table 5.15 Error And Performance Of Network Of Input Layer Size 41 Frames In Recognizing The Whole English Alphabet After 2500 Training Sweeps

[Error And Performance Of Network Of Input Layer Size 41 Frames In Recognizing 
         The Whole English Alphabet After 2500 Training Sweeps]



Table 5.16 Error And Performance Of Network Of Input Layer Size 45 Frames In Recognizing The Whole English Alphabet After 2500 Training Sweeps

[Error And Performance Of Network Of Input Layer Size 45 Frames In Recognizing 
         The Whole English Alphabet After 2500 Training Sweeps]



Figure 5.13 shows how the performance of a network on the training data set changes as the size of the input layer of the network is increased. In this case the sigmoid01 activation function is employed and results are shown for three different sets of output node target values. Figure 5.14 shows how the performance on the test data is affected by changing the size of the input layer and figures 5.15 and 5.16 show how the error rate changes for the training data set and the test data set respectively.

A definite pattern can be seen in these graphs. As the number of frames in the input layer is increased the performance will, initially, increase and the error will decrease. A point is reached, at around 30-35 frames, where no significant improvement in the results can be seen. When the input layer size is increased beyond this point any improvement in performance or error is very small and eventually performance starts to fall off while the error increases.

It can be seen that the {0, 1} and {0.01, 0.99} sets give very similar results to each other but those achieved by the {0.1, 0.9} set include much lower performances and much higher error rates than the other two.

The results obtained for the sigmoid11 and standard sigmoid activation functions are graphed in the same manner as those for the sigmoid01 function. These graphs are shown in figures 5.17 through 5.24. The same pattern is seen in these results as was observed for the sigmoid01 function. Performance initially increases and error decreases as the size of the input layer is increased. No significant improvement in the results is seen beyond 30-35 frames and, in fact, beyond this point the performance falls off and the error increases.


[Performance Of Network On Training Data With Sigmoid01 activation function]

Figure 5.13 : Performance Of Network On Training Data With Sigmoid01 activation function



[Performance Of Network On Test Data With Sigmoid01 activation function]

Figure 5.14 : Performance Of Network On Test Data With Sigmoid01 activation function



[Network Error On Training Data With Sigmoid01 activation function]

Figure 5.15 : Network Error On Training Data With Sigmoid01 activation function



[Network Error On Test Data With Activation Function Sigmoid01]

Figure 5.16 : Network Error On Test Data With Activation Function Sigmoid01



[Performance Of Network On Training Data With Sigmoid11 activation function]

Figure 5.17 : Performance Of Network On Training Data With Sigmoid11 activation function



[Performance Of Network On Test Data With Sigmoid11 activation function]

Figure 5.18 : Performance Of Network On Test Data With Sigmoid11 activation function



[Performance Of Network On Training Data With Sigmoid01 Activation Function]

Figure 5.19 : Network Error On Training Data With Sigmoid11 activation function



[Network Error On Test Data With Activation Function Sigmoid11]

Figure 5.20 : Network Error On Test Data With Activation Function Sigmoid11



[Performance Of Network On Training Data With Standard Sigmoid Activation Function]

Figure 5.21 : Performance Of Network On Training Data With Standard Sigmoid Activation Function



[Performance Of Network On Test Data With Standard Sigmoid Activation Function]

Figure 5.22 : Performance Of Network On Test Data With Standard Sigmoid Activation Function



[Network Error On Training Data With Standard Sigmoid activation function]

Figure 5.23 : Network Error On Training Data With Standard Sigmoid activation function



[Network Error On Test Data With Activation Function Standard Sigmoid]

Figure 5.24 : Network Error On Test Data With Activation Function Standard Sigmoid




5.5 Discussion

5.5.1 Effect Of The Architecture

In all cases investigated, optimal performance and minimal error are achieved by an input layer size of 35 frames. From the results obtained it can be observed that performance increases as the input layer size is increased and error rate decreases. This is remains true until the input layer size reaches around 30 frames when the change in performance and error with input layer size has become very small. Beyond an input layer size of 35 frames any improvement in the results is very small and requires a large increase in network complexity to be achieved. When the size of the input layer is increased even further, the performance starts to fall off and the error increases in some cases. It can be concluded that a network with an input layer size of 35 frames gives optimal performance and minimal error without incurring too large a computational cost or increase in the complexity of the network. The convergence of the networks can be examined by plotting performance and error versus the number of training sweeps. Several examples of these are provided in Appendix A (A.15, A.16, A.17, A.18, A.19 and A.20). It is observed that when the number of inputs to the neural network is low the performance and error rates fluctuate up and down throughout training with the performance following an upward trend and the error following a downward trend. As the number of inputs to the neural network increases the convergence is much smoother with much less fluctuation occurring during training.


5.5.2 Effect Of Activation Function

The sigmoid11 function gives the best results on the training data set for the task of recognizing either the E-set or the whole English alphabet. However, the sigmoid01 function gives the best results on the test data set for both of these tasks. The results obtained on the test data set are the most significant since the purpose of this investigation was to determine which activation function, of those available, gives the best performance for speaker independent recognition. For this reason the sigmoid01 function is identified as the best choice of the three possible activation functions. It should however be noted that the statements from the previous chapter are still relevant. The differences in performance achieved with the different activation functions may be simply due to different learning rates and the fact the the networks have not fully converged when training is ended.


5.5.3 Effect Of Node Target Values

For each activation function, one set of output node target values is clearly the poorest choice and can be discarded from further investigation. For the sigmoid01 function it is the set {0.1, 0.9}, for the sigmoid11 function it is the set {-0.9, 0.9} and for the standard sigmoid function it is the set {-1.61, 1.61}. There is very little observable difference between the results obtained for the other two output node target value sets explored for each activation function. The set of output nodes which performs poorest in each case also results in more fluctuation of performance and error during training. The other two possible choices result in much smoother convergence of the performance and error rate which can be seen more prominently in the networks with more input units.


5.5.4 Use Of the E-Set

The subset of letters from the English Alphabet known as the E-set was investigated. The E-set consists of the letters B, C, D, E, G, P, T and V which are relatively hard to distinguish from each other because they sound similar. In the simulations described in this chapter, the networks and parameters which gave best results for the E-set also gave optimal results in the task of recognizing the whole English alphabet. There was a significant drop in performance between the two tasks when the training data was being recognized but the results for the test data were comparable. In some cases, networks which gave good performance when recognizing the E-set performed even better when recognizing the whole alphabet. It would appear, therefore, that the performance of a network on the E-set is a good gauge of how the network will perform in recognizing the whole alphabet.


5.6 Conclusions

The scaly type architecture neural network, when used in conjunction with the trace segmentation nonlinear time alignment algorithm, is shown to be promising for speaker independent recognition of letters of the English alphabet. The sigmoid activation function over the interval 0 to 1 appears to be the most suitable when used with node target values of {0, 1} or {0.01, 0.99} for speaker independent recognition. For the task in question, a network layer size of around 35 input frames proves sufficient for optimal performance on the training and test data. The E-set proved to be very suitable for use in testing options to indicate which will yield the best results on the whole English alphabet.


[Back] [Contents Page] [Next]