[Back] [Contents Page] [Next]


Chapter 6 Further Investigation Of The Scaly Neural Network Architecture



6.1 Introduction

In Chapter 5 the input layer size of the scaly neural network architecture when applied to recognition of the E-set from the English Alphabet was investigated. The performance of a series of networks with a range of different input layer sizes was investigated when the other variable features of the scaly architecture were kept constant. It was determined that, to achieve the best possible performance for speaker independent recognition without having to use too complex a network, an optimal size for the input layer was 35 frames. It was also established that the sigmoid01 activation function was the most effective for optimal performance in speaker independent recognition.

Variation of the output node target values was examined by testing how each of the networks performed when three different sets of target values were applied on the output nodes. For each of the activation functions available two sets of values were identified which gave comparable performances. One set of values out of each three was found to perform poorly compared to the other two and could be omitted from any further investigation.

Performance was then determined on the full English alphabet to determine whether the E-set gave good indication of how any network architecture would perform on recognizing the whole alphabet. It was found that networks and features which gave the best results for recognizing the E-set also gave the best results for the full English alphabet. The E-set can therefore be used to test properties of a network and the results will indicate which of those will work best for the full alphabet.

There are more features of the scaly neural network architecture which can be varied and may lead to improved performance. In the work presented in this chapter, the number of hidden neurons, the input zone size and the input zone overlap associated with the scaly architecture are varied and optimized to give the best performance on recognition of the English alphabet.


6.2 Experimental Procedure

Three features of the scaly architecture are examined in this chapter; the number of neurons in the hidden layer, the size of the input zones and the size of overlap of the input zones. All other possible variants in architecture are kept the same while the amount by which the input zones are overlapped is increased and the change in performance observed. This process is carried out for all possible sizes of input zone. In carrying out this process, all possible sizes of hidden layer will be encountered for a scaly neural network with an input layer size of 35 frames. As in the work described in Chapter 5, speaker independent recognition is investigated since this form of recognition is more widely applicable in the real world and it is therefore more desirable for the network to achieve good performance with it.

From Chapter 5, it was determined that investigating the performance of networks on the recognition of the E-set from the English alphabet gives a good indication of how a network will perform on recognition of the whole alphabet. A large number of simulations has to be carried out in examining how varying the size of input zones and the amount by which they are overlapped effects the performance of the neural network. This means that it is desirable to use a smaller subset of letters to gauge the network performance since the network can be trained and tested in a shorter amount of time. For this reason the neural networks are trained to recognize the E-set of letters and the results from these simulations can be used to determine an optimal network structure which will also give a good performance in the task of recognizing the whole English alphabet.

From Chapter 4 it was established that the performance of the Trace Segmentation (TS) algorithm is comparable to that of the Dynamic Time Warping Algorithm (DTW) and in some instances it performs better. The ability of networks to achieve good performance when the TS algorithm is employed was further demonstrated in chapter 5. This time the algorithm was being used in a speaker independent recognition system and promising results were obtained for the recognition of spoken letters from the English alphabet. The TS algorithm is much simpler computationally then the DTW algorithm and does not require the use of a reference pattern. The TS algorithm is, therefore, employed again as a time alignment algorithm to preprocess the speech signal so that it is in a suitable form to be input to the scaly neural network.

Three possible sigmoid activation function were investigated in chapter 5 for the speaker independent recognition of the English alphabet. The sigmoid01 function did not give the best performance overall but it did yield the best performance and lowest errors on the test data. This is the more important result in determining which function should be used because when a speaker independent system is put in to practical use it is the performance on data it has never seen before that is the most crucial. For this reason the sigmoid01 activation function is employed in the following set of simulations.

As mentioned in chapter 5, Woodland suggests setting the target value for the output node of the network representing the current input pattern class to 0.9 and the target value for the other output nodes to 0.1 [61]. The networks will be less prone to overlearning of a task and therefore give better generalization performance. This is significant in speaker independent recognition because the network should not learn to recognize the speech it is trained with too well. The network must be capable of dealing with the variations that occur in speech when the same word or letter is being spoken by different people. If the network becomes too specialized in the recognition of the speech it is trained with it will perform poorly in recognizing the same words or letters when spoken by voices it has never "heard" before. Three sets of node target values were investigated for each sigmoid activation function. For the sigmoid01 function the sets {0, 1}, {0.1, 0.9} and {0.01, 0.99} were investigated. The {0.1, 0.9} set was discarded since its performance was much poorer than the other two. There was not much distinction in the performances of the other two so both are used in the following simulations to see if any further determination can be made as to whether the use of {0.01, 0.99} over the more usual {0, 1} does provide any significant improvement in network performance.

The size of the input layer was investigated in chapter 5 and it was discovered that, for optimal performance on the test data in speaker independent mode, an input layer size of 35 frames produced the best performance and lower errors with minimal complexity of network architecture. For this reason, the size of the input layer is kept constant at 35 frames for all of the simulations described in this chapter. It should be noted that 35 frames is around the average size of the the speech samples contained in the British Telecom English alphabet database.


6.3 Simulation Results

Tables 6.1, 6.2, 6.3 and 6.4 show the results obtained for each scaly network configuration on both data sets after 2500 training sweeps.


Table 6.1 Error And Performance Of Various Scaly Network Architectures In Recognizing The E-Set After 2500 Training Sweeps

[Error And Performance Of Various Scaly Network Architectures In Recognizing 
         The E-Set After 2500 Training Sweeps]



Table 6.2 Error And Performance Of Various Scaly Network Architectures In Recognizing The E-Set After 2500 Training Sweeps

[Error And Performance Of Various Scaly Network Architectures In Recognizing 
         The E-Set After 2500 Training Sweeps]



Table 6.3 Error And Performance Of Various Scaly Network Architectures In Recognizing The E-Set After 2500 Training Sweeps

[Error And Performance Of Various Scaly Network Architectures In Recognizing 
         The E-Set After 2500 Training Sweeps]



Table 6.4 Error And Performance Of Various Scaly Network Architectures In Recognizing The E-Set After 2500 Training Sweeps

[Error And Performance Of Various Scaly Network Architectures In Recognizing 
         The E-Set After 2500 Training Sweeps]



From these results it can be seen that the best performance obtained on the training data is 98.14% with the output node target values {0.01, 0.99} and 97.98% with the values {0, 1}. The network architecture that attains these performances has a zone size of 16 frames and an overlap of 15 frames. The number of hidden frames in this architecture is 20 and the total number of weights in the network is 6906. This configuration also achieves the lowest error values with the {0.01, 0.99} set achieving an error value of 0.00285 and the {0, 1} set achieving 0.00247.

The best performance achieved on the test data is 66.69% with the {0.01, 0.99} set and 67.68% with the {0, 1} set. The lowest error values obtained by these sets are 0.02220 and 0.002208 respectively. These optimal values are achieved by a different network configuration than that which gives the best results on the training data set. A scaly network architecture with a zone size of 2 frames and an overlap of 1 frame gives the best results on the test data set. This architecture has 34 hidden frames and a total of 7914 weights in the network.


6.3.1 Size Of Input Zone

The results obtained which, are described in the previous section, are graphed in two different manners. First they are graphed to investigate how performance and error are affected by the size of the input zone and then to examine how they are affected by the overlap of the input zones.

Several examples of how the performance and error change with input zone size when the input zone overlap is kept constant are shown in Figures 6.1 through 6.10. Figure 6.1 shows how performance changes with input zone size when the input zone overlap is kept constant at 3 frames. Figure 6.2 shows how the error changes as the size of the input zone is increased and the size of the overlap is kept constant at 3. As the input zone size increases the performance decreases significantly for both the training data and the test data. The error values achieved increase significantly as the size of the input zones is increased. This is true for both sets of output node target values.


[Performance Of Scaly Networks With Input Zone Overlap = 3]

Figure 6.1 : Performance Of Scaly Networks With Input Zone Overlap = 3



[Error In Scaly Networks With Input Zone Overlap = 3]

Figure 6.2 : Error In Scaly Networks With Input Zone Overlap = 3



When the size of the input zone is small there is not much difference in the performances or error values attained when the different sets of output node target values are used. As the input zone size is increased the difference is becomes much more significant but performances and error values are poor and so are of no interest.

Figures 6.3 through 6.10 show performance and error versus input zone size for increasing sizes of overlap of the input zones. Several graphs were chosen which show good examples of the overall pattern seen in all cases. Performance always decreases as the input zone size is increased and the error values always increase as the input zone size is increased.

The effect of using different output node target values is similar for all the sizes of zone overlap. The difference in performance and error is always very small when smaller sizes of input zone are used. In some cases, even for the larger sizes of input zone, the performance and error curves never diverge by much. This happens once zone overlaps of greater than 15 frames are investigated. Beyond this point, the performance and error versus zone size curves for the two output node target values differ by very small amounts.

Where significant divergence does occur between the performance and error curves resulting from the use of the two different output node target value sets it is the {0, 1} set which gives the better results. As mentioned previously, the divergence occurs at larger input zone sizes where performance has fallen off as compared to that achieved at smaller input zone sizes and these cases are of no interest since optimal performance is the goal.


[Performance Of Scaly Networks With Input Zone Overlap = 5]

Figure 6.3 : Performance Of Scaly Networks With Input Zone Overlap = 5



[Error In Scaly Networks With Input Zone Overlap = 5]

Figure 6.4 : Error In Scaly Networks With Input Zone Overlap = 5



[Performance Of Scaly Networks With Input Zone Overlap = 7]

Figure 6.5 : Performance Of Scaly Networks With Input Zone Overlap = 7



[Error In Scaly Networks With Input Zone Overlap = 7]

Figure 6.6 : Error In Scaly Networks With Input Zone Overlap = 7



[Performance Of Scaly Networks With Input Zone Overlap = 11]

Figure 6.7 : Performance Of Scaly Networks With Input Zone Overlap = 11



[Error In Scaly Networks With Input Zone Overlap = 11]

Figure 6.8 : Error In Scaly Networks With Input Zone Overlap = 11



[Performance Of Scaly Networks With Input Zone Overlap = 17]

Figure 6.9 : Performance Of Scaly Networks With Input Zone Overlap = 17<



[Error In Scaly Networks With Input Zone Overlap = 17]

Figure 6.10 : Error In Scaly Networks With Input Zone Overlap = 17



6.3.2 Size Of Input Zone Overlap

Figure 6.11 shows how performance is affected by the size of the overlap of the input zones when the input zone size is kept constant at 11 frames. Figure 6.12 shows how the error values change as the overlap of the input zones is increased for a constant input zone size of 11 frames. As the input zone overlap is increased the performance increases and the error rate decreases. This is the case for both the training data and the test data and occurs for both of the possible sets of output node target values.

There is very little difference in the performance or the error of the networks when either the {0, 1} or the {0.01, 0.99} output node target value sets is used. Neither of these sets of values gives significantly better results on either the training or the test data.

Figures 6.13 through 6.18 show the same results for increasingly larger sizes of input zone. Several graphs were chosen which show good examples of the overall pattern seen in all cases. In all cases the same pattern is seen where performance increases and error rates fall off as the size of the overlap between the input zones is increased.

In most cases there is very little significant divergence in the curves produced when either of the two sets of output node target values are used in conjunction with the training data or the test data. In the few cases where any noteworthy difference in the performance or error obtained with a network using the two different sets occurs it is, for the most part, the {0, 1} set which produces the better result. Significant differences occur when the size of input zone overlap is low and poorer performance and error values are being obtained so the result is of no relevance when looking at networks which give optimal performance.


[Error In Scaly Networks With Input Zone Overlap = 17]

Figure 6.11 : Performance Of Scaly Networks With Input Zone Size = 11



[Error In Scaly Networks With Input Zone Overlap = 17]

Figure 6.12 : Error In Scaly Networks With Input Zone Size = 11



[Error In Scaly Networks With Input Zone Overlap = 17]

Figure 6.13 : Performance Of Scaly Networks With Input Zone Size = 15



[Error In Scaly Networks With Input Zone Overlap = 17]

Figure 6.14 : Error In Scaly Networks With Input Zone Size = 15



[Error In Scaly Networks With Input Zone Overlap = 17]

Figure 6.15 : Performance Of Scaly Networks With Input Zone Size = 19



[Error In Scaly Networks With Input Zone Overlap = 17]

Figure 6.16 : Error In Scaly Networks With Input Zone Size = 19



[Error In Scaly Networks With Input Zone Overlap = 17]

Figure 6.17 : Performance Of Scaly Networks With Input Zone Size = 23



[Error In Scaly Networks With Input Zone Overlap = 17]

Figure 6.18 : Error In Scaly Networks With Input Zone Size = 23



6.4 Discussion

The purpose of this investigation was to determine an optimal scaly architecture neural network for the task of recognizing spoken letters from the English alphabet. Performance must be as high as possible and the error must be minimized on the test data set. It is required that the best results are obtained on the test data because the goal is to achieve the best performance in speaker independent recognition so the test data results are more significant.


6.4.1 Size Of Input Zone

The general trend seen for the input zone is that performance decreases and error values get higher as the input zone size is increased and the size of the input zone overlap is kept constant. In two particular cases of input zone overlap this pattern does not appear. These cases occur when the input zone overlap is equal to 0 frames and 31 frames. The performance of the neural architecture increases as the input zone size is increased and the error falls off. The performances achieved in these cases is lower than the better performances achieved, so they are not of interest in this investigation and can be eliminated from any further study.

In all other cases it can be stated that better performance and lower error rates can be obtained by having as small an input zone as possible with any particular size of input zone overlap. This observation supports the suggestion that speech is localized [16]. The results seen here indicate that, in this case, the speech data is highly localized. Recognition of portions of speech is principally dependent on the data immediately prior to and after the portion of speech in question. This is true for the initial processing of the speech data carried out between the input layer and the hidden layer. Here very small portions of the speech signal are being processed when the input zone size is small. The network is fully connected between the hidden layer and the output layer so the final processing and the decision about which letter is being input to the network is dependent upon the complete speech signal of the utterance being present.


6.4.2 Size Of Input Zone Overlap

When the size of the input zone is kept constant and and the size of the input zone overlap is increased it is observed that the performance of the neural network increases. The error rates are seen to decrease as the size of the input zone is increased.

As mentioned in the previous section, recognition of portions of speech is highly dependent upon the data immediately prior to and after the portion in question. Regardless of the size of the actual zone, the larger the overlap between the zones the more data that is being used in processing that portion of speech so performance is expected to improve which is the case. The smaller sized input zones which provide better performance as discussed in section 6.4.1 interconnect only small portions of speech and give good results. The more these zones are overlapped the more significant data that is being included in processing the data in that zone and passing on information to the hidden layer of the neural network. The larger input zones do not provide as good results as the smaller zones but their performance does still improve the more these zones are overlapped.


6.4.3 Size Of Hidden Layer

In section 6.4.2 it was observed that, to obtain an optimal performance with any size of input zone, as large an overlap as possible must be used. In all cases, this will be one frame less than the size of that zone. The results shown in Figure 6.101 are obtained by taking each possible size of input zone and using an overlap of one frame less than the input zone size. As the input zone size is increased the number of frames in the hidden layer is reduced and the effect of the size of the hidden layer on the performance of the network can be observed.

It can be seen that performance does stay fairly high for all sizes of hidden layer when recognizing the training data set. There is a fall off in performance at either end of the graph when the size of the hidden layer is either very small or very large. With the test data set it can be seen that the better performance is achieved as the size of the hidden layer is increased. The best performance on the testing data is in fact when the hidden layer is as large as possible which is 34 frames. The same is also true for the error rates with the lowest error incurred on the test data occurring when the size of the hidden layer is 34 frames.


[Performance Of Scaly Networks]

Figure 6.19 : Performance Of Scaly Networks



[Error In Scaly Networks]

Figure 6.20 : Error In Scaly Networks



From the findings described in sections 6.4.1 and 6.4.2 it can be concluded, that for optimal speaker independent recognition using a scaly neural network, the size of the input zone must be kept as small as possible and the overlap must be as large as possible for that particular input zone size. The smallest input zone size we can have with a scaly architecture is 2 frames and the largest overlap that can be implemented with this size of input zone is 1 frame. This is also the network architecture that results in a hidden layer size of 34 frames which gave the best results on the test data as seen in figures 6.101 and 6.102.

In section 6.3 it was stated that the best overall performance achieved on the test data was 67.68% with the output node target value set {0, 1} and 66.69% with the {0.01, 0.99} set. These results were those achieved by the network with an input zone size of 2 frames and an overlap of 1 frame. It was also stated that this network achieved the lowest error rates with the {0, 1} set achieving 0.02208 and the {0.01, 0.99} set achieving 0.0222.


6.4.4 Output Node Target Values

It was observed in section 6.3 that very little divergence is seen in the curves obtained for the two sets of output node target values examined here. This is true for the results of performance and error. Where any divergence in the results does occur it is the {0, 1} set which gives the best results in the majority of cases. This does not agree with Woodland's theory that the use of a set of values, other than the more usual {0, 1}, can result in better performance especially in speaker independent recognition [61]. Woodland used the values 0.1 and 0.9 as the target values on the output of a neural network. The use of these alternative values should result in there being less chance of overlearning the task for which the network is being trained. Overlearning leads to the network being too specialized in the recognition of the speech samples it has already been presented during training and it therefore performs poorly on examples of speech it has never seen before.

In figures 6.101 and 6.102, which contain the best results obtained with the scaly architecture, there is very little difference obtained using the two different sets of values. With the training data the two curves are almost on top of one another and with the test data they are very close although the {0, 1} sets gives slightly better results overall. The same is true of the error rates with the results on the training data being very close and on the test data the {0, 1} set gives slightly lower error rates overall.

6.5 Conclusions

In using a scaly architecture neural network for recognition of letters from the English alphabet it is found that the input zone size should be kept as small as possible and the overlap of the input zones should be kept as large as possible for the input zone size being implemented. When this is true for both of these properties of the scaly architecture, the largest size of hidden layer for that particular input layer size is implemented. Investigation of the use of different output target value sets demonstrated that no significantly better results were achieved using the alternative output values to the more usual 0 and 1. Where the network architecture was configured such that the performance and error were optimized, using 0 and 1 as the target values on the output nodes actually provided the better results.

There is a significant difference in performance and error rates obtained with the test data as compared to the training data. Several examples of how performance and error change during training are presented in Appendix A (Figures A.21 through A.32). Performance and Error are plotted against number of training sweeps and it can be seen that convergence is taking place. The rate of learning with the test data falls off more rapidly than with the training data so the results are much poorer. This is more pronounced in the networks with less hidden units where the rate of learning goes to approximately zero very early on while performance on the training data is still increasing. This could be the result of the network learning the training data so well that it performs very poorly on the test data. It has become so specialized at recognition of the speech examples in the training data that it cannot generalize to include samples of speech that it has never seen before but that do belong to the categories it is being trained to recognize.

A scaly network with an input zone size of 2 frames, a zone overlap of 1 frame and a hidden layer size of 34 frames was found to give the best overall performance and error for speaker independent recognition. A performance of 67.68% was achieved on the test data set which is the most significant results for a speaker independent recognition system.


[Back] [Contents Page] [Next]