Algorithms and Architectures

Algorithms and Architectures

von: Cornelius T. Leondes, Cornelius T. Leondes (Eds.)

Elsevier Trade Monographs, 1997

ISBN: 9780080498980 , 460 Seiten

Format: PDF, ePUB, OL

Kopierschutz: DRM

Windows PC,Mac OSX für alle DRM-fähigen eReader Apple iPad, Android Tablet PC's Apple iPod touch, iPhone und Android Smartphones Online-Lesen für: Windows PC,Mac OSX,Linux

Preis: 91,95 EUR

Mehr zum Inhalt

Algorithms and Architectures


 

Statistical Theories of Learning in Radial Basis Function Networks


Jason A.S. Freeman    Centre for Cognitive Science, University of Edinburgh, Edinburgh EH8 9LW, United Kingdom

Mark J.L. Orr    Centre for Cognitive Science, University of Edinburgh, Edinburgh EG8 9LW, United Kingdom

David Saad    Department of Computer Science and Applied Mathematics, University of Aston, Birmingham B4 7ET, United Kingdom

1 INTRODUCTION


There are many heuristic techniques described in the neural network literature to perform various tasks within the supervised learning paradigm, such as optimizing training, selecting an appropriately sized network, and predicting how much data will be required to achieve a particular generalization performance. The aim of this chapter is to explore these issues in a theoretically based, well founded manner for the radial basis function (RBF) network. We will be concerned with issues such as using cross-validation to select network size, growing networks, regularization, and calculating the average and worst-case generalization performance. Two RBF training paradigms will be considered: one in which the hidden units are fixed on the basis of statistical properties of the data, and one with hidden units which adapt continuously throughout the training period. We also probe the evolution of the learning process over time to examine, for instance, the specialization of the hidden units.

A RADIAL BASIS FUNCTION NETWORK


RBF networks have been successfully employed in many real world tasks in which they have proved to be a valuable alternative to multilayer perceptrons (MLPs). These tasks include chaotic time-series prediction [1], speech recognition [2], and data classification [3]. Furthermore, the RBF network is a universal approximator for continuous functions given a sufficient number of hidden units [4]. The RBF architecture consists of a two-layer fully connected network (see Fig. 1), with an input layer which performs no computation. For simplicity, we use a single output node throughout the chapter that computes a linear combination of the outputs of the hidden units, parametrized by the weights w between hidden and output layers. The defining feature of an RBF as opposed to other neural networks is that the basis functions (the transfer functions of the hidden units) are radially symmetric.

Figure 1 The radial basis function network. Each of N components of the input vector ξ feeds forward to K basis functions whose outputs are linearly combined with weights bb=1K into the network output f(ξ).

The function computed by a general RBF network is therefore of the form

ξw=∑b=1Kwbsbξ,

  (1)

where ξ is the vector applied to the input units and sb denotes basis function b.

The most common choice for the basis functions is the Gaussian, in which case the function computed becomes

ξw=∑b=1Kwb−ξ‐mb22σB2,

  (2)

where each hidden node is parametrized by two quantities: a center m in input space, corresponding to the vector defined by the weights between the node and the input nodes, and a width σB.

Other possibilities include using Cauchy functions and multiquadrics. Functions that decrease in value as one moves toward the periphery are most frequently utilized; this issue is discussed in Section II.

There are two commonly employed methods for training RBFs. One approach involves fixing the parameters of the hidden layer (both the basis function centers and widths) using an unsupervised technique such as clustering, setting a center on each data point of the training set, or even picking random values (for a review, see [5]). Only the hidden-to-output weights are adaptable, which makes the problem linear in those weights. Although fast to train, this approach often results in suboptimal networks because the basis function centers are set to fixed values. This method is explored in Section II, in which methods of selecting and training optimally sized networks using techniques such as cross-validation and ridge regression are discussed. Forward selection, an advanced method of selecting the centers from a large fixed pool, is also explored. The performance that can be expected from fixed-hidden-layer networks is calculated in Section III, using both Bayesian and probably approximately correct (PAC) frameworks.

The alternative is to adapt the hidden-layer parameters, either just the center positions or both center positions and widths. This renders the problem nonlinear in the adaptable parameters, and hence requires an optimization technique, such as gradient descent, to estimate these parameters. The second approach is computationally more expensive, but usually leads to greater accuracy of approximation. The generalization error that can be expected from this approach can be calculated from a worst-case perspective, under the assumption that the algorithm finds the best solution given the available data (see Section III). It is perhaps more useful to know the average performance, rather than the worst-case result, and this is explored in Section IV. This average-case approach provides a complete description of the learning process, formulated in terms of the overlaps between vectors in the system, and so can be used to study the phenomenology of the learning process, such as the specialization of the hidden units.

II LEARNING IN RADIAL BASIS FUNCTION NETWORKS


A SUPERVISED LEARNING


In supervised learning problems we try to fit a model of the unknown target function to a training set D consisting of noisy sampled input–output pairs:

=ξpy^pp=1P.

  (3)

The caret (hat) in ŷp indicates that this value is a sample of a stochastic variable, yp, which has a mean, ¯p, and a variance, σp2. If we generated a new training set with the same input points, pp=1P, we would get a new set of output values, ^pp=1P because of the random sampling. The outputs are not completely random and in fact it is their deterministic part, as a function of the input, which we seek to estimate in supervised learning.

If the weights, bb=1K,, which appear in the model provided by an RBF network [defined by Eq. (1)] were the only part of the network to adapt during training, then this model would be linear. That would imply a unique minimum of the usual sum-squared-error cost function,

wD=∑p=1Pfξpw−y^p2,

  (4)

which can be found by a straightforward computation (the bulk of which is the inversion of a square matrix of size K). There would be no confusion caused by local minima and no need for computationally expensive gradient descent algorithms. Of course, the difficulty is in determining the right set of basis functions, bb=1K, to use in the model (1). More likely than not, if the training set is ignored when choosing the basis functions we will end up having too many or too few of them, putting them in the wrong places, or giving them the wrong sizes. For this reason we have to allow other model parameters (as well as the weights) to adapt in learning, and this inevitably leads to some kind of nonlinear algorithm involving something more complicated than just a matrix inverse.

However, as we shall see, even though we cannot get away from nonlinearity in the learning problem, we are not thereby restricted to algorithms which construct a vector space of dimension equal to the number of adaptable parameters and search it for a good local minimum of the cost function—the usual approach with neural networks. This section investigates alternative approaches where the linear character of the underlying model is to the foremost in both the analysis (using linear algebra) and implementation (using matrix computations).

The section is divided as follows. It begins with some review material before describing the main learning algorithms. First, Section II.B reminds us why, if the model were linear, the cost function would have a single minimum and how it could be found with a single matrix inversion. Section II.C describes bias and variance, the two main sources of error in supervised learning, and the trade-off which occurs between them. Section II.D describes some cost functions, such as generalized cross-validation (GCV), which are better than sum-squared-error for effective generalization. This completes the review material and the next two subsections describe two learning algorithms, both modem refinements of techniques from linear regression theory. The first is ridge regression (Section II.E), a crude type of regularization, which balances bias and variance by varying the amount of smoothing until GCV is minimized. The second is forward selection (Section II.F), which balances bias and variance by...