Deep Learning:深度前馈神经网络(四)

Architecture Design

The word architecture refers to the overall structure of the network: how many units it should have and how these units should be connected to each other.
Most neural networks are organized into groups of units called layers. Most neural network architectures arrange these layers in a chain structure, with each layer being a function of the layer that preceded it. In this structure, the first layer is given by

h(1)=g(1)(W(1)Tx+b(1))

the second layer is given by
h(2)=g(2)(W(2)Th(1)+b(2))

and so on.
In these chain-based architectures, the main architectural considerations are to choose the depth of the network and the width of each layer. As we will see, a network with even one hidden layer is sufficient to fit the training set. Deeper networks often are able to use far fewer units per layer and far fewer parameters and often generalize to the test set, but are also often harder to optimize. The ideal network architecture for a task must be found via experimentation guided by monitoring the validation set error.

Universal Approximation Properties and Depth

  • At first glance, we might presume that learning a nonlinear function requires designing a specialized model family for the kind of nonlinearity we want to learn. Fortunately, feedforward networks with hidden layers provide a universal approximation framework.
  • Specifically, the universal approximation theorem:
    A feedforward network with a linear output layer and at least one hidden layer with any “squashing” activation function (such as the logistic sigmoid activation function) can approximate any Borel measurable function from one finite-dimensional space to another with any desired non-zero amount of error, provided that the network is given enough hidden units.
  • For our purposes it suffices to say that any continuous function on a closed and bounded subset of Rn is Borel measurable and therefore may be approximated by a neural network.
  • A neural network may also approximate any function mapping from any finite dimensional discrete space to another.
  • The universal approximation theorem means that regardless of what function we are trying to learn, we know that a large MLP will be able to represent this function.
  • However, we are not guaranteed that the training algorithm will be able to learn that function. Even if the MLP is able to represent the function, learning can fail for two different reasons.
    (1) First, the optimization algorithm used for training may not be able to find the value of the parameters that corresponds to the desired function.
    (2) Second, the training algorithm might choose the wrong function due to overfitting.

  • The “no free lunch” theorem shows that there is no universally superior machine learning algorithm.

  • There is no universal procedure for examining a training set of specific examples and choosing a function that will generalize to points not in the training set.

In summary, a feedforward network with a single layer is sufficient to represent any function, but the layer may be infeasibly large and may fail to learn and generalize correctly. In many circumstances, using deeper models can reduce the number of units required to represent the desired function and can reduce the amount of generalization error.

  • In many cases, the number of hidden units required by the shallow model is exponential in n. Many modern neural networks use rectified linear units demonstrated that shallow networks with a broad family of non-polynomial activation functions, including rectified linear units, have universal approximation properties, but these results do not address the questions of depth or efficiency—they specify only that a sufficiently wide rectifier network could represent any function.

  • Functions representable with a deep rectifier net can require an exponential number of hidden units with a shallow (one hidden layer) network. More precisely, they showed that piecewise linear networks (which can be obtained from rectifier nonlinearities or maxout units) can represent functions with a number of regions that is exponential in the depth of the network.

  • More precisely, the main theorem in Montufar et al. (2014) states that the number of linear regions carved out by a deep rectifier network with d inputs, depth l, and n units per hidden layer, is

    Deep Learning:深度前馈神经网络(四)

Any time we choose a specific machine learning algorithm, we are implicitly stating some set of prior beliefs we have about what kind of function the algorithm should learn.
(1) Choosing a deep model encodes a very general belief that the function we want to learn should involve composition of several simpler functions. This can be interpreted from a representation learning point of view as saying that we believe the learning problem consists of discovering a set of underlying factors of variation that can in turn be described in terms of other, simpler underlying factors of variation.
(2) Alternately, we can interpret the use of a deep architecture as expressing a belief that the function we want to learn is a computer program consisting of multiple steps, where each step makes use of the previous step’s output.
(3) Empirically, greater depth does seem to result in better generalization for a wide variety of tasks:

Deep Learning:深度前馈神经网络(四)

Other Architectural Considerations

Many neural network architectures have been developed for specific tasks. Specialized architectures for computer vision called convolutional networks are described in Chapter 9. Feedforward networks may also be generalized to the recurrent neural networks for sequence processing, described in Chapter 10, which have their own architectural considerations.

  • In general, the layers need not be connected in a chain, even though this is the most common practice. Many architectures build a main chain but then add extra architectural features to it, such as skip connections going from layer i to layer i+ 2 or higher. These skip connections make it easier for the gradient to flow from output layers to layers nearer the input.
  • Another key consideration of architecture design is exactly how to connect a pair of layers to each other. In the default neural network layer described by a linear transformation via a matrix W , every input unit is connected to every output unit. Many specialized networks in the chapters ahead have fewer connections, so that each unit in the input layer is connected to only a small subset of units in the output layer. These strategies for reducing the number of connections reduce the number of parameters and the amount of computation required to evaluate the network, but are often highly problem-dependent.