Naoki Yokoyama
About MeResearchTutorialsSide Projects CV

Why Do We Need Activation Functions?

  • In general, a standard neuron learns a weight and a bias, which are used to transform its input into its output.
  • But a neural net made up entirely of these neurons is essentially a single linear function, there is no point.
  • An activation function is responsible for deciding to “activate” certain nodes; turning them on or off.
  • With an adequate net architecture, this introduction of nonlinearity will allow our net to go from a simple linear function to a universal function approximator!
  • What Does an Activation Function Look Like?

  • Simplest activation: 0 or 1 if the output passes a certain, learned threshold.
  • Problem: makes it difficult to learn weights.
  • Why? Backpropagation relies heavily on gradients to learn weights, which are non-existent for this activation function.
  • So we need activation functions that provide useful gradients...

  • Sigmoids are popular because they are both nonlinear and differentiable. They squash the input into a value between 0 and 1, similar to a boolean function.
  • However, the output is always positive, which may not be desirable.
  • Tanh activation is similar to sigmoid, but is centered about 0 (range is [-1, 1]), solving the problem of positive-only outputs.
  • Similar to convolutional layers, a sliding window is used, usually 2x2.
  • But both activations have the vanishing gradient problem; towards the extremities of their outputs, they become flat, making the gradient vanish and preventing weights from being updated.
  • The ReLU

    ReLU and Leaky ReLU

  • The Rectified Linear Unit (ReLU) is both nonlinear and differentiable.
  • The ReLU function is just max{0,x}; it will pass the input if it is positive, and 0 if not.
  • This ensures that the gradient exists for values larger than 0 (see image).
  • However, if the input is lower than 0, it may again result in a dead neuron that no longer learns.
  • Leaky ReLU fixes this by adding a small gradient below 0, giving the neuron a chance to revive over time if the backpropagation wills it do so.
  • Softmax

  • Returns a vector that sums to 1.
  • Useful for outputting probabilities for multiple classes, i.e classifying an image as a type of animal.
  • However, some frameworks choose to stick to independent logistic classifiers (sigmoids) to dispense probabilities for each class, such as YOLOv3.